Thursday, August 13, 2020

Sounding an Alarm on Data Mining

Back in 2013, someone going under the nom de plume of Economist Hulk wrote on Twitter (all caps, natch): "WHEN FACTS CHANGE, HULK SMASH FACTS UNTIL THEY FIT HIS PRE-CONCEIVED THEORY. HULK CALL THIS ‘ECONOMETRICS’."

Gordon Tullock offered an aphorism in a similar spirit, which he attributed to verbal comments from Ronald Coase ("A Comment on Daniel Klein's "A Plea to Economists Who Favor Liberty,"  Eastern Economic Journal , Spring, 2001, 27: 2, pp. 203- 207). Tullock wrote: "As Ronald Coase says, `if you torture the data long enough it will confess.'"

Ronald Coase (Nobel 1991) put the point just a little differently in a 1981 lecture, "How Should Economists Choose?", while attributing a similar point to Thomas Kuhn. Coase wrote (footnotes omitted): 
In a talk I gave at the University of Virginia in the early 1960s ... I said that if you torture the data enough, nature will always confess, a saying which, in a somewhat altered form, has taken its place in the statistical literature. Kuhn puts the point more elegantly and makes the process sound more like a seduction: "nature undoubtedly responds to the theoretical predispositions with which she is approached by the measuring scientist." I observed that a failure to get an exact fit between the theory and the quantitative results is not generally treated as calling for the abandonment of the theory but the discrepancies are put on one side as something calling for further study. Kuhn says this: "Isolated discrepancies ... occur so regularly that no scientist could bring his research problems to an end if he paused for many of them. In any case, experience has repeatedly shown that in overwhelming proportion, these discrepancies disappear upon closer scrutiny ." Because of this, Kuhn argues that "the efficient procedure " is to ignore them, a conclusion economists will find it easy to accept.
Gary Smith offers an overview of what these issues are all about in "Data Mining Fool's Gold" (Journal of Information Technology, posted "Online First" as a forthcoming paper on May 22, 2020, subscription needed for access). Smith offers what he calls "the paradox of big data":
It is tempting to believe that patterns are unusual and their discovery meaningful; in large data sets, patterns are inevitable and generally meaningless. ... Data-mining algorithms—often operating under the label artificial intelligence—are now widely used to discover statistical patterns. However, in large data sets streaks, clusters, correlations, and other patterns are the norm, not the exception. While data mining might discover a useful relationship, the number of possible patterns that can be spotted relative to the number that are genuinely useful has grown exponentially—which means that the chances that a discovered pattern is useful is rapidly approaching zero. This is the paradox of big data:
It would seem that having data for a large number of variables will help us find more reliable patterns; however, the more variables we consider, the less likely it is that what we find will be useful.
Along with useful background discussion, Smith offers some vivid examples of data mining gone astray. For instance, Smith put a data-mining algorithm to work on President Donald Trump's tweets in the first three years of his term. He found: 
It turned out that the S&P 500 index of stock prices is predicted to be 97 points higher 2 days after a one-standard-deviation increase in the Trump’s use of the word president, ... my 10-fold cross-validation data-mining algorithm discovered that the low temperature in Moscow is predicted to be 3.30°F higher 4 days after a one-standard-deviation increase in Trump’s use of the word ever, and that the low temperature in Pyongyang is predicted to be 4.65°F lower 5 days after a one-standard-deviation increase in the use of the word wall. ... I considered the proverbial price of tea in China. I could not find daily data on tea prices in China, so I used the daily stock prices of Urban Tea, a tea product distributer headquartered in Changsha City, Hunan Province, China, with retail stores in Changsha and Shaoyang that sell tea and tea-based beverages. The data-mining algorithm found that Urban Tea’s stock price is predicted to fall 4 days after Trump used the word with more frequently.
Indeed, Smith created a random variable, put it into the date-mining algorithm, and found: 
The data-mining algorithm found that a one-standard deviation increase in Trump’s use of the word democrat had a strong positive correlation with the value of this random variable 5 days later. The intended lessons are how easy it is for data-mining algorithms to find transitory patterns and how tempting it is to think up explanations after the fact. ... That is the nature of the beast we call data mining: seek and ye shall find.
At this point, a common response is something like: "Well, of course it's possible to find correlations that don't mean anything. What about the correlations that do mean something?" The response is true enough in some sense: perhaps in looking at a bucket-full of correlations, one of them may suggest a theory that can be tested in various ways to see if it has lasting power. But also notice that when you start talking making judgments that certain statistical findings "mean something" and other statistical findings derived in exactly the same way do not "mean something," you aren't actually doing statistics any more. The data isn't telling you the answer: you are deciding on other grounds what the answers are likely to be. 

Smith's example about Trump's tweets is not just a hypothetical example. Several studies have tried to find whether Trump's tweets--or a selection of Google search terms, or other big data sets--would cause the stock market to rise or fall in some systematic way. 

Two other examples from Smith involve investment companies called Equabot and Voleon. They were going to run their investments with data-mining tools, not human intuition. But a funny thing happened on the way to the profits: returns at both companies somewhat underperformed the S&P 500. One plausible explanation is that the data-mining algorithms kept finding correlations that weren't lasting or valid, so when the algorithm made investments on the basis of those correlations, it was either investing mostly randomly, like an index fund, or even counterproductively. 

Or you may remember that about 10 years back, Google Flu Trends sought to predict flu outbreaks by looking at Google searches. The early evidence looked promising: "However, after issuing its report, Google Flu Trends over-estimated the number of flu cases for 100 of the next 108 weeks, by an average of nearly 100% (Lazer et al., 2014). Google Flu Trends no longer makes flu predictions." As another example, a British insurance company in 2016 decided to base its car insurance rates on data mined from Facebook posts, and found that it was charging different rates according to whether people were more likely to mention Michael Jordan or Leonard Cohen, a process "which humans would recognize as ripe with errors and biases." 

I do not mean to imply that using algorithms to search systematically through big data are not useful,  just that their results must be interpreted with care and precision. Otherwise, ridiculousness can result. For some discussions of how big data and machine learning can be used in economics, starting points from the journal that I edit include: