In a talk I gave at the University of Virginia in the early 1960s ... I said that if you torture the data enough, nature will always confess, a saying which, in a somewhat altered form, has taken its place in the statistical literature. Kuhn puts the point more elegantly and makes the process sound more like a seduction: "nature undoubtedly responds to the theoretical predispositions with which she is approached by the measuring scientist." I observed that a failure to get an exact fit between the theory and the quantitative results is not generally treated as calling for the abandonment of the theory but the discrepancies are put on one side as something calling for further study. Kuhn says this: "Isolated discrepancies ... occur so regularly that no scientist could bring his research problems to an end if he paused for many of them. In any case, experience has repeatedly shown that in overwhelming proportion, these discrepancies disappear upon closer scrutiny ." Because of this, Kuhn argues that "the efficient procedure " is to ignore them, a conclusion economists will find it easy to accept.
It is tempting to believe that patterns are unusual and their discovery meaningful; in large data sets, patterns are inevitable and generally meaningless. ... Data-mining algorithms—often operating under the label artificial intelligence—are now widely used to discover statistical patterns. However, in large data sets streaks, clusters, correlations, and other patterns are the norm, not the exception. While data mining might discover a useful relationship, the number of possible patterns that can be spotted relative to the number that are genuinely useful has grown exponentially—which means that the chances that a discovered pattern is useful is rapidly approaching zero. This is the paradox of big data:
It would seem that having data for a large number of variables will help us find more reliable patterns; however, the more variables we consider, the less likely it is that what we find will be useful.
It turned out that the S&P 500 index of stock prices is predicted to be 97 points higher 2 days after a one-standard-deviation increase in the Trump’s use of the word president, ... my 10-fold cross-validation data-mining algorithm discovered that the low temperature in Moscow is predicted to be 3.30°F higher 4 days after a one-standard-deviation increase in Trump’s use of the word ever, and that the low temperature in Pyongyang is predicted to be 4.65°F lower 5 days after a one-standard-deviation increase in the use of the word wall. ... I considered the proverbial price of tea in China. I could not find daily data on tea prices in China, so I used the daily stock prices of Urban Tea, a tea product distributer headquartered in Changsha City, Hunan Province, China, with retail stores in Changsha and Shaoyang that sell tea and tea-based beverages. The data-mining algorithm found that Urban Tea’s stock price is predicted to fall 4 days after Trump used the word with more frequently.
The data-mining algorithm found that a one-standard deviation increase in Trump’s use of the word democrat had a strong positive correlation with the value of this random variable 5 days later. The intended lessons are how easy it is for data-mining algorithms to find transitory patterns and how tempting it is to think up explanations after the fact. ... That is the nature of the beast we call data mining: seek and ye shall find.At this point, a common response is something like: "Well, of course it's possible to find correlations that don't mean anything. What about the correlations that do mean something?" The response is true enough in some sense: perhaps in looking at a bucket-full of correlations, one of them may suggest a theory that can be tested in various ways to see if it has lasting power. But also notice that when you start talking making judgments that certain statistical findings "mean something" and other statistical findings derived in exactly the same way do not "mean something," you aren't actually doing statistics any more. The data isn't telling you the answer: you are deciding on other grounds what the answers are likely to be.
Smith's example about Trump's tweets is not just a hypothetical example. Several studies have tried to find whether Trump's tweets--or a selection of Google search terms, or other big data sets--would cause the stock market to rise or fall in some systematic way.
Two other examples from Smith involve investment companies called Equabot and Voleon. They were going to run their investments with data-mining tools, not human intuition. But a funny thing happened on the way to the profits: returns at both companies somewhat underperformed the S&P 500. One plausible explanation is that the data-mining algorithms kept finding correlations that weren't lasting or valid, so when the algorithm made investments on the basis of those correlations, it was either investing mostly randomly, like an index fund, or even counterproductively.
I do not mean to imply that using algorithms to search systematically through big data are not useful, just that their results must be interpreted with care and precision. Otherwise, ridiculousness can result. For some discussions of how big data and machine learning can be used in economics, starting points from the journal that I edit include:
- Mullainathan, Sendhil, and Jann Spiess. 2017. "Machine Learning: An Applied Econometric Approach." Journal of Economic Perspectives, 31 (2): 87-106.
- Athey, Susan, and Guido W. Imbens. 2017. "The State of Applied Econometrics: Causality and Policy Evaluation." Journal of Economic Perspectives, 31 (2): 3-32.
- Varian, Hal R. 2014. "Big Data: New Tricks for Econometrics." Journal of Economic Perspectives, 28 (2): 3-28.