Monday, April 20, 2015

The Data Revolution and Economic Research

Empirical research in economic is being revolutionized (and no, that word is not too strong) by two major new sources of data: administrative data and private sector data. Liran Einav and Jonathan Levin explain in "Economics in the age of big data," which appears in Science magazine, November 7, 2014 (vol 346, iossue 6210; the "Review Summary" is p. 715 and the "Review" article itself is pp. 1243089-1 to 12403089-6). Science is not freely available online, but many readers will have access through library subscriptions.

To grasp the magnitude of the change, you need to know that a two or more decades ago, economists had only a few main sources of data: there was data produced by the government for public consumption like all the economic statistics from the Bureau of Economic or the Bureau of Labor, the surveys from the US Census, and a few other major surveys. Sometimes, economists also constructed their own data by working in library archives or carrying out their own surveys. For example, I remember as an undergraduate back around 1980 I remember doing basic empirical exercises where you wrote programs (stored on punch cards!) to find correlations between GDP, unemployment, interest rates, and car sales. I remember as a graduate student in the early 1980s compiling data on the miles-per-gallon of new cars, which involved collecting the annual paper brochures from the US Department of Transportation and then inputting the date to a computer file (no more punchcards by then!).  As Einav and Levin put it: "Even 15 or 20 years ago, interesting and unstudied data sets were a scarce resource."

One of the major new sources of data is "administrative data," which is data collected by the government in course of administering various programs. As Einav and Levin point out, some of the most prominent results in empirical economics in recent years are based on administrative data.

For example, the evidence that most of the rise in income inequality is at the very top of the income distribution is based on IRS tax data. Evidence on wide variation in health care spending, how people and providers react to different health insurance provisions, and the use of certain health care treatments across states (thus implying that some health care providers in some states may be overdiagnosing or underdiagnosing) is often based on administrative data from Medicare and Medicaid. Evidence on how teachers can affect student academic achievement is based on a combination of student test scores and the patterns of how teachers are assigned to classrooms.

Of course, the use of administrative data for research raises legitimate privacy issues. But  just to be clear, it's the existence of this data in government hands that raises the privacy concerns in the first place. Before the administrative data is received by researchers, it is "anonymized" so that it should be impossible to identify individuals. Einav and Levin sum up:
The potential of administrative data for academic research is just starting to be realized, and substantial challenges remain. This is particularly true in the United States,where confidentiality and privacy concerns, as well as bureaucratic hurdles, have made accessing administrative data sets and linking records between these data sets relatively cumbersome. European countries such as Norway, Sweden, and Denmark have gone much farther to merge distinct administrative records and facilitate research. ... However, even with today’s somewhat piecemeal access to administrative records, it seems clear that these data will play a defining role in economic research over the coming years.
The other major source of new data comes from private efforts, either by firms or by researchers. Your credit card company, your insurance company, your cable access provider, and many other firms have a lot of information about your life and your preferences. They are already doing in-house research on this data, but in some cases, they are pairing with research economists to work on anonymized forms of the data. For example, Einav and Levin have done research with eBay data on how Internet sales taxes affect online shopping.

Some companies are taking the next step and publishing data. Einav and Levin write:
Already the payroll service company ADP publishes monthly employment statistics in advance of the Bureau of Labor Statistics, MasterCard makes available retail sales numbers, and Zillow generates house price indices at the county level. These data may be less definitive than the eventual government statistics, but in principle they can be provided faster and perhaps at a more granular level, making them useful complements to traditional economic statistics. 

Similarly, Google publishes a "Flu Trends" list which seeks to provide early warning of flu outbreaks, faster than the Center for Disease Control statistics, by using data from search queries.

 Researchers can create their own data sets by "scraping" the web: that is, by writing programs that will download data from various websites at regular intervals. One of the best-known of these projects is the Billion Prices Project run by Alberto Cavallo and Roberto Rigobon at MIT. Their program downloads detailed data on prices and product characteristics from websites all over the world every day on hundreds of thousands of products. For a sense of the findings that can emerge from this kind of study, here's one graph showing the US price level as measured by the Billion Prices Project and the official Consumer Price Index. They are fairly close. Next  look at the price level from the Billion Prices Project and the official measure of inflation in Argentina. It's strong publicly available evidence that the government in Argentina is gaming its inflation statistics.

Finally, many more economists are creating their own data by carrying out their own social experiments and surveys.

The new sources of data are changing the emphasis of published economic research. If you go back about 30 years, the majority of papers appearing in top research journals were theoretical--that is, they contained either no data or a few bits of illustrative data. Now, about 70% of the papers in top economics journals are primarily empirical and data-based. Einav and Levin offer evidence that for empirical papers (not including experiments designed by the researcher), only about 5-10% of the papers used administrative or private data back in 2006-2008, but by 2013 and 2014, the share of empirical papers using administrative and private data was nearly half. The tools for collecting and using this administrative and private-sector data are different in many ways from what economists have traditionally done. Careers, reputations, and eventually even Nobel prizes will be built on this body of work.