Thursday, January 21, 2021

The Reproducibility Challenge with Economic Data

One basic standard of economic research is surely that someone else should be able to reproduce what you have done. They don't have to agree with what you've done. They may think your data is terrible and your methodology is worse. But as a minimal standard, they should be able to reproduce your result, so that the follow-up research can then be in a position to think about what might have been done differently or better.  This standard may seem obvious, but during the last 30 years or so, the methods for reproducibility have been transformed. 

Lars Vilhuber describes the shift in "Reproducibility and Replicability in Economics" in the Harvard Data Science Review (Fall 2020 issue, published December 21, 2020). Vilhuber is the Data Editor for the journals published by the American Economic Association (including the Journal of Economic Perspectives where I work as Managing Editor). Thus, he heads the group which oversees posting of data and code for new empirical results in AEA journals--including making sure that an outsider can use the data and code to reproduce the actual results reported in the paper. 

To jump to the bottom line, Vilhuber writes: "Still, after 30 years, the results of reproducibility studies consistently show problems with about a third of reproduction attempts, and the increasing share of restricted-access data in economic research requires new tools, procedures, and methods to enable greater visibility into the reproducibility of such studies."

It's worth noting that reproducibility has come a long way. Back in the 1980s and earlier, researchers who had completed a published empirical research paper. but then moved on to other topics, often did not keep their data or code--or if they did keep them, the data and code were often full of idiosyncratic formats and labelling that worked fine for the original researcher (or perhaps for the research assistants of the original researcher who did a lot of the spadework), but could be impenetrable to a would-be outside replicator.  By contrast, a fair share of modern economics research can post the actual data, computer code, documentation for what was done, and so on. In this situation, you may disagree with how the researcher chose to proceed, but you can at  least reproduce their result easily. 

However, here I want to emphasize that a lot of the difficulties with reproducibility arise because finding the actual data used in an economic study is not as easy as one might think. Non-economists often think of economic data in terms of publicly available data series like GDP, inflation, or unemployment, which anyone can look up on the internet. But economic research often goes well beyond these extremely well-known data sources. One big shift has been to the use of "administrative" data, which is a catch-all term to describe data that was not collected for research purposes, but instead developed for administrative reasons. Examples would include tax data from the Internal Revenue Service, data on earnings from the Social Security Administration, data on details of health care spending from Medicare and Medicaid, and education data on teachers and students collected by school districts. There is also private-sector administrative data about issues from financial markets to cell-phone data, credit card data, and "scanner" data generated by cash registers when you, say, buy groceries. 

Vilhuber writes: "In 1960, 76% of empirical AER [American Economic Review- articles used public-use data. By 2010, 60% used administrative data, presumably none of which is public use ..."

You can't just write to, say, the Internal Revenue Service and ask to see all the detailed data from tax returns. Nor can you directly access detailed data from Social Security or Medicare or a school district, or from what people reported in the US Census. There are obvious privacy concerns here. 

Thus, one change in recent years is what are called "restricted access data environments," where accredited researchers can get access to detailed data, but in ways that protect individual privacy. For example, there are now 30 Federal Statistical Data Research Centers around the country, mostly located close to big universities.  Vilhuber writes (citations omitted): 

It is worth pointing out the increase in the past 2 decades of formal restricted-access data environments (RADEs), sponsored or funded by national statistical offices and funding agencies. RADE networks, with formal, nondiscriminatory, albeit often lengthy access protocols, have been set up in the United States (FSRDC), France, and many other countries. Often, these networks have been initiated by economists, though widespread use is made by other social scientists and in some cases health researchers. RADE are less common for private-sector data, although several initiatives have made progress and are frequently used by researchers: Institute for Research on Innovation and Science, Health Care Cost Institute , Private Capital Research Institute (PCRI). When such nondiscriminatory agreements are implemented at scale, a significant number of researchers can obtain access to these data under strict security protocols. As of 2018, the FSRDC hosted more than 750 researchers on over 300 projects, of which 140 had started within the last 12 months. The IAB FDZ [a source of German employment data] lists over 500 projects active as of September 2019, most with multiple authors. In these and other networks, many researchers share access to the same data sets, and could potentially conduct reproducibility studies. Typically, access is via a network of secure rooms (FSRDC, Canada, Germany), but in some cases, remote access via ‘thin clients’ (France) or virtual desktop infrastructure (some Scandinavian countries, data from the Economic Research Service of the United States Department of Agriculture [USDA] via NORC) is allowed.

A common situation is that this kind of data often cannot be put into the public domain; instead, you would need to apply and to gain access to the "restricted access data environment," and access the data in that way. 

Another issue is that in some of these data sources, researchers are not given access to all of the data; instead, to protect privacy, they are given an extract of the overall data. As a result, two researchers who go to the data center and make the same data request will not get the same data. The overall patterns in the data should be pretty close, if random samples are used, but they won't be the same. Vilhuber writes: 

Some widely used data sets are accessible by any researcher, but the license they are subject to prevents their redistribution and thus their inclusion as part of data deposits. This includes nonconfidential data sets from the Health and Retirement Study (HRS) and the Panel Study of Income Dynamics (PSID) at the University of Michigan and data provided by IPUMS at the Minnesota Population Center. All of these data can be freely downloaded, subject to agreement to a license. IPUMS lists 963 publications for 2015 alone that use one of its data sources. The typical user will create a custom extract of the PSID and IPUMS databases through a data query system, not download specific data sets. Thus, each extract is essentially unique. Yet that same extract cannot be redistributed, or deposited at a journal or any other archive.undefined In 2018, the PSID, in collaboration with ICPSR, has addressed this issue with the PSID Repository, which allows researchers to deposit their custom extracts in full compliance with the PSID Conditions of Use.

Yet another issue arises with data from commercial sources, which often require a fee to access: 

Commercial (‘proprietary’) data is typically subject to licenses that also prohibit redistribution. Larger companies may have data provision as part of their service, but providing it to academic researchers is only a small part of the overall business. Dun and Bradstreet’s Compustat, Bureau van Dijk’s Orbis, Nielsen Scanner data via the Kilts Center at Chicago Booth (Kilts Center, n.d.), or Twitter data are all used frequently by economists and other social scientists. But providing robust and curated archives of data as used by clients over 5 or more years is typically not part of their service.

Research using social media data can pose particular problems for someone who wants to reproduce the study using the same data:

Difficulties when citing data are compounded when the data is either changing, or is a potentially ill-defined subset of a larger static or dynamic databases. ‘Big data’ have always posed challenges—see the earlier discussion of the 1950s–1960s demand for access to government databases. By nature, they most often fall into the ‘proprietary’ and ‘commercial’ category, with the problems that entails for reproducibility. However, beyond the (solvable) problem of providing replicators with authorized access and enough computing resources to replicate original research, even defining or acquiring the original data inputs may be hard. Big data may be ephemerous by nature, too big to retain for significant duration (sometimes referred to as ‘velocity’), temporally or cross-sectionally inconsistent (variable specifications change, sometimes referred to as ‘variety’). This may make computational reproducibility impossible. ... For instance, a study that uses data from an ephemerous social media platform where posts last no more than 24 hours (‘velocity’) and where the data schema may mutate over time (‘variety’) may not be computationally reproducible, because the posts will have been deleted (and terms of use may prohibit redistribution of any scraped data). But the same data collection (scraping or data extraction) can be repeated, albeit with some complexity in reprogramming to address the variety problem, leading to a replication study.

Finally, there a problem of "cleaning" data. "Raw" data always has errors. Sometimes data isn't filled in. Other times it may show a nonsensical finding, like someone having a negative level of income in a year, or an entry where it looks as if several zeros were added to a number by accident. Thus, the data needs to be "cleaned" before it's used. For well-known data, there are archives of documentation for how data has been cleaned, and why. But for lots of data, the documentation for how it has been cleaned isn't available.  Vilhuber writes: 

While in theory, researchers are able to at least informally describe the data extraction and cleaning processes when run on third-party–controlled systems that are typical of big data, in practice, this does not happen. An informal analysis of various Twitter-related economics articles shows very little or no description of the data extraction and cleaning process. The problem, however, is not unique to big-data articles—most articles provide little if any input data cleaning code in reproducibility archives, in large part because provision of the code that manipulates the input data is only suggested, but not required by most data deposit policies.

As a final thought, I'll point out that academic researchers have mixed incentives when it comes to data. They always want access to new data, because new data is often a reliable pathway to published papers that can build a reputation and a paycheck. They often want access to the data used by rival researchers, to understand and to critique their results. But making access available to details of their own data doesn't necessarily help them much. 

For example, imagine that you write a prominent academic paper, and all the data is widely available. The chances are good that for years to come, your paper will become target practice for economics students and younger faculty members, who want to critique you and to justify all the choices you made in the research. However, you may have a reasonable dislike of spending large chunks of the rest of your career going over the same ground, again and again.

From this standpoint, it's perhaps not surprising that while many leading journals of economics now do require that authors publish their computer code and as much of their data as they are allowed to do, the number of papers that get "exceptions" for publishing their data is rising. Moreover, the requirement that an author supply data and computer code is not part of what is required for submitting a paper or making a decision about publishing the paper (although other professors refereeing the paper can make a request to see the data and code, if they wish). 

It's also maybe not a surprise that a study of one prominent journal looked at papers published from 2009 to 2013 and found that of the papers where data was not posted online, only about one-third of the papers had data where it was reasonably straightforward for others to obtain the data. 

And it's also maybe not a surprise that more and more papers are published with data that you have to be an official researcher to access, through a restricted access data center, which presents some hurdles to those not well-connected in the research community. 

Access to data and computer code behind economic research has improved, and improved a lot, since the pre-internet age. But in many cases, it's still far from easy.