Friday, August 2, 2019

Administrative Data Moves Toward Center Stage

For the later decades of the 20th century, the most common source of data for economic studies was government surveys and statisticians. There are household surveys like the Current Population Survey, the Survey of Income and Program Participation, the Consumer Expenditure Survey, the  National Health Interview Survey, the Consumer Expenditure Survey, and the General Social Survey. There were government workers collecting data on prices at stores as input to measures of inflation like the Consumer Price Index. There were business surveys, like the Economic Census, the Retail Trade Survey, an Annual Survey of Manufactures, the Residential Finance Survey, and others. Branches of government like the Department of Energy, the Department of Agriculture, and the Health Care Financing Administration collect data on specific industries. There's also a Census of Governments to get data on state and local government. The Bureau of Economic Analysis would pull together data from these sources and others to estimate GDP.

But over time, a split developed. On one side was this body of data created by government for use by business and and policy-makers, as well as researchers. But on the other side was a vast amount of data collected in the process of administering these programs. Often this administrative data was not formatted or organized in a way that researchers could use it. Moreover, the administrative data was often siloed in one government agency; for example, student grades and their academic progress were traditionally kept inside school districts or in some cases state departments of education, and not easily connected to other data that might explain patterns of school grades, either within a school year or over time.

There were gaps between the government-collected survey data and the administrative data. As one  example, the administrative data told how much the government had paid out in welfare benefits and food stamps, but the surveys of households were reporting only about half of that total had been received.

There has been a considerable movement toward the use of administrative data. The Russell Sage Foundation Journal of the Social Sciences has put out a double-issue in March 2019 on the theme of "Using Administrative Data for Science and Policy" with a range of examples. From the introductory essay by Andrew M. Penner and Kenneth A. Dodge:
Research using administrative data has much in common with history and archeology, insofar as it observes the tracks that individuals leave as they move through society and draws lessons from these glimpses into their lives. ...

Given their origin in a particular institutional context, administrative records are typically fragmented, and these data are often not linked to other data that would be useful for research and policy. Hospitals, for example, collect detailed information about patients’ health, schools regularly collect information about student development, and employers often keep records not only about the performance of employees, but also about applicants who were ultimately not offered positions. Although various combinations of these data can provide important insights, they are typically compartmentalized. Likewise, given their origin, administrative records often lack certain kinds of information that are less likely to be collected in these records. For example, information about attitudes, affinities, and motives are not often collected in administrative records. Combining administrative data with records from other sources—either by linking administrative records across sources or by making administrative records available to be linked to data collected via other means—is thus central to building administrative data infrastructure. ...

By virtue of how they come into existence, administrative data are typically focused on one facet of an individual’s life, and data and insights are often siloed. ... Administrative records from birth, education, criminal justice, labor market, and mortality often capture different points in an individual’s life; combining data across these stages allows us to understand how inequalities unfold over the arc of an individual’s life. ... 
One clear example is in education. Despite their focus on preparing students who are “college- and career-ready,” schools have historically struggled to obtain data on the practices that will prepare their students to be successful because widespread links between students in K–12 educational systems and higher education outcomes have become available only recently, and links between K–12 data systems and the labor market remain relatively rare. These data linkages are important to understand the efficacy of school-based vocational programs, dropout recovery interventions, college readiness programs, and advancement placement course policies. But schools, like other organizations, typically lack the capacity and expertise to build this infrastructure and analyze the resulting data.
At a time when we are all sensitized to how big tech companies are gathering, combining, and marketing our personal data, the rise of administrative data clearly has a concerning side. Consider the footprints that many of us have left in administrative data over the years, about our education, physical and mental health, finances, how much we were paid, tax filings, car and real estate ownership, Social Security contributions, benefits from government programs, and many more--right down to the books checked out of the public library.

The obvious challenge is to find ways to use administrative data where protection of personal privacy is built-in from the start. For example, when the unemployment rate based on the Current Population Survey, no one is concerned that unemployment at specific identified households will become public as a result. In this issue, a paper by David B. Grusky, Michael Hout, Timothy M. Smeeding, C. Matthew Snipp descibes "The American Opportunity Study: A New Infrastructure for Monitoring Outcomes, Evaluating Policy, and Advancing Basic Science." They write:

The American Opportunity Study (AOS) ...  is an ongoing effort to link the censuses of 1960 through 2010 and the American Community Surveys (ACS) and thereby convert cross-sectional decennial census data into a bona fide panel that will represent the full U.S. population over the last seventy years. Because this panel will be continuously refreshed as additional census and ACS data become available, it can serve as a population-level scaffolding on which other administrative data (such as tax records, earnings reports, program data) are then hung. ... In other countries that have linked data, such as Wales and New Zealand, a well-developed infrastructure allows access to carefully vetted scholars, with the result that high-quality evidence is more frequently brought to bear on policy decisions.
I can ramble on a bit about the merits of administrative data for research. It covers everyone, and thus allows detailed analysis of various subgroups, and tracking people over time, and even looking across generations. It describes what government programs have actually done, which can then be compared and combined with surveys of household or businesses. But rather than talking in generalities, let me just mention some of the studies from this double issue. Notice in particular how the studies often use administrative data, sometimes from  separate government agencies, in a way that addresses a worthwhile question.

From Sean F. Reardon, "Educational Opportunity in Early and Middle Childhood: Using Full Population Administrative Data to Study Variation by Place and Age":
I use standardized test scores from roughly forty-five million students to describe the temporal structure of educational opportunity in more than eleven thousand school districts in the United States. Variation among school districts is considerable in both average third-grade scores and test score growth rates. The two measures are uncorrelated, indicating that the characteristics of communities that provide high levels of early childhood educational opportunity are not the same as those that provide high opportunities for growth from third to eighth grade.
From Janelle Downing and Tim Bruckner, "Subprime Babies: The Foreclosure Crisis and Initial Health Endowments":
This research uses a probabilistic matching strategy to link foreclosure records with birth certificate records from 2006 to 2010 in California to identify birth parents who experienced a foreclosure. ... [We] find that infants in gestation during or after the foreclosure had a lower birth weight for gestational age than those born earlier, suggesting that the foreclosure crisis was a plausible contributor to disparities in initial health endowments.
From Agustina Laurito, Johanna Lacoe, Amy Ellen Schwartz, Patrick Sharkey, and Ingrid Gould Ellen, "School Climate and the Impact of Neighborhood Crime on Test Scores":
Using administrative data from the New York City Department of Education and the New York City Police Department, we find that exposure to violence in the residential neighborhood and an unsafe climate at school lead to substantial test score losses in English language arts (ELA).
From Roberto M. Fernandez and Brian Rubineau, "Network Recruitment and the Glass Ceiling: Evidence from Two Firms":
Does network recruitment contribute to the glass ceiling? We use administrative data from two companies to answer the question. In the presence of gender homophily, recruitment through employee referrals can disadvantage women when an old boys’ network is in place. We calculate the segregating effects of network recruitment across multiple job levels in the two firms. If network recruitment is a factor, the segregating impact should disadvantage women more at higher levels. We find this pattern, but also find that network recruitment is a desegregating force overall. It promotes women’s representation strongly at all levels, but less so at higher levels.
One final thought: Using administrative data often requires academic researchers to become entrepreneurial about seeking out such data, working with the government agencies or private firms that hold the original data, finding ways to offer cast-iron reassurances about personal privacy, and only then being able to actually work with the data and see if something interesting emerges. For modern economists, this process is quite different from the old days of digging through data collected and made public in a way already prepared for their use by government agencies.