Friday, November 25, 2011

Genetic Data and Economics: Problems in Drawing Inferences

A number of datasets that have economic and demographic information are also starting to have genetic information about the participants: in the U.S., some examples include National Longitudinal Study of Adolescent Health, the Wisconsin Longitudinal Study, and the Health and Retirement Survey. It is becoming possible, in other words, to look for connections between a person's genes and their education, income, and other economic outcomes. In the Fall 2011of my own Journal of Economic Perspectives, Jonathan Beauchamp, David Cesarini, and a host of co-authors tackle the issue of drawing inferences from this data in "Molecular Genetics and Economics."

The fundamental problem in these studies is that humans have a lot of genes.To be more specific, each person has about 3 billion "base pairs" of DNA material, and "genes" are combinations of these base pairs. However, the human genome includes more than just genes and DNA; there is also RNA and all sorts of other stuff. Figuring out the interactions between DNA, RNA, various proteins, and other ingredients is exciting and cutting-edge work in the life sciences.

For social scientists, working with this data is tricky. Current technologies create data on about 500,000 possible individual differences at the base-pair level in genes; before long, it will be a million and more. To those marinated in a bit of statistics, the problem can be phrased this way: If you have 500,000 independent variables in a least-squares regression, a whole lot of them will be statistically "significant" at conventional levels just by chance. For those to whom that statement carried no particular meaning, think of it this way:

When social scientists look at data, they are always trying to distinguish a real pattern from a pattern that could have happened by chance. To understand the difference, imagine watching a person flip a coin 10 times, and get "heads" every time. The odds of getting "heads" 10 times in a row with a fair coin is .5 raised to the power of 10, or .0009766--which is roughly one in a thousand. If you see a pattern that happens by chance only one time in a thousand, you would strongly suspect something is going on. Maybe it's a two-headed coin? But now imagine that you start off with 500,000 people each flipping a coin. After they have all flipped a coin 10 times, on average 488 of them will have gotten 10 straight heads. In this context, observing 10 straight heads is just what happens a certain amount of the time because of random chance when you start with very large numbers of people.

Bottom line: When you observe a particular event in a fairly small group, you can have some confidence (never complete certainty!) as to whether it occurred by chance. But if you see the same event happen for a small proportion of those in a really big group, then it certainly could have happened by chance. When you have 500,000 pieces of genetic data, it's like a big group, and any connections you see can happen by chance.

What's to be done? Beauchamp, Cesarini, and their co-authors suggest three steps.

First, a researcher who is working with 500,000 variables need to demand a much more extreme event before concluding that a connection is real. If I'm starting with 500,000 people flipping coins, I want to see someone flip heads maybe 100 times in a row before I conclude that something other than random chance is happening here. There are statistical methods for making this kind of correction, but they are still a work in progress. Research has found 180 different base-pairs that seem to be associated with height, but perhaps many more need to be considered as well, and perhaps considered all at once, not one at a time.

Second, it becomes extremely important to do the same calculation with multiple different datasets, to see whether the results are replicated. In their JEP article, they look at genetic determinants of education in two different datasets--and fail to replicate the results.

Third, if you're going to have really large numbers of variables, it's useful to have really large populations in your data, which isn't yet true of most of the datasets in this area.

In the same issue, Charles Manski also offers a comment on  this research in "Genes, Eyeglasses, and Social Policy."  Manski offers several useful  insights on this research. For example:

First, a finding that genes cause an effect is totally different from deciding about appropriate social policy. It seems likely that genes are highly correlated with poor eyesight, for example, but that genetic condition is easily and cheaply remedied with corrective lenses. Social policy should be about costs and benefits, not about whether something is "caused" by genes.

Second, it's important to be cautious about interactions of genes, environment and outcomes. If one looked at genetic patterns and the propensity to eat with chopsticks, for example, one might find a statistical correlation. But the obvious reason is that many of those with the common genetic pattern are also living in a common society, and it's society rather than genes which is causing correlation with chopsticks. In addition, certain traits like height are definitely highly inheritable, but they can still shift substantially over time as the environment alters--as in the way that average human heights have increased in the last century.

Third, Manski expresses some doubt that brute-force statistical calculations with hundreds of thousands of possible explanatory variables will ever yield solid inferences about causality. Instead, he suggests that over time, biologists, medical researchers and social scientists will develop better insights about how genes and all the rest of the activity in the human genome affects various traits. It will then be somewhat easier--if never actually easy--to understand cause and effect.