Monday, March 25, 2019

Time to Abolish "Statistical Significance"?

The idea of "statistical significance" has been a basic concept in introductory statistics courses for decades. If you spend any time looking at quantitative research, you will often see in tables of results that certain numbers are marked with an asterisk or some other symbol to show that they are "statistically significant."

For the uninitiated, "statistical significance" is a way of summarizing whether a certain statistical result is likely to have happened by chance, or not. For example, if I flip a coin 10 times and get six heads and four tails, this could easily happen by chance even with a fair and evenly balanced coin. But if I flip a coin 10 times and get 10 heads, this is extremely unlikely to happen by chance. Or if I flip a coin 10,000 times, with a result of 6,000 heads and 4,000 tails (essentially, repeating the 10-flip coin experiment 1,000 times), I can be quite confident that the coin is not a fair one. A common rule of thumb has been that if the probability of an outcome occurring by chance is 5% or less--in the jargon, has a p-value of 5% or less--then the result is statistically significant. However, it's also pretty common to see studies that report a range of other p-values like 1% or 10%.

Given the omnipresence of "statistical significance" in pedagogy and the research literature, it was interesting last year when the American Statistical Association made an official statement "ASA Statement on Statistical Significance and P-Values" (discussed here) which includes comments like: "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. ... A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. ... By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis."

Now, the ASA has followed up with a special supplemental issue of its journal The American Statistician on the theme "Statistical Inference in the 21st Century: A World Beyond p < 0.05" (January 2019).  The issue has a useful overview essay, "Moving to a World Beyond “p < 0.05.” by Ronald L. Wasserstein, Allen L. Schirm, and  Nicole A. Lazar. They write:
We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way. Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. ... In sum, `statistically significant'—don’t say it and don’t use it.
The special issue is then packed with 43 essays from a wide array of experts and fields on the general theme of  "if we eliminate the language of statistical significance, what comes next?"

To understand the arguments here, it's perhaps useful to have a brief and partial review of some main reasons why the emphasis on "statistical significance" can be so misleading: namely, it can lead one to dismiss useful and true connections; it can lead one to draw false implications; and it can cause researchers to play around with their results. A few words on each of these.

The question of whether a result is "statistically significant" is related to the size of the sample. As noted above, 6 out of 10 heads can easily happen by chance, but 6,000 out of 10,000 heads is extraordinarily unlikely to happen by chance.  So say that you do an study which finds an effect which is fairly large in size, but where the sample size isn't large enough for it to be statistically significant by a standard test. In practical terms, it would be foolish to ignore to ignore this large result; instead, you should presumably start trying to find ways to run the test with a much larger sample size. But in academic terms, the study you just did may be unpublishable: after all, a  lot of journals will tend to decide against publishing a study with negative results--a study that doesn't that doesn't fine a statistically significant effect

Knowing that journals are looking to publish "statistically significant" results, researchers will be tempted to look for ways to jigger their results. Studies in economics, for example, aren't about simple probability examples like flipping coins. Instead, one might be looking at Census data on households that can be divided up in roughly a jillion ways: not just the basic categories like age, income, wealth, education, health, occupation, ethnicity, geography, urban/rural, during recession or not, and others, but also various interactions of these factors looking at two or three or more at a time. Then, researchers make choices about whether to assume that connections between these variables should be thought of a linear relationship, curved relationships (curving up or down), relationships are are U-shaped or inverted-U, and others. Now add in all the different time periods and events and places and before-and-after legislation that can be considered. For this fairly basic data, one is quickly looking at thousands or tens of thousands of possible connections relationships.

Remember that the idea of statistical significance relates to  whether something has a 5% probability or less of happening by chance. To put that another way, it's whether something would have happened only one time out of 20 by chance. So if a researcher takes the same basic data and looks at thousands of possible equations, there will be dozens of equations that look like they had a 5% probability of not happening by chance. When there are thousands of researchers acting in this way, there will be a steady stream of hundreds of result every month that appear to be "statistically significant," but are just a result of the general situation that if you try enough

A classic statement of this issue arises in Edward Leamer's 1983 article, "Taking the Con out of Econometrics" (American Economic Review, March 1983, pp. 31-43). Leamer wrote:
The econometric art as it is practiced at the computer terminal involves fitting many, perhaps thousands, of statistical models. One or several that the researcher finds pleasing are selected for re- porting purposes. This searching for a model is often well intentioned, but there can be no doubt that such a specification search in-validates the traditional theories of inference. ... [I]n fact, all the concepts of traditional theory, utterly lose their meaning by the time an applied researcher pulls from the bramble of computer output the one thorn of a model he likes best, the one he chooses to portray as a rose. The consuming public is hardly fooled by this chicanery. The econometrician's shabby art is humorously and disparagingly labelled "data mining," "fishing," "grubbing," "number crunching." A joke evokes the Inquisition: "If you torture the data long enough, Nature will confess" ... This is a sad and decidedly unscientific state of affairs we find ourselves in. Hardly anyone takes data analyses seriously. Or perhaps more accurately, hardly anyone takes anyone else's data analyses seriously."
Economists and other social scientists have become much more aware of these issues over the decades, but Leamer was still writing in 2010 ("Tantalus on the Road to Asymptopia," Journal of Economic Perspectives, 24: 2, pp. 31-46):
Since I wrote my “con in econometrics” challenge much progress has been made in economic theory and in econometric theory and in experimental design, but there has been little progress technically or procedurally on this subject of sensitivity analyses in econometrics. Most authors still support their conclusions with the results implied by several models, and they leave the rest of us wondering how hard they had to work to find their favorite outcomes ... It’s like a court of law in which we hear only the experts on the plaintiff’s side, but are wise enough to know that there are abundant for the defense. 
Taken together, these issues suggest that a lot of the findings in social science research shouldn't be believed with too much firmness. The results might be true. They might be a result of a researcher pulling out "from the bramble of computer output the one thorn of a model he likes best, the one he chooses to portray as a rose." And given the realities of real-world research, it seems goofy to say that a result with, say, only a 4.8% probability of happening by chance is "significant," while if the result had a 5.2% probability of happening by chance it is "not significant." Uncertainty is a continuum, not a black-and-white difference.


So let's accept the that the "statistical significance" label has some severe problems, as Wasserstein, Schirm, and Lazar write: 
[A] label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics. In a world without bright lines, on the other hand, it becomes untenable to assert dramatic differences in interpretation from inconsequential differences in estimates. As Gelman and Stern (2006) famously observed, the difference between “significant” and “not significant” is not itself statistically significant.
But as they recognize, criticizing is the easy part. What is to be done instead? And here, the argument fragments substantially. Did I mention that there were 43 different responses in this issue of the American Statistician?

Some of the recommendations are more a matter of temperament than of specific statistical tests. As Wasserstein, Schirm, and Lazar emphasize, many of the authors offer advice that can be summarized in about seven words: "Accept uncertainty. Be thoughtful, open, and modest.” This is good advice! But a researcher struggling to get a paper published might be forgiven for feeling that it lacks specificity.

Other recommendations focus on the editorial process used by academic journals, which establish some of the incentives here. One interesting suggestion is that when a research journal is deciding whether to publish a paper, the reviewer should only see a description of what the researcher did--without seeing the actual empirical findings. After all, if the study was worth doing, then it's worthy of being published, right? Such an approach would mean that authors had no incentive to tweak their results. A method already used by some journals is "pre-publication registration," where the researcher lays out beforehand, in a published paper, exactly what is going to be done. Then afterwards, no one can accuse that researcher of tweaking the methods to obtain specific results.

Other authors agree with turning away from "statistical significance," but in favor of their own preferred tools for analysis: Bayesian approaches, "second-generation p-values," "false positive risk,"
"statistical decision theory," "confidence index," and many more. With many alteratative examples along these lines, the researcher trying to figure out how to proceed can again be forgiven for desiring little more definitive guidance.

Wasserstein, Schirm, and Lazar also asked some of the authors whether there might be specific situations where a p-value threshold made sense. They write:
"Authors identified four general instances. Some allowed that, while p-value thresholds should not be used for inference, they might still be useful for applications such as industrial quality control, in which a highly automated decision rule is needed and the costs of erroneous decisions can be carefully weighed when specifying the threshold. Other authors suggested that such dichotomized use of p-values was acceptable in model-fitting and variable selection strategies, again as automated tools, this time for sorting through large numbers of potential models or variables. Still others pointed out that p-values with very low thresholds are used in fields such as physics, genomics, and imaging as a filter for massive numbers of tests. The fourth instance can be described as “confirmatory setting[s] where the study design and statistical analysis plan are specified prior to data collection, and then adhered to during and after it” ...  Wellek (2017) says at present it is essential in these settings. “[B]inary decision making is indispensable in medicine and related fields,” he says. “[A] radical rejection of the classical principles of statistical inference…is of virtually no help as long as no conclusively substantiated alternative can be offered.”
The deeper point here is that there are situation where a researcher or a policy-maker or an economic needs to make a yes-or-no decision. When doing quality control, is it meeting the standard or not? when the Food and Drug Administration is evaluating a new drug, does it  approve the drug or not? When a researcher in genetics is dealing with a database that has thousands of genes, there's a need to focus on a subset of those genes, which means making yes-or-no decisions on which genes to include a certain analysis. 

Yes, the scientific spirit should "Accept uncertainty. Be thoughtful, open, and modest.” But real life isn't a philosophy contest. Sometimes, decisions need to be made. If you don't have a statistical rule, then the alternative decision rule becomes human judgment--which has plenty of cognitive, group-based, and political biases of its own.

My own sense is that "statistical significance" would be a  very poor master, but that doesn't mean it's a useless servant. Yes, it would foolish and potentially counterproductive to give excessive weight to "statistical significance." But the clarity of conventions and rule, when their limitations are recognized and acknowledges, can still be useful. I was struck by a comment in the essay by Steven N. Goodman:
P-values are part of a rule-based structure that serves as a bulwark against claims of expertise untethered from empirical support. It can be changed, but we must respect the reason why the statistical procedures are there in the first place ... So what is it that we really want? The ASA statement says it; we want good scientific practice. We want to measure not just the signal properly but its uncertainty, the twin goals of statistics. We want to make knowledge claims that match the strength of the evidence. Will we get that by getting rid of P−values? Will eliminating P−values improve experimental design? Would it improve measurement? Would it help align the scientific question with those analyses? Will it eliminate bright line thinking? If we were able to get rid of P-values, are we sure that unintended consequences wouldn’t make things worse? In my idealized world, the answer is yes, and many statisticians believe that. But in the real world, I am less sure.