Thursday, December 31, 2020

Lessons from World War II Statisticians: Survivorship Bias and Sequential Analysis

During World War II, a Statistical Research Group was formed to assist the war effort. W. Allen Wallis, who was Director of Research, tells the story in in "The Statistical Research Group, 1942-1945" (Journal of the American Statistical Association, 75:370, June 1980, pp. 320-330, available vis JSTOR): "The Statistical Research Group (SRG) was based at Columbia University during the Second World War and supported by the Applied Mathematics Panel (AMP) of the National Defense Research Committee (NDRC), which was part of the Office of Scientific Research and Development (OSRD)." Wallis was Director of Research. Some prominent members of the group included Milton Friedman, Harold Hotelling, Leonard Savage, and Abraham Wald. Indeed, Wallis writes: "SRG was composed of what surely must be the most extraordinary group of statisticians ever organized, taking into account both number and quality."

The backstory goes like this. On behalf of himself and some other other Stanford statistics professors, Wallis wrote to the US government government in 1942, offering to help in some way with the war effort. He got a letter back from W. Edwards Deming, the engineer who later became a guru of industrial quality control, but who at this time was working in the US  Census Bureau. Deming wrote back "with four single- spaced pages on the letterhead of the Chief of Ordnance, War Department," and suggested that the statisticians prepare a short course for engineers and firms in how statistical methods could be used for quality control. As Wallis dryly noted in 1980: "The program that resulted from Deming's suggestion eventually made a major contribution to the war effort. Its aftermath, in fact, continues to make major contributions not only to the American economy but also to the Japanese economy."

By mid-1942, Wallis soon ended up moving to Columbia to run the Statistical Research Group. One bit of back-story is that, in those pre-computer days, "the computing ... was done by about 30 young women, mostly mathematics graduates of Hunter or Vassar. Some of the basic statistical tables published in Techniques of Statistical Analysis (SRG 1948) were computed as backlog projects when there was slack in the computing load."

The SRG carried out literally hundreds of analyses: how the ammunition in aircraft machine guns should be mixed; quality examination methods for rocket fuel; "the best settings on proximity fuzes for air bursts of artillery shells against ground troops"; "to evaluate the comparative effectiveness of four 20 millimeter guns on the one hand and eight 50 caliber guns on the other as the armament of a fighter aircraft"; calculating "pursuit curves" for targeting missiles and torpedoes. "Statistical studies were also made of stereoscopic range finders, food storage data, high temperature alloys, the diffusion of mustard gas, and clothing tests."

Several of the insights from the SRG have had a lasting effect in terms of statistical analysis. Here, I'll focus on two of them: survivorship bias and sequential sampling. 

"Survivorship bias" refers to a problem that emerges when you look at the results of data, not realizing that some data points have dropped out over time. For example, suppose you look at the average rate of return from stock market mutual funds. If you just look at the universe of current funds, you will be leaving out funds that did badly and were closed or merged for lack of interest. Or suppose you argue in favor of borrowing money to attend a four-year college by citing evidence about higher salaries earned by college graduates, but you leave out the experience of those who borrowed money and did not end up graduating.  In health care, the issue of survivorship bias can come up quite literally in studies of trauma care: before drawing conclusions, such studies must of course beware of the fact that the data of those who suffered and injury but did not end up in the trauma care unit, or those who died of the injury before arriving at the trauma care unit, will not be included in the study. 

In a follow-up comment on the main article, appearing in the same issue, Wallis describes the origins of the idea of survivorship bias: 
In the course of reviewing the history of SRG, I was reminded of some ingenious work by Wald that has never seen the light of day. Arrangements have now been made for its publication, although the form and place are yet undecided. Wald wrote a series of memoranda on estimating the vulnerability of various parts of an airplane from data showing the number of hits on the respective parts of planes returning from combat. The vulnerability of a part (engine, aileron, pilot, stabilizer, elevator, etc.) is defined as the probability that a hit on that part will result in destruction of the plane (fire, explosion, loss of power, loss of control, etc.). The military was inclined to provide protection for those parts that on returning planes showed the most hits. Wald assumed, on good evidence, that hits in combat were uniformly distributed over the planes. It follows that hits on the more vulnerable parts were less likely to be found on returning planes than hits on the less vulnerable parts, since planes receiving hits on the more vulnerable parts were less likely to return to provide data. From these premises, he devised methods for estimating the vulnerability of various parts.
In other words, just looking at damage on the planes that returns would not be useful, but when adjusting for the fact that the returning planes are the ones that survived, it can offer real insights. Wald's 1943 manuscript "A Method of Estimating Plane Vulnerability Based on Damage of Survivors," was published in 1980 by the Defense Technical Information Center

But clearly the most prominent statistical insight from the SRG was the idea of sequential analysis, which Wallis calls "one of the most powerful and seminal statistical ideas of the past third of a century." In his 1980 article, he reproduces a long letter that he wrote in 1950 on the subject. Doing quality control testing on potential new kinds of ordnance required firing thousands of rounds. Apparently, a general observed to Wallis that if someone "wise and experienced" was on hand, that person could tell within a few thousand or even a few hundred rounds if the new ordnance was either much worse or much better than hoped. The general asked if there was some mechanical rule that could be devised for when the testing could be ended earlier than the full sample. Wallis noodled around with this idea, and expressed it this way in his 1950 letter: 
The fact that a test designed for its optimum properties with a sample of predetermined size could be still better if that sample size were made variable naturally suggested that it might pay to design a test in order to capitalize on this sequential feature; that is, it might pay to use a test which would not be as efficient as the classical tests if a sample of exactly N were to be taken, but which would more than offset this disadvantage by providing a good chance of terminating early when used sequentially. 

Wallis remembers a series of conversations with Milton Friedman on the subject, after Friedman joined the SRG in 1943. They made some progress in thinking about tradeoffs between sample size and statistical power and what is learned along the way. But they also ended up feeling that the discovery was potentially important to the war effort and that they weren't well-equipped to solve it expeditiously. Wallis remembers a momentous walk:  

We finally decided to bring in someone more expert in mathematical statistics than we. This decision was made after rather careful consideration. I recall talking it over with Milton walking down Morningside Drive from the office to our apartment. He said that it was not unlikely, in his opinion, that the idea would prove a bigger one than either of us would hit on again in a lifetime. We also discussed our prospects for being able to work it out ourselves. Milton was pretty confident of our (his?) ability to squeeze the juice out of the idea, but I had doubts and felt that it might go beyond our (my!) depth mathematically. We also discussed the fact that if we gave the idea away, we could never expect much credit, and would have to take our chances on receiving any at all. We definitely decided that even if the credit situation turned out in a way that disappointed us, there would be nothing to do about it, 
They ended up getting permission to talk with Abraham Wald on the subject, which wasn't easy, because Wald's time was "too valuable to be wasted." 
At this first meeting Wald was not enthusiastic and was completely noncommital. ... The next day Wald phoned that he had thought some about our idea and was prepared to admit that there was sense in it. That is, he admitted that our idea was logical and worth investigating. He added, however, that he thought nothing would come of it; his hunch was that tests of a sequential nature might exist but would be found less powerful than existing tests. On the second day, however, he phoned that he had found that such tests do exist and are more powerful, and furthermore he could tell us how to make them.
It took a few more years for the underlying theory to be worked out, and Wald's book on Sequential Analysis is published in 1947. But the roots of the idea go back to an army general noting that someone with expert and informed judgment could sometimes make a faster decision than the existing quality control algorithms.

The SRG is an example of how ideas and statistical methods invented out of immediate practical necessity--like new methods of quality control--had longer-run powerful results. As this year draws to a close, I find myself wondering if some of the ideas and methods that have been used to create vaccines and to push back against COVID-19 will find broader applicability in the years ahead, perhaps in areas reaching well beyond health care.