Is a P-value of 0.05 sacrosanct?
As I asked trusted sources about statistical significance, one road led clearly to Dr. Donald Berry of MD Anderson Cancer Center. He is a fellow of the American Statistical Association. If you’ve heard his name, it may have been in the context of breast cancer. Since 1990 he has served as a faculty statistician on the Breast Cancer Committee of the Cancer and Leukemia Group B (CALGB), a national oncology group. In this role he has designed and supervised the conduct of many large U.S. intergroup trials in breast cancer.
When I first approached him, Berry wrote:
“Interesting that you should ask. The American Statistical Association has recently formed a committee to address this and other questions related to p-values. One of the motivations of forming the committee was to help journalists. Another was to help scientists.”
Since that report won’t be finished for awhile, he referred me to an article he had published in the Journal of the National Cancer Institute, “Multiplicities in Cancer Research: Ubiquitous and Necessary Evils.” It might be difficult to digest all of this in one sitting, but I’ll excerpt from the section on statistical significance:
“…statistical significance is an arcane concept. Few researchers can even repeat the definition of P value. People usually convert it to something they do understand, but the conversion—almost always an inversion—is essentially always wrong. For example: “The P value is the probability that the results could have occurred by chance alone.” This interpretation is ambiguous at best. When pressed for the meaning of “chance” and “could have occurred,” the response is usually circular or otherwise incoherent. Such incoherence is more than academic. Much of the world acts as though statistical significance implies truth, which is not even approximately correct.
Statistical significance is widely regarded to be difficult to understand, perhaps even impossible to understand. Some educators go so far as to recommend not teaching it at all.”
Still, as Berry wrote to me, “the cutpoint of 0.05 for statistical significance has become standard in many fields, including medicine.”
For example, many highly capable and highly intelligent MDs regard p > 0.05 versus p < 0.05 as defining truth. This attitude is sacrosanct in a sense but at the same time it is preposterous. As you say, the cutpoint is arbitrary. Moreover, essentially no one knows what a p-value means. And the rare scientist who can give the correct mathematical interpretation can’t put it into a non-mathematical language that someone else can understand. That’s because p-value is fundamentally a perversion of common logic. For example, if you read what I’ve written about p-values in the attached and you come away being able to repeat what you read (whether you understand it or not), you would be a rare bird indeed!
However, because the original study that touched off the original reporter’s question to me (back in part one of this series) – “Outcomes of Pregnancy after Bariatric Surgery,” in the New England Journal of Medicine – was an observational study, Berry says there are bigger fish to fry than statistical significance. He wrote to me:
“My assessment of the landscape of observational studies, including much of epidemiology, ranges from bleak to parched earth. There are some researchers (and some journalists) who feel that epidemiology studies should be banned entirely. And with reason. I’ve thought about carrying out an epidemiology study of my own, one that addresses the question of whether epidemiology studies have done more harm than good!
(One social psychology journal has a different—and they think better—solution. They’ve banned all p-values, statements of statistical significance, etc.)
In the context of most observational studies, worrying about whether p < 0.05 or > 0.05 is like worrying about whether you made your bed when your house is burning.
There are researchers who make their careers by searching the literature and databases for drugs that have a “statistically higher rate” of some important serious adverse effect, such as a particular type of cancer. Sometimes the observation is real and can be confirmed but many times it is a statistical fluke. In the age of Big Data, “look enough and ye shall find!”
If there is a conclusion to this discussion, it may be Berry’s line:
One thing is clear: there is no one-size-fits-all approach.
A good online resource is Delfini.org, run by Sheri Strite and Michael Stuart, MD. You can read up on their discussion of P-values and the problems therewith. And you can look up other important topics that I have glossed over in this series on statistical significance, such as:
Sheri Strite wrote to me:
“After we were made aware of the tremendous complexity and limitations of p-values, which confidence intervals also share, we had to land somewhere to give people some practical approach. So we say it is an indicator of chance effects. We point out that less than .05 is roughly an indicator (given some complexities and limitations) of a 1 in 20 chance effect (caveats re: above). And if you had a patient for which you had a good study with a p-value slightly higher, you may have a higher chance of a random effect, yet you and your patient might be willing to accept that. We encourage them to look at confidence intervals, with same caveats.
So we’ve tried to find a way to help people operate in an imperfect and confusing situation, even though not exact.”
Hilda Bastian has also written about statistical significance, addressing “over-simplified ways of explaining this.”
By the way, the reporter who first brought this up with me (back in part one of this series) told me my answers were helpful. He may have just wanted to shut me up – having heard enough!
Link to Part 2 of the series: “Are P-values above .05 really just statistical noise?”
Follow us on Twitter: