Mixed messages about statistical significance


Vector background from numbersIt’s very difficult to capture the essence of statistical significance in a brief primer. (Journalist Christie Aschwanden published a thoughtful piece, “Not even scientists can easily explain P values.”)

First, let’s at least be as concerned about clinical significance as we are about statistical significance. Journalists should keep that in mind as they write stories about studies, but it’s a good idea for the general public to understand the possible difference as well. In other words, did the result make a difference in people’s lives?

From a Duke website:

Clinical versus Statistical Significance

“Although it is tempting to equate statistical significance with clinical importance, critical readers should avoid this temptation. To be clinically important requires a substantial change in an outcome that matters. Statistically significant changes, however, can be observed with trivial outcomes. And because statistical significance is powerfully influenced by the number of observations, statistically significant changes can be observed with trivial (small) changes in important outcomes. Large studies can be significant without being clinically important and small studies may be important without being significant.” (Effective Clinical Practice, July/August 2001, ACP)

Clinical significance has little to do with statistics and is a matter of judgment. Clinical significance often depends on the magnitude of the effect being studied. It answers the question “Is the difference between groups large enough to be worth achieving?” Studies can be statistically significant yet clinically insignificant.

For example, a large study might find that a new antihypertensive drug lowered BP, on average, 1 mm Hg more than conventional treatments. The results were statistically significant with a P Value of less than .05 because the study was large enough to detect a very small difference. However, most clinicians would not find the 1 mm Hg difference in blood pressure large enough to justify changing to a new drug. This would be a case where the results were statistically significant (p value less than .05) but clinically insignificant.

Source:
Guyatt, G. Rennie, D. Meade, MO, Cook, DJ.  Users’ Guide to Medical Literature: A Manual for Evidence-Based Clinical Practice, 2nd Edition 2008.

The somewhat arbitrary choice to set the p-value for statistical significance at less than 5% was made nearly 100 years ago.  There’s nothing magical or sacrosanct about it. It’s just become a time-honored norm.  In reality, the difference between “not quite statistically significant” and statistically significant at the .05 level can be minuscule.

In a Scientific American blog post, Hilda Bastian wrote, “Statistical significance and its part in science downfalls.”  Excerpt:

Get a “value” over or under 0.05 and you can be 95% certain it’s either a fluke or it isn’t. You can eliminate the play of chance! You can separate the signal from the noise!

Except that you can’t. That’s not really what testing for statistical significance does. And therein lies the rub.

Testing for statistical significance estimates the probability of getting roughly that result if the study hypothesis is assumed to be true. It can’t on its own tell you whether this assumption was right, or whether the results would hold true in different circumstances. It provides a limited picture of probability, taking limited information about the data into account and giving only “yes” or “no” as options.

Dartmouth’s Lisa Schwartz, Steve Woloshin and Gil Welch wrote “Fat or Fiction?  Is There a Link Between Dietary Fat and Cancer Risk?  Why Two Big Studies Reached Different Conclusions” in a column for the Washington Post years ago that touched on some of these issues.

  • It reflected on an “apparent flip-flop” in recent news about low-fat diet and breast cancer. One month, a front page Post headline read, “Low-Fat Diet’s Benefit Rejected: Study Finds No Drop in Risk for Disease.” But less than a year before, a headline sent a different message: “Study of Breast Cancer Patients Finds Benefit in Low-Fat Diet.”
  • They wrote: “The p values for the effect of low-fat diet on breast cancer in the two studies were quite similar. For women with breast cancer, the p value was 3 percent. For women without breast cancer, the p value was 7 percent. So even though, by convention, one finding is called “statistically significant” and the other “not-significant,” we would say that the statistics of the two studies are not that different: Both are close to the conventional cutoff point of 5 percent…. if you believe one is real, you should probably believe the other is real.”
  • Read the “Research Basics: Accounting for Chance” sidebar in their Post column for an explanation of how close the two can be.

Dr. Donald Berry of MD Anderson Cancer Center is a fellow of the American Statistical Association.  He wrote an article in the Journal of the National Cancer Institute, “Multiplicities in Cancer Research: Ubiquitous and Necessary Evils.” It might be difficult to digest all of this in one sitting, but I’ll excerpt from the section on statistical significance:

“…statistical significance is an arcane concept. Few researchers can even repeat the definition of P value. People usually convert it to something they do understand, but the conversion—almost always an inversion—is essentially always wrong. For example: “The P value is the probability that the results could have occurred by chance alone.” This interpretation is ambiguous at best. When pressed for the meaning of “chance” and “could have occurred,” the response is usually circular or otherwise incoherent. Such incoherence is more than academic. Much of the world acts as though statistical significance implies truth, which is not even approximately correct.

Statistical significance is widely regarded to be difficult to understand, perhaps even impossible to understand. Some educators go so far as to recommend not teaching it at all.”

Still, as Berry wrote to me, “the cutpoint of 0.05 for statistical significance has become standard in many fields, including medicine.”

For example, many highly capable and highly intelligent MDs regard p > 0.05 versus p < 0.05 as defining truth. This attitude is sacrosanct in a sense but at the same time it is preposterous. As you say, the cutpoint is arbitrary. Moreover, essentially no one knows what a p-value means. And the rare scientist who can give the correct mathematical interpretation can’t put it into a non-mathematical language that someone else can understand. That’s because p-value is fundamentally a perversion of common logic. For example, if you read what I’ve written about p-values in the attached and you come away being able to repeat what you read (whether you understand it or not), you would be a rare bird indeed!

If there is a conclusion to this discussion, it may be Berry’s line:

One thing is clear: there is no one-size-fits-all approach.

 

 back to “Tips for Understanding Studies”

Comments (3)

We Welcome Comments. But please note: We will delete comments left by anyone who doesn’t leave an actual first and last name and an actual email address.

We will delete comments that include personal attacks, unfounded allegations, unverified facts, product pitches, or profanity. We will also end any thread of repetitive comments. Comments should primarily discuss the quality (or lack thereof) in journalism or other media messages about health and medicine. This is not intended to be a forum for definitive discussions about medicine or science. Nor is it a forum to share your personal story about a disease or treatment -- your comment must relate to media messages about health care. If your comment doesn't adhere to these policies, we won't post it. Questions? Please see more on our comments policy.

Paul Alper

August 3, 2015 at 11:08 am

The correct but still foggy definition of p-value is:
Probability of obtaining data at least this extreme given that the null hypothesis is true.
Typically in the health/medical field, the null hypothesis is that the treatment and the placebo produce the same effect. Notice that the above definition is often confused with what we are really interested in:
Probability that the null hypothesis is true given the data obtained.
Unfortunately, many researchers commit this fallacy. A famous way of clarifying things when it comes to conditional probability is the following example:
Probability of being dead given that you are lynched is 100 percent.
Probability of being lynched given that you are dead is zero.
If dead and lynching are too stark, replace dead with “brown eyes” and lynched with “Dominican Republic.”

Paul Alper

August 3, 2015 at 12:47 pm

Don Berry is quoted above as saying: “Moreover, essentially [virtually] no one knows what a p-value means…p-value is fundamentally a perversion of common logic.” But then the astute reader should ask why is p-value such a ubiquitous metric for discerning success? The simple answer is that p-value is much easier to calculate than more meaningful measures such as posterior probability or Bayes factor which require a good deal more computing power and increased knowledge of mathematics .

Keith O'Rourke

July 26, 2016 at 10:39 am

Paul as helpful as a correct nominal definition of p-values and an ability to discern what is and is not a p-value may be, arguably the critical conceptualization involves what to make of a p-value, and thereby in turn, the study under consideration.
Furthermore, the American Statistical Society’s recent statement on p-values had 20 published comments. Some of which support your version of “the null hypothesis is that the treatment and the placebo produce the same effect” but more argued for what I believe is a more purposeful version. This being that it also includes a myriad of background assumptions (e.g. random assignment, lack of informative dropout, pre-specification rather than cherry picking of outcome, etc.). Essentially everything that is required so that the distribution of p-values when the “treatment and the placebo produce the same effect” is (technically) known to be equal to the Uniform(0,1) distribution. Without that last bit of knowledge, no one could sensibly know what to make of a p-value. Even with it, it is very hard if not debatable.
So for readers of this blog, perhaps the best sense for them of what to make of a p-value (and thereby in turn the study under consideration), would be simply – if its less than 0.05 folks are likely somewhat over excited about the study and greater than 0.05 likely somewhat overly dismissive of the study. Then perhaps they can better focus on all the other tips for understanding studies or at least not overlook them.