March 12, 2015

Is a P-value of 0.05 sacrosanct?

As I asked trusted sources about statistical significance, one road led clearly to Dr. Donald Berry of MD Anderson Cancer Center. He is a fellow of the American Statistical Association. If you’ve heard his name, it may have been in the context of breast cancer. Since 1990 he has served as a faculty statistician on the Breast Cancer Committee of the Cancer and Leukemia Group B (CALGB), a national oncology group. In this role he has designed and supervised the conduct of many large U.S. intergroup trials in breast cancer.

When I first approached him, Berry wrote:

“Interesting that you should ask. The American Statistical Association has recently formed a committee to address this and other questions related to p-values. One of the motivations of forming the committee was to help journalists. Another was to help scientists.”

Since that report won’t be finished for awhile, he referred me to an article he had published in the *Journal of the National Cancer Institute*, “Multiplicities in Cancer Research: Ubiquitous and Necessary Evils.” It might be difficult to digest all of this in one sitting, but I’ll excerpt from the section on statistical significance:

“…statistical significance is an arcane concept. Few researchers can even repeat the definition of

Pvalue. People usually convert it to something they do understand, but the conversion—almost always an inversion—is essentially always wrong. For example: “ThePvalue is the probability that the results could have occurred by chance alone.” This interpretation is ambiguous at best. When pressed for the meaning of “chance” and “could have occurred,” the response is usually circular or otherwise incoherent. Such incoherence is more than academic. Much of the world acts as though statistical significance implies truth, which is not even approximately correct.Statistical significance is widely regarded to be difficult to understand, perhaps even impossible to understand. Some educators go so far as to recommend not teaching it at all.”

Still, as Berry wrote to me, “the cutpoint of 0.05 for statistical significance has become standard in many fields, including medicine.”

For example, many highly capable and highly intelligent MDs regard p > 0.05 versus p < 0.05 as defining truth. This attitude is sacrosanct in a sense but at the same time it is preposterous. As you say, the cutpoint is arbitrary. Moreover, essentially no one knows what a p-value means. And the rare scientist who can give the correct mathematical interpretation can’t put it into a non-mathematical language that someone else can understand. That’s because p-value is fundamentally a perversion of common logic. For example, if you read what I’ve written about p-values in the attached and you come away being able to repeat what you read (whether you understand it or not), you would be a rare bird indeed!

However, because the original study that touched off the original reporter’s question to me (back in part one of this series) – “Outcomes of Pregnancy after Bariatric Surgery,” in the New England Journal of Medicine – was an observational study, Berry says there are bigger fish to fry than statistical significance. He wrote to me:

“My assessment of the landscape of observational studies, including much of epidemiology, ranges from bleak to parched earth. There are some researchers (and some journalists) who feel that epidemiology studies should be banned entirely. And with reason. I’ve thought about carrying out an epidemiology study of my own, one that addresses the question of whether epidemiology studies have done more harm than good!

(One social psychology journal has a different—and they think better—solution. They’ve banned all p-values, statements of statistical significance, etc.)

In the context of most observational studies, worrying about whether p < 0.05 or > 0.05 is like worrying about whether you made your bed when your house is burning.

There are researchers who make their careers by searching the literature and databases for drugs that have a “statistically higher rate” of some important serious adverse effect, such as a particular type of cancer. Sometimes the observation is real and can be confirmed but many times it is a statistical fluke. In the age of Big Data, “look enough and ye shall find!”

If there is a conclusion to this discussion, it may be Berry’s line:

One thing is clear: there is no one-size-fits-all approach.

A good online resource is Delfini.org, run by Sheri Strite and Michael Stuart, MD. You can read up on their discussion of P-values and the problems therewith. And you can look up other important topics that I have glossed over in this series on statistical significance, such as:

Sheri Strite wrote to me:

“After we were made aware of the tremendous complexity and limitations of p-values, which confidence intervals also share, we had to land somewhere to give people some practical approach. So we say it is an indicator of chance effects. We point out that less than .05 is roughly an indicator (given some complexities and limitations) of a 1 in 20 chance effect (caveats re: above). And if you had a patient for which you had a good study with a p-value slightly higher, you may have a higher chance of a random effect, yet you and your patient might be willing to accept that. We encourage them to look at confidence intervals, with same caveats.

So we’ve tried to find a way to help people operate in an imperfect and confusing situation, even though not exact.”

Hilda Bastian has also written about statistical significance, addressing “over-simplified ways of explaining this.”

By the way, the reporter who first brought this up with me (back in part one of this series) told me my answers were helpful. He may have just wanted to shut me up – having heard enough!

Link to Part 2 of the series: “Are P-values above .05 really just statistical noise?”

—————————

Follow us on Twitter:

https://twitter.com/garyschwitzer

## Comments (3)

We Welcome Comments.But please note: We will delete comments left by anyone who doesn’t leave an actualfirst and last nameand an actualemail address.We will delete comments that include personal attacks, unfounded allegations, unverified facts, product pitches, or profanity. We will also end any thread of repetitive comments. Comments should primarily discuss the quality (or lack thereof) in journalism or other media messages about health and medicine. This is not intended to be a forum for definitive discussions about medicine or science. Nor is it a forum to share your personal story about a disease or treatment -- your comment must relate to media messages about health care. If your comment doesn't adhere to these policies, we won't post it. Questions? Please see more on our comments policy.

## Matthew

March 12, 2015 at 12:19 pm” less than .05 is roughly an indicator (given some complexities and limitations) of a 1 in 20 chance effect” still isn’t right.

Here’s an example. Let event A be that we observe a mean in our dataset as a result of an experiment. We’ll denote B as being the situation where treatment has no effect (on the sample mean). The p-value is the conditional probability of observing A if B is true, so p-value=pr(A|B).

By definition, p-value=pr(A|B)=pr(A?B)/pr(B), which implies that the probability that B is true is pr(B)=pr(A?B)/pr(A|B). Note that the denominator in that last equation is the p-value.

When we say that a particular result is “due to chance,” we mean that the null hypothesis–that there was no effect–is true. Hence, pr(B) is the probability that the result was “due to chance.” From that last equation above, we can say that a higher p-value means it is more likely that results were “due to chance.” That is in fact all we can say based on the p-value. It’s the absolute strongest claim we can make with the available data.

To see why, suppose that, by some divine act of inspiration, we know that the unconditional probability of both observing A and B being true is 2.5%–that is, pr(A?B)=0.025. So, it’s fairly unlikely that both A and B are true at the same time. In that case, a p-value of 0.05 implies that the probability that the treatment had no effect was actually pr(B)=0.025/0.05=0.5–a whopping 50 percent! That’s an order of magnitude larger than 1 in 20!

I have no particular objections to using frequentist statistics, or even to using p-values to assess statistical significance, but we do all need to be informal bayesians when we read research. What a p-value means depends in part on our prior beliefs.

## Hilda Bastian

March 14, 2015 at 7:40 amI agree with Matthew that this isn’t right. If significance testing could tell whether or not an effect was occurring by chance, then it would be what people mistakenly think it is! This is just another way of saying the same thing.

Here is one way that I try to get at least close to the technical description of what this testing does, and keep it simple.

If – and that’s the a big if – if your hypothesis is true, then significance testing will show whether the results are roughly what you would expect to see. It can’t tell you if the hypothesis is true. It can’t tell you if the results were because of chance.

I discuss it in more detail here: http://blogs.plos.org/absolutely-maybe/statistical-significance-and-its-part-in-science-downfalls/

## Sheri Strite

March 16, 2015 at 7:50 pmTo my partner and me, medical science is an attempt to address several questions: 1) Are the results likely to be true?; 2) If yes, are they useful?; 3) If yes, to whom?; 4) If yes, at what price; and, 5) Are they usable (this has to do with ability to understand, access, apply and act upon, etc.)

This post of Gary’s has to do with, “ARE THEY LIKELY TO BE TRUE?” of which statistical significance is a small part. While discussing statistical significance is important, I just want to ensure that readers do not lose sight of the fact that statistical significance is irrelevant if the study is at high or uncertain risk of bias. This is very important and often ignored or not understood. Another important point is that statistical significance should never be confused with clinical significance—or its potential impact on a patient.

My partner and I try to take a fairly rigorous approach to discovering the likelihood of a medical intervention being helpful or harmful for whom (e.g., are the results likely to be true and provide net benefit)? Academic approaches are vitally important. Sometimes exigencies of reality demand other approaches even if not “exact.”

We teach physicians and other health care professionals an applied and practical approach to evaluating medical science with a strong focus on attention to bias. Most health care professionals are not researchers, but people who have to make medical decisions and advise patients. These are also smart people who often have had no real understanding of the difference between an absolute and a relative number, for example. These are people who will be totally stymied by certain complex topics that are extremely hard to apply (if at all) to help inform a medical decision. And they must make some choices including what to say to patients.

Let me give an example. Many meta-analyses report results as odds ratios. Rather than experience helplessness and paralysis because odds are very difficult to deal with, many clinicians convert them into relative risk, which is a probability measure. We point out to them that when outcomes occur commonly (e.g., >5%), odds ratios may then overestimate the effect of a treatment. So should they not do this conversion? I think it is appropriate for them to do the conversion and then make a mental adjustment, knowing they are moving into an area of greater uncertainty because of reporting odds and not probability. To be able to move and act, sometimes one has to start somewhere, keeping in mind that they may be imprecise. And this uncertainty needs to be communicated to a patient.

Another example, in “Why Most Published Research Findings are False.” PLoS Med 2005; 2(8):696-701 PMID: 16060722, Dr. John P. A. Ioannidis calculates the Positive Predictive Value of a well-done epidemiological (observational) study as being 20 percent. Do I think this figure is exactly right? I have no idea—so I am willing to say, “What if he is wrong? If I double this, it is still sufficiently meaningful to me to help me with decision-making.” It still may be even more wrong than that, true. But making no decision is also a decision, and sometimes I am willing to let a potential indicator of direction guide me (considering a variety of issues).

(Having said all of this, I will add that I do not “buy into” the concept of the “best available evidence” because it is too inexact—and a clinician’s opinion—in my opinion—may be the best choice of action over an observational study that sways one the wrong way.)

Many scientific “answers” in medical science are “inexact.” That is why good clinical trials provide confidence intervals (which are also inexact and built upon a lot of assumptions). We also caution clinicians to not be too married to any results. For example, study bias can distort results, and studies that appear to be at low risk of bias, upon a close review and critical appraisal of the published information, may actually be at great risk of undetected bias. (All of which argues for replicated studies with similar populations, interventions, comparators, outcomes, timing, settings, etc., done by differing interest groups.)

So back to statistical significance. I am not sure if the previous commentators actually read our post at—

http://www.delfini.org/page_Glossary.htm#p-value

Our goal is to help people get at whether they are going to use study results or not. There will usually (maybe always) be uncertainty attached with that—so that needs to be considered by a clinician and communicated to a patient.

The quoted comment in Gary’s blog is within the context of this post at the link above. When we discuss p-values, we provide learners with the information at the link above. We then shrug our shoulders and say that p-values are some indication of the operation of chance upon the results. There is no exactitude in any of this, usually.

The question Gary had raised to us concerned reporting for a potentially serious issue when the p-value was .06 and not .05. My answer quoted above is entirely within the context of his question. Should he or shouldn’t he? We voted, “Should, with caveats”—if this is an observational study, then it should be made clear that the association could be easily due to another cause. To this specific question, we provided him with our link above and tried to provide a practical way—though statedly inexact—to make a choice. I cannot evaluate the math or applicability of the information provided by commentator Matthew—e.g., did this convert to a relative number? Anyway, we have relied on multiple sources for our trying to make sense of p-values, and we think readers may be helped by the information provided by GraphPad:

http://www.graphpad.com/guides/prism/6/statistics/index.htm?what_is_a_p_value.htm

So our point is that sometimes a clinician (or a reporter) might wish to utilize (or report on) an intervention even if the study did not reach statistical significance. They, and patients, should be made aware of the fact they are dealing with greater uncertainty than numbers may “imply.”

If someone has a better approach or solution for us to apply and share with others, we would be delighted as we try to help others navigate this imperfect world.

Our bottom line in response to Gary’s original question to us and similar questions of others: Do you communicate/use study results or not? Statistical significance is one small, complicated piece of one’s decision.

## Our Comments Policy

We welcome comments, which users can leave at the end of any of our systematic story reviews or at the end of any of our blog posts.

But before leaving a comment, please review these notes about our policy.

You are responsible for any comments you leave on this site.

This site is primarily a forum for discussion about the quality (or lack thereof) in journalism or other media messages (advertising, marketing, public relations, medical journals, etc.) It is not intended to be a forum for definitive discussions about medicine or science.We will delete comments that include personal attacks, unfounded allegations, unverified claims, product pitches, profanity or any from anyone who does not list a full name and a functioning email address. We will also end any thread of repetitive comments. We don”t give medical advice so we won”t respond to questions asking for it.

We don”t have sufficient staffing to contact each commenter who left such a message. If you have a question about why your comment was edited or removed, you can email us at feedback@healthnewsreview.org.

There has been a recent burst of attention to troubles with many comments left on science and science news/communication websites. Read “Online science comments: trolls, trash and treasure.”

The authors of the Retraction Watch comments policy urge commenters:

We”re also concerned about anonymous comments. We ask that all commenters leave their full name and provide an actual email address in case we feel we need to contact them. We may delete any comment left by someone who does not leave their name and a legitimate email address.

And, as noted, product pitches of any sort – pushing treatments, tests, products, procedures, physicians, medical centers, books, websites – are likely to be deleted. We don”t accept advertising on this site and are not going to give it away free.

The ability to leave comments expires after a certain period of time. So you may find that you’re unable to leave a comment on an article that is more than a few months old.

## You might also like

Kevin Lomangino is the managing editor of HealthNewsReview.org. He tweets as @KLomangino. The big story…

Genetic testing for breast and ovarian cancer risk is cheaper and easier than ever. But…

Dr. Deanna J. Attai is an Assistant Clinical Professor of Surgery at the David Geffen…