Understanding medical tests: sensitivity, specificity, and positive predictive value

Headlines like these touting new medical tests often include impressive-sounding claims of accuracy.

Consider this HealthDay story about an experimental breath test said to be “85% accurate” for the detection of stomach cancer.

To many people, an 85% accuracy rate probably sounds pretty good — and the story about the test seemed to encourage a perception of precision. It speculated that the test could lead to “earlier diagnosis and treatment, and better survival” for individuals with stomach cancer.

But HealthNewsReview.org’s experts were not nearly as optimistic.

In their review of the news release that was the basis for the story, they pointed out that the test, if widely adopted, could conceivably lead to hundreds of false-positive results for every person who is correctly identified with stomach cancer. Those false-positive results weren’t mentioned in either the news release or the story that rehashed it. Our reviewers thought they should have been:

If the release is going to discuss potential unproven benefits, it should also mention the potential harms of screening tests including false- positives, false-negatives leading to over- or under-diagnosis. Chief among these harms would be falsely labeling healthy people as possibly having cancer and then subjecting them to invasive testing or even treatments that turn out to be unnecessary.

Sensitivity and Specificity

What else could have been done differently?

Both the news release and the news story would have been improved with discussion of two important concepts in medical testing: sensitivity and specificity.

They are the yin and yang of the testing world and convey critical information about what a test can and cannot tell us. Both are needed to fully understand a test’s strengths as well as its shortcomings.

Sensitivity measures how often a test correctly generates a positive result for people who have the condition that’s being tested for (also known as the “true positive” rate). A test that’s highly sensitive will flag almost everyone who has the disease and not generate many false-negative results. (Example: a test with 90% sensitivity will correctly return a positive result for 90% of people who have the disease, but will return a negative result — a false-negative — for 10% of the people who have the disease and should have tested positive.)

Specificity measures a test’s ability to correctly generate a negative result for people who don’t have the condition that’s being tested for (also known as the “true negative” rate). A high-specificity test will correctly rule out almost everyone who doesn’t have the disease and won’t generate many false-positive results. (Example: a test with 90% specificity will correctly return a negative result for 90% of people who don’t have the disease, but will return a positive result — a false-positive — for 10% of the people who don’t have the disease and should have tested negative.)

The following graphic shows how these terms apply to one of the most commonly used tests: a pregnancy test.

It’s important to recognize that sensitivity and specificity exist in a state of balance. Increased sensitivity – the ability to correctly identify people who have the disease — usually comes at the expense of reduced specificity (meaning more false-positives).  Likewise, high specificity — when a test does a good job of ruling out people who don’t have the disease – usually means that the test has lower sensitivity (more false-negatives).

Another everyday example

Airport security offers a good example how these tradeoffs play out in practice. To ensure that truly dangerous items like weapons cannot be brought on board an aircraft, scanners at a security checkpoint may also alarm for harmless items like belt buckles, watches, and jewelry. The scanner prioritizes sensitivity and will flag almost anything that seems like it could be dangerous. But that means it also has low specificity and is prone to false alarms; a positive result is much more likely to be a shampoo bottle than it is an explosive device.

The same issues crop up when it comes to testing for deadly diseases like cancer. High sensitivity is desirable: missing cases of actual cancer could lead to delays in treatment that negatively affect outcomes. However, specificity is more important with cancer testing than it is at an airport checkpoint: false-positive results create anxiety and lead to unnecessary and invasive follow-up tests like biopsies. They raise costs for everyone involved and increase the likelihood of experiencing harm. Those harms can be significant enough to outweigh the potential benefits of the test.  (Prostate specific antigen [PSA] testing is a good example low specificity test that generates many false-positive results.)

Good and bad media models

Stories don’t have to get too technical to address the issues readers need to know about. This NPR story about a blood test for cancer never mentions sensitivity or specificity, and yet it effectively communicates the problems associated with false-positive and false-negative results. It quotes expert sources who warn that a negative result might be giving “false reassurance to the patient” and that a false-positive result might send patients on “needless and expensive medical odysseys.”

This CNN story about a test for Parkinson’s disease also drew kudos from our reviewers. They praised the story’s provision of sensitivity and specificity values, but noted that readers could have been given more context as to what these statistics mean.  “The story could have gone further to explain, for example, that a low specificity test means it will have a high false-positive rate (more people who don’t have the disease are erroneously told that they have it),” they said.

It’s much more common for stories and news releases to overstate the accuracy of tests and gloss over potential harms as we saw in these examples:

  • This Guardian story touted an experimental blood test that was said not only to “detect autism in children” but also “could lead to earlier diagnosis.” Our reviewers noted that the story, based on a study of just 38 children, offered no data to back up its claim; nor did it warn of the harm that a false-positive or false-negative result could inflict on children and their parents.
  • Similar concerns were raised about this USA Today story about a genetic test for breast cancer. Reviewers said the story “doesn’t offer much information that readers can use to make decisions about the use of the recently-approved 23andMe test. For example, what is the rate of false-positive results? Or false negatives? What does that mean for actual risk of developing breast cancer?”
  • A New York Presbyterian Hospital news release touted a test that “detects prostate cancer with 92 percent accuracy.” But as pointed out on our blog, the 92% figure represents the sensitivity of the test – not the accuracy – which is a very different concept.

“The problem with the accuracy statistic is that it’s meaningless,” said Richard Hoffman, MD, MPH, director of the Division of General Internal Medicine at the University of Iowa Carver College of Medicine and the Iowa City VA Medical Center. He noted that in the medical literature, “accuracy” is usually defined as the sum of the true positive and true negative results divided by the sum of all test results. (Read the linked post above for more detail on how this is calculated.)

“This means that even a completely worthless test—unable to detect any patients with disease—would have a high accuracy if most patients do not have the disease. For example, if 10 of 100 patients have the disease, the test detects none of them, the accuracy—based on the true negatives—would be 90%,” he explained.

Many stories aren’t clear on whether they’re using the medical literature’s definition of accuracy or using the term accuracy more generally — which can further add to the confusion.

What’s a ‘good’ test? It depends

The ideal test is one that has both high sensitivity and high specificity, but the value of a test depends on the situation, says Hoffman.

Generally speaking, “a test with a sensitivity and specificity of around 90% would be considered to have good diagnostic performance—nuclear cardiac stress tests can perform at this level,” Hoffman said.

But just as important as the numbers, it’s crucial to consider what kind of patients the test is being applied to. Hoffman noted that even a good test won’t offer much useful information if you’re testing the wrong population.

“If you’re testing people who you know going in are very likely to have the disease, they’re still likely to have the disease even if the test comes up negative,” he said.

The same is true of positive tests in people who are very unlikely to have the disease: “Just because the test comes up positive, that won’t give you much confidence that they have the disease if the prevalence of disease is very low in the patients that you’re testing.”  As with an airport scanner looking for weapons, there’s a good chance any positive result is merely a false alarm.

The issue of false alarms is especially important when screening for diseases, such as cancer and HIV, in apparently healthy people who have a low likelihood of having the disease. In those cases the testing is done sequentially in a two-step process, Hoffman said.

“The initial tests are selected because they have high sensitivity (>99% in the case of HIV tests),” he said. “The expectation is that these tests do not miss patients with disease–and that all of those with positive tests (which could be a large proportion) will then undergo the highly-specific diagnostic gold standard test to confirm the diagnosis.”

The second step is meant to rule out the many false-positives resulting from the first test.

Diagnosis vs. screening – a critical distinction

This brings us back to the stomach cancer breath test discussed at the top of post.

Researchers claimed that the test could identify stomach cancer in otherwise healthy-seeming people who showed no signs of disease. Again, this refers to screening – which is finding early, non-symptomatic cases of disease in the general population.  That’s different from diagnosis – which is when doctors try to find out exactly what’s wrong in people who are already complaining of symptoms.

Mammograms are used to screen for breast cancer; a positive result requires follow-up with an invasive breast biopsy to confirm the diagnosis.

Although the HealthDay story made claims about the test’s ability to screen for cancer, the study that was the subject of the HealthDay story didn’t look at healthy people. About half of the samples tested came from people who were already known to have cancer, and most of those cases were in the advanced stages. While the test seemed to perform reasonably well in this population where most of the people had cancer (about 80% sensitivity and 80% specificity), applying the test to healthy population would likely generate a disastrous outcome.

Our reviewers ran some hypothetical numbers on a healthy population where the stomach cancer rate is lower – say 1 out of 1,000. (They used round numbers for the purposes of explanation.) They calculated that for a test with 80% specificity (which corresponds to a 20% false-positive rate), there would be 200 false-positive results for every cancer that is accurately identified! This means that 200 hundred people would suffer the anxiety of being told they may have stomach cancer, and then be referred for additional invasive testing to confirm or rule out the possibility of cancer.

Positive predictive value

This example raises a question that is usually top of mind for readers, but often not addressed in news stories: If I test positive for a disease, what are the chances that I actually have the condition that I was tested for?

Positive predictive value (PPV) – a statistic that encompasses sensitivity, specificity, as well as how common the condition is in the population being tested — offers an answer to that question.

In the breath test example, our reviewers calculated 200 false-positives for every person correctly diagnosed with disease. This means that the likelihood of a positive result correctly indicating disease is only 1 out of 201 or 0.5%. Not very good!

How could a test that‘s “85% accurate” in a study turn out to be so dismally inaccurate in practice?

“The problem is that tests are often evaluated in populations with a high prevalence of disease which inflates the positive predictive value,” says Hoffman. “When applied to a lower risk population the predictive value drops” because there are more false-positives. “This is particularly a problem when you are talking about screening, where the prevalence of disease in the population is usually quite low.  This has important public health implications because the number of false-positive tests can be in the hundreds of thousands or even millions—and each of those patients will be advised to get the gold standard test.”

That’s one of the reasons why screening for disease in healthy people is so fraught; such tests inevitably flag many people who have nothing to worry about and turn them into patients.

(For reference, the PPV of a PSA test for prostate cancer is about 30%, whereas the PPV of mammograms for breast cancer is said to range from 4.3% to 52.4% depending on the expertise of the radiologist interpreting the image.)

The bottom line on medical tests

While many journalists are enamored with the thought of “simple” blood tests, medical testing is complicated. Readers aren’t well served by stories that blithely tout accuracy figures that don’t reflect reality.

Consumers and journalists can better inform themselves by always considering the sensitivity and specificity of the test, as well as the flip side of those statistics – the false-negative and false-positive rates. 

Another useful barometer is the positive predictive value, which reflects the likelihood that a positive test result correctly indicates the presence of the disease.

Finally, journalists should always ask questions about the population that was studied, and whether those people are comparable to the people who would be tested in the real world. And remember that a test that’s reasonable to use in people who already have symptoms of disease (i.e. for diagnosis) may not be useful in people who seem healthy (i.e. for screening). News stories about medical tests must be mindful of the distinction.

Go back to our toolkit for understanding medical evidence