# Bayes Theorem applied to biomarker design

In an **article we posted on “The Digital Biologist”**, we gave a very brief and simple introduction to Bayes’ Theorem, using cancer biomarkers as an example of one of the many ways in which the theorem can be applied to the evaluation of data and evidence for research. Bayesian analysis can be a really useful and practical computational tool in life science R&D, and in this brief article we will develop and use a really tiny snippet of **Python** code to demonstrate its application to biomarkers.

Let’s suppose that we are a company determined to develop a more reliable biomarker than **CA-125** for the early detection of **ovarian cancer**. One direction we might pursue is to identify a biomarker that predicts disease in actual sufferers with a higher frequency i.e. a biomarker with a better true positive hit rate. We saw in the **introductory article**, that CA-125 only predicts the disease in about 50% of sufferers for stage I ovarian cancer and about 80% of sufferers for stage II and beyond. One of the dilemmas faced by physicians working in the oncology field, is that biomarkers like CA-125 can be poorly predictive of the disease in the early stages when the prognosis and options for treatment are better. It’s disheartening for both the patient and the physician to be able to get a reliable diagnosis only when the disease has already progressed to the point at which there are fewer good options for treatment.

Ovarian cancer affects about 1.5% of the population and the CA-125 test is predictive of the disease in about 80% of sufferers, which sounds pretty good right? Most people would probably look at those numbers and conclude that a positive test for the biomarker was associated with something like an 80% chance of having the disease.

Not so fast!

The CA-125 test also produces false positives in about 4% of patients without the disease (i.e. the test result indicates disease where none is actually present). Armed with these numbers, we might feel that CA-125 has a pretty bright future in the clinic based upon the following reasoning – it will only fail to detect the disease in about 2 out of 10 sufferers and it will only produce a misdiagnosis of the disease in about 4 out of 100 healthy patients.

Once again – not so fast.

Let's develop a very simple Bayesian model with just a few lines of **Python**, and then we can plug in these numbers and see what we get.

We'll write a very simple Python function that takes as input, the rate of occurrence of the disease in the population, the rate of true positive test results, and the rate of false positive test results. The function returns the actual probability of having the disease given a positive test result. Using the discrete Bayesian formula shown here, we will be formulating the problem like this:

- p(B) = probability of having the disease
- p(A) = probability of a positive test result
- p(B|A) = probability of having the disease given a positive test result
- p(A|B) = probability of getting a positive test result given that you have the disease

Here's the Python function:

def biomarker(p_disease, p_true_positive, p_false_positive): p_healthy = 1.0 - p_disease p_positive = p_true_positive * p_disease + p_false_positive * p_healthy return (p_true_positive * p_disease) / p_positive

Firstly you will notice that in order to calculate the denominator **p(A)** in the discrete Bayesian equation which corresponds to the probability of a positive test result, we need to calculate the cumulative probability of getting *any* positive test result - in other words, the *total probability* of a positive test result for healthy *and* diseased patients. The probability of being healthy (i.e. not having the disease), is simply **1.0 - p(B)**. Once we have this value (**p_healthy** in the Python function), we can then calculate the cumulative probability of any kind of positive test result. We already know **p(A|B)** the probability of getting a positive test result when you have the disease, because we passed this to the function as **p_true_positive**. Similarly, we also know **p(B)**, the probability of having the disease since this is also passed to the function as **p_disease**. Once we have compiled the cumulative probability of getting any kind of positive test result (**p_positive**), we are ready to deliver the result.

You can actually test the numbers for yourself in this live Python code window below. At the prompts, just enter the fractional occurrence rate of your disease in the population, and the fractional rates of true and false positives, to calculate the probability that a patient has the disease given a positive result from your test.

Let's plug in **the real numbers for the CA-125 test for ovarian cancer** ...

p_disease (p(B)) = 0.015

p_true_positive = 0.8

p_false_positive = 0.04

and when we run the code we get ...

Probability of disease given a positive test result = 0.23

Wow!

There is only a probability of about 23% of actually having ovarian cancer given a positive CA-125 test - in other words, the patient who had the positive test result has only about 1 chance in 4 of actually having the disease, based upon the positive test result.

Let's come back to our hypothetical diagnostic company seeking to develop a better ovarian cancer test and let's say that the company's scientists decide that they have two potential approaches - **increase** the true positive rate of the test - or **decrease** the false positive rate.

So imagine that our hypothetical diagnostic company has made a heavy R&D investment in identifying a biomarker with a much better rate of true positives. If our new biomarker has true positive rate of 95% (a fairly significant improvement on our previous value of about 80%) and the same roughly 4% false positive test rate as previously, how much better off are we?

If we plug the numbers into our Bayesian model, the answer is “not much”

Probability of disease given a positive test result = 0.27

Even with such a dramatic improvement in the rate of true positives, the new prediction is barely any more accurate than the old one.

**What if instead of pursuing a better true positive hit rate, our company had invested in reducing the false positive test rate?**

Without altering the true positive rate of about 80%, let's imagine that they are able to reduce the biomarker’s false positive rate from about 4% to 1%.

Probability of disease given a positive test result = 0.55

Much better but still not great. What about a false positive rate of 0.5%?

Probability of disease given a positive test result = 0.71

The Bayesian model clearly tells us in the case of ovarian cancer - that our hypothetical company is much better off investing its R&D dollars in the pursuit of lower false positive test rates rather than higher true positive test rates. Even a 99% true positive test rate barely shifts the probabilities associated with a positive test result, whereas getting the false positive test rate down to 1% improves the probability of a true diagnosis from less than 1 in 4, to better than even. Even this scenario however, is far from ideal.

If you look at the actual numbers in the model with regard to the populations of tested patients with and without the disease, there is another valuable lesson to be learned, and it is one that illuminates the reason why improving the true positive test rate while ignoring the false positive test rate is less than optimal.

**It is the overwhelmingly larger population of healthy patients versus those with the disease, that is skewing the probability numbers against us - and the lower the incidence of the disease, the worse this problem will be.**

If ovarian cancer had a higher incidence of say, 1 in 10 women instead of 1 in 72 as is actually the case, a positive test result with CA-125 would correspond to an almost 70% probability of the patient actually having the disease. By contrast, if the ovarian cancer incidence was 1 in 1000 women, a positive test result with CA-125 would still correspond to less than 1 chance in 50 of the patient actually having the disease.

**The lower the incidence of the disease you want to diagnose, the correspondingly lower your false positive test rate needs to be.**