 Conditional Probabilities and Bayes Theorem

I’ve been getting a lot of questions from friends lately about what Bayes Theorem means. The confusion is understandable because it appears in a few models that seem to be completely unrelated to each other. For example we have Naive Bayes Classifiers and Bayesian Networks which operate on completely different principles. Moreover this is compounded with a lack of understanding regarding unconditional and conditional probabilities.

In this article I offer a tutorial to help bring the lay person up to speed with some basic understanding on these concepts and how Bayes Theorem can be applied.

Probability Space

We have to start by explaining a few terms and how they are used.

A Random Trial is a trial where we perform some random experiment. For example it might be flipping a coin or rolling dice.

The Sample Space of a Random Trial, typically denoted by $\Omega$, represents all possible outcomes for the Random Trial being performed. So for flipping a coin the outcome can be either heads or tails, so the Sample Space would be a set containing only these two values.

$$\Omega = \{ Heads, Tails \}$$

For rolling dice the Sample Space would be the set of all the various faces for the dice being rolled. When rolling only one standard six sided die the Sample space would be as follows.

$$\Omega = \{1, 2, 3, 4, 5, 6\}$$

In both of these examples the Random Trial being performed will select an outcome from their respective Sample Space at random. In these trials each outcome has an equal chance of being selected, though that does not necessarily need to be the case.

For our purposes here I want to formulate a Random Trial thought experiment that simulates a medical trial consisting of 10 patients. We will be using this example throughout much of this tutorial. Therefore our Sample Space will be a set consisting of 10 elements, each element represents a single unique patient in the trial. Patients are represented with the variable x with a subscript from 1 to 10 that uniquely identifies each patient.

$$\Omega = \{x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_{10}\}$$

An Event in the context of probabilities is a set of outcomes that can be satisfied. Typically these sets are represented using lowercase greek letters such as $\alpha$ or $\beta$. For example if we were rolling a single die and wanted it to land on an odd number then the Event representing an odd outcome would be represented as the following set.

$$\alpha = \{1, 3, 5\}$$

Similarly if we simply wanted to roll the number 6 then the Event would be a set containing only that one number.

$$\alpha = \{6\}$$

The Event Space, often denoted by $\mathcal{F}$, is the set of all Events to be observed. It is a set of sets that represents every possible combination of subsets of the Sample Space or some part thereof. Not all Events in the event space need to be possible however. For example if we are talking about flipping a coin then the Event Space would have, at most, 4 members representing outcomes for Heads, Tails, either Heads or Tails, and neither Heads nor Tails. We could represent this with the following set notation.

$$\mathcal{F} = \{\{\}, \{Heads\}, \{Tails\}, \{Heads, Tails\}\}$$

The empty set is usually represented with the $\emptyset$ symbol. So the previous set can be rewritten using this shorthand as follows.

$$\mathcal{F} = \{\emptyset, \{Heads\}, \{Tails\}, \{Heads, Tails\}\}$$

Notice that one of the members of the Event Space is equivalent to the Sample space. the fact that it contains both the empty set and the Sample Space as members is more a matter of mathematical completeness and plays a role in making some mathematical proofs easier to carry out. For our purposes here they will largely be ignored.

At this point I want to go over a little bit of mathematical notation that may help when reading other texts on the subject. The first is the concept of a Power Set. The Power Set is simply every possible combination of subsets for a particular set. In the example above regarding the coin toss Event Space we can say that the Event Space specified is the Power Set of the Sample Space. The notation for the Power Set is the number 2 with an exponent that is a set. For example short hand for the above Event Space definition could have been the following.

$$\mathcal{F} = 2^{\Omega}$$

Every Event Space must be either equal to, or a subset of, the Power Set of the Sample Space. We can represent that with the following set notation.

$$\mathcal{F} \subseteq 2^{\Omega}$$

Going back to our example of patients in a clinical trial we might want to know what the chance is of selecting a patient at random that has a fever. In that case the Event would be the set of all patients that have a fever and the outcome would be a single patient selected at random. Each Event is an element in the Event Space. So we will denote it as $\mathcal{F}$ with a subscript so it is easier to read than it would be using arbitrary greek lowercase letters, as is the usual convention. If 3 of the 10 patients in our Sample Space have a fever we can represent the fever Event as follows.

$$\mathcal{F}_{fever} = \{x_1, x_6, x_8\}$$

This means that if we select a patient at random and that patient is a member of the $\mathcal{F}_{fever}$ set then that patient has a fever and the outcome has satisfied the event. Similarly we can define the event representing patients that have the flu with the following notation.

$$\mathcal{F}_{flu} = \{x_2, x_4, x_6, x_8\}$$

As stated earlier Events are simply members of the Event Space. This can be indicated using the following set notation which simply states that the flu Event is a member of the Event Space and the fever Event is also a member of the Event Space.

$$\mathcal{F}_{fever} \in \mathcal{F}$$

$$\mathcal{F}_{flu} \in \mathcal{F}$$

Similarly if we wish to indicate that the fever and flu Events are subsets of the Event Space we can do so using the following notation.

$$\mathcal{F}_{fever} \subset \Omega$$

$$\mathcal{F}_{flu} \subset \Omega$$

The only term left to define is the Probability Space. This is just the combination of the Event Space, the Sample Space, as well as the probability of each of the Events taking place. It represents all the information we need to determine the chance of any possible outcome occurring. It is denoted as a 3-tuple containing these three things. The probability, P, represents a function that maps Events in $\mathcal{F}$ to probabilities.

$$(\Omega, \mathcal{F}, P)$$

Unconditional Probability

This is where things get interesting. Since we have all the important terms defined we can start talking about actual probabilities. We start with the simplest type of probability, the Unconditional Probability. These are the sort of probability most people are familiar with. It is the chance that an outcome will occur independent of any other Events. For example if I flip a coin I can say the probability of it landing on Heads is 50%; this would be an Unconditional Probability.

If all the outcomes in our Sample Space have the same chance of being selected by a Random Trial then calculating the Unconditional Probability is rather easy. The Event would represent all desired outcomes from our Sample Space. So if we wanted to flip a coin and get heads then our Event is a set with a single member and the Sample Space consists of only two members, the possible outcomes. We can write this as the following.

$$\Omega = \{Heads, Tails\}$$
$$\mathcal{F}_{Heads} = \{Heads\}$$

If the above event is satisfied by the flip of a coin it means the outcome of the coin toss was heads. To calculate the probability for this event we simply count the number of members in the Event Set and divide it by the number of members in the Sample Space. In this case the result is 50% but we can represent that as follows.

$$P(\mathcal{F}_{Heads}) = \frac{1}{2}$$

The number of members in a set is called Cardinality. We can represent that using notation that is the same as the absolute value sign used around a set. Therefore we can represent the previous equation using the following notation.

$$P(\mathcal{F} _{Heads}) = \frac{\mid \mathcal{F} _{Heads} \mid}{\mid \Omega \mid} = \frac{1}{2}$$

We can generalize this for any Event represented as $\mathcal{F}_{i}$ with the following definition for calculating an Unconditional Probability.

$$P(\mathcal{F} _{i}) = \frac{\mid \mathcal{F} _{i} \mid}{\mid \Omega \mid}$$

Now let’s apply this to our clinical trial example from earlier. Say we wanted to calculate the chance of selecting someone from the 10 patients in the trial, at random, such that the person selected has a fever. We can calculate that with the following.

$$P(\mathcal{F} _{fever}) = \frac{\mid \mathcal{F} _{fever} \mid}{\mid \Omega \mid} = \frac{3}{10}$$

We can also do the same for calculating the chance of randomly selecting a patient that has the flu.

$$P(\mathcal{F} _{flu}) = \frac{\mid \mathcal{F} _{flu} \mid}{\mid \Omega \mid} = \frac{4}{10} = \frac{2}{5}$$

Conditional Probability

A Conditional Probability takes this idea one step further. A Conditional Probability specifies the probability of an event being satisfied if it is known that another event was also satisfied. For example using our clinical trial thought experiment one might ask what is the probability of someone having the flu if we know that person has a fever. This would be represented with the following notation.

$$P(\mathcal{F} _{flu} \mid \mathcal{F} _{fever})$$

Assuming that having a fever has some effect on the likelihood of having the flu then this probability would be different than the chance for just any randomly selected member having the flu, after all people with fevers are more likely to have the flu than people without a fever.

Since we already know which patients have the flu and which have a fever it is easy to determine an answer to this question. To calculate the probability we can look at how many patients in our Sample Space have a fever and what percentage of those patients with fever also have the flu. By looking at the data we can see that there are 3 patients with fevers and of those patients only 2 of them have the flu. So the answer is $\frac{2}{3}$.

$$\mathcal{F} _{fever} = \{x_1, x_6, x_8\}$$

$$\mathcal{F} _{flu} = \{x_2, x_4, x_6, x_8\}$$

$$P(\mathcal{F} _{flu} \mid \mathcal{F} _{fever}) = \frac{2}{3}$$

We can generalize this statement by saying that we take the intersection of the sets that represent the Event for patients with the flu and patients with a fever. The intersection is just the set of all the elements that those two sets have in common.

The symbol for intersection is $\cap$, therefore we can show the intersection of these two sets as follows.

$$\mathcal{F} _{flu} \cap \mathcal{F} _{fever} = \{x_6, x_8\}$$

Another way to look at calculating the Conditional Probability would be to take the Cardinality of the intersection of these two Events and divide it by the cardinality of the conditional Event that has been satisfied. So now we have the following.

$$P(\mathcal{F} _{flu} \mid \mathcal{F} _{fever}) = \frac{\mid \mathcal{F} _{flu} \cap \mathcal{F} _{fever} \mid}{\mid \mathcal{F} _{fever} \mid} = \frac{2}{3}$$

We can also ask a similar, but markedly different, question. If we know a patient has the flu what is the chance that same patient will have a fever. For this we can use the same logic as above and come up with the following.

$$P(\mathcal{F} _{fever} \mid \mathcal{F} _{flu}) = \frac{\mid \mathcal{F} _{flu} \cap \mathcal{F} _{fever} \mid}{\mid \mathcal{F} _{flu} \mid} = \frac{2}{4} = \frac{1}{2}$$

As you can see the only thing that changed is the denominator which is now the Cardinality of the flu Event rather than the fever Event. We can generalize the equation for calculating a Conditional Probability as follows.

$$P(\mathcal{F} _{i} \mid \mathcal{F} _{j}) = \frac{\mid \mathcal{F} _{i} \cap \mathcal{F} _{j} \mid}{\mid \mathcal{F} _{j} \mid}$$

Bayes Theorem

Bayes Theorem itself is remarkably simple on the surface yet immensely useful in practice. In its simplest form it lets us calculate a Conditional Probability when we have limited information to work with. If we only knew, for example, the probabilities for $P(F_i \mid F_j)$, $P(F_i)$, and $P(F_j)$, then using Bayes Theorem we could calculate the probability for $P(F_j \mid F_i)$. The precise equation for Bayes Theorem is as follows.

$$P(\mathcal{F} _{i} \mid \mathcal{F} _{j}) = \frac{ P(\mathcal{F} _{j} \mid \mathcal{F} _{i}) \cdot P(\mathcal{F} _{i}) }{ P(\mathcal{F} _{j}) }$$

Let’s say we didn’t know all the details of the clinical trial from earlier; we have no idea what the Sample Space is or what members belong to each Event set. All we know is the probability that someone will have a fever at any given time, the probability they will have the flu, and the probability that someone with the flu has a fever. From this limited information, and using Bayes Theorem it would be possible to infer the probability of having the flu if you have a fever. First let’s copy the probabilities we know to match what we previously calculated manually.

$$P(\mathcal{F} _{fever}) = \frac{3}{10}$$

$$P(\mathcal{F} _{flu}) = \frac{2}{5}$$

$$P(\mathcal{F} _{fever} \mid \mathcal{F} _{flu}) = \frac{1}{2}$$

Using only this information, along with Bayes Theorem, we can calculate the probability of someone having the flu if they have a fever as follows.

$$P(\mathcal{F} _{flu} \mid \mathcal{F} _{fever}) = \frac{ P(\mathcal{F} _{fever} \mid \mathcal{F} _{flu}) \cdot P(\mathcal{F} _{flu}) }{ P(\mathcal{F} _{fever}) }$$

$$P(\mathcal{F} _{flu} \mid \mathcal{F} _{fever}) = \frac{ \frac{1}{2} \cdot \frac{2}{5} }{ \frac{3}{10} }$$

$$P(\mathcal{F} _{flu} \mid \mathcal{F} _{fever}) = \frac{2}{3}$$

This solution of course agrees with our earlier results when we were able to calculate the answer by manually counting the data. However, this time we did not have to use the data directly.

Let’s do one more example to drive the point home. Say we have a test for Tuberculosis, TB, that is 95% accurate. That is to say that if you have TB then 95% of the time the test will give you a positive result. Similarly if you do not have TB then only 95% of the time will you get a negative result. We can represent this as follows.

$$P(\mathcal{F} _{positive} \mid \mathcal{F} _{infected}) = \frac{19}{20}$$

Furthermore let’s say we know that only one in a thousand members of the population are infected with TB at any one time. We can demonstrate this as follows.

$$P(\mathcal{F} _{infected}) = \frac{1}{1000}$$

Finally let’s say when tested on the general population that 509 out of every 10,000 people received a positive result. We can represent that with the following.

$$P(\mathcal{F} _{positive}) = \frac{509}{10000}$$

With this information it is possible to calculate the probability someone will have TB if they receive a positive test result. Using Bayes Theorem we can solve for the probability as follows.

$$P(\mathcal{F} _{infected} \mid \mathcal{F} _{positive}) = \frac{ P(\mathcal{F} _{positive} \mid \mathcal{F} _{infected}) \cdot P(\mathcal{F} _{infected}) }{ P(\mathcal{F} _{positive}) }$$

$$P(\mathcal{F} _{infected} \mid \mathcal{F} _{positive}) = \frac{ \frac{19}{20} \cdot \frac{1}{1000} }{ \frac{509}{10000} }$$

$$P(\mathcal{F} _{infected} \mid \mathcal{F} _{positive}) = \frac{ 19 }{ 1018 } = 0.018664 = 1.8664%$$

This gives us a very surprising result. It says that of the people who take the TB test and show up positive less than 2% of them actually have TB. This demonstrates the importance of using very accurate clinical tests when testing for diseases that have a low occurrence in the population. Even a small error in the test can give false positives at an alarmingly high rate.

Bayes theorem is unbelievable pervasive… it’s in every field and discipline.
I just started using it often for phylogenetic trees and molecular clock dating

1 Like

It really is. I have yet to find a field that doesnt use it somewhere. Yet so few people really understand it. Part of why I thought writing this might be good.

So, I was looking at BIC before (Bayesian Information Criterion) but I have a very silly, math-dumb doubt: the likelihood is higher when the BIC value is low, right?

Likelihood is higher only when the complexity is the same. In simplest terms (not really but its the easiest simplification that works) it is the ratio between likelihood and complexity. There is a penalty for high complexity an a positive effect for better likelihood.

So you could have two models with the same BIC and the lower BIC might have an even worse likelihood but because it is significantly less complex it may, in fact, have a lower BIC.