Skip to main content
Statistics LibreTexts

10: Probability

  • Page ID
    7649
  • 10 Probability

    Probability theory is the branch of mathematics that deals with chance and uncertainty. It forms an important part of the foundation for statistics, because it provides us with the mathematical tools to describe uncertain events. The study of probability arose in part due to interest in understanding games of chance, like cards or dice. These games provide useful examples of many statistical concepts, because when we repeat these games the likelihood of different outcomes remains (mostly) the same. However, there are deep questions about the meaning of probability that we will not address here; see Suggested Readings at the end if you are interested in learning more about this fascinating topic and its history.

    10.1 What is probability?

    Informally, we usually think of probability as a number that describes the likelihood of some event occurring, which ranges from zero (impossibility) to one (certainty). Sometimes probabilities will instead be expressed in percentages, which range from zero to one hundred, as when the weather forecast predicts a twenty percent chance of rain today. In each case, these numbers are expressing how likely that particular event is, ranging from absolutely impossible to absolutely certain.

    To formalize probability theory, we first need to define a few terms:

    • An experiment is any activity that produces or observes an outcome. Examples are flipping a coin, rolling a 6-sided die, or trying a new route to work to see if it’s faster than the old route.
    • The sample space is the set of possible outcomes for an experiment. We represent these by listing them within a set of squiggly brackets. For a coin flip, the sample space is {heads, tails}. For a six-sided die, the sample space is each of the possible numbers that can appear: {1,2,3,4,5,6}. For the amount of time it takes to get to work, the sample space is all possible real numbers greater than zero (since it can’t take a negative amount of time to get somewhere, at least not yet). We won’t bother trying to write out all of those numbers within the brackets.
    • An event is a subset of the sample space. In principle it could be one or more of possible outcomes in the sample space, but here we will focus primarily on elementary events which consist of exactly one possible outcome. For example, this could be obtaining heads in a single coin flip, rolling a 4 on a throw of the die, or taking 21 minutes to get home by the new route.

    Now that we have those definitions, we can outline the formal features of a probability, which were first defined by the Russian mathematician Andrei Kolmogorov. These are the features that a value has to have if it is going to be a probability. If P(Xi)P(X_i) is the probability of event XiX_i:

    • Probability cannot be negative: P(Xi)0P(X_i) \ge 0
    • The total probability of all outcomes in the sample space is 1; that is, if we take the probability of each element in the sample space and add them up, they must sum to 1. We can express this using the summation symbol \sum: i=1NP(Xi)=P(X1)+P(X2)+...+P(XN)=1 \sum_{i=1}^N{P(X_i)} = P(X_1) + P(X_2) + ... + P(X_N) = 1

    This is interpreted as saying “Take all of the N elementary events, which we have labeled from 1 to N, and add up their probabilities. These must sum to one.”
    - The probability of any individual event cannot be greater than one: P(Xi)1P(X_i)\le 1. This is implied by the previous point; since they must sum to one, and they can’t be negative, then any particular probability must be less than or equal to one.

    10.2 How do we determine probabilities?

    Now that we know what a probability is, how do we actually figure out what the probability is for any particular event?

    10.2.1 Personal belief

    Let’s say that I asked you what the probability was that the Beatles would have been equally successful if they had not replaced their original drummer Pete Best with Ringo Starr in 1962. We will define “success” in terms of the number of number-one hits on the Billboard Hot 100 (which we refer to as NhitsN_{hits}); the Beatles had 20 such number-one hits, so the sample space is {Nhits<20N_{hits} < 20,Nhits20N_{hits} \ge 20 }. We can’t actually do the experiment to find the outcome. However, most people with knowledge of the Beatles would be willing to at leaste offer a guess at the probability of this event. In many cases personal knowledge and/or opinion is the only guide we have determining the probability of an event, but this is not very scientifically satisfying.

    10.2.2 Empirical frequency

    Another way to determine the probability of an event is to do the experiment many times and count how often each event happens. From the relative frequency of the different outcomes, we can compute the probability of each. For example, let’s say that we are interested in knowing the probability of rain in San Francisco. We first have to define the experiment — let’s say that we will look at the National Weather Service data for each day in 2017 and determine whether there was any rain at the downtown San Francisco weather station.

    Number of rainy days Number of days measured P(rain)
    73 365 0.2
    According to these data , in 2017 there were rain y days. To compute the probability of rain in San Francisco, we simply divide the number of rainy days by the number of days counted (365), giving P(rain in SF in 2017) = .

    How do we know that empirical probability gives us the right number? The answer to this question comes from the law of large numbers, which shows that the empirical probability will approach the true probability as the sample size increases. We can see this by simulating a large number of coin flips, and looking at our estimate of the probability of heads after each flip. We will spend more time discussing simulation in a later chapter; for now, just assume that we have a computational way to generate a random outcome for each coin flip.

    The left panel of Figure 10.1 shows that as the number of samples (i.e., coin flip trials) increases, the estimated probability of heads converges onto the true value of 0.5. However, note that the estimates can be very far off from the true value when the sample sizes are small. A real-world example of this was seen in the 2017 special election for the US Senate in Georgia, which pitted the Republican Roy Moore against Democrat Doug Jones. The right panel of Figure 10.1 shows the relative amount of the vote reported for each of the candidates over the course of the evening, as an increasing number of ballots were counted. Early in the evening the vote counts were especially volatile, swinging from a large initial lead for Jones to a long period where Moore had the lead, until finally Jones took the lead to win the race.

    news/national/alabama-senate-race-live-updates-roy-moore-doug-jones/KPRfkdaweoiXICW3FHjXqI/

    Figure 10.1: Left: A demonstration of the law of large numbers. A coin was flipped 30,000 times, and after each flip the probability of heads was computed based on the number of heads and tail collected up to that point. It takes about 15,000 flips for the probability to settle at the true probability of 0.5. Right: Relative proportion of the vote in the Dec 12, 2017 special election for the US Senate seat in Georgia, as a function of the percentage of precincts reporting. These data were transcribed from https://www.ajc.com/news/national/alabama-senate-race-live-updates-roy-moore-doug-jones/KPRfkdaweoiXICW3FHjXqI/

    These two examples show that while large samples will ultimately converge on the true probability, the results with small samples can be far off. Unfortunately, many people forget this and overinterpret results from small samples. This was referred to as the law of small numbers by the psychologists Danny Kahneman and Amos Tversky, who showed that people (even trained researchers) often behave as if the law of large numbers applies even to small samples, giving too much credence to results from small datasets. We will see examples throughout the course of just how unstable statistical results can be when they are generated on the basis of small samples.

    10.2.3 Classical probability

    It’s unlikely that any of us has ever flipped a coin tens of thousands of times, but we are nonetheless willing to believe that the probability of flipping heads is 0.5. This reflects the use of yet another approach to computing probabilities, which we refer to as classical probability. In this approach, we compute the probability directly based on our knowledge of the situation.

    Classical probability arose from the study of games of chance such as dice and cards. A famous example arose from a problem encountered by a French gambler who went by the name of Chevalier de Méré. de Méré played two different dice games: In the first he bet on the chance of at least one six on four rolls of a six-sided die, while in the second he bet on the chance of at least one double-six on 24 rolls of two dice. He expected to win money on both of these gambles, but he found that while on average he won money on the first gamble, he actually lost money on average when he played the second gamble many times. To understand this he turned to his friend, the mathematician Blaise Pascal, who is now recognized as one of the founders of probability theory.

    How can we understand this question using probability theory? In classical probability, we start with the assumption that all of the elementary events in the sample space are equally likely; that is, when you roll a die, each of the possible outcomes ({1,2,3,4,5,6}) is equally likely to occur. (No loaded dice allowed!) Given this, we can compute the probability of any individual outcome as one divided by the number of possible outcomes:

    P(outcomei)=1number of possible outcomes P(outcome_i) = \frac{1}{\text{number of possible outcomes}}

    For the six-sided die, the probability of each individual outcome is 1/6.

    This is nice, but de Méré was interested in more complex events, like what happens on multiple dice throws. How do we compute the probability of a complex event (which is a union of single events), like rolling a one on the first or the second throw?
    We represent the union of events mathematically using the \cup symbol: for example, if the probability of rolling a one on the first throw is referred to as P(Roll1throw1)P(Roll1_{throw1}) and the probability of rolling a one on the second throw is P(Roll1throw2)P(Roll1_{throw2}), then we the union is referred to as P(Roll1throw1Roll1throw2)P(Roll1_{throw1} \cup Roll1_{throw2}).

    de Méré thought (incorrectly, as we will see below) that he could simply add together the probabilities of the individual events to compute the probability of the combined event, meaning that the probability of rolling a one on the first or second roll would be computed as follows:

    P(Roll1throw1)=1/6 P(Roll1_{throw1}) = 1/6 P(Roll1throw2)=1/6 P(Roll1_{throw2}) = 1/6

    Demereserror:P(Roll1throw1Roll1throw2)=P(Roll1throw1)+P(Roll1throw2)=1/6+1/6=1/3 Demere's \ error: P(Roll1_{throw1} \cup Roll1_{throw2}) = P(Roll1_{throw1}) + P(Roll1_{throw2}) = 1/6 + 1/6 = 1/3

    De Méré reasoned based on this that the probability of at least one six in four rolls was the sum of the probabilities on each of the individual throws: 4*16=234*\frac{1}{6}=\frac{2}{3}. Similarly, he reasoned that since the probability of a double-six in throws of dice is 1/36, then the probability of at least one double-six on 24 rolls of two dice would be 24*136=2324*\frac{1}{36}=\frac{2}{3}. Yet, while he consistently won money on the first bet, he lost money on the second bet. What gives?

    To understand de Méré’s error, we need to introduce some of the rules of probability theory. The first is the rule of subtraction, which says that the probability of some event A not happening is one minus the probability of the event happening:

    P(¬A)=1P(A) P(\neg A) = 1 - P(A)

    where ¬A\neg A means “not A”. This rule derives directly from the axioms that we discussed above; because A and ¬A\neg A are the only possible outcomes, then their total probability must sum to 1. For example, if the probability of rolling a one in a single throw is 16\frac{1}{6}, then the probability of rolling anything other than a one is 56\frac{5}{6}.

    A second rule tells us how to compute the probability of a conjoint event – that is, the probability that both of two events will occur. We refer to this as an intersection, which is signified by the \cap symbol; thus, P(AB)P(A \cap B) means the probability that both A and B will occur.

    This version of the rule tells us how to compute this quantity in the special case when the two events are independent from one another; we will learn later exactly what the concept of independence means, but for now we can just take it for granted that the two die throws are independent events. We compute the probability of the union of two independent events by simply multiplying the probabilities of the individual events:

    P(AB)=P(A)*P(B)if and only if A and B are independent P(A \cap B) = P(A) * P(B)\ \text{if and only if A and B are independent} Thus, the probability of throwing a six on both of two rolls is 16*16=136\frac{1}{6}*\frac{1}{6}=\frac{1}{36}.

    The third rule tells us how to add together probabilities - and it is here that we see the source of de Méré’s error. The addition rule tells us that to obtain the probability of either of two events occurring, we add together the individual probabilities, but then subtract the likelihood of both occurring together:

    P(AB)=P(A)+P(B)P(AB) P(A \cup B) = P(A) + P(B) - P(A \cap B) In a sense, this prevents us from counting those instances twice, and that’s what distinguishes the rule from de Méré’s incorrect computation. Let’s say that we want to find the probability of rolling 6 on either of two throws. According to our rules:

    P(Roll1throw1Roll1throw2)=P(Roll1throw1)+P(Roll1throw2)P(Roll1throw1Roll1throw2) P(Roll1_{throw1} \cup Roll1_{throw2}) = P(Roll1_{throw1}) + P(Roll1_{throw2}) - P(Roll1_{throw1} \cap Roll1_{throw2}) =16+16136=1136 = \frac{1}{6} + \frac{1}{6} - \frac{1}{36} = \frac{11}{36}

    Each cell in this matrix represents one outcome of two throws of a die, with the columns representing the first throw and the rows representing the second throw. Cells shown in red represent the cells with a one in either the first or second throw; the rest are shown in blue.

    Figure 10.2: Each cell in this matrix represents one outcome of two throws of a die, with the columns representing the first throw and the rows representing the second throw. Cells shown in red represent the cells with a one in either the first or second throw; the rest are shown in blue.

    Let’s use a graphical depiction to get a different view of this rule. Figure 10.2 shows a matrix representing all possible combinations of results across two throws, and highlights the cells that involve a one on either the first or second throw. If you count up the cells in light blue you will see that there are 11 such cells. This shows why the addition rule gives a different answer from de Méré’s; if we were to simply add together the probabilities for the two throws as he did, then we would count (1,1) towards both, when it should really only be counted once.

    10.2.4 Solving de Méré’s problem

    Blaise Pascal used the rules of probability to come up with a solution to de Méré’s problem. First, he realized that computing the probability of at least one event out of a combination was tricky, whereas computing the probability that something does not occur across several events is relatively easy – it’s just the product of the probabilities of the individual events. Thus, rather than computing the probability of at least one six in four rolls, he instead computed the probability of no sixes across all rolls:

    P(no sixes in four rolls)=56*56*56*56=(56)4=0.482 P(\text{no sixes in four rolls}) = \frac{5}{6}*\frac{5}{6}*\frac{5}{6}*\frac{5}{6}=\bigg(\frac{5}{6}\bigg)^4=0.482

    He then used the fact that the probability of no sixes in four rolls is the complement of at least one six in four rolls (thus they must sum to one), and used the rule of subtraction to compute the probability of interest:

    P(at least one six in four rolls)=1(56)4=0.517 P(\text{at least one six in four rolls}) = 1 - \bigg(\frac{5}{6}\bigg)^4=0.517

    de Méré’s gamble that he would throw at least one six in four rolls has a probability of greater than 0.5, explaning why de Méré made money on this bet on average.

    But what about de Méré’s second bet? Pascal used the same trick:

    P(no double six in 24 rolls)=(3536)24=0.509 P(\text{no double six in 24 rolls}) = \bigg(\frac{35}{36}\bigg)^{24}=0.509 P(at least one double six in 24 rolls)=1(3536)24=0.491 P(\text{at least one double six in 24 rolls}) = 1 - \bigg(\frac{35}{36}\bigg)^{24}=0.491

    The probability of this outcome was slightly below 0.5, showing why de Méré lost money on average on this bet.

    10.3 Probability distributions

    A probability distribution describes the probability of all of the possible outcomes in an experiment. For example, on Jan 20 2018, the basketball player Steph Curry hit only 2 out of 4 free throws in a game against the Houston Rockets. We know that Curry’s overall probability of hitting free throws across the entire season was 0.91, so it seems pretty unlikely that he would hit only 50% of his free throws in a game, but exactly how unlikely is it? We can determine this using a theoretical probability distribution; during this course we will encounter a number of these probability distributions, each of which is appropriate to describe different types of data. In this case, we use the binomial distribution, which provides a way to compute the probability of some number of successes out of a number of trials on which there is either success or failure and nothing in between (known as “Bernoulli trials”) given some known probability of success on each trial. This distribution is defined as:

    P(k;n,p)=P(X=k)=(nk)pk(1p)nk P(k; n,p) = P(X=k) = \binom{n}{k} p^k(1-p)^{n-k}

    This refers to the probability of k successes on n trials when the probability of success is p. You may not be familiar with (nk)\binom{n}{k}, which is referred to as the binomial coefficient. The binomial coefficient is also referred to as “n-choose-k” because it describes the number of different ways that one can choose k items out of n total items. The binomial coefficient is computed as:

    (nk)=n!k!(nk)! \binom{n}{k} = \frac{n!}{k!(n-k)!} where the explanation point (!) refers to the factorial of the number:

    n!=i=1ni=n*(n1)*...*2*1 n! = \prod_{i=1}^n i = n*(n-1)*...*2*1

    In the example of Steph Curry’s free throws:

    P(2;4,0.91)=(42)0.912(10.91)42=0.040 P(2;4,0.91) = \binom{4}{2} 0.91^2(1-0.91)^{4-2} = 0.040

    This shows that given Curry’s overall free throw percentage, it is very unlikely that he would hit only 2 out of 4 free throws. Which just goes to show that unlikely things do actually happen in the real world.

    10.3.1 Cumulative probability distributions

    Often we want to know not just how likely a specific value is, but how likely it is to find a value that is as extreme or more than a particular value; this will become very important when we discuss hypothesis testing in a later chapter. To answer this question, we can use a cumulative probability distribution; whereas a standard probability distribution tells us the probability of some specific value, the cumulative distribution tells us the probability of a value as large or larger (or as small or smaller) than some specific value.

    In the free throw example, we might want to know: What is the probability that Steph Curry hits 2 or fewer free throws out of four, given his overall free throw probability of 0.91. To determine this, we could simply use the the binomial probability equation and plug in all of the possible values of k and add them together:

    P(k2)=P(k=2)+P(k=1)+P(k=0)=6e5+.002+.040=.043 P(k\le2)= P(k=2) + P(k=1) + P(k=0) = 6e^{-5} + .002 + .040 = .043

    In many cases the number of possible outcomes would be too large for us to compute the cumulative probability by enumerating all possible values; fortunately, it can be computed directly. For the binomial, we can do this in R using the pbinom() function:

    Table 10.1: Cumulative probability distribution for number of successful free throws by Steph Curry in 4 attempts.
    numSuccesses probability
    0 0.00
    1 0.00
    2 0.04
    3 0.31
    4 1.00

    From the table we can see that the probability of Curry landing 2 or fewer free throws out of 4 attempts is 0.043.

    10.4 Conditional probability

    So far we have limited ourselves to simple probabilities - that is, the probability of a single event or combination of events. However, we often wish to determine the probability of some event given that some other event has occurred, which are known as conditional probabilities.

    Let’s take the 2016 US Presidential election as an example. There are two simple probabilities that we could use to describe the electorate. First, we know the probability that a voter in the US affiliated with the Republican party: p(Republican)=0.44p(Republican) = 0.44. We also know the probability that a voter cast their vote in favor of Donald Trump: p(Trumpvoter)=0.46p(Trump voter)=0.46. However, let’s say that we want to know the following: What is the probability that a person cast their vote for Donald Trump, given that they are a Republican?

    To compute the conditional probability of A given B (which we write as P(A|B)P(A|B), “probability of A, given B”), we need to know the joint probability (that is, the probability of both A and B occurring) as well as the overall probability of B:

    P(A|B)=P(AB)P(B) P(A|B) = \frac{P(A \cap B)}{P(B)}

    That is, we want to know the probability that both things are true, given that the one being conditioned upon is true.

    A graphical depiction of conditional probability, showing how the conditional probability limits our analysis to a subset of the data.

    Figure 10.3: A graphical depiction of conditional probability, showing how the conditional probability limits our analysis to a subset of the data.

    It can be useful to think of this is graphically. Figure 10.3 shows a flow chart depicting how the full population of voters breaks down into Republicans and Democrats, and how the conditional probability (conditioning on party) further breaks down the members of each party according to their vote.

    10.5 Computing conditional probabilities from data

    For many examples in this course we will use data obtained from the National Health and Nutrition Examination Survey (NHANES). NHANES is a large ongoing study organized by the US Centers for Disease Control that is designed to provide an overall picture of the health and nutritional status of both adults and children in the US. Every year, the survey examines a sample of about 5000 people across the US using both interviews and physical and medical tests. The NHANES dataset is included as a package in R, making it easy to access and work with. It also provides us with a large, realistic dataset that will serve as an example for many different statistical tools.

    Let’s say that we are interested in the following question: What is the probability that someone has diabetes, given that they are not physically active? – that is, P(diabetes|inactive)P(diabetes|inactive). NHANES records two variables that address the two parts of this question. The first (Diabetes) asks whether the person has ever been told that they have diabetes, and the second (PhysActive) records whether the person engages in sports, fitness, or recreational activities that are at least of moderate intensity. Let’s first compute the simple probabilities.

    Table 10.2: Summary data for diabetes and physical activity
    Answer N_diabetes P_diabetes N_PhysActive P_PhysActive
    No 4893 0.9 2472 0.45
    Yes 550 0.1 2971 0.55

    The table shows that the probability that someone in the NHANES dataset has diabetes is .1, and the probability that someone is inactive is .45.

    Table 10.3: Joint probabilities for Diabetes and PhysActive variables.
    Diabetes PhysActive n prob
    No No 2123 0.39
    No Yes 2770 0.51
    Yes No 349 0.06
    Yes Yes 201 0.04

    To compute P(diabetes|inactive)P(diabetes|inactive) we would also need to know the joint probability of being diabetic and inactive, in addition to the simple probabilities of each.

    Based on these joint probabilities, we can compute P(diabetes|inactive)P(diabetes|inactive). To do this, we can first determine the truth value of whether the PhysActive variable was equal to “No” for each indivdual, and then take the mean of those truth values. Since TRUE/FALSE values are treated as 1/0 respectively by most programming languages (including R), this allows us to easily identify the probability of a simple event by simply taking the mean of a logical variable representing its truth value. We then use that value to compute the conditional probability, where we find that the probability of someone having diabetes given that they are physically inactive is 0.141.

    10.6 Independence

    The term “independent” has a very specific meaning in statistics, which is somewhat different from the common usage of the term. Statistical independence between two variables means that knowing the value of one variable doesn’t tell us anything about the value of the other. This can be expressed as:

    P(A|B)=P(A) P(A|B) = P(A)

    That is, the probability of A given some value of B is just the same as the overall probability of A. Looking at it this way, we see that many cases of what we would call “independence” in the world are not actually statistically independent. For example, there is currently a move by a small group of California citizens to declare a new independent state called Jefferson, which would comprise a number of counties in northern California and Oregon. If this were to happen, then the probability that a current California resident would now live in the state of Jefferson would be P(Jefferson)=0.014P(\text{Jefferson})=0.014, whereas the proability that they would remain a California resident would be P(California)=0.986P(\text{California})=0.986. The new states might be politically independent, but they would not be statistically independent, because P(California|Jefferson)=0P(\text{California|Jefferson}) = 0! That is, while independence in common language often refers to sets that are exclusive, statistical independence refers to the case where one cannot predict anything about one variable from the value of another variable. For example, knowing a person’s hair color is unlikely to tell you whether they prefer chocolate or strawberry ice cream.

    Let’s look at another example, using the NHANES data: Are physical health and mental health independent of one another? NHANES includes two relevant questions: PhysActive, which asks whether the individual is physically active, and DaysMentHlthBad, which asks how many days out of the last 30 that the individual experienced bad mental health. Let’s consider anyone who had more than 7 days of bad mental health in the last month to be in bad mental health. Based on this, we can define a new variable called badMentalHealth as a logical variable telling whether each person had more than 7 days of bad mental health or not. Using this new variable, we can then determine whether mental health and physical activity are independent by asking whether the simple probability of bad mental health is different from the conditional probability of bad mental health given that one is physically active.

    PhysActive badMentalHealth
    No 0.20
    Yes 0.13

    The overall probability of bad mental health P(bad mental health)P(\text{bad mental health}) is 0.16 while the conditional probability P(bad mental health|physically active)P(\text{bad mental health|physically active}) is 0.13. Thus, it seems that the conditional probability is somewhat smaller than the overall probability, suggesting that they are not independent, though we can’t know for sure just by looking at the numbers, since these numbers might be different due to sampling variability. Later in the course we will encounter tools that will let us more directly test whether two variables are independent.

    10.7 Reversing a conditional probability: Bayes’ rule

    In many cases, we know P(A|B)P(A|B) but we really want to know P(B|A)P(B|A). This commonly occurs in medical screening, where we know P(positive test result| disease)P(\text{positive test result| disease}) but what we want to know is P(disease|positive test result)P(\text{disease|positive test result}). For example, some doctors recommend that men over the age of 50 undergo screening using a test called prostate specific antigen (PSA) to screen for possible prostate cancer. Before a test is approved for use in medical practice, the manufacturer needs to test two aspects of the test’s performance. First, they need to show how sensitive it is – that is, how likely is it to find the disease when it is present: sensitivity=P(positive test| disease)\text{sensitivity} = P(\text{positive test| disease}). They also need to show how specific it is: that is, how likely is it to give a negative result when there is no disease present: specificity=P(negative test|no disease)\text{specificity} = P(\text{negative test|no disease}). For the PSA test, we know that sensitivity is about 80% and specificity is about 70%. However, these don’t answer the question that the physician wants to answer for any particular patient: what is the likelihood that they actually have cancer, given that the test comes back positive? This requires that we reverse the conditional probability that defines sensitivity: instead of P(positivetest|disease)P(positive\ test| disease) we want to know P(disease|positivetest)P(disease|positive\ test).

    In order to reverse a conditional probability, we can use Bayes’ rule:

    P(B|A)=P(A|B)*P(B)P(A) P(B|A) = \frac{P(A|B)*P(B)}{P(A)}

    Bayes’ rule is fairly easy to derive, based on the rules of probability that we learned earlier in the chapter (see the Appendix for this derivation).

    If we have only two outcomes, we can express Bayes’ rule in a somewhat clearer way, using the sum rule to redefine P(A)P(A):

    P(A)=P(A|B)*P(B)+P(A|¬B)*P(¬B) P(A) = P(A|B)*P(B) + P(A|\neg B)*P(\neg B)

    Using this, we can redefine Bayes’s rule:

    P(B|A)=P(A|B)*P(B)P(A|B)*P(B)+P(A|¬B)*P(¬B) P(B|A) = \frac{P(A|B)*P(B)}{P(A|B)*P(B) + P(A|\neg B)*P(\neg B)}

    We can plug the relevant numbers into this equation to determine the likelihood that an individual with a positive PSA result actually has cancer – but note that in order to do this, we also need to know the overall probability of cancer in the person, which we often refer to as the base rate. Let’s take a 60 year old man, for whom the probability of prostate cancer in the next 10 years is P(cancer)=0.058P(cancer)=0.058. Using the sensitivity and specificity values that we outlined above, we can compute the individual’s likelihood of having cancer given a positive test:

    P(cancer|test)=P(test|cancer)*P(cancer)P(test|cancer)*P(cancer)+P(test|¬cancer)*P(¬cancer) P(\text{cancer|test}) = \frac{P(\text{test|cancer})*P(\text{cancer})}{P(\text{test|cancer})*P(\text{cancer}) + P(\text{test|}\neg\text{cancer})*P(\neg\text{cancer})} =0.8*0.0580.8*0.058+0.3*0.942=0.14 = \frac{0.8*0.058}{0.8*0.058 +0.3*0.942 } = 0.14 That’s pretty small – do you find that surprising? Many people do, and in fact there is a substantial psychological literature showing that people systematically neglect base rates (i.e. overall prevalence) in their judgments.

    10.8 Learning from data

    Another way to think of Bayes’ rule is as a way to update our beliefs on the basis of data – that is, learning about the world using data. Let’s look at Bayes’ rule again:

    P(B|A)=P(A|B)*P(B)P(A) P(B|A) = \frac{P(A|B)*P(B)}{P(A)}

    The different parts of Bayes’ rule have specific names, that relate to their role in using Bayes rule to update our beliefs. We start out with an initial guess about the probability of B (P(B)P(B)), which we refer to as the prior probability. In the PSA example we used the base rate for the prior, since it was our best guess as to the individual’s chance of cancer before we knew the test result. We then collect some data, which in our example was the test result. The degree to which the data A are consistent with outcome B is given by P(A|B)P(A|B), which we refer to as the likelihood. You can think of this as how likely the data are, given the particular hypothesis being tested. In our example, the hypothesis being tested was whether the individual had cancer, and the likelihood was based on our knowledge about the sensitivity of the test (that is, the probability of cancer given a positive test outcome). The denominator (P(A)P(A)) is referred to as the marginal likelihood, because it expresses the overall likelihood of the data, averaged across all of the possible values of A (which in our example were the positive and negative test results). The outcome to the left (P(B|A)P(B|A)) is referred to as the posterior - because it’s what comes out the back end of the computation.

    There is a another way of writing Bayes rule that makes this a bit clearer:

    P(B|A)=P(A|B)P(A)*P(B) P(B|A) = \frac{P(A|B)}{P(A)}*P(B)

    The part on the left (P(A|B)P(A)\frac{P(A|B)}{P(A)}) tells us how much more or less likely the data A are given B, relative to the overall (marginal) likelihood of the data, while the part on the right side (P(B)P(B)) tells us how likely we thought B was before we knew anything about the data. This makes it clearer that the role of Bayes theorem is to update our prior knowledge based on the degree to which the data are more likely given B than they would be overall. If the hypothesis is more likely given the data than it would be in general, then we increase our belief in the hypothesis; if it’s less likely given the data, then we decrease our belief.

    10.9 Odds and odds ratios

    The result in the last section showed that the likelihood that the individual has cancer based on a positive PSA test result is still fairly low, even though it’s more than twice as big as it was before we knew the test result. We would often like to quantify the relation between probabilities more directly, which we can do by converting them into odds which express the relative likelihood of something happening or not:
    odds of A=P(A)P(¬A) \text{odds of A} = \frac{P(A)}{P(\neg A)}

    In our PSA example, the odds of having cancer (given the positive test) are:

    odds of cancer=P(cancer)P(¬cancer)=0.1410.14=0.16 \text{odds of cancer} = \frac{P(\text{cancer})}{P(\neg \text{cancer})} =\frac{0.14}{1 - 0.14} = 0.16

    This tells us that the that the odds are fairly low of having cancer, even though the test was positive. For comparison, the odds of rolling a 6 in a single dice throw are:

    odds of 6=15=0.2 \text{odds of 6} = \frac{1}{5} = 0.2

    As an aside, this is a reason why many medical researchers have become increasingly wary of the use of widespread screening tests for relatively uncommon conditions; most positive results will turn out to be false positives.

    We can also use odds to compare different probabilities, by computing what is called an odds ratio - which is exactly what it sounds like. For example, let’s say that we want to know how much the positive test increases the individual’s odds of having cancer. We can first compute the prior odds – that is, the odds before we knew that the person had tested positively. These are computed using the base rate:

    prior odds=P(cancer)P(¬cancer)=0.05810.058=0.061 \text{prior odds} = \frac{P(\text{cancer})}{P(\neg \text{cancer})} =\frac{0.058}{1 - 0.058} = 0.061

    We can then compare these with the posterior odds, which are computed using the posterior probability:

    odds ratio=posterior oddsprior odds=0.160.061=2.62 \text{odds ratio} = \frac{\text{posterior odds}}{\text{prior odds}} = \frac{0.16}{0.061} = 2.62

    This tells us that the odds of having cancer are increased by 2.62 times given the positive test result. An odds ratio is example of what we will later call an effect size, which is a way of quantifying how relatively large any particular statistical effect is.

    10.10 What do probabilities mean?

    It might strike you that it is a bit odd to talk about the probability of a person having cancer depending on a test result; after all, the person either has cancer or they don’t. Historically, there have been two different ways that probabilities have been interpreted. The first (known as the frequentist interpretation) interprets probabilities in terms of long-run frequencies. For example, in the case of a coin flip, it would reflect the relative frequencies of heads in the long run after a large number of flips. While this interpretation might make sense for events that can be repeated many times like a coin flip, it makes less sense for events that will only happen once, like an individual person’s life or a particular presidential election; and as the economist John Maynard Keynes famously said, “In the long run, we are all dead.”

    The other interpretation of probablities (known as the Bayesian interpretation) is as a degree of belief in a particular proposition. If I were to ask you “How likely is it that the US will return to the moon by 2026”, you can provide an answer to this question based on your knowledge and beliefs, even though there are no relevant frequencies to compute a frequentist probability. One way that we often frame subjective probabilities is in terms of one’s willingness to accept a particular gamble. For example, if you think that the probability of the US landing on the moon by 2026 is 0.1 (i.e. odds of 9 to 1), then that means that you should be willing to accept a gamble that would pay off with anything more than 9 to 1 odds if the event occurs.

    As we will see, these two different definitions of probability are very relevant to the two different ways that statisticians think about testing statistical hypotheses, which we will encounter in later chapters.

    10.11 Learning objectives

    Having read this chapter, you should be able to:

    • Describe the sample space for a selected random experiment.
    • Compute relative frequency and empirical probability for a given set of events
    • Compute probabilities of single events, complementary events, and the unions and intersections of collections of events.
    • Describe the law of large numbers.
    • Describe the difference between a probability and a conditional probability
    • Describe the concept of statistical independence
    • Use Bayes’ theorem to compute the inverse conditional probability.

    10.12 Suggested readings

    • The Drunkard’s Walk: How Randomness Rules Our Lives, by Leonard Mlodinow

    10.13 Appendix

    Proof (Derivation of Bayes’ rule). First, remember the rule for computing a conditional probability:

    P(A|B)=P(AB)P(B) P(A|B) = \frac{P(A \cap B)}{P(B)}

    We can rearrange this to get the formula to compute the joint probability using the conditional:

    P(AB)=P(A|B)*P(B) P(A \cap B) = P(A|B) * P(B)

    Using this we can compute the inverse probability:

    P(B|A)=P(AB)P(A)=P(A|B)*P(B)P(A) P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{P(A|B)*P(B)}{P(A)}