# 2.4: Conditional Probability

- Page ID
- 10132

The purpose of this section is to study how probabilities are updated in light of new information, clearly an absolutely essential topic. If you are a new student of probability, you may want to skip the technical details.

## Definitions and Interpretations

### The Basic Definition

As usual, we start with a random experiment modeled by a probability space \((S, \mathscr S, \P)\). Thus, \( S \) is the set of outcomes, \( \mathscr S \) the collection of events, and \( \P \) the probability measure on the sample space \( (S, \mathscr S) \). Suppose now that we know that an event \(B\) has occurred. In general, this information should clearly alter the probabilities that we assign to other events. In particular, if \(A\) is another event then \(A\) occurs if and only if \(A\) and \(B\) occur; effectively, the sample space has been reduced to \(B\). Thus, the probability of \(A\), given that we know \(B\) has occurred, should be proportional to \(\P(A \cap B)\).

However, conditional probability, given that \(B\) has occurred, should still be a probability measure, that is, it must satisfy the axioms of probability. This forces the proportionality constant to be \(1 \big/ \P(B)\). Thus, we are led inexorably to the following definition:

Let \(A\) and \(B\) be events with \(\P(B) \gt 0\). The conditional probability of \(A\) given \(B\) is defined to be \[\P(A \mid B) = \frac{\P(A \cap B)}{\P(B)}\]

### The Law of Large Numbers

The definition above was based on the axiomatic definition of probability. Let's explore the idea of conditional probability from the less formal and more intuitive notion of relative frequency (the law of large numbers). Thus, suppose that we run the experiment repeatedly. For \( n \in \N_+ \) and an event \(E\), let \(N_n(E)\) denote the number of times \(E\) occurs (the frequency of \( E \)) in the first \(n\) runs. Note that \(N_n(E)\) is a random variable in the compound experiment that consists of replicating the original experiment. In particular, its value is unknown until we actually run the experiment \( n \) times.

If \(N_n(B)\) is large, the conditional probability that \(A\) has occurred, given that \(B\) has occurred, should be close to the conditional relative frequency of \(A\) given \(B\), namely the relative frequency of \(A\) for the runs on which \(B\) occurred: \(N_n(A \cap B) / N_n(B)\). But note that \[\frac{N_n(A \cap B)}{N_n(B)} = \frac{N_n(A \cap B) / n}{N_n(B) / n}\] The numerator and denominator of the main fraction on the right are the relative frequencies of \( A \cap B \) and \( B \), respectively. So by the law of large numbers again, \( N_n(A \cap B) / n \to \P(A \cap B) \) as \( n \to \infty \) and \( N_n(B) \to \P(B) \) as \( n \to \infty \). Hence \[\frac{N_n(A \cap B)}{N_n(B)} \to \frac{\P(A \cap B)}{\P(B)} \text{ as } n \to \infty\] and we are led again to the definition above.

In some cases, conditional probabilities can be computed *directly*, by effectively reducing the sample space to the given event. In other cases, the formula in the mathematical definition is better. In some cases, conditional probabilities are known from modeling assumptions, and then are used to compute other probabilities. We will see examples of all of these situations in the computational exercises below.

It's very important that you not confuse \(\P(A \mid B)\), the probability of \(A\) given \(B\), with \(\P(B \mid A)\), the probability of \(B\) given \(A\). Making that mistake is known as the fallacy of the transposed conditional. (How embarrassing!)

### Conditional Distributions

Suppose that \(X\) is a random variable for the experiment with values in \(T\). Mathematically, \( X \) is a function from \( S \) into \( T \), and \( \{X \in A\} \) denotes the event \( \{s \in S: X(s) \in A\} \) for \(A \subseteq T \). Intuitively, \( X \) is a variable of interest in the experiment, and every meaningful statement about \( X \) defines an event. Recall that the probability distribution of \(X\) is the probability measure on \(T\) given by \[A \mapsto \P(X \in A), \quad A \subseteq T\] This has a natural extension to a conditional distribution, given an event.

If \(B\) is an event with \( \P(B) \gt 0 \), then the conditional distribution of \(X\) given \(B\) is the probability measure on \(T\) given by \[A \mapsto \P(X \in A \mid B), \quad A \subseteq T\]

## Details

Recall that \( T \) will come with a \( \sigma \)-algebra of admissible subsets so that \( (T, \mathscr T) \) is a measurable space, just like the sample space \( (S, \mathscr S) \). Random variable \( X \) is required to be measurable as a function from \( S \) into \( T \). This ensures that \( \{X \in A\} \) is a valid event for each \( A \in \mathscr T \), so that the definition makes sense.

## Basic Theory

### Preliminary Results

Our first result is of fundamental importance, and indeed was a crucial part of the argument for the definition of conditional probability.

Suppose again that \( B \) is an event with \( \P(B) \gt 0 \). Then \(A \mapsto \P(A \mid B)\) is a probability measure on \( S \).

## Proof

Clearly \( \P(A \mid B) \ge 0 \) for every event \( A \), and \( \P(S \mid B) = 1 \). Thus, suppose that \( \{A_i: i \in I\} \) is a countable collection of pairwise disjoint events. Then \[ \P\left(\bigcup_{i \in I} A_i \biggm| B\right) = \frac{1}{\P(B)} \P\left[\left(\bigcup_{i \in I} A_i\right) \cap B\right] = \frac{1}{\P(B)} \P\left(\bigcup_{i \in I} (A_i \cap B)\right) \] But the collection of events \( \{A_i \cap B: i \in I\} \) is also pairwise disjoint, so \[ \P\left(\bigcup_{i \in I} A_i \biggm| B\right) = \frac{1}{\P(B)} \sum_{i \in I} \P(A_i \cap B) = \sum_{i \in I} \frac{\P(A_i \cap B)}{\P(B)} = \sum_{i \in I} \P(A_i \mid B) \]

It's hard to overstate the importance of the last result because this theorem means that any result that holds for probability measures in general holds for conditional probability, as long as the conditioning event remains fixed. In particular the basic probability rules in the section on Probability Measure have analogs for conditional probability. To give two examples, \begin{align} \P\left(A^c \mid B\right) & = 1 - \P(A \mid B) \\ \P\left(A_1 \cup A_2 \mid B\right) & = \P\left(A_1 \mid B\right) + \P\left(A_2 \mid B\right) - \P\left(A_1 \cap A_2 \mid B\right) \end{align} By the same token, it follows that the conditional distribution of a random variable with values in \( T \), given in above, really does define a probability distribution on \( T \). No further proof is necessary. Our next results are very simple.

Suppose that \(A\) and \(B\) are events with \( \P(B) \gt 0 \).

- If \(B \subseteq A\) then \(\P(A \mid B) = 1\).
- If \(A \subseteq B\) then \(\P(A \mid B) = \P(A) / \P(B)\).
- If \(A\) and \(B\) are disjoint then \(\P(A \mid B) = 0\).

## Proof

These results follow directly from the definition of conditional probability. In part (a), note that \( A \cap B = B \). In part (b) note that \( A \cap B = A \). In part (c) note that \( A \cap B = \emptyset \).

Parts (a) and (c) certainly make sense. Suppose that we know that event \( B \) has occurred. If \( B \subseteq A \) then \( A \) becomes a certain event. If \( A \cap B = \emptyset \) then \( A \) becomes an impossible event. A conditional probability can be computed relative to a probability measure that is itself a conditional probability measure. The following result is a consistency condition.

Suppose that \(A\), \(B\), and \(C\) are events with \( \P(B \cap C) \gt 0 \). The probability of \(A\) given \(B\), relative to \(\P(\cdot \mid C)\), is the same as the probability of \(A\) given \(B\) and \(C\) (relative to \(\P\)). That is, \[\frac{\P(A \cap B \mid C)}{\P(B \mid C)} = \P(A \mid B \cap C)\]

## Proof

From the definition, \[\frac{\P(A \cap B \mid C)}{\P(B \mid C)} = \frac{\P(A \cap B \cap C) \big/ \P(C)}{\P(B \cap C) \big/ \P(C)} = \frac{\P(A \cap B \cap C)}{\P(B \cap C)} = \P(A \mid B \cap C)\]

### Correlation

Our next discussion concerns an important concept that deals with how two events are related, in a probabilistic sense.

Suppose that \(A\) and \(B\) are events with \( \P(A) \gt 0 \) and \( \P(B) \gt 0 \).

- \(\P(A \mid B) \gt \P(A)\) if and only if \(\P(B \mid A) \gt \P(B)\) if and only if \(\P(A \cap B) \gt \P(A) \P(B)\). In this case, \( A \) and \( B \) are positively correlated.
- \(\P(A \mid B) \lt \P(A)\) if and only if \(\P(B \mid A) \lt \P(B)\) if and only if \(\P(A \cap B) \lt \P(A) \P(B)\). In this case, \( A \) and \( B \) are negatively correlated.
- \(\P(A \mid B) = \P(A)\) if and only if \(\P(B \mid A) = \P(B)\) if and only if \(\P(A \cap B) = \P(A) \P(B)\). In this case, \( A \) and \( B \) are uncorrelated or independent.

## Proof

These properties following directly from the definition of conditional probability and simple algebra. Recall that multiplying or dividing an inequality by a positive number preserves the inequality.

Intuitively, if \( A \) and \( B \) are positively correlated, then the occurrence of either event means that the other event is more likely. If \(A\) and \(B\) are negatively correlated, then the occurrence of either event means that the other event is less likely. If \(A\) and \(B\) are uncorrelated, then the occurrence of either event does not change the probability of the other event. Independence is a fundamental concept that can be extended to more than two events and to random variables; these generalizations are studied in the next section on Independence. A much more general version of correlation, for random variables, is explored in the section on Covariance and Correlation in the chapter on Expected Value.

Suppose that \( A \) and \( B \) are events. Note from (4) that if \( A \subseteq B \) or \( B \subseteq A \) then \( A \) and \( B \) are positively correlated. If \( A \) and \( B \) are disjoint then \( A \) and \( B \) are negatively correlated.

Suppose that \(A\) and \(B\) are events in a random experiment.

- \(A\) and \(B\) have the same correlation (positive, negative, or zero) as \(A^c\) and \(B^c\).
- \(A\) and \(B\) have the opposite correlation as \(A\) and \(B^c\) (that is, positive-negative, negative-positive, or 0-0).

## Proof

- Using DeMorgan's law and the complement law. \[ \P(A^c \cap B^c) - \P(A^c) \P(B^c) = \P\left[(A \cup B)^c\right] - \P(A^c) \P(B^c) = \left[1 - \P(A \cup B)\right] - \left[1 - \P(A)\right]\left[1 - \P(B)\right] \] Using the inclusion-exclusion law and algebra, \[ \P(A^c \cap B^c) - \P(A^c) \P(B^c) = \P(A \cap B) - \P(A) \P(B) \]
- Using the difference rule and the complement law: \[ \P(A \cap B^c) - \P(A) \P(B^c) = \P(A) - \P(A \cap B) - \P(A) \left[1 - \P(B)\right] = -\left[\P(A \cap B) - \P(A) \P(B)\right]\]

### The Multiplication Rule

Sometimes conditional probabilities are known and can be used to find the probabilities of other events. Note first that if \( A \) and \( B \) are events with positive probability, then by the very definition of conditional probability, \[ \P(A \cap B) = \P(A) \P(B \mid A) = \P(B) \P(A \mid B) \] The following generalization is known as the multiplication rule of probability. As usual, we assume that any event conditioned on has positive probability.

Suppose that \((A_1, A_2, \ldots, A_n)\) is a sequence of events. Then \[\P\left(A_1 \cap A_2 \cap \cdots \cap A_n\right) = \P\left(A_1\right) \P\left(A_2 \mid A_1\right) P\left(A_3 \mid A_1 \cap A_2\right) \cdots \P\left(A_n \mid A_1 \cap A_2 \cap \cdots \cap A_{n-1}\right)\]

## Proof

The product on the right a collapsing product in which only the probability of the intersection of all \(n\) events survives. The product of the first two factors is \( \P\left(A_1 \cap A_2\right) \), and hence the product of the first three factors is \( \P\left(A_1 \cap A_2 \cap A_3\right) \), and so forth. The proof can be made more rigorous by induction on \( n \).

The multiplication rule is particularly useful for experiments that consist of dependent stages, where \(A_i\) is an event in stage \(i\). Compare the multiplication rule of probability with the multiplication rule of combinatorics.

As with any other result, the multiplication rule can be applied to a conditional probability measure. In the context above, if \(E\) is another event, then \[\P\left(A_1 \cap A_2 \cap \cdots \cap A_n \mid E\right) = \P\left(A_1 \mid E\right) \P\left(A_2 \mid A_1 \cap E\right) P\left(A_3 \mid A_1 \cap A_2 \cap E\right) \cdots \P\left(A_n \mid A_1 \cap A_2 \cap \cdots \cap A_{n-1} \cap E\right)\]

### Conditioning and Bayes' Theorem

Suppose that \(\mathscr{A} = \{A_i: i \in I\}\) is a countable collection of events that partition the sample space \(S\), and that \( \P(A_i) \gt 0 \) for each \( i \in I \).

The following theorem is known as the law of total probability.

If \( B \) is an event then \[\P(B) = \sum_{i \in I} \P(A_i) \P(B \mid A_i)\]

## Proof

Recall that \(\{A_i \cap B: i \in I\}\) is a partition of \(B\). Hence \[ \P(B) = \sum_{i \in I} \P(A_i \cap B) = \sum_{i \in I} \P(A_i) \P(B \mid A_i) \]

The following theorem is known as Bayes' Theorem, named after Thomas Bayes:

If \( B \) is an event then \[\P(A_j \mid B) = \frac{\P(A_j) \P(B \mid A_j)}{\sum_{i \in I}\P(A_i) \P(B \mid A_i)}, \quad j \in I\]

## Proof

Again the numerator is \(\P(A_j \cap B)\) while the denominator is \(\P(B)\) by the law of total probability.

These two theorems are most useful, of course, when we know \(\P(A_i)\) and \(\P(B \mid A_i)\) for each \(i \in I\). When we compute the probability of \(\P(B)\) by the law of total probability, we say that we are conditioning on the partition \(\mathscr{A}\). Note that we can think of the sum as a weighted average of the conditional probabilities \(\P(B \mid A_i)\) over \(i \in I\), where \(\P(A_i)\), \(i \in I\) are the weight factors. In the context of Bayes theorem, \(\P(A_j)\) is the prior probability of \(A_j\) and \(\P(A_j \mid B)\) is the posterior probability of \(A_j\) for \( j \in I \). We will study more general versions of conditioning and Bayes theorem in the section on Discrete Distributions in the chapter on Distributions, and again in the section on Conditional Expected Value in the chapter on Expected Value.

Once again, the law of total probability and Bayes' theorem can be applied to a conditional probability measure. So, if \(E\) is another event with \( \P(A_i \cap E) \gt 0 \) for \( i \in I \) then \begin{align} \P(B \mid E) & = \sum_{i \in I} \P(A_i \mid E) \P(B \mid A_i \cap E) \\ \P(A_j \mid B \cap E) & = \frac{\P(A_j \mid E) \P(B \mid A_j \cap E)}{\sum_{i \in I}\P(A_i \cap E) \P(B \mid A_i \cap E)}, \quad j \in I \end{align}

## Examples and Applications

### Basic Rules

Suppose that \(A\) and \(B\) are events in an experiment with \(\P(A) = \frac{1}{3}\), \(\P(B) = \frac{1}{4}\), \(\P(A \cap B) = \frac{1}{10}\). Find each of the following:

- \(\P(A \mid B)\)
- \(\P(B \mid A)\)
- \(\P(A^c \mid B)\)
- \(\P(B^c \mid A)\)
- \(\P(A^c \mid B^c)\)

## Answer

- \(\frac{2}{5}\)
- \(\frac{3}{10}\)
- \(\frac{3}{5}\)
- \(\frac{7}{10}\)
- \(\frac{31}{45}\)

Suppose that \(A\), \(B\), and \(C\) are events in a random experiment with \(\P(A \mid C) = \frac{1}{2}\), \(\P(B \mid C) = \frac{1}{3}\), and \(\P(A \cap B \mid C) = \frac{1}{4}\). Find each of the following:

- \(\P(B \setminus A \mid C)\)
- \(\P(A \cup B \mid C)\)
- \(\P(A^c \cap B^c \mid C)\)
- \(\P(A^c \cup B^c \mid C)\)
- \(\P(A^c \cup B \ \mid C)\)
- \(\P(A \mid B \cap C)\)

## Answer

- \(\frac{1}{12}\)
- \(\frac{7}{12}\)
- \(\frac{5}{12}\)
- \(\frac{3}{4}\)
- \(\frac{3}{4}\)
- \(\frac{3}{4}\)

Suppose that \(A\) and \(B\) are events in a random experiment with \(\P(A) = \frac{1}{2}\), \(\P(B) = \frac{1}{3}\), and \(\P(A \mid B) =\frac{3}{4}\).

- Find \(\P(A \cap B)\)
- Find \(\P(A \cup B)\)
- Find \(\P(B \cup A^c)\)
- Find \(\P(B \mid A)\)
- Are \(A\) and \(B\) positively correlated, negatively correlated, or independent?

## Answer

- \(\frac{1}{4}\)
- \(\frac{7}{12}\)
- \(\frac{3}{4}\)
- \(\frac{1}{2}\)
- positively correlated.

Open the conditional probability experiment.

- Given \( \P(A) \), \( \P(B) \), and \( \P(A \cap B) \), in the table, verify all of the other probabilities in the table.
- Run the experiment 1000 times and compare the probabilities with the relative frequencies.

### Simple Populations

In a certain population, 30% of the persons smoke cigarettes and 8% have COPD (Chronic Obstructive Pulmonary Disease). Moreover, 12% of the persons who smoke have COPD.

- What percentage of the population smoke and have COPD?
- What percentage of the population with COPD also smoke?
- Are smoking and COPD positively correlated, negatively correlated, or independent?

## Answer

- 3.6%
- 45%
- positively correlated.

A company has 200 employees: 120 are women and 80 are men. Of the 120 female employees, 30 are classified as managers, while 20 of the 80 male employees are managers. Suppose that an employee is chosen at random.

- Find the probability that the employee is female.
- Find the probability that the employee is a manager.
- Find the conditional probability that the employee is a manager given that the employee is female.
- Find the conditional probability that the employee is female given that the employee is a manager.
- Are the events
*female*and*manager*positively correlated, negatively correlated, or indpendent?

## Answer

- \(\frac{120}{200}\)
- \(\frac{50}{200}\)
- \(\frac{30}{120}\)
- \(\frac{30}{50}\)
- independent

### Dice and Coins

Consider the experiment that consists of rolling 2 standard, fair dice and recording the sequence of scores \(\bs{X} = (X_1, X_2)\). Let \(Y\) denote the sum of the scores. For each of the following pairs of events, find the probability of each event and the conditional probability of each event given the other. Determine whether the events are positively correlated, negatively correlated, or independent.

- \(\{X_1 = 3\}\), \(\{Y = 5\}\)
- \(\{X_1 = 3\}\), \(\{Y = 7\}\)
- \(\{X_1 = 2\}\), \(\{Y = 5\}\)
- \(\{X_1 = 3\}\), \(\{X_1 = 2\}\)

## Answer

In each case below, the answers are for \( \P(A) \), \( \P(B) \), \( \P(A \mid B) \), and \( \P(B \mid A) \)

- \(\frac{1}{6}\), \(\frac{1}{9}\), \(\frac{1}{4}\), \(\frac{1}{6}\). Positively correlated.
- \(\frac{1}{6}\), \(\frac{1}{6}\), \(\frac{1}{6}\), \(\frac{1}{6}\). Independent.
- \(\frac{1}{6}\), \(\frac{1}{9}\), \(\frac{1}{4}\), \(\frac{1}{6}\). Positively correlated.
- \(\frac{1}{6}\), \(\frac{1}{6}\), \(0\), \(0\). Negatively correlated.

Note that positive correlation is not a transitive relation. From the previous exercise, for example, note that \(\{X_1 = 3\}\) and \(\{Y = 5\}\) are positively correlated, \(\{Y = 5\}\) and \(\{X_1 = 2\}\) are positively correlated, but \(\{X_1 = 3\}\) and \(\{X_1 = 2\}\) are negatively correlated (in fact, disjoint).

In dice experiment, set \(n = 2\). Run the experiment 1000 times. Compute the empirical conditional probabilities corresponding to the conditional probabilities in the last exercise.

Consider again the experiment that consists of rolling 2 standard, fair dice and recording the sequence of scores \(\bs{X} = (X_1, X_2)\). Let \(Y\) denote the sum of the scores, \(U\) the minimum score, and \(V\) the maximum score.

- Find \(\P(U = u \mid V = 4)\) for the appropriate values of \(u\).
- Find \(\P(Y = y \mid V = 4)\) for the appropriate values of \(y\).
- Find \(\P(V = v \mid Y = 8)\) for appropriate values of \(v\).
- Find \(\P(U = u \mid Y = 8)\) for the appropriate values of \(u\).
- Find \(\P[(X_ 1, X_2) = (x_1, x_2) \mid Y = 8]\) for the appropriate values of \((x_1, x_2)\).

## Answer

- \(\frac{2}{7}\) for \(u \in \{1, 2, 3\}\), \(\frac{1}{7}\) for \( u = 4 \)
- \(\frac{2}{7}\) for \(y \in \{5, 6, 7\}\), \(\frac{1}{7}\) for \( y = 8 \)
- \(\frac{1}{5}\) for \( v = 4 \), \(\frac{2}{5}\) for \(v \in \{5, 6\}\)
- \(\frac{2}{5}\) for \(u \in \{2, 3\}\), \(\frac{1}{5}\) for \( u = 4 \)
- \(\frac{1}{5}\) for \((x_1, x_2) \in \{(2,6), (6,2), (3,5), (5,3), (4,4)\}\)

In the die-coin experiment, a standard, fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let \(N\) denote the die score and \(H\) the event that all coin tosses result in heads.

- Find \(\P(H)\).
- Find \(\P(N = n \mid H)\) for \(n \in \{1, 2, 3, 4, 5, 6\}\).
- Compare the results in (b) with \(\P(N = n)\) for \(n \in \{1, 2, 3, 4, 5, 6\}\). In each case, note whether the events \(H\) and \(\{N = n\}\) are positively correlated, negatively correlated, or independent.

## Answer

- \(\frac{21}{128}\)
- \(\frac{64}{63} \frac{1}{2^n}\) for \(n \in \{1, 2, 3, 4, 5, 6\}\)
- positively correlated for \(n \in \{1, 2\}\) and negatively correlated for \(n \in \{3, 4, 5, 6\}\)

Run the die-coin experiment 1000 times. Let \(H\) and \(N\) be as defined in the previous exercise.

- Compute the empirical probability of \(H\). Compare with the true probability in the previous exercise.
- Compute the empirical probability of \(\{N = n\}\) given \(H\), for \(n \in \{1, 2, 3, 4, 5, 6\}\). Compare with the true probabilities in the previous exercise.

Suppose that a bag contains 12 coins: 5 are fair, 4 are biased with probability of heads \(\frac{1}{3}\); and 3 are two-headed. A coin is chosen at random from the bag and tossed.

- Find the probability that the coin is heads.
- Given that the coin is heads, find the conditional probability of each coin type.

## Answer

- \(\frac{41}{72}\)
- \( \frac{15}{41} \) that the coin is fair, \( \frac{8}{41} \) that the coin is biased, \( \frac{18}{41} \) that the coin is two-headed

Compare die-coin experiment and bag of coins experiment. In the die-coin experiment, we toss a coin with a *fixed* probability of heads a *random* number of times. In the bag of coins experiment, we effectively toss a coin with a *random* probability of heads a *fixed* number of times. The random experiment of tossing a coin with a fixed probability of heads \(p\) a fixed number of times \(n\) is known as the binomial experiment with parameters \(n\) and \(p\). This is a very basic and important experiment that is studied in more detail in the section on the binomial distribution in the chapter on Bernoulli Trials. Thus, the die-coin and bag of coins experiments can be thought of as modifications of the binomial experiment in which a parameter has been randomized. In general, interesting new random experiments can often be constructed by randomizing one or more parameters in another random experiment.

In the coin-die experiment, a fair coin is tossed. If the coin lands tails, a fair die is rolled. If the coin lands heads, an ace-six flat die is tossed (faces 1 and 6 have probability \(\frac{1}{4}\) each, while faces 2, 3, 4, and 5 have probability \(\frac{1}{8}\) each). Let \(H\) denote the event that the coin lands heads, and let \(Y\) denote the score when the chosen die is tossed.

- Find \(\P(Y = y)\) for \(y \in \{1, 2, 3, 4, 5, 6\}\).
- Find \(\P(H \mid Y = y)\) for \(y \in \{1, 2, 3, 4, 5, 6,\}\).
- Compare each probability in part (b) with \(\P(H)\). In each case, note whether the events \(H\) and \(\{Y = y\}\) are positively correlated, negatively correlated, or independent.

## Answer

- \(\frac{5}{24}\) for \(y \in \{1, 6\}\), \(\frac{7}{48}\) for \(y \in \{2, 3, 4, 5\}\)
- \(\frac{3}{5}\) for \(y \in \{1, 6\}\), \(\frac{3}{7}\) for \(y \in \{2, 3, 4, 5\}\)
- Positively correlated for \( y \in \{1, 6\} \), negatively correlated for \( y \in \{2, 3, 4, 5\} \)

Run the coin-die experiment 1000 times. Let \(H\) and \(Y\) be as defined in the previous exercise.

- Compute the empirical probability of \(\{Y = y\}\), for each \(y\), and compare with the true probability in the previous exercise
- Compute the empirical probability of \(H\) given \(\{Y = y\}\) for each \(y\), and compare with the true probability in the previous exercise.

### Cards

Consider the card experiment that consists of dealing 2 cards from a standard deck and recording the sequence of cards dealt. For \(i \in \{1, 2\}\), let \(Q_i\) be the event that card \(i\) is a queen and \(H_i\) the event that card \(i\) is a heart. For each of the following pairs of events, compute the probability of each event, and the conditional probability of each event given the other. Determine whether the events are positively correlated, negatively correlated, or independent.

- \(Q_1\), \(H_1\)
- \(Q_1\), \(Q_2\)
- \(Q_2\), \(H_2\)
- \(Q_1\), \(H_2\)

## Answer

The answers below are for \( \P(A) \), \( \P(B) \), \( \P(A \mid B) \), and \( \P(B \mid A) \) where \( A \) and \( B \) are the given events

- \(\frac{1}{13}\), \(\frac{1}{4}\), \(\frac{1}{13}\), \(\frac{1}{4}\), independent.
- \(\frac{1}{13}\), \(\frac{1}{13}\), \(\frac{3}{51}\), \(\frac{3}{51}\), negatively correlated.
- \(\frac{1}{13}\), \(\frac{1}{4}\), \(\frac{1}{13}\), \(\frac{1}{4}\), independent.
- \(\frac{1}{13}\), \(\frac{1}{4}\), \(\frac{1}{13}\), \(\frac{1}{4}\), independent.

In the card experiment, set \(n = 2\). Run the experiment 500 times. Compute the conditional relative frequencies corresponding to the conditional probabilities in the last exercise.

Consider the card experiment that consists of dealing 3 cards from a standard deck and recording the sequence of cards dealt. Find the probability of the following events:

- All three cards are all hearts.
- The first two cards are hearts and the third is a spade.
- The first and third cards are hearts and the second is a spade.

## Proof

- \(\frac{11}{850}\)
- \(\frac{13}{850}\)
- \(\frac{13}{850}\)

In the card experiment, set \(n = 3\) and run the simulation 1000 times. Compute the empirical probability of each event in the previous exercise and compare with the true probability.

### Bivariate Uniform Distributions

Recall that Buffon's coin experiment consists of tossing a coin with radius \(r \le \frac{1}{2}\) randomly on a floor covered with square tiles of side length 1. The coordinates \((X, Y)\) of the center of the coin are recorded relative to axes through the center of the square, parallel to the sides. Since the needle is dropped randomly, the basic modeling assumption is that \( (X, Y) \) is uniformly distributed on the square \( [-1/2, 1/2]^2 \).

In Buffon's coin experiment,

- Find \(\P(Y \gt 0 \mid X \lt Y)\)
- Find the conditional distribution of \((X, Y)\) given that the coin does not touch the sides of the square.

## Answer

- \(\frac{3}{4}\)
- Given \((X, Y) \in [r - \frac{1}{2}, \frac{1}{2} - r]^2\), \((X, Y)\) is uniformly distributed on this set.

Run Buffon's coin experiment 500 times. Compute the empirical probability that \(Y \gt 0\) given that \(X \lt Y\) and compare with the probability in the last exercise.

In the conditional probability experiment, the random points are uniformly distributed on the rectangle \( S \). Move and resize events \( A \) and \( B \) and note how the probabilities change. For each of the following configurations, run the experiment 1000 times and compare the relative frequencies with the true probabilities.

- \( A \) and \( B \) in general position
- \( A \) and \( B \) disjoint
- \( A \subseteq B \)
- \( B \subseteq A \)

### Reliability

A plant has 3 assembly lines that produces memory chips. Line 1 produces 50% of the chips and has a defective rate of 4%; line 2 has produces 30% of the chips and has a defective rate of 5%; line 3 produces 20% of the chips and has a defective rate of 1%. A chip is chosen at random from the plant.

- Find the probability that the chip is defective.
- Given that the chip is defective, find the conditional probability for each line.

## Answer

- 0.037
- 0.541 for line 1, 0.405 for line 2, 0.054 for line 3

Suppose that a bit (0 or 1) is sent through a noisy communications channel. Because of the noise, the bit sent may be received incorrectly as the complementary bit. Specifically, suppose that if 0 is sent, then the probability that 0 is received is 0.9 and the probability that 1 is received is 0.1. If 1 is sent, then the probability that 1 is received is 0.8 and the probability that 0 is received is 0.2. Finally, suppose that 1 is sent with probability 0.6 and 0 is sent with probability 0.4. Find the probability that

- 1 was sent given that 1 was received
- 0 was sent given that 0 was received

## Answer

- \(12/13\)
- \(3/4\)

Suppose that \(T\) denotes the lifetime of a light bulb (in 1000 hour units), and that \(T\) has the following exponential distribution, defined for measurable \(A \subseteq [0, \infty)\):

\[\P(T \in A) = \int_A e^{-t} dt\]- Find \(\P(T \gt 3)\)
- Find \(\P(T \gt 5 \mid T \gt 2)\)

## Answer

- \(e^{-3}\)
- \(e^{-3}\)

Suppose again that \(T\) denotes the lifetime of a light bulb (in 1000 hour units), but that \(T\) is uniformly distributed on the interal \([0, 10]\).

- Find \(\P(T \gt 3)\)
- Find \(\P(T \gt 5 \mid T \gt 2)\)

## Answer

- \(\frac{7}{10}\)
- \(\frac{5}{8}\)

### Genetics

Please refer to the discussion of genetics in the section on random experiments if you need to review some of the definitions in this section.

Recall first that the ABO blood type in humans is determined by three alleles: \(a\), \(b\), and \(o\). Furthermore, \(a\) and \(b\) are co-dominant and \(o\) is recessive. Suppose that the probability distribution for the set of blood genotypes in a certain population is given in the following table:

Genotype | \(aa\) | \(ab\) | \(ao\) | \(bb\) | \(bo\) | \(oo\) |
---|---|---|---|---|---|---|

Probability | 0.050 | 0.038 | 0.310 | 0.007 | 0.116 | 0.479 |

Suppose that a person is chosen at random from the population. Let \(A\), \(B\), \(AB\), and \(O\) be the events that the person is type \(A\), type \(B\), type \(AB\), and type \(O\) respectively. Let \(H\) be the event that the person is homozygous, and let \(D\) denote the event that the person has an \(o\) allele. Find each of the following:

- \(\P(A)\), \(\P(B)\), \(\P(AB)\), \(\P(O)\), \(\P(H)\), \(\P(D)\)
- \(P(A \cap H)\), \(P(A \mid H)\), \(P(H \mid A)\). Are the events \(A\) and \(H\) positively correlated, negatively correlated, or independent?
- \(P(B \cap H)\), \(P(B \mid H)\), \(P(H \mid B)\). Are the events \(B\) and \(H\) positively correlated, negatively correlated, or independent?
- \(P(A \cap D)\), \(P(A \mid D)\), \(P(D \mid A)\). Are the events \(A\) and \(D\) positively correlated, negatively correlated, or independent?
- \(P(B \cap D)\), \(P(B \mid D)\), \(P(D \mid B)\). Are the events \(B\) and \(D\) positively correlated, negatively correlated, or independent?
- \(P(H \cap D)\), \(P(H \mid D)\), \(P(D \mid H)\). Are the events \(H\) and \(D\) positively correlated, negatively correlated, or independent?

## Answer

- 0.360, 0.123, 0.038, 0.479, 0.536, 0.905
- 0.050, 0.093, 0.139. \(A\) and \(H\) are negatively correlated.
- 0.007, 0.013, 0.057. \(B\) and \(H\) are negatively correlated.
- 0.310, 0.343, 0.861. \(A\) and \(D\) are negatively correlated.
- 0.116, 0.128, 0.943. \(B\) and \(D\) are positivley correlated.
- 0.479, 0.529, 0.894. \(H\) and \(D\) are negatively correlated.

Suppose next that pod color in certain type of pea plant is determined by a gene with two alleles: \(g\) for green and \(y\) for yellow, and that \(g\) is dominant and \(y\) recessive.

Suppose that a green-pod plant and a yellow-pod plant are bred together. Suppose further that the green-pod plant has a \(\frac{1}{4}\) chance of carrying the recessive yellow-pod allele.

- Find the probability that a child plant will have green pods.
- Given that a child plant has green pods, find the updated probability that the green-pod parent has the recessive allele.

## Answer

- \(\frac{7}{8}\)
- \(\frac{1}{7}\)

Suppose that two green-pod plants are bred together. Suppose further that with probability \(\frac{1}{3}\) neither plant has the recessive allele, with probability \(\frac{1}{2}\) one plant has the recessive allele, and with probability \(\frac{1}{6}\) both plants have the recessive allele.

- Find the probability that a child plant has green pods.
- Given that a child plant has green pods, find the updated probability that both parents have the recessive gene.

## Answer

- \(\frac{23}{24}\)
- \(\frac{3}{23}\)

Next consider a sex-linked hereditary disorder in humans (such as colorblindness or hemophilia). Let \(h\) denote the healthy allele and \(d\) the defective allele for the gene linked to the disorder. Recall that \(h\) is dominant and \(d\) recessive for women.

Suppose that in a certain population, 50% are male and 50% are female. Moreover, suppose that 10% of males are color blind but only 1% of females are color blind.

- Find the percentage of color blind persons in the population.
- Find the percentage of color blind persons that are male.

## Answer

- 5.5%
- 90.9%

Since color blindness is a sex-linked hereditary disorder, note that it's reasonable in the previous exercise that the probability that a female is color blind is the square of the probability that a male is color blind. If \(p\) is the probability of the defective allele on the \(X\) chromosome, then \(p\) is also the probability that a male will be color blind. But since the defective allele is recessive, a woman would need two copies of the defective allele to be color blind, and assuming independence, the probability of this event is \(p^2\).

A man and a woman do not have a certain sex-linked hereditary disorder, but the woman has a \(\frac{1}{3}\) chance of being a carrier.

- Find the probability that a son born to the couple will be normal.
- Find the probability that a daughter born to the couple will be a carrier.
- Given that a son born to the couple is normal, find the updated probability that the mother is a carrier.

## Answer

- \(\frac{5}{6}\)
- \(\frac{1}{6}\)
- \(\frac{1}{5}\)

### Urn Models

Urn 1 contains 4 red and 6 green balls while urn 2 contains 7 red and 3 green balls. An urn is chosen at random and then a ball is chosen at random from the selected urn.

- Find the probability that the ball is green.
- Given that the ball is green, find the conditional probability that urn 1 was selected.

## Answer

- \(\frac{9}{20}\)
- \(\frac{2}{3}\)

Urn 1 contains 4 red and 6 green balls while urn 2 contains 6 red and 3 green balls. A ball is selected at random from urn 1 and transferred to urn 2. Then a ball is selected at random from urn 2.

- Find the probability that the ball from urn 2 is green.
- Given that the ball from urn 2 is green, find the conditional probability that the ball from urn 1 was green.

## Answer

- \(\frac{9}{25}\)
- \(\frac{2}{3}\)

An urn initially contains 6 red and 4 green balls. A ball is chosen at random from the urn and its color is recorded. It is then replaced in the urn and 2 new balls of the same color are added to the urn. The process is repeated. Find the probability of each of the following events:

- Balls 1 and 2 are red and ball 3 is green.
- Balls 1 and 3 are red and ball 2 is green.
- Ball 1 is green and balls 2 and 3 are red.
- Ball 2 is red.
- Ball 1 is red given that ball 2 is red.

## Answer

- \(\frac{4}{35}\)
- \(\frac{4}{35}\)
- \(\frac{4}{35}\)
- \(\frac{3}{5}\)
- \(\frac{2}{3}\)

Think about the results in the previous exercise. Note in particular that the answers to parts (a), (b), and (c) are the same, and that the probability that the second ball is red in part (d) is the same as the probability that the first ball is red. More generally, the probabilities of events do not depend on the order of the draws. For example, the probability of an event involving the first, second, and third draws is the same as the probability of the corresponding event involving the seventh, tenth and fifth draws. Technically, the sequence of events \((R_1, R_2, \ldots)\) is exchangeable. The random process described in this exercise is a special case of Pólya's urn scheme, named after George Pólya. We sill study Pólya's urn in more detail in the chapter on Finite Sampling Models

An urn initially contains 6 red and 4 green balls. A ball is chosen at random from the urn and its color is recorded. It is then replaced in the urn and two new balls of the *other* color are added to the urn. The process is repeated. Find the probability of each of the following events:

- Balls 1 and 2 are red and ball 3 is green.
- Balls 1 and 3 are red and ball 2 is green.
- Ball 1 is green and balls 2 and 3 are red.
- Ball 2 is red.
- Ball 1 is red given that ball 2 is red.

## Answer

- \(\frac{6}{35}\)
- \(\frac{6}{35}\)
- \(\frac{16}{105}\)
- \(\frac{17}{30}\)
- \(\frac{9}{17}\)

Think about the results in the previous exercise, and compare with Pólya's urn. Note that the answers to parts (a), (b), and (c) are not all the same, and that the probability that the second ball is red in part (d) is not the same as the probability that the first ball is red. In short, the sequence of events \((R_1, R_2, \ldots)\) is *not* exchangeable.

### Diagnostic Testing

Suppose that we have a random experiment with an event \(A\) of interest. When we run the experiment, of course, event \(A\) will either occur or not occur. However, suppose that we are not able to observe the occurrence or non-occurrence of \(A\) directly. Instead we have a diagnostic test designed to indicate the occurrence of event \(A\); thus the test that can be either positive for \(A\) or negative for \(A\). The test also has an element of randomness, and in particular can be in error. Here are some typical examples of the type of situation we have in mind:

- The event is that a person has a certain disease and the test is a blood test for the disease.
- The event is that a woman is pregnant and the test is a home pregnancy test.
- The event is that a person is lying and the test is a lie-detector test.
- The event is that a device is defective and the test consists of a sensor reading.
- The event is that a missile is in a certain region of airspace and the test consists of radar signals.
- The event is that a person has committed a crime, and the test is a jury trial with evidence presented for and against the event.

Let \(T\) be the event that the test is positive for the occurrence of \(A\). The conditional probability \(\P(T \mid A)\) is called the sensitivity of the test. The complementary probability \[\P(T^c \mid A) = 1 - \P(T \mid A)\] is the false negative probability. The conditional probability \(\P(T^c \mid A^c)\) is called the specificity of the test. The complementary probability \[\P(T \mid A^c) = 1 - \P(T^c \mid A^c)\] is the false positive probability. In many cases, the sensitivity and specificity of the test are known, as a result of the development of the test. However, the *user* of the test is interested in the opposite conditional probabilities, namely \(\P(A \mid T)\), the probability of the event of interest, given a positive test, and \(\P(A^c \mid T^c)\), the probability of the complementary event, given a negative test. Of course, if we know \( \P(A \mid T) \) then we also have \( \P(A^c \mid T) = 1 - \P(A \mid T) \), the probability of the complementary event given a positive test. Similarly, if we know \( \P(A^c \mid T^c) \) then we also have \( \P(A \mid T^c) \), the probability of the event given a negative test. Computing the probabilities of interest is simply a special case of Bayes' theorem.

The probability that the event occurs, given a positive test is \[\P(A \mid T) = \frac{\P(A) \P(T \mid A)}{\P(A) \P(T \mid A) + \P(A^c) \P(T \mid A^c)}\] The probability that the event does not occur, given a negative test is \[\P(A^c \mid T^c) = \frac{\P(A^c) \P(T^c \mid A^c)}{\P(A) \P(T^c \mid A) + \P(A^c) \P(T^c \mid A^c)}\]

There is often a trade-off between sensitivity and specificity. An attempt to make a test more sensitive may result in the test being less specific, and an attempt to make a test more specific may result in the test being less sensitive. As an extreme example, consider the worthless test that always returns positive, no matter what the evidence. Then \( T = S \) so the test has sensitivity 1, but specificity 0. At the opposite extreme is the worthless test that always returns negative, no matter what the evidence. Then \( T = \emptyset \) so the test has specificity 1 but sensitivity 0. In between these extremes are helpful tests that are actually based on evidence of some sort.

Suppose that the sensitivity \( a = \P(T \mid A) \in (0, 1)\) and the specificity \( b = \P(T^c \mid A^c) \in (0, 1) \) are fixed. Let \( p = \P(A) \) denote the *prior* probability of the event \( A \) and \( P = \P(A \mid T) \) the *posterior* probability of \( A \) given a positive test.

\( P \) as a function of \( p \) is given by \[ P = \frac{a p}{(a + b - 1) p + (1 - b)}, \quad p \in [0, 1] \]

- \( P \) increases continuously from 0 to 1 as \( p \) increases from 0 to 1.
- \( P \) is concave downward if \( a + b \gt 1 \). In this case \( A \) and \( T \) are positively correlated.
- \( P \) is concave upward if \( a + b \lt 1 \). In this case \( A \) and \( T \) are negatively correlated.
- \( P = p \) if \( a + b = 1 \). In this case, \( A \) and \( T \) are uncorrelated (independent).

## Proof

The formula for \( P \) in terms of \( p \) follows from (42) and algebra. For part (a), note that \[ \frac{dP}{dp} = \frac{a (1 - b)}{[(a + b - 1) p + (1 - b)]^2} \gt 0\] For parts (b)-(d), note that \[ \frac{d^2 P}{dp^2} = \frac{-2 a (1 - b)(a + b - 1)}{[(1 + b - 1)p + (1 - b)]^3} \] If \( a + b \gt 1 \), \( d^2P/dp^2 \lt 0 \) so \( P \) is concave downward on \( [0, 1] \) and hence \( P \gt p \) for \( 0 \lt p \lt 1 \). If \( a + b \lt 1 \), \( d^2P/dp^2 \gt 0 \) so \( P \) is concave upward on \( [0, 1] \) and hence \( P \lt p \) for \( 0 \lt p \lt 1 \). Trivially if \( a + b = 1 \), \( P = p \) for \( 0 \le p \le 1 \).

Of course, part (b) is the typical case, where the test is useful. In fact, we would hope that the sensitivity and specificity are close to 1. In case (c), the test is worse than useless since it gives the wrong information about \( A \). But this case could be turned into a useful test by simply reversing the roles of *positive* and *negative*. In case (d), the test is worthless and gives no information about \( A \). It's interesting that the broad classification above depends only on the *sum* of the sensitivity and specificity.

Suppose that a diagnostic test has sensitivity 0.99 and specificity 0.95. Find \( \P(A \mid T) \) for each of the following values of \( \P(A) \):

- 0.001
- 0.01
- 0.2
- 0.5
- 0.7
- 0.9

## Answer

- 0.0194
- 0.1667
- 0.8319
- 0.9519
- 0.9788
- 0.9944

With sensitivity 0.99 and specificity 0.95, the test in the last exercise superficially looks good. However the small value of \(\P(A \mid T)\) for small values of \(\P(A)\) is striking (but inevitable given the properties above). The moral, of course, is that \(\P(A \mid T)\) depends critically on \(\P(A)\) not just on the sensitivity and specificity of the test. Moreover, the correct comparison is \(\P(A \mid T)\) with \(\P(A)\), as in the exercise, not \(\P(A \mid T)\) with \(\P(T \mid A)\)—Beware of the fallacy of the transposed conditional! In terms of the correct comparison, the test does indeed work well; \(\P(A \mid T)\) is significantly larger than \(\P(A)\) in all cases.

A woman initially believes that there is an even chance that she is or is not pregnant. She takes a home pregnancy test with sensitivity 0.95 and specificity 0.90 (which are reasonable values for a home pregnancy test). Find the updated probability that the woman is pregnant in each of the following cases.

- The test is positive.
- The test is negative.

## Answer

- 0.905
- 0.053

Suppose that 70% of defendants brought to trial for a certain type of crime are guilty. Moreover, historical data show that juries convict guilty persons 80% of the time and convict innocent persons 10% of the time. Suppose that a person is tried for a crime of this type. Find the updated probability that the person is guilty in each of the following cases:

- The person is convicted.
- The person is acquitted.

## Answer

- 0.949
- 0.341

The Check Engine

light on your car has turned on. Without the information from the light, you believe that there is a 10% chance that your car has a serious engine problem. You learn that if the car has such a problem, the light will come on with probability 0.99, but if the car does not have a serious problem, the light will still come on, under circumstances similar to yours, with probability 0.3. Find the updated probability that you have an engine problem.

## Answer

0.268

The standard test for HIV is the ELISA (Enzyme-Linked Immunosorbent Assay) test. It has sensitivity and specificity of 0.999. Suppose that a person is selected at random from a population in which 1% are infected with HIV, and given the ELISA test. Find the probability that the person has HIV in each of the following cases:

- The test is positive.
- The test is negative.

## Answer

- 0.9098
- 0.00001

The ELISA test for HIV is a very good one. Let's look another test, this one for prostate cancer, that's rather bad.

The PSA test for prostate cancer is based on a blood marker known as the Prostate Specific Antigen. An elevated level of PSA is evidence for prostate cancer. To have a diagnostic test, in the sense that we are discussing here, we must decide on a definite level of PSA, above which we declare the test to be positive. A positive test would typically lead to other more invasive tests (such as biopsy) which, of course, carry risks and cost. The PSA test with cutoff 2.6 ng/ml has sensitivity 0.40 and specificity 0.81. The overall incidence of prostate cancer among males is 156 per 100000. Suppose that a man, with no particular risk factors, has the PSA test. Find the probability that the man has prostate cancer in each of the following cases:

- The test is positive.
- The test is negative.

## Answer

- 0.00328
- 0.00116

Diagnostic testing is closely related to a general statistical procedure known as hypothesis testing. A separate chapter on hypothesis testing explores this procedure in detail.

### Data Analysis Exercises

For the M&M data set, find the empirical probability that a bag has at least 10 reds, given that the weight of the bag is at least 48 grams.

## Answer

\(\frac{10}{23}\).

Consider the Cicada data.

- Find the empirical probability that a cicada weighs at least 0.25 grams given that the cicada is male.
- Find the empirical probability that a cicada weighs at least 0.25 grams given that the cicada is the tredecula species.

## Answer

- \(\frac{2}{45}\)
- \(\frac{7}{44}\)