# 11.4: The Negative Binomial Distribution

- Page ID
- 10236

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)## Basic Theory

Suppose again that our random experiment is to perform a sequence of Bernoulli trials \(\bs{X} = (X_1, X_2, \ldots)\) with success parameter \(p \in (0, 1]\). Recall that the number of successes in the first \(n\) trials \[ Y_n = \sum_{i=1}^n X_i \] has the binomial distribution with parameters \(n\) and \(p\). In this section we will study the random variable that gives the trial number of the \(k\)th success: \[ V_k = \min\left\{n \in \N_+: Y_n = k\right\} \] Note that \(V_1\) is the number of trials needed to get the first success, which we now know has the geometric distribution on \(\N_+\) with parameter \(p\).

### The Probability Density Function

The probability distribution of \(V_k\) is given by \[ \P(V_k = n) = \binom{n - 1}{k - 1} p^k (1 - p)^{n - k}, \quad n \in \{k, k + 1, k + 2, \ldots\}\]

## Proof

Note that \(V_k = n\) if and only if \(X_n = 1\) and \(Y_{n-1} = k - 1\). Hence, from independence and the binomial distribution, \[ \P(V_k = n) = \P(Y_{n-1} = k - 1) \P(X_n = 1) = \binom{n - 1}{k - 1} p^{k - 1}(1 - p)^{(n - 1) - (k - 1)} \, p = \binom{n - 1}{k - 1} p^k (1 - p)^{n - k} \]

The distribution defined by the density function in (1) is known as the negative binomial distribution; it has two parameters, the stopping parameter \(k\) and the success probability \(p\).

In the negative binomial experiment, vary \(k\) and \(p\) with the scroll bars and note the shape of the density function. For selected values of \(k\) and \(p\), run the experiment 1000 times and compare the relative frequency function to the probability density function.

The binomial and negative binomial sequences are inverse to each other in a certain sense.

For \( n \in \N_+ \) and \( k \in \{0, 1, \ldots, n\} \),

- \( Y_n \ge k \iff V_k \le n \) and hence \( \P(Y_n \ge k) = \P(V_k \le n) \)
- \(k \P\left(Y_n = k\right) = n \P\left(V_k = n\right)\)

## Proof

- The events \( \left\{Y_n \ge k\right\} \) and \( \left\{V_k \le n\right\} \) both mean that there are at least \( k \) successes in the first \( n \) Bernoulli trials.
- From the formulas for the binomial and negative binomial PDFs, \( k \P(Y_n = k) \) and \( n \P(V_k = n) \) both simplify to \( \frac{n!}{(k - 1)! (n - k)!} p^k (1 - p)^{n - k} \).

In particular, it follows from part (a) that any event that can be expressed in terms of the negative binomial variables can also be expressed in terms of the binomial variables.

The negative binomial distribution is unimodal. Let \(t = 1 + \frac{k - 1}{p}\). Then

- \(\P(V_k = n) \gt \P(V_k = n - 1)\) if and only if \(n \lt t\).
- The probability density function at first increases and then decreases, reaching its maximum value at \(\lfloor t \rfloor\).
- There is a single mode at \( \lfloor t \rfloor\) if \(t\) is not an integers, and two consecutive modes at \(t - 1\) and \(t\) if \(t\) is an integer.

### Times Between Successes

Next we will define the random variables that give the number of trials between successive successes. Let \(U_1 = V_1\) and \(U_k = V_k - V_{k-1}\) for \(k \in \{2, 3, \ldots\}\)

\(\bs{U} = (U_1, U_2, \ldots)\) is a sequence of independent random variables, each having the geometric distribution on \(\N_+\) with parameter \(p\). Moreover, \[ V_k = \sum_{i=1}^k U_i \]

In statistical terms, \(\bs{U}\) corresponds to sampling from the geometric distribution with parameter \(p\), so that for each \(k\), \((U_1, U_2, \ldots, U_k)\) is a random sample of size \(k\) from this distribution. The sample mean corresponding to this sample is \(V_k / k\); this random variable gives the average number of trials between the first \(k\) successes. In probability terms, the sequence of negative binomial variables \(\bs{V}\) is the partial sum process corresponding to the sequence \(\bs{U}\). Partial sum processes are studied in more generality in the chapter on Random Samples.

The random process \(\bs{V} = (V_1, V_2, \ldots)\) has stationary, independent increments:

- If \(j \lt k\) then \(V_k - V_j\) has the same distribution as \(V_{k - j}\), namely negative binomial with parameters \(k - j\) and \(p\).
- If \(k_1 \lt k_2 \lt k_3 \lt \cdots\) then \((V_{k_1}, V_{k_2} - V_{k_1}, V_{k_3} - V_{k_2}, \ldots)\) is a sequence of independent random variables.

Actually, any partial sum process corresponding to an independent, identically distributed sequence will have stationary, independent increments.

### Basic Properties

The mean, variance and probability generating function of \(V_k\) can be computed in several ways. The method using the representation as a sum of independent, identically distributed geometrically distributed variables is the easiest.

\(V_k\) has probability generating function \(P\) given by \[ P(t) = \left( \frac{p \, t}{1 - (1 - p) t} \right)^k, \quad \left|t\right| \lt \frac{1}{1 - p} \]

## Proof

Recall that the probability generating function of a sum of independent variables is the product of the probability generating functions of the variables. Recall also, the probability generating function of the geometric distribution with parameter \(p\) is \(t \mapsto p \, t \big/ \left[1 - (1 - p) t\right]\). Thus, the result follows immediately from the sum representation above. A derivation can also be given directly from the probability density function.

The mean and variance of \( V_k \) are

- \(\E(V_k) = k \frac{1}{p}\).
- \(\var(V_k) = k \frac{1 - p}{p^2}\)

## Proof

The geometric distribution with parameter \(p\) has mean \(1 / p\) and variance \((1 - p) \big/ p^2\), so the results follows immediately from the sum representation above. Recall that the mean of a sum is the sum of the means, and the variance of the sum of *independent* variables is the sum of the variances. These results can also be proven directly from the probability density function or from the probability generating function.

In the negative binomial experiment, vary \(k\) and \(p\) with the scroll bars and note the location and size of the mean/standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation.

Suppose that \(V\) and \(W\) are independent random variables for an experiment, and that \(V\) has the negative binomial distribution with parameters \(j\) and \(p\), and \(W\) has the negative binomial distribution with parameters \(k\) and \(p\). Then \(V + W\) has the negative binomial distribution with parameters \(k + j\) and \(p\).

## Proof

Once again, the simplest proof is based on the representation as a sum of independent geometric variables. In the context of the sum representation above, we can take \(V = V_j\) and \(W = V_{k+j} - V_j\), so that \(V + W = V_{k + j}\). Another simple proof uses probability generating functions. Recall again that the PGF of the sum of independent variables is the product of the PGFs. Finally, a difficult proof can be constructed using probability density functions. Recall that the PDF of a sum of independent variables is the convolution of the PDFs.

### Normal Approximation

In the negative binomial experiment, start with various values of \(p\) and \(k = 1\). Successively increase \(k\) by 1, noting the shape of the probability density function each time.

Even though you are limited to \(k = 5\) in the app, you can still see the characteristic bell shape. This is a consequence of the central limit theorem because the negative binomial variable can be written as a sum of \(k\) independent, identically distributed (geometric) random variables.

The standard score of \(V_k\) is \[ Z_k = \frac{p \, V_k - k}{\sqrt{k (1 - p)}} \] The distribution of \(Z_k\) converges to the standard normal distribution as \(k \to \infty\).

From a practical point of view, this result means that if \(k\) is large

, the distribution of \(V_k\) is approximately normal with mean \(k \frac{1}{p}\) and variance \(k \frac{1 - p}{p^2}\). Just how large \(k\) needs to be for the approximation to work well depends on \(p\). Also, when using the normal approximation, we should remember to use the continuity correction, since the negative binomial is a discrete distribution.

### Relation to Order Statistics

Suppose that \(n \in \N_+\) and \(k \in \{1, 2, \ldots, n\}\), and let \(L = \left\{ (n_1, n_2, \ldots, n_k) \in \{1, 2, \ldots, n\}^k: n_1 \lt n_2 \lt \cdots \lt n_k \right\}\). Then \[ \P\left(V_1 = n_1, V_2 = n_2, \ldots, V_k = n_k \mid Y_n = k\right) = \frac{1}{\binom{n}{k}}, \quad (n_1, n_2, \ldots, n_k) \in L \]

## Proof

\[ \P\left(V_1 = n_1, V_2 = n_2, \ldots, V_k = n_k \mid Y_n = k\right) = \frac{\P\left(V_1 = n_1, V_2 = n_2, \ldots, V_k = n_k, Y_n = k\right)}{\P(Y_n = k)} = \frac{p^k (1 - p)^{n - k}}{\binom{n}{k} p^k (1 - p)^{n - k}} = \frac{1}{\binom{n}{k}}\] Note that the event in the numerator of the first fraction means that in the first \( n \) trials, successes occurred at trials \( n_1, n_2, \ldots, n_k \) and failures occurred at all other trials.

Thus, given exactly \(k\) successes in the first \(n\) trials, the vector of success trial numbers is uniformly distributed on the set of possibilities \(L\), regardless of the value of the success parameter \( p \). Equivalently, the vector of success trial numbers is distributed as the vector of order statistics corresponding to a sample of size \(k\) chosen at random and without replacement from \(\{1, 2, \ldots, n\}\).

Suppose that \(n \in \N_+\), \(k \in \{1, 2, \ldots, n\}\), and \(j \in \{1, 2, \ldots, k\}\). Then \[ \P\left(V_j = m \mid Y_n = k\right) = \frac{\binom{m - 1}{j - 1} \binom{n - m}{k - j}}{\binom{n}{k}}, \quad m \in \{j, j + 1, \ldots, n + k - j\} \]

## Proof

This follows immediately from the previous result and a theorem in the section on order statistics. However, a direct proof is also easy. Note that the event \( \{V_j = m, Y_n = k\} \) means that there were \( j - 1 \) successes in the first \( m - 1 \) trials, a success on trial \( m \) and \( k - j \) success in trials \( m + 1 \) to \( n \). Hence using the binomial distribution and independence, \begin{align} \P\left(V_j = m \mid Y_n = k\right) & = \frac{\P(V_j = m, Y_n = k)}{\P(Y_n = k)} = \frac{\binom{m - 1}{j - 1} p^{j - 1}(1 - p)^{(m - 1) - (j - 1)} p \binom{n - m}{k - j} p^{k - j}(1 - p)^{(n - m) - (k - j)}}{\binom{n}{k} p^k (1 - p)^{n - k}} \\ & = \frac{\binom{m - 1}{j - 1} \binom{n - m}{k - j} p^k(1 - p)^{n - k}}{\binom{n}{k} p^k (1 - p)^{n - k}} = \frac{\binom{m - 1}{j - 1} \binom{n - m}{k - j}}{\binom{n}{k}}, \end{align}

Thus, given exactly \(k\) successes in the first \(n\) trials, the trial number of the \(j\)th success has the same distribution as the \(j\)th order statistic when a sample of size \(k\) is selected at random and without replacement from the population \(\{1, 2, \ldots, n\}\). Again, this result does not depend on the value of the success parameter \( p \). The following theorem gives the mean and variance of the conditional distribution.

Suppose again that \(n \in \N_+\), \(k \in \{1, 2, \ldots, n\}\), and \(j \in \{1, 2, \ldots, k\}\). Then

- \( \E\left(V_j \mid Y_n = k\right) = j \frac{n + 1}{k + 1} \)
- \( \var\left(V_j \mid Y_n = k\right) = j (k - j + 1) \frac{(n + 1)(n - k)}{(k + 1)^2 (k + 2)}\)

## Proof

These moment results follow immediately from the previous theorem and a theorem in the section on order statistics. However, there is also a nice heuristic argument for (a) using indicator variables. Given \( Y_n = k \), the \( k \) successes divide the set of indices where the failures occur into \( k + 1 \) disjoint sets (some may be empty, of course, if there are adjacent successes).

Let \( I_i \) take the value 1 if the \( i \)th failure occurs before the \( j \)th success, and 0 otherwise, for \( i \in \{1, 2, \ldots, n - k\} \). Then given \( Y_n = k \), \[ V_j = j + \sum_{i=1}^{n - k} I_i \] Given \( Y_n = k \), we know that the \( k \) successes and \( n - k \) failures are randomly placed in \( \{1, 2, \ldots, n\} \), with each possible configuration having the same probability. Thus, \[ \E(I_i \mid Y_n = k) = \P(I_i = 1 \mid Y_n = k) = \frac{j}{k + 1}, \quad i \in \{1, 2, \ldots n - k\} \] Hence \[ \E(V_j \mid Y_n = k) = j + (n - k) \frac{j}{k + 1} = j \frac{n + 1}{k + 1} \]

## Examples and Applications

### Coins, Dice and Other Gadgets

A standard, fair die is thrown until 3 aces occur. Let \(V\) denote the number of throws. Find each of the following:

- The probability density function of \(V\).
- The mean of \(V\).
- The variance of \(V\).
- The probability that at least 20 throws will needed.

## Answer

- \(\P(V = n) = \binom{n - 1}{2} \left(\frac{1}{6}\right)^3 \left(\frac{5}{6}\right)^{n-3}, \quad n \in \{3, 4, \ldots\}\)
- \(\E(V) = 18\)
- \(\var(V) = 90\)
- \(\P(V \ge 20) = 0.3643\)

A coin is tossed repeatedly. The 10th head occurs on the 25th toss. Find each of the following:

- The probability density function of the trial number of the 5th head.
- The mean of the distribution in (a).
- The variance of the distribution in (a).

## Answer

- \(\P(V_5 = m \mid V_{10} = 25) = \frac{\binom{m - 1}{4} \binom{24 - m}{4}}{\binom{24}{9}}, \quad m \in \{5, 6, \ldots, 2\}\)
- \(\E(V_5 \mid V_{10} = 25) = \frac{25}{2}\)
- \(\var(V_5 \mid V_{10} = 25) = \frac{375}{44}\)

A certain type of missile has failure probability 0.02. Let \(N\) denote the launch number of the fourth failure. Find each of the following:

- The probability density function of \(N\).
- The mean of \(N\).
- The variance of \(N\).
- The probability that there will be at least 4 failures in the first 200 launches.

## Answer

- \(\P(N = n) = \frac{n - 1}{2} \left(\frac{1}{50}\right)^4 \left(\frac{49}{50}\right)^{n-4}, \quad n \in \{4, 5, \ldots\}\)
- \(\E(N) = 200\)
- \(\var(N) = 9800\)
- \(\P(N \le 200) = 0.5685\)

In the negative binomial experiment, set \(p = 0.5\) and \(k = 5\). Run the experiment 1000 times. Compute and compare each of the following:

- \(\P(8 \le V_5 \le 15)\)
- The relative frequency of the event \(\{8 \le V_5 \le 15\}\) in the simulation
- The normal approximation to \(\P(8 \le V_5 \le 15)\)

## Answer

- \(\P(8 \le V_5 \le 15) = 0.7142\)
- \(\P(8 \le V_5 \le 15) \approx 0.7445\)

A coin is tossed until the 50th head occurs.

- Assuming that the coin is fair, find the normal approximation of the probability that the coin is tossed at least 125 times.
- Suppose that you perform this experiment, and 125 tosses are required. Do you believe that the coin is fair?

## Answer

- 0.0072
- No.

### The Banach Match Problem

Suppose that an absent-minded professor (is there any other kind?) has \(m\) matches in his right pocket and \(m\) matches in his left pocket. When he needs a match to light his pipe, he is equally likely to choose a match from either pocket. We want to compute the probability density function of the random variable \(W\) that gives the number of matches remaining when the professor first discovers that one of the pockets is empty. This is known as the Banach match problem, named for the mathematician Stefan Banach, who evidently behaved in the manner described.

We can recast the problem in terms of the negative binomial distribution. Clearly the match choices form a sequence of Bernoulli trials with parameter \(p = \frac{1}{2}\). Specifically, we can consider a match from the right pocket as a win for player \(R\), and a match from the left pocket as a win for player \(L\). In a hypothetical infinite sequence of trials, let \(U\) denote the number of trials necessary for \(R\) to win \(m + 1\) trials, and \(V\) the number of trials necessary for \(L\) to win \(m + 1\) trials. Note that \(U\) and \(V\) each have the negative binomial distribution with parameters \(m + 1\) and \(p\).

For \(k \in \{0, 1, \ldots, m\}\),

- \(L\) has \(m - k\) wins at the moment when \(R\) wins \(m + 1\) games if and only if \(U = 2 m - k + 1\).
- \(\{U = 2 m - k + 1\}\) is equivalent to the event that the professor first discovers that the right pocket is empty and that the left pocket has \(k\) matches
- \(\P(U = 2 m - k + 1) = \binom{2 m - k}{m} \left(\frac{1}{2}\right)^{2 m - k + 1}\)

For \(k \in \{0, 1, \ldots, m\}\),

- \(R\) has \(m - k\) wins at the moment when \(L\) wins \(m + 1\) games if and only if \(V = 2 m - k + 1\).
- \(\{V = 2 m - k + 1\}\) is equivalent to the event that the professor first discovers that the right pocket is empty and that the left pocket has \(k\) matches
- \(\P(V = 2 m - k + 1) = \binom{2 \, m - k}{m} \left(\frac{1}{2}\right)^{2 m - k + 1}\).

\(W\) has probability density function \[ \P(W = k) = \binom{2 m - k}{m} \left(\frac{1}{2}\right)^{2 m - k}, \quad k \in \{0, 1, \ldots, m\} \]

## Proof

This result follows from the previous two exercises, since \(\P(W = k) = \P(U = 2 m - k + 1) + \P(V = 2 m - k + 1)\).

We can also solve the non-symmetric Banach match problem, using the same methods as above. Thus, suppose that the professor reaches for a match in his right pocket with probability \(p\) and in his left pocket with probability \(1 - p\), where \(0 \lt p \lt 1\). The essential change in the analysis is that \(U\) has the negative binomial distribution with parameters \(m + 1\) and \(p\), while \(V\) has the negative binomial distribution with parameters \(m + 1\) and \(1 - p\).

For the Banach match problem with parameter \(p\), \(W\) has probability density function \[ \P(W = k) = \binom{2 m - k}{m} \left[ p^{m+1} (1 - p)^{m-k} + (1 - p)^{m+1} p^{m-k} \right], \quad k \in \{0, 1, \ldots, m\} \]

### The Problem of Points

Suppose that two teams (or individuals) \(A\) and \(B\) play a sequence of Bernoulli trials (which we will also call points), where \(p \in (0, 1)\) is the probability that player \(A\) wins a point. For nonnegative integers \(n\) and \(m\), let \(A_{n,m}(p)\) denote the probability that \(A\) wins \(n\) points before \(B\) wins \(m\) points. Computing \(A_{n,m}(p)\) is an historically famous problem, known as the problem of points, that was solved by Pierre de Fermat and by Blaise Pascal.

Comment on the validity of the Bernoulli trial assumptions (independence of trials and constant probability of success) for games of sport that have a *skill* component as well as a *random* component.

There is an easy solution to the problem of points using the binomial distribution; this was essentially Pascal's solution. There is also an easy solution to the problem of points using the negative binomial distribution In a sense, this has to be the case, given the equivalence between the binomial and negative binomial processes in (3). First, let us pretend that the trials go on forever, regardless of the outcomes. Let \(Y_{n+m-1}\) denote the number of wins by player \(A\) in the first \( n + m - 1 \) points, and let \(V_n\) denote the number of trials needed for \(A\) to win \(n\) points. By definition, \( Y_{n+m-1} \) has the binomial distribution with parameters \(n + m - 1\) and \(p\), and \( V_n \) has the negative binomial distribution with parameters \( n \) and \( p \).

Player \(A\) wins \(n\) points before \(B\) wins \(m\) points if and only if \(Y_{n + m - 1} \ge n\) if and only if \(V_n \le m + n - 1\). Hence \[ A_{n,m}(p) = \sum_{k=n}^{n + m - 1} \binom{n + m - 1}{k} p^k (1 - p)^{n + m - 1 - k} = \sum_{j=n}^{n+m-1} \binom{j - 1}{n - 1} p^n (1 - p)^{j - n} \]

\(A_{n,m}(p)\) satisfies the following properties:

- \(A_{n,m}(p)\) increases from 0 to 1 as \(p\) increases from 0 to 1 for fixed \(n\) and \(m\).
- \(A_{n,m}(p)\) decreases as \(n\) increases for fixed \(m\) and \(p\).
- \(A_{n,m}(p)\) increases as \(m\) increases for fixed \(n\) and \(p\).

\(1 - A_{n,m}(p) = A_{m,n}(1 - p)\) for any \(m, \; n \in \N_+\) and \(p \in (0, 1)\).

## Proof

A simple probabilistic proof is to note that both sides can be interpreted as the probability that a player with point probability \(1 - p\) wins \(m\) points before the opponent wins \(n\) points. An analytic proof can also be constructed using the formulas above for \( A_{n,m}(p) \)

In the problem of points experiments, vary the parameters \(n\), \(m\), and \(p\), and note how the probability changes. For selected values of the parameters, run the simulation 1000 times and note the apparent convergence of the relative frequency to the probability.

The win probability function for player \(A\) satisfies the following recurrence relation and boundary conditions (this was essentially Fermat's solution):

- \(A_{m,n}(p) = p \, A_{n-1,m}(p) + (1 - p) \, A_{n,m-1}(p), \quad n, \; m \in \N_+\)
- \(A_{n,0}(p) = 0\), \(A_{0,m}(p) = 1\)

## Proof

Condition on the outcome of the first trial.

Next let \(N_{n,m}\) denote the number of trials needed until either \(A\) wins \(n\) points or \(B\) wins \(m\) points, whichever occurs first—the length of the problem of points experiment. The following result gives the distribution of \( N_{n,m} \)

For \(k \in \left\{\min\{m, n\}, \ldots, n + m - 1\right\}\) \[ \P\left(N_{n,m} = k\right) = \binom{k - 1}{n - 1} p^n (1 - p)^{k - n} + \binom{k - 1}{m - 1} (1 - p)^m p^{k - m} \]

## Proof

Again, imagine that we continue the trials indefinitely. Let \(V_n\) denote the number of trials needed for \(A\) to win \(n\) points, and let \(W\) denote the number of trials needed for \(B\) to win \(m\) points. Then \(\P\left(N_{n,m} = k\right) = \P(V_n = k) + \P(W_m = k)\) for \( k \) in the indicated range.

### Series of Games

The special case of the problem of points experiment with \(m = n\) is important, because it corresponds to \(A\) and \(B\) playing a best of \(2 n - 1\) game series. That is, the first player to win \(n\) games wins the series. Such series, especially when \(n \in \{2, 3, 4\}\), are frequently used in championship tournaments.

Let \(A_n(p)\) denote the probability that player \(A\) wins the series. Then \[ A_n(p) = \sum_{k=n}^{2 n - 1} \binom{2 n - 1}{k} p^k (1 - p)^{2 n - 1 - k} = \sum_{j=n}^{2 n - 1} \binom{j - 1}{n - 1} p^n (1 - p)^{j - n} \]

## Proof

This follows directly from the problem of points probability above, since \(A_n(p) = A_{n,n}(p)\).

Suppose that \(p = 0.6\). Explicitly find the probability that team \(A\) wins in each of the following cases:

- A best of 5 game series.
- A best of 7 game series.

## Answer

- 0.6825.
- 0.7102

In the problem of points experiments, vary the parameters \(n\), \(m\), and \(p\) (keeping \(n = m\)), and note how the probability changes. Now simulate a best of 5 series by selecting \(n = m = 3\), \(p = 0.6\). Run the experiment 1000 times and compare the relative frequency to the true probability.

\(A_n(1 - p) = 1 - A_n(p)\) for any \(n \in \N_+\) and \(p \in [0, 1]\). Therefore

- The graph of \(A_n\) is symmetric with respect to \(p = \frac{1}{2}\).
- \(A_n\left(\frac{1}{2}\right) = \frac{1}{2}\).

## Proof

Again, there is a simple probabilistic argument for the equation: both sides represent the probabiltiy that a player with game probability \(1 - p\) will win the series.

In the problem of points experiments, vary the parameters \(n\), \(m\), and \(p\) (keeping \(n = m\)), and note how the probability changes. Now simulate a best 7 series by selecting \(n = m = 4\), \(p = 0.45\). Run the experiment 1000 times and compare the relative frequency to the true probability.

If \(n \gt m\) then

- \(A_n(p) \lt A_m(p)\) if \(0 \lt p \lt \frac{1}{2}\)
- \(A_n(p) \gt A_m(p)\) if \(\frac{1}{2} \lt p \lt 1\)

## Proof

The greater the number of games in the series, the more the series favors the stronger player (the one with the larger game probability).

Let \(N_n\) denote the number of trials in the series. Then \(N_n\) has probability density function \[ \P(N_n = k) = \binom{k - 1}{n - 1} \left[p^n (1 - p)^{k-n} + (1 - p)^n p^{k-n} \right], \quad k \in \{n, n + 1, \ldots, 2 \, n - 1\} \]

## Proof

This result follows directly from the corresponding problem of points result above with \(n = m\).

Explicitly compute the probability density function, expected value, and standard deviation for the number of games in a best of 7 series with the following values of \(p\):

- 0.5
- 0.7
- 0.9

## Answer

- \(f(k) = \binom{k - 1}{3} \left(\frac{1}{2}\right)^{k-1}, \quad k \in \{4, 5, 6, 7\}\), \(\E(N) = 5.8125\), \(\sd(N) = 1.0136\)
- \(f(k) = \binom{k - 1}{3} \left[(0.7)^4 (0.3)^{k-4} + (0.3)^4 (0.7)^{k-4}\right], \quad k \in \{4, 5, 6, 7\}\), \(\E(N) = 5.3780\), \(\sd(N) = 1.0497\)
- \(f(k) = \binom{k - 1}{3} \left[(0.9)^4 (0.1)^{k-4} + (0.1)^4 (0.9)^{k-4}\right], \quad k \in \{4, 5, 6, 7\}\), \(\E(N) = 4.4394\), \(\sd(N) = 0.6831\)

### Division of Stakes

The problem of points originated from a question posed by Chevalier de Mere, who was interested in the fair division of stakes when a game is interrupted. Specifically, suppose that players \(A\) and \(B\) each put up \(c\) monetary units, and then play Bernoulli trials until one of them wins a specified number of trials. The winner then takes the entire \(2 c\) fortune.

If the game is interrupted when \(A\) needs to win \(n\) more trials and \(B\) needs to win \(m\) more trials, then the fortune should be divided between \(A\) and \(B\), respectively, as follows:

- \(2 c A_{n,m}(p)\) for \(A\)
- \(2 c \left[1 - A_{n,m}(p)\right] = 2 c A_{m,n}(1 - p)\) for \(B\).

Suppose that players \(A\) and \(B\) bet $50 each. The players toss a fair coin until one of them has 10 wins; the winner takes the entire fortune. Suppose that the game is interrupted by the gambling police when \(A\) has 5 wins and \(B\) has 3 wins. How should the stakes be divided?

## Answer

\(A\) gets $72.56, \(B\) gets $27.44

## Alternate and General Versions

Let's return to the formulation at the beginning of this section. Thus, suppose that we have a sequence of Bernoulli trials \(\bs{X}\) with success parameter \(p \in (0, 1]\), and for \(k \in \N_+\), we let \(V_k\) denote the trial number of the \(k\)th success. Thus, \(V_k\) has the negative binomial distribution with parameters \(k\) and \(p\) as we studied above. The random variable \(W_k = V_k - k\) is the number of failures before the \(k\)th success. Let \(N_1 = W_1\), the number of failures before the first success, and let \(N_k = W_k - W_{k-1}\), the number of failures between the \((k - 1)\)st success and the \(k\)th success, for \(k \in \{2, 3, \ldots\}\).

\(\bs{N} = (N_1, N_2, \ldots)\) is a sequence of independent random variables, each having the geometric distribution on \(\N\) with parameter \(p\). Moreover,

\[ W_k = \sum_{i=1}^k N_i \]Thus, \(\bs{W} = (W_1, W_2, \ldots)\) is the partial sum process associated with \(\bs{N}\). In particular, \(\bs{W}\) has stationary, independent increments.

### Probability Density Functions

The probability density function of \(W_k\) is given by

\[ \P(W_k = n) = \binom{n + k - 1}{k - 1} p^k (1 - p)^n = \binom{n + k - 1}{n} p^k (1 - p)^n, \quad n \in \N \]## Proof

This result follows directly from the PDF of \(V_k\), since \(\P(W_k = n) = \P(V_k = k + n)\) for \(n \in \N\).

The distribution of \(W_k\) is also referred to as the negative binomial distribution with parameters \(k\) and \(p\). Thus, the term *negative binomial distribution* can refer either to the distribution of the trial number of the \(k\)th success or the distribution of the number of failures before the \(k\)th success, depending on the author and the context. The two random variables differ by a constant, so it's not a particularly important issue as long as we know which version is intended. In this text, we will refer to the alternate version as the negative binomial distribution on \( \N \), to distinguish it from the original version, which has support set \( \{k, k + 1, \ldots\} \)

More interestingly, however, the probability density function in the last result makes sense for any \(k \in (0, \infty)\), not just integers. To see this, first recall the definition of the general binomial coefficient: if \( a \in \R \) and \( n \in \N \), we define \[ \binom{a}{n} = \frac{a^{(n)}}{n!} = \frac{a (a - 1) \cdots (a - n + 1)}{n!} \]

The function \(f\) given below defines a probability density function for every \(p \in (0, 1)\) and \(k \in (0, \infty)\):

\[ f(n) = \binom{n + k - 1}{n} p^k (1 - p)^n, \quad n \in \N \]## Proof

Recall from the section on Combinatorial Structures that \(\binom{n + k - 1}{n} = (-1)^n \binom{-k}{n}\). From the general binomial theorem,

\[ \sum_{n=0}^\infty f(n) = p^k \sum_{n=0}^\infty \binom{-k}{n} (-1)^n (1 - p)^n = p^k \left[1 - (1 - p)\right]^{-k} = 1\]Once again, the distribution defined by the probability density function in the last theorem is the negative binomial distribution on \( \N \), with parameters \(k\) and \(p\). The special case when \(k\) is a positive integer is sometimes referred to as the Pascal distribution, in honor of Blaise Pascal.

The distribution is unimodal. Let \(t = \left|k - 1\right| \frac{1 - p}{p}\).

- \(f(n - 1) \lt f(n)\) if and only if \(n \lt t\).
- The distribution has a single mode at \(\lfloor t \rfloor\) if \(t\) is not an integer.
- The distribution has two consecutive modes at \(t - 1\) and \(t\) if \(t\) is a positive integer.

### Basic Properties

Suppose that \(W\) has the negative binomial distribution on \( \N \) with parameters \(k \in (0, \infty)\) and \(p \in (0, 1)\). To establish basic properties, we can no longer use the decomposition of \(W\) as a sum of independent geometric variables. Instead, the best approach is to derive the probability generating function and then use the generating function to obtain other basic properties.

\(W\) has probability generating function \(P\) given by

\[ P(t) = \E\left(t^W\right) = \left( \frac{p}{1 - (1 - p) \, t} \right)^k, \quad \left|t\right| \lt \frac{1}{1 - p} \]## Proof

This follows from the general binomial theorem: for \(\left|t\right| \lt 1 / (1 - p)\), \[ \E\left(t^W\right) = \sum_{n=0}^\infty f(n) t^n = p^k \sum_{n=0}^\infty \binom{-k}{n} (-1)^n (1 - p)^n t^n = p^k \left[1 - (1 - p) t\right]^{-k}\ \]

The moments of \(W\) can be obtained from the derivatives of the probability generating funciton.

\(W\) has the following moments:

- \(\E(W) = k \frac{1 - p}{p}\)
- \(\var(W) = k \frac{1 - p}{p^2}\)
- \(\skw(W) = \frac{2 - p}{\sqrt{k (1 - p)}}\)
- \(\kur(W) = \frac{3 \, (k + 2) (1 - p) + p^2}{k \, (1 - p)}\)

## Proof

Recall that the factorial moments of \(W\) can be obtained from the derivatives of the probability generating function: \(\E\left[W^{(k)}\right] = P^{(k)}(1)\). Then the various moments above can be obtained from standard formulas.

The negative binomial distribution on \( \N \) is preserved under sums of independent variables.

Suppose that \(V\) has the negative binomial distribution on \( \N \) with parameters \(a \in(0, \infty)\) and \(p \in (0, 1)\), and that \(W\) has the negative binomial distribution on \( \N \) with parameters \(b \in (0, \infty)\) and \(p \in (0, 1)\), and that \(V\) and \(W\) are independent. Then \(V + W\) has the negative binomial on \( \N \) distribution with parameters \(a + b\) and \(p\).

## Proof

This result follows from the probability generating functions. Recall that the PGF of \(V + W\) is the product of the PGFs of \(V\) and \(W\).

In the last result, note that the success parameter \(p\) must be the same for both variables.

### Normal Approximation

Because of the decomposition of \(W\) when the parameter \(k\) is a positive integer, it's not surprising that a central limit theorm holds for the general negative binomial distribution.

Suppose that \(W\) has the negative binomial distibtion with parameters \(k \in (0, \infty)\) and \(p \in (0, 1)\). The standard score of \(W\) is \[ Z = \frac{p \, W - k \, (1 - p)}{\sqrt{k \, (1 - p)}} \] The distribution of \(Z\) converges to the standard normal distribution as \(k \to \infty\).

Thus, if \(k\) is large (and not necessarily an integer), then the distribution of \(W\) is approximately normal with mean \(k \frac{1 - p}{p}\) and variance \(k \frac{1 - p}{p^2}\).

### Special Families

The negative binomial distribution on \( \N \) belongs to several special families of distributions. First, It follows from the result above on sums that we can decompose a negative binomial variable on \( \N \) into the sum of an arbitrary number of independent, identically distributed variables. This special property is known as infinite divisibility, and is studied in more detail in the chapter on Special Distributions.

The negative binomial distribution on \( \N \) is infinitely divisible.

## Proof

Suppose that \( V \) has the negative binomial distribution on \( \N \) with parameters \( k \in (0, \infty) \) and \( p \in (0, 1) \). It follows from the previous result that for any \( n \in \N_+ \), \( V \) can be represented as \( V = \sum_{i=1}^n V_i \) where \( (V_1, V_2, \ldots, V_n) \) are independent, and each has the negative binomial distribution on \( \N \) with parameters \( k/n \) and \( p \).

A Poisson-distributed random sum of independent, identically distributed random variables is said to have a compound Poisson distributions; these distributions are studied in more detail in the chapter on the Poisson Process. A theorem of William Feller states that an infinite divisible distribution on \( \N \) must be compound Poisson. Hence it follows from the previous result that the negative binomial distribution on \( \N \) belongs to this family. Here is the explicit result:

Let \( p, \, k \in (0, \infty) \). Suppose that \( \bs{X} = (X_1, X_2, \ldots) \) is a sequence of independent variables, each having the logarithmic series distribution with shape parameter \( 1 - p \). Suppose also that \( N \) is independent of \( \bs{X} \) and has the Poisson distribution with parameter \( - k \ln(p) \). Then \( W = \sum_{i=1}^N X_i \) has the negative binomial distribution on \( \N \) with parameters \( k \) and \( p \).

## Proof

From the general theory of compound Poisson distributions, the probability generating function of \( W \) is \( P(t) = \exp\left( \lambda [Q(t) - 1]\right) \) where \( \lambda \) is the parameter of the Poisson variable \( N \) and \( Q(t) \) is the common PGF of the the terms in the sum. Using the PGF of the logarithmic series distribution, and the particular values of the parameters, we have \[ P(t) = \exp \left[-k \ln(p) \left(\frac{\ln[1 - (1 - p)t]}{\ln(p)} - 1\right)\right], \quad \left|t\right| \lt \frac{1}{1 - p} \] Using properties of logarithms and simple algebra, this reduces to \[ P(t) = \left(\frac{p}{1 - (1 - p)t}\right)^k, \quad \left|t\right| \lt \frac{1}{1 - p} \] which is the PGF of the negative binomial distribution with parameters \( k \) and \( p \).

As a special case (\( k = 1 \)), it follows that the geometric distribution on \( \N \) is infinitely divisible and compound Poisson.

Next, the negative binomial distribution on \( \N \) belongs to the general exponential family. This family is important in inferential statistics and is studied in more detail in the chapter on Special Distributions.

Suppose that \( W \) has the negative binomial distribution on \( \N \) with parameters \( k \in (0, \infty) \) and \( p \in (0, 1) \). For fixed \( k \), \( W \) has a one-parameter exponential distribution with natural statistic \( W \) and natural parameter \( \ln(1 - p) \).

## Proof

The PDF of \( W \) can be written as \[ f(n) = \binom{n + k - 1}{n} p^k \exp\left[n \ln(1 - p)\right], \quad n \in \N \] so the result follows from the definition of the general exponential family.

Finally, the negative binomial distribution on \( \N \) is a power series distribution. Many special discrete distribution belong to this family, which is studied in more detail in the chapter on Special Distributions.

For fixed \( k \in (0, \infty) \), the negative binomial distribution on \( \N \) with parameters \( k \) and \( p \in (0, 1) \) is a power series distribution corresponding to the function \( g(\theta) = 1 \big/ (1 - \theta)^k \) for \( \theta \in (0, 1) \), where \( \theta = 1 - p \).

## Proof

In terms of the new parameter \( \theta \), the negative binomial pdf has the form \( f(n) = \frac{1}{g(\theta)} \binom{n + k - 1}{n} \theta^n\) for \( n \in \N \), and \( \sum_{n=0}^\infty \binom{n + k - 1}{n} \theta^n = g(\theta) \).

### Computational Exercises

Suppose that \(W\) has the negative binomial distribution with parameters \(k = \frac{15}{2}\) and \(p = \frac{3}{4}\). Compute each of the following:

- \(\P(W = 3)\)
- \(\E(W)\)
- \(\var(W)\)

## Answer

- \(\P(W = 3) = 0.1823\)
- \(\E(W) = \frac{5}{2}\)
- \(\var(W) = \frac{10}{3}\)

Suppose that \(W\) has the negative binomial distribution with parameters \(k = \frac{1}{3}\) and \(p = \frac{1}{4}\). Compute each of the following:

- \(\P(W \le 2)\)
- \(\E(W)\)
- \(\var(W)\)

## Answer

- \(\P(W \le 2) = \frac{11}{8 \sqrt[3]{4}}\)
- \(\E(W) = 1\)
- \(\var(W) = 4\)

Suppose that \(W\) has the negative binomial distribution with parameters \(k = 10 \, \pi\) and \(p = \frac{1}{3}\). Compute each of the following:

- \(\E(W)\)
- \(\var(W)\)
- The normal approximation to \(\P(50 \le W \le 70)\)

## Answer

- \(\E(W) = 20 \, \pi\)
- \(\var(W) = 60 \, \pi\)
- \(\P(50 \le W \le 70) \approx 0.5461\)