6.4: The Central Limit Theorem

Last updated
Save as PDF

Page ID: 10181

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\Z}{\mathbb{Z}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\P}{\mathbb{P}}\) \(\newcommand{\var}{\text{var}}\) \(\newcommand{\sd}{\text{sd}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\cor}{\text{cor}}\) \(\newcommand{\bs}{\boldsymbol}\)

The central limit theorem and the law of large numbers are the two fundamental theorems of probability. Roughly, the central limit theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution. The importance of the central limit theorem is hard to overstate; indeed it is the reason that many statistical procedures work.

Partial Sum Processes

Definitions

Suppose that \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of independent, identically distributed, real-valued random variables with common probability density function \(f\), mean \(\mu\), and variance \(\sigma^2\). We assume that \(0 \lt \sigma \lt \infty\), so that in particular, the random variables really are random and not constants. Let \[ Y_n = \sum_{i=1}^n X_i, \quad n \in \N \] Note that by convention, \(Y_0 = 0\), since the sum is over an empty index set. The random process \(\bs{Y} = (Y_0, Y_1, Y_2, \ldots)\) is called the partial sum process associated with \(\bs{X}\). Special types of partial sum processes have been studied in many places in this text; in particular see

the binomial distribution in the setting of Bernoulli trials
the negative binomial distribution in the setting of Bernoulli trials
the gamma distribution in the Poisson process
the the arrival times in a general renewal process

Recall that in statistical terms, the sequence \(\bs{X}\) corresponds to sampling from the underlying distribution. In particular, \((X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the distribution, and the corresponding sample mean is \[ M_n = \frac{Y_n}{n} = \frac{1}{n} \sum_{i=1}^n X_i \] By the law of large numbers, \(M_n \to \mu\) as \(n \to \infty\) with probability 1.

Stationary, Independent Increments

The partial sum process corresponding to a sequence of independent, identically distributed variables has two important properties, and these properties essentially characterize such processes.

If \(m \le n\) then \(Y_n - Y_m\) has the same distribution as \(Y_{n-m}\). Thus the process \(\bs{Y}\) has stationary increments.

Proof

Note that \(Y_n - Y_m = \sum_{i=m+1}^n X_i\) and is the sum of \(n - m\) independent variables, each with the common distribution. Of course, \(Y_{n-m}\) is also the sum of \(n - m\) independent variables, each with the common distribution.

Note however that \(Y_n - Y_m\) and \(Y_{n-m}\) are very different random variables; the theorem simply states that they have the same distribution.

If \(n_1 \le n_2 \le n_3 \le \cdots\) then \(\left(Y_{n_1}, Y_{n_2} - Y_{n_1}, Y_{n_3} - Y_{n_2}, \ldots\right)\) is a sequence of independent random variables. Thus the process \(\bs{Y}\) has independent increments.

Proof

The terms in the sequence of increments \(\left(Y_{n_1}, Y_{n_2} - Y_{n_1}, Y_{n_3} - Y_{n_2}, \ldots\right)\) are sums over disjoint collections of terms in the sequence \(\bs{X}\). Since the sequence \(\bs{X}\) is independent, so is the sequence of increments.

Conversely, suppose that \(\bs{V} = (V_0, V_1, V_2, \ldots)\) is a random process with stationary, independent increments. Define \(U_i = V_i - V_{i-1}\) for \(i \in \N_+\). Then \(\bs{U} = (U_1, U_2, \ldots)\) is a sequence of independent, identically distributed variables and \(\bs{V}\) is the partial sum process associated with \(\bs{U}\).

Thus, partial sum processes are the only discrete-time random processes that have stationary, independent increments. An interesting, and much harder problem, is to characterize the continuous-time processes that have stationary independent increments. The Poisson counting process has stationary independent increments, as does the Brownian motion process.

Moments

If \(n \in \N\) then

\(\E(Y_n) = n \mu\)
\(\var(Y_n) = n \sigma^2\)

Proof

The results follow from basic properties of expected value and variance. Expected value is a linear operation so \( \E(Y_n) = \sum_{i=1}^n \E(X_i) = n \mu \). By independence, \(\var(Y_n) = \sum_{i=1}^n \var(X_i) = n \sigma^2\).

If \(n \in \N_+\) and \(m \in \N\) with \(m \le n\) then

\(\cov(Y_m, Y_n) = m \sigma^2\)
\(\cor(Y_m, Y_n) = \sqrt{\frac{m}{n}}\)
\(\E(Y_m Y_n) = m \sigma^2 + m n \mu^2\)

Proof

Note that \(Y_n = Y_m + (Y_n - Y_m)\). This follows from basic properties of covariance, and Theorem 1 and Theorem 2: \[ \cov(Y_m, Y_n) = \cov(Y_m, Y_m) + \cov(Y_m, Y_n - Y_m) = \var(Y_m) + 0 = m \sigma^2 \]
This result follows from part (a) and Theorem 4 \[ \cor(Y_m, Y_m) = \frac{\cov(Y_m, Y_n)}{\sd(Y_m) \sd(Y_n)} = \frac{m \sigma^2}{\sqrt{m \sigma^2} \sqrt{n \sigma^2}} = \sqrt{\frac{m}{n}} \]
This result also follows from part (a) and Theorem 4: \(\E(Y_m Y_n) = \cov(Y_m, Y_n) + \E(Y_m) \E(Y_n) = m \sigma^2 + m \mu n \mu\)

If \(X\) has moment generating function \(G\) then \(Y_n\) has moment generating function \(G^n\).

Proof

This follows from a basic property of generating functions: the generating function of a sum of independent variables is the product of the generating functions of the terms.

Distributions

Suppose that \(X\) has either a discrete distribution or a continuous distribution with probability density function \(f\). Then the probability density function of \(Y_n\) is \(f^{*n} = f * f * \cdots * f\), the convolution power of \(f\) of order \(n\).

Proof

This follows from a basic property of PDFs: the pdf of a sum of independent variables is the convolution of the PDFs of the terms.

More generally, we can use the stationary and independence properties to find the joint distributions of the partial sum process:

If \(n_1 \lt n_2 \lt \cdots \lt n_k\) then \((Y_{n_1}, Y_{n_2}, \ldots, Y_{n_k})\) has joint probability density function \[ f_{n_1, n_2, \ldots, n_k}(y_1, y_2, \ldots, y_k) = f^{*n_1}(y_1) f^{*(n_2 - n_1)}(y_2 - y_1) \cdots f^{*(n_k - n_{k-1})}(y_k - y_{k-1}), \quad (y_1, y_2, \ldots, y_k) \in \R^k \]

Proof

This follows from the multivariate change of variables theorem.

The Central Limit Theorem

First, let's make the central limit theorem more precise. From Theorem 4, we cannot expect \(Y_n\) itself to have a limiting distribution. Note that \(\var(Y_n) \to \infty\) as \(n \to \infty\) since \(\sigma \gt 0\), and \(\E(Y_n) \to \infty\) as \(n \to \infty\) if \(\mu \gt 0\) while \(\E(Y_n) \to -\infty\) as \(n \to \infty\) if \(\mu \lt 0\). Similarly, we know that \(M_n \to \mu\) as \(n \to \infty\) with probability 1, so the limiting distribution of the sample mean is degenerate. Thus, to obtain a limiting distribution of \(Y_n\) or \(M_n\) that is not degenerate, we need to consider, not these variables themeselves, but rather the common standard score. Thus, let \[ Z_n = \frac{Y_n - n \mu}{\sqrt{n} \sigma} = \frac{M_n - \mu}{\sigma \big/ \sqrt{n}} \]

\(Z_n\) has mean 0 and variance 1.

\(\E(Z_n) = 0\)
\(\var(Z_n) = 1\)

Proof

These results follow from basic properties of expected value and variance, and are true for the standard score associated with any random variable. Recall also that the standard score of a variable is invariant under linear transformations with positive slope. The fact that the standard score of \(Y_n\) and the standard score of \(M_n\) are the same is a special case of this.

The precise statement of the central limit theorem is that the distribution of the standard score \(Z_n\) converges to the standard normal distribution as \(n \to \infty\). Recall that the standard normal distribution has probability density function \[ \phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2}, \quad z \in \R \] and is studied in more detail in the chapter on special distributions. A special case of the central limit theorem (to Bernoulli trials), dates to Abraham De Moivre. The term central limit theorem was coined by George Pólya in 1920. By definition of convergence in distribution, the central limit theorem states that \(F_n(z) \to \Phi(z)\) as \(n \to \infty\) for each \(z \in \R\), where \(F_n\) is the distribution function of \(Z_n\) and \(\Phi\) is the standard normal distribution function:

\[ \Phi(z) = \int_{-\infty}^z \phi(x) \, dx = \int_{-\infty}^z \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} x^2} \, dx, \quad z \in \R \]

An equivalent statment of the central limit theorm involves convergence of the corresponding characteristic functions. This is the version that we will give and prove, but first we need a generalization of a famous limit from calculus.

Suppose that \((a_1, a_2, \ldots)\) is a sequence of real numbers and that \(a_n \to a \in \R\) as \(n \to \infty\). Then \[ \left( 1 + \frac{a_n}{n} \right)^n \to e^a \text{ as } n \to \infty \]

Now let \(\chi\) denote the characteristic function of the standard score of the sample variable \(X\), and let \(\chi_n\) denote the characteristic function of the standard score \(Z_n\): \[ \chi(t) = \E \left[ \exp\left( i t \frac{X - \mu}{\sigma} \right) \right], \; \chi_n(t) = \E[\exp(i t Z_n)]; \quad t \in \R \] Recall that \(t \mapsto e^{-\frac{1}{2}t^2}\) is the characteristic function of the standard normal distribution. We can now give a proof.

The central limit theorem. The distribution of \(Z_n\) converges to the standard normal distribution as \(n \to \infty\). That is, \(\chi_n(t) \to e^{-\frac{1}{2}t^2}\) as \(n \to \infty\) for each \(t \in \R\).

Proof

Note that \(\chi(0) = 1\), \(\chi^\prime(0) = 0\), \(\chi^{\prime \prime}(0) = -1\). Next \[ Z_n = \frac{1}{\sqrt{n}} \sum_{i=1}^n \frac{X_i - \mu}{\sigma} \] From properties of characteristic functions, \(\chi_n(t) = \chi^n (t / \sqrt{n})\) for \(t \in \R\). By Taylor's theorem (named after Brook Taylor), \[ \chi\left(\frac{t}{\sqrt{n}}\right) = 1 + \frac{1}{2} \chi^{\prime\prime}(s_n) \frac{t^2}{n} \text{ where } \left|s_n\right| \le \frac{\left|t\right|}{n} \] But \(s_n \to 0\) and hence \(\chi^{\prime\prime}(s_n) \to -1\) as \(n \to \infty\). Finally, \[ \chi_n(t) = \left[1 + \frac{1}{2} \chi^{\prime\prime}(s_n) \frac{t^2}{n} \right]^n \to e^{-\frac{1}{2} t^2} \text{ as } n \to \infty \]

Normal Approximations

The central limit theorem implies that if the sample size \(n\) is large then the distribution of the partial sum \(Y_n\) is approximately normal with mean \(n \mu\) and variance \(n \sigma^2\). Equivalently the sample mean \(M_n\) is approximately normal with mean \(\mu\) and variance \(\sigma^2 / n\). The central limit theorem is of fundamental importance, because it means that we can approximate the distribution of certain statistics, even if we know very little about the underlying sampling distribution.

Of course, the term large is relative. Roughly, the more abnormal the basic distribution, the larger \(n\) must be for normal approximations to work well. The rule of thumb is that a sample size \(n\) of at least 30 will usually suffice if the basic distribution is not too weird; although for many distributions smaller \(n\) will do.

Let \(Y\) denote the sum of the variables in a random sample of size 30 from the uniform distribution on \([0, 1]\). Find normal approximations to each of the following:

\(\P(13 \lt Y \lt 18)\)
The 90th percentile of \(Y\)

Answer

0.8682
17.03

Random variable \(Y\) in the previous exercise has the Irwin-Hall distribution of order 30. The Irwin-Hall distributions are studied in more detail in the chapter on Special Distributions and are named for Joseph Irwin and Phillip Hall.

In the special distribution simulator, select the Irwin-Hall distribution. Vary and \(n\) from 1 to 10 and note the shape of the probability density function. With \(n = 10\) run the experiment 1000 times and compare the empirical density function to the true probability density function.

Let \(M\) denote the sample mean of a random sample of size 50 from the distribution with probability density function \(f(x) = \frac{3}{x^4}\) for \(1 \le x \lt \infty\). This is a Pareto distribution, named for Vilfredo Pareto. Find normal approximations to each of the following:

\(\P(M \gt 1.6)\)
The 60th percentile of \(M\)

Answer

0.2071
1.531

The Continuity Correction

A slight technical problem arises when the sampling distribution is discrete. In this case, the partial sum also has a discrete distribution, and hence we are approximating a discrete distribution with a continuous one. Suppose that \(X\) takes integer values (the most common case) and hence so does the partial sum \(Y_n\). For any \(k \in \Z\) and \(h \in [0, 1)\), note that the event \(\{k - h \le Y_n \le k + h\}\) is equivalent to the event \(\{Y = k\}\). Different values of \(h\) lead to different normal approximations, even though the events are equivalent. The smallest approximation would be 0 when \(h = 0\), and the approximations increase as \(h\) increases. It is customary to split the difference by using \(h = \frac{1}{2}\) for the normal approximation. This is sometimes called the half-unit continuity correction or the histogram correction. The continuity correction is extended to other events in the natural way, using the additivity of probability.

Suppose that \(j, k \in \Z\) with \(j \le k\).

For the event \(\{j \le Y_n \le k\} = \{j - 1 \lt Y_n \lt k + 1\}\), use \(\{j - \frac{1}{2} \le Y_n \le k + \frac{1}{2}\}\) in the normal approximation.
For the event \(\{j \le Y_n\} = \{j - 1 \lt Y_n\}\), use \(\{j - \frac{1}{2} \le Y_n\}\) in the normal approximation.
For the event \(\{Y_n \le k\} = \{Y_n \lt k + 1\}\), use \(\{Y_n \le k + \frac{1}{2}\}\) in the normal approximation.

Let \(Y\) denote the sum of the scores of 20 fair dice. Compute the normal approximation to \(\P(60 \le Y \le 75)\).

Answer

0.6741

In the dice experiment, set the die distribution to fair, select the sum random variable \(Y\), and set \(n = 20\). Run the simulation 1000 times and find each of the following. Compare with the result in the previous exercise:

\(\P(60 \le Y \le 75)\)
The relative frequency of the event \(\{60 \le Y \le 75\}\) (from the simulation)

Normal Approximation to the Gamma Distribution

Recall that the gamma distribution with shape parameter \(k \in (0, \infty)\) and scale parameter \(b \in (0, \infty)\) is a continuous distribution on \( (0, \infty) \) with probability density function \( f \) given by \[ f(x) = \frac{1}{\Gamma(k) b^k} x^{k-1} e^{-x/b}, \quad x \in (0, \infty) \] The mean is \( k b \) and the variance is \( k b ^2 \). The gamma distribution is widely used to model random times (particularly in the context of the Poisson model) and other positive random variables. The general gamma distribution is studied in more detail in the chapter on Special Distributions. In the context of the Poisson model (where \(k \in \N_+\)), the gamma distribution is also known as the Erlang distribution, named for Agner Erlang; it is studied in more detail in the chapter on the Poisson Process. Suppose now that \(Y_k\) has the gamma (Erlang) distribution with shape parameter \(k \in \N_+\) and scale parameter \(b \gt 0\) then \[ Y_k = \sum_{i=1}^k X_i \] where \((X_1, X_2, \ldots)\) is a sequence of independent variables, each having the exponential distribution with scale parameter \(b\). (The exponential distribution is a special case of the gamma distribution with shape parameter 1.) It follows that if \(k\) is large, the gamma distribution can be approximated by the normal distribution with mean \(k b\) and variance \(k b^2\). The same statement actually holds when \(k\) is not an integer. Here is the precise statement:

Suppose that \( Y_k \) has the gamma distribution with scale parameter \( b \in (0, \infty) \) and shape parameter \( k \in (0, \infty) \). Then the distribution of the standardized variable \( Z_k \) below converges to the standard normal distribution as \(k \to \infty\): \[ Z_k = \frac{Y_k - k b}{\sqrt{k} b} \]

In the special distribution simulator, select the gamma distribution. Vary and \(b\) and note the shape of the probability density function. With \(k = 10\) and various values of \(b\), run the experiment 1000 times and compare the empirical density function to the true probability density function.

Suppose that \(Y\) has the gamma distribution with shape parameter \(k = 10\) and scale parameter \(b = 2\). Find normal approximations to each of the following:

\( \P(18 \le Y \le 23) \)
The 80th percentile of \(Y\)

Answer

0.3063
25.32

Normal Approximation to the Chi-Square Distribution

Recall that the chi-square distribution with \(n \in (0, \infty)\) degrees of freedom is a special case of the gamma distribution, with shape parameter \(k = n / 2\) and scale parameter \(b = 2\). Thus, the chi-square distribution with \(n\) degrees of freedom has probability density function \[ f(x) = \frac{1}{\Gamma(n/2) 2^{n/2}} x^{n/2 - 1}e^{-x/2}, \quad 0 \lt x \lt \infty \] When \( n \) is a positive, integer, the chi-square distribution governs the sum of \( n \) independent, standard normal variables. For this reason, it is one of the most important distributions in statistics. The chi-square distribution is studied in more detail in the chapter on Special Distributions. From the previous discussion, it follows that if \(n\) is large, the chi-square distribution can be approximated by the normal distribution with mean \(n\) and variance \(2 n\). Here is the precise statement:

Suppose that \(Y_n\) has the chi-square distribution with \(n \in (0, \infty) \) degrees of freedom. Then the distribution of the standardized variable \( Z_n \) below converges to the standard normal distribution as \(n \to \infty\): \[ Z_n = \frac{Y_n - n}{\sqrt{2 n}} \]

In the special distribution simulator, select the chi-square distribution. Vary \(n\) and note the shape of the probability density function. With \(n = 20\), run the experiment 1000 times andcompare the empirical density function to the probability density function.

Suppose that \(Y\) has the chi-square distribution with \(n = 20\) degrees of freedom. Find normal approximations to each of the following:

\(\P(18 \lt Y \lt 25)\)
The 75th percentile of \(Y\)

Answer

0.4107
24.3

Normal Approximation to the Binomial Distribution

Recall that a Bernoulli trials sequence, named for Jacob Bernoulli, is a sequence \( (X_1, X_2, \ldots) \) of independent, identically distributed indicator variables with \(\P(X_i = 1) = p\) for each \(i\), where \(p \in (0, 1)\) is the parameter. In the usual language of reliability, \(X_i\) is the outcome of trial \(i\), where 1 means success and 0 means failure. The common mean is \(p\) and the common variance is \(p (1 - p)\).

Let \(Y_n = \sum_{i=1}^n X_i\), so that \(Y_n\) is the number of successes in the first \(n\) trials. Recall that \(Y_n\) has the binomial distribution with parameters \(n\) and \(p\), and has probability density function \[ f(k) = \binom{n}{k} p^k (1 - p)^{n-k}, \quad k \in \{0, 1, \ldots, n\} \] The binomial distribution is studied in more detail in the chapter on Bernoulli trials.

It follows from the central limit theorem that if \(n\) is large, the binomial distribution with parameters \(n\) and \(p\) can be approximated by the normal distribution with mean \(n p\) and variance \(n p (1 - p)\). The rule of thumb is that \(n\) should be large enough for \(n p \ge 5\) and \(n (1 - p) \ge 5\). (The first condition is the important one when \(p \lt \frac{1}{2}\) and the second condition is the important one when \(p \gt \frac{1}{2}\).) Here is the precise statement:

Suppose that \( Y_n \) has the binomial distribution with trial parameter \( n \in \N_+ \) and success parameter \( p \in (0, 1) \). Then the distribution of the standardized variable \(Z_n\) given below converges to the standard normal distribution as \(n \to \infty\): \[ Z_n = \frac{Y_n - n p}{\sqrt{n p (1 - p)}} \]

In the binomial timeline experiment, vary \(n\) and \(p\) and note the shape of the probability density function. With \(n = 50\) and \(p = 0.3\), run the simulation 1000 times and compute the following:

\(\P(12 \le Y \le 16)\)
The relative frequency of the event \(\{12 \le Y \le 16\}\) (from the simulation)

Answer

0.5448

Suppose that \(Y\) has the binomial distribution with parameters \(n = 50\) and \(p = 0.3\). Compute the normal approximation to \( \P(12 \le Y \le 16) \) (don't forget the continuity correction) and compare with the results of the previous exercise.

Answer

0.5383

Normal Approximation to the Poisson Distribution

Recall that the Poisson distribution, named for Simeon Poisson, is a discrete distribution on \( \N \) with probability density function \( f \) given by \[ f(x) = e^{-\theta} \frac{\theta^x}{x!}, \quad x \in \N \] where \(\theta \gt 0\) is a parameter. The parameter is both the mean and the variance of the distribution. The Poisson distribution is widely used to model the number of random points in a region of time or space, and is studied in more detail in the chapter on the Poisson Process. In this context, the parameter is proportional to the size of the region.

Suppose now that \(Y_n\) has the Poisson distribution with parameter \(n \in \N_+\). Then \[ Y_n = \sum_{i=1}^n X_i \] where \((X_1, X_2, \ldots, X_n)\) is a sequence of independent variables, each with the Poisson distribution with parameter 1. It follows from the central limit theorem that if \(n\) is large, the Poisson distribution with parameter \(n\) can be approximated by the normal distribution with mean \(n\) and variance \(n\). The same statement holds when the parameter \(n\) is not an integer. Here is the precise statement:

. Suppose that \( Y_\theta \) has the Poisson distribution with parameter \( \theta \in (0, \infty) \). Then the distribution of the standardized variable \( Z_\theta \) below converges to the standard normal distribution as \(\theta \to \infty\):

\[ Z_\theta = \frac{Y_\theta - \theta}{\sqrt{\theta}} \]

Suppose that \(Y\) has the Poisson distribution with mean 20.

Compute the true value of \(\P(16 \le Y \le 23)\).
Compute the normal approximation to \(\P(16 \le Y \le 23)\).

Answer

0.6310
0.6259

In the Poisson experiment, vary the time and rate parameters \(t\) and \(r\) (the parameter of the Poisson distribution in the experiment is the product \(r t\)). Note the shape of the probability density function. With \(r = 5\) and \(t = 4\), run the experiment 1000 times and compare the empirical density function to the true probability density function.

Normal Approximation to the Negative Binomial Distribution

The general version of the negative binomial distribution is a discrete distribution on \( \N \), with shape parameter \( k \in (0, \infty) \) and success parameter \( p \in (0, 1) \). The probability density function \( f \) is given by \[ f(n) = \binom{n + k - 1}{n} p^k (1 - p)^n, \quad n \in \N_+ \] The mean is \( k (1 - p) / p \) and the variance is \( k (1 - p) / p^2 \). The negative binomial distribution is studied in more detail in the chapter on Bernoulli trials. If \( k \in \N_+ \), the distribution governs the number of failures \( Y_k \) before success number \( k \) in a sequence of Bernoulli trials with success parameter \( p \). Thus in this case, \[ Y_k = \sum_{i=1}^k X_i \] where \((X_1, X_2, \ldots, X_k)\) is a sequence of independent variables, each having the geometric distribution on \(\N\) with parameter \(p\). (The geometric distribution is a special case of the negative binomial, with parameters 1 and \(p\).) In the context of the Bernoulli trials, \( X_1 \) is the number of failures before the first success, and for \( i \in \{2, 3, \ldots\} \), \(X_i\) is the number of failures between success number \( i - 1 \) success number \( i \). It follows that if \(k\) is large, the negative binomial distribution can be approximated by the normal distribution. The same statement holds if \( k \) is not an integer. Here is the precise statement:

Suppose that \( Y_k \) has the negative binomial distribution with shape parameter \( k \in (0, 1) \) and scale parameter \( p \in (0, 1) \). Then the distribution of the standardized variable \( Z_k \) below converges to the standard normal distribution as \(k \to \infty\): \[ Z_k = \frac{p Y_k - k(1 - p)}{\sqrt{k (1 - p)}} \]

Another version of the negative binomial distribution is the distribution of the trial number \( V_k \) of success number \( k \in \N_+ \). So \( V_k = k + Y_k \) and \( V_k \) has mean \( k / p \) and variance \( k (1 - p) / p^2 \). The normal approximation applies to the distribution of \( V_k \) as well, if \( k \) is large, and since the distributions are related by a location transformation, the standard scores are the same. That is \[ \frac{p V_k - k}{\sqrt{k (1 - p)}} = \frac{p Y_k - k(1 - p)}{\sqrt{k ( 1 - p)}} \]

In the negative binomial experiment, vary \(k\) and \(p\) and note the shape of the probability density function. With \(k = 5\) and \(p = 0.4\), run the experiment 1000 times and compare the empirical density function to the true probability density function.

Suppose that \(Y\) has the negative binomial distribution with trial parameter \(k = 10\) and success parameter \(p = 0.4\). Find normal approximations to each of the following:

\(\P(20 \lt Y \lt 30)\)
The 80th percentile of \(Y\)

Answer

0.6318
30.1

Partial Sums with a Random Number of Terms

Our last topic is a bit more esoteric, but still fits with the general setting of this section. Recall that \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of independent, identically distributed real-valued random variables with common mean \(\mu\) and variance \(\sigma^2\). Suppose now that \(N\) is a random variable (on the same probability space) taking values in \(\N\), also with finite mean and variance. Then \[ Y_N = \sum_{i=1}^N X_i \] is a random sum of the independent, identically distributed variables. That is, the terms are random of course, but so also is the number of terms \(N\). We are primarily interested in the moments of \(Y_N\).

Independent Number of Terms

Suppose first that \(N\), the number of terms, is independent of \(\bs{X}\), the sequence of terms. Computing the moments of \(Y_N\) is a good exercise in conditional expectation.

The conditional expected value of \(Y_N\) given \(N\), and the expected value of \(Y_N\) are

\(\E(Y_N \mid N) = N \mu\)
\(\E(Y_N) = \E(N) \mu\)

The conditional variance of \(Y_N\) given \(N\) and the variance of \(Y_N\) are

\(\var(Y_N \mid N) = N \sigma^2\)
\(\var(Y_N) = \E(N) \sigma^2 + \var(N) \mu^2\)

Let \(H\) denote the probability generating function of \(N\). Show that the moment generating function of \(Y_N\) is \(H \circ G\).

\(\E(e^{t Y_N} \mid N) = [G(t)]^N\)
\(\E(e^{t Y_N}) = H(G(t))\)

Wald's Equation

The result in Exercise 29 (b) generalizes to the case where the random number of terms \(N\) is a stopping time for the sequence \(\bs{X}\). This means that the event \(\{N = n\}\) depends only on (technically, is measurable with respect to) \((X_1, X_2, \ldots, X_n)\) for each \(n \in \N\). The generalization is knowns as Wald's equation, and is named for Abraham Wald. Stopping times are studied in much more technical detail in the section on Filtrations and Stopping Times.

If \(N\) is a stopping time for \(\bs{X}\) then \(\E(Y_N) = \E(N) \mu\).

Proof

First note that \(Y_N = \sum_{i=1}^\infty X_i \bs{1}(i \le N)\). But \(\{i \le N\} = \{N \lt i\}^c\) depends only on \(\{X_1, \ldots, X_{i-1}\}\) and hence is independent of \(X_i\). Thus \(\E[X_i \bs{1}(i \le N)] = \mu \P(N \ge i)\). Suppose that \(X_i \ge 0\) for each \(i\). Taking expected values term by term gives Wald's equation in this special case. The interchange of sum and expected value is justified by the monotone convergence theorem. Now Wald's equation can be established in general by using the dominated convergence theorem.

An elgant proof of Wald's equation is given in the chapter on Martingales.

Suppose that the number of customers arriving at a store during a given day has the Poisson distribution with parameter 50. Each customer, independently of the others (and independently of the number of customers), spends an amount of money that is uniformly distributed on the interval \([0, 20]\). Find the mean and standard deviation of the amount of money that the store takes in during a day.

Answer

500, 81.65

When a certain critical component in a system fails, it is immediately replaced by a new, statistically identical component. The components are independent, and the lifetime of each (in hours) is exponentially distributed with scale parameter \(b\). During the life of the system, the number of critical components used has a geometric distribution on \(\N_+\) with parameter \(p\). For the total life of the critical component,

Find the mean.
Find the standard deviation.
Find the moment generating function.
Identify the distribution by name.

Answer

\(b / p\)
\(b / p\)
\(t \mapsto \frac{1}{1 - (b/p)t}\)
Exponential distribution with scale parameter \(b / p\)