3.8: Convergence in Distribution

Last updated
Save as PDF

Page ID: 10148

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\Q}{\mathbb{Q}}\) \( \newcommand{\E}{\mathbb{E}} \) \(\newcommand{\cl}{\text{cl}}\) \(\newcommand{\interior}{\text{int}}\) \(\newcommand{\bs}{\boldsymbol}\)

This section is concenred with the convergence of probability distributions, a topic of basic importance in probability theory. Since we will be almost exclusively concerned with the convergences of sequences of various kinds, it's helpful to introduce the notation \(\N_+^* = \N_+ \cup \{\infty\} = \{1, 2, \ldots\} \cup \{\infty\}\).

Distributions on \((\R, \mathscr R)\)

Definition

We start with the most important and basic setting, the measurable space \((\R, \mathscr R)\), where \(\R\) is the set of real numbers of course, and \(\mathscr R\) is the Borel \(\sigma\)-algebra of subsets of \(\R\). Recall that if \(P\) is a probability measure on \((\R, \mathscr R)\), then the function \(F: \R \to [0, 1]\) defined by \(F(x) = P(-\infty, x]\) for \(x \in \R\) is the (cumulative) distribution function of \(P\). Recall also that \(F\) completely determines \(P\). Here is the definition for convergence of probability measures in this setting:

Suppose \(P_n\) is a probability measure on \((\R, \mathscr R)\) with distribution function \(F_n\) for each \(n \in \N_+^*\). Then \(P_n\) converges (weakly) to \(P_\infty\) as \(n \to \infty\) if \(F_n(x) \to F_\infty(x)\) as \(n \to \infty\) for every \(x \in \R\) where \(F_\infty\) is continuous. We write \(P_n \Rightarrow P_\infty\) as \(n \to \infty\).

Recall that a distribution function \(F\) is continuous at \(x \in \R\) if and only if \(\P(X = x) = 0\), so that \(x\) is not an atom of the distribution (a point of positive probability). We will see shortly why this condition on \(F_\infty\) is appropriate. Of course, a probability measure on \((\R, \mathscr R)\) is usually associated with a real-valued random variable for some random experiment that is modeled by a probability space \((\Omega, \mathscr F, \P)\). So to review, \(\Omega\) is the set of outcomes, \(\mathscr F\) is the \(\sigma\)-algebra of events, and \(\P\) is the probability measure on the sample space \((\Omega, \mathscr F)\). If \(X\) is a real-valued random variable defined on the probability space, then the distribution of \(X\) is the probability measure \(P\) on \((\R, \mathscr R)\) defined by \(P(A) = \P(X \in A)\) for \(A \in \mathscr R\), and then of course, the distribution function of \(X\) is the function \(F\) defined by \(F(x) = \P(X \le x)\) for \(x \in \R\). Here is the convergence terminology used in this setting:

Suppose that \(X_n\) is a real-valued random variable with distribution \(P_n\) for each \(n \in \N_+^*\). If \(P_n \Rightarrow P_\infty\) as \(n \to \infty\) then we say that \(X_n\) converges in distribution to \(X_\infty\) as \(n \to \infty\). We write \(X_n \to X_\infty\) as \(n \to \infty\) in distribution.

So if \(F_n\) is the distribution function of \(X_n\) for \(n \in \N_+^*\), then \(X_n \to X_\infty\) as \(n \to \infty\) in distribution if \(F_n(x) \to F_\infty(x)\) at every point \(x \in \R\) where \(F_\infty\) is continuous. On the one hand, the terminology and notation are helpful, since again most probability measures are associated with random variables (and every probability measure can be). On the other hand, the terminology and notation can be a bit misleading since the random variables, as functions, do not converge in any sense, and indeed the random variables need not be defined on the same probability spaces. It is only the distributions that converge. However, often the random variables are defined on the same probability space \((\Omega, \mathscr F, \P)\), in which case we can compare convergence in distribution with the other modes of convergence we have or will study:

Convergence with probability 1
Convergence in probability
Convergence in mean

We will show, in fact, that convergence in distribution is the weakest of all of these modes of convergence. However, strength of convergence should not be confused with importance. Convergence in distribution is one of the most important modes of convergence; the central limit theorem, one of the two fundamental theorems of probability, is a theorem about convergence in distribution.

Preliminary Examples

The examples below show why the definition is given in terms of distribution functions, rather than probability density functions, and why convergence is only required at the points of continuity of the limiting distribution function. Note that the distributions considered are probability measures on \((\R, \mathscr R)\), even though the support of the distribution may be a much smaller subset. For the first example, note that if a deterministic sequence converges in the ordinary calculus sense, then naturally we want the sequence (thought of as random variables) to converge in distribution. Expand the proof to understand the example fully.

Suppose that \(x_n \in \R\) for \(n \in \N_+^*\). Define random variable \(X_n = x_n\) with probability 1 for each \(n \in \N_+^*\). Then \(x_n \to x_\infty\) as \(n \to \infty\) if and only if \(X_n \to X_\infty\) as \(n \to \infty\) in distribution.

Proof

For \(n \in \N_+^*\), the CDF \(F_n\) of \(X_n\) is given by \(F_n(x) = 0\) for \(x \lt x_n\) and \(F_n(x) = 1\) for \(x \ge x_n\).

Suppose that \(x_n \to x_\infty\) as \(n \to \infty\). If \(x \lt x_\infty\) then \(x \lt x_n\), and hence \(F_n(x) = 0\), for all but finitely many \(n \in \N_+\), and so \(F_n(x) \to 0\) as \(n \to \infty\). If \(x \gt x_\infty\) then \(x \gt x_n\),and hence \(F_n(x) = 1\), for all but finitely many \(n \in \N_+\), and so \(F_n(x) \to 1\) as \(n \to \infty\). Nothing can be said about the limiting behavior of \(F_n(x_\infty)\) as \(n \to \infty\) without more information. For example, if \(x_n \le x_\infty\) for all but finitely many \(n \in \N_+\) then \(F_n(x_\infty) \to 1\) as \(n \to \infty\). If \(x_n \gt x_\infty\) for all but finitely many \(n \in \N_+\) then \(F_n(x_\infty) \to 0\) as \(n \to \infty\). If \(x_n \lt x_\infty\) for infinitely many \(n \in \N_+\) and \(x_n \gt x_\infty\) for infinitely many \(n \in \N_+\) then \(F_n(x_\infty)\) does not have a limit as \(n \to \infty\). But regardless, we have \(F_n(x) \to F_\infty(x)\) as \(n \to \infty\) for every \(x \in \R\) except perhaps \(x_\infty\), the one point of discontinuity of \(F_\infty\). Hence \(X_n \to X_\infty\) as \(n \to \infty\) in distribution.
Conversely, suppose that \(X_n \to X_\infty\) as \(n \to \infty\) in distribution. If \(x \lt x_\infty\) then \(F_n(x) \to 0\) as \(n \to \infty\) and hence \(x \lt x_n\) for all but finitely many \(n \in \N_+\). If \(x \gt x_\infty\) then \(F_n(x) \to 1\) as \(n \to \infty\) and hence \(x \ge x_n\) for all but finitely many \(n \in \N_+\). So, for every \(\epsilon \gt 0\), \(x_n \in (x_\infty - \epsilon, x_\infty + \epsilon)\) for all but finitely many \(n \in \N_+\), and hence \(x_n \to x_\infty\) as \(n \to \infty\).

The proof is finished, but let's look at the probability density functions to see that these are not the proper objects of study. For \(n \in \N_+^*\), the PDF \(f_n\) of \(X_n\) is given by \(f_n(x_n) = 1\) and \(f_n(x) = 0\) for \(x \in \R \setminus \{x_n\}\). Only when \(x_n = x_\infty\) for all but finitely many \(n \in \N_+\) do we have \(f_n(x) \to f(x)\) for \(x \in \R\).

For the example below, recall that \( \Q \) denotes the set of rational numbers. Once again, expand the proof to understand the example fully

For \(n \in \N_+\), let \(P_n\) denote the discrete uniform distribution on \(\left\{\frac{1}{n}, \frac{2}{n}, \ldots \frac{n-1}{n}, 1\right\}\) and let \(P_\infty\) denote the continuous uniform distribution on the interval \([0, 1]\). Then

\(P_n \Rightarrow P_\infty\) as \(n \to \infty\)
\(P_n(\Q) = 1\) for each \(n \in \N_+\) but \(P_\infty(\Q) = 0\).

Proof

As usual, let \(F_n\) denote the CDF of \(P_n\) for \(n \in \N_+^*\).

For \(n \in \N_+\) note that \( F_n \) is given by \( F_n(x) = \lfloor n \, x \rfloor / n \) for \( x \in [0, 1] \). But \( n \, x - 1 \le \lfloor n \, x \rfloor \le n \, x \) so \( \lfloor n \, x \rfloor / n \to x \) as \( n \to \infty \) for \(x \in [0, 1]\). Of course, \(F_n(x) = 0\) for \(x \lt 0\) and \(F_n(x) = 1\) for \(x \gt 1\). So \(F_n(x) \to F_\infty(x)\) as \(n \to \infty\) for all \(x \in \R\).
Note that by definition, so \(P_n(\Q) = 1\) for \(n \in \N_+\). On the other hand, \( P_\infty \) is a continuous distribution and \( \Q \) is countable, so \(P_\infty(\Q) = 0\).

The proof is finished, but let's look at the probability density functions. For \(n \in \N_+\), the PDF \(f_n\) of \(P_n\) is given by \(f_n(x) = \frac{1}{n}\) for \(x \in \left\{\frac{1}{n}, \frac{2}{n}, \ldots \frac{n-1}{n}, 1\right\}\) and \(f_n(x) = 0\) otherwise. Hence \( 0 \le f_n(x) \le \frac{1}{n} \) for \(n \in \N_+\) and \(x \in \R\), so \(f_n(x) \to 0\) as \(n \to \infty\) for every \( x \in \R \).

The point of the example is that it's reasonable for the discrete uniform distribution on \(\left\{\frac{1}{n}, \frac{2}{n}, \ldots \frac{n-1}{n}, 1\right\}\) to converge to the continuous uniform distribution on \([0, 1]\), but once again, the probability density functions are evidently not the correct objects of study.

Probability Density Functions

As the previous example shows, it is quite possible to have a sequence of discrete distributions converge to a continuous distribution (or the other way around). Recall that probability density functions have very different meanings in the discrete and continuous cases: density with respect to counting measure in the first case, and density with respect to Lebesgue measure in the second case. This is another indication that distribution functions, rather than density functions, are the correct objects of study. However, if probability density functions of a fixed type converge then the distributions converge. Recall again that we are thinnking of our probability distributions as measures on \((\R, \mathscr R)\) even when supported on a smaller subset.

Convergence in distribution in terms of probability density functions.

Suppose that \(f_n\) is a probability density function for a discrete distribution \(P_n\) on a countable set \(S \subseteq \R\) for each \(n \in \N_+^*\). If \(f_n(x) \to f_\infty(x)\) as \(n \to \infty\) for each \(x \in S\) then \(P_n \Rightarrow P_\infty\) as \(n \to \infty\).
Suppose that \(f_n\) is a probability density function for a continuous distribution \(P_n\) on \(\R\) for each \(n \in \N_+^*\) If \(f_n(x) \to f(x)\) as \(n \to \infty\) for all \(x \in \R\) (except perhaps on a set with Lebesgue measure 0) then \(P_n \Rightarrow P_\infty\) as \(n \to \infty\).

Proof

Fix \(x \in \R\). Then \(P_n(-\infty, x] = \sum_{y \in S, \, y \le x} f(y)\) for \(n \in \N_+\) and \(P(-\infty, x] = \sum_{y \in S, \, y \le x} f(y)\). It follows from Scheffé's theorem with the measure space \((S, \mathscr P(S), \#)\) that \(P_n(-\infty, x] \to P(-\infty, x]\) as \(n \to \infty\).
Fix \(x \in \R\). Then \(P_n(-\infty, x] = \int_{-\infty}^x f(y) \, dy\) for \(n \in \N_+\) and \(P(-\infty, x] = \int_{-\infty}^x f(y) \, dy\). It follows from Scheffé's theorem with the measure space \((\R, \mathscr R, \lambda)\) that \(P_n(-\infty, x] \to P(-\infty, x]\) as \(n \to \infty\).

Convergence in Probability

Naturally, we would like to compare convergence in distribution with other modes of convergence we have studied.

Suppose that \(X_n\) is a real-valued random variable for each \(n \in \N_+^*\), all defined on the same probability space. If \(X_n \to X_\infty\) as \(n \to \infty\) in probability then \(X_n \to X_\infty\) as \(n \to \infty\) in distribution.

Proof

Let \(F_n\) denote the distribution function of \(X_n\) for \(n \in \N_+^*\). Fix \(\epsilon \gt 0\). Note first that \(\P(X_n \le x) = \P(X_n \le x, X_\infty \le x + \epsilon) + \P(X_n \le x, X_\infty \gt x + \epsilon) \). Hence \(F_n(x) \le F_\infty(x + \epsilon) + \P\left(\left|X_n - X_\infty\right| \gt \epsilon\right)\). Next, note that \(\P(X_\infty \le x - \epsilon) = \P(X_\infty \le x - \epsilon, X_n \le x) + \P(X_\infty \le x - \epsilon, X_n \gt x)\). Hence \(F_\infty(x - \epsilon) \le F_n(x) + \P\left(\left|X_n - X_\infty\right|\right) \gt \epsilon\). From the last two results it follows that \[ F_\infty(x - \epsilon) - \P\left(\left|X_n - X_\infty\right| \gt \epsilon\right) \le F_n(x) \le F_\infty(x + \epsilon) + \P\left(\left|X_n - X_\infty\right| \gt \epsilon\right) \] Letting \(n \to \infty\) and using convergence in probability gives \[ F_\infty(x - \epsilon) \le \liminf_{n \to \infty} F_n(x) \le \limsup_{n \to \infty} F_n(x) \le F_\infty(x + \epsilon) \] Finally, letting \(\epsilon \downarrow 0\) we see that if \(F_\infty\) is continuous at \(x\) then \(F_n(x) \to F_\infty(x)\) as \(n \to \infty\).

Our next example shows that even when the variables are defined on the same probability space, a sequence can converge in distribution, but not in any other way.

Let \(X\) be an indicator variable with \(\P(X = 0) = \P(X = 1) = \frac{1}{2}\), so that \(X\) is the result of tossing a fair coin. Let \(X_n = 1 - X \) for \(n \in \N_+\). Then

\(X_n \to X\) as \(n \to \infty\) in distribution.
\(\P(X_n \text{ does not converge to } X \text{ as } n \to \infty) = 1\).
\(X_n \) does not converge to \( X \) as \(n \to \infty\) in probability.
\(X_n\) does not converge to \(X\) as \(n \to \infty\) in mean.

Proof

This trivially holds since \(1 - X\) has the same distribution as \(X\).
This follows since \(\left|X_n - X\right| = 1\) for every \(n \in \N_+\).
This follows since \(\P\left(\left|X_n - X\right| \gt \frac{1}{2}\right) = 1\) for each \(n \in \N_+\).
This follows since \(\E\left(\left|X_n - X\right|\right) = 1\) for each \(n \in \N_+\).

The critical fact that makes this counterexample work is that \(1 - X\) has the same distribution as \(X\). Any random variable with this property would work just as well, so if you prefer a counterexample with continuous distributions, let \(X\) have probability density function \(f\) given by \(f(x) = 6 x (1 - x)\) for \(0 \le x \le 1\). The distribution of \(X\) is an example of a beta distribution.

The following summary gives the implications for the various modes of convergence; no other implications hold in general.

Suppose that \(X_n\) is a real-valued random variable for each \(n \in \N_+^*\), all defined on a common probability space.

If \(X_n \to X_\infty\) as \(n \to \infty\) with probability 1 then \(X_n \to X_\infty\) as \(n \to \infty\) in probability.
If \(X_n \to X_\infty\) as \(n \to \infty\) in mean then \(X_n \to X_\infty\) as \(n \to \infty\) in probability.
If \(X_n \to X_\infty\) as \(n \to \infty\) in probability then \(X_n \to X_\infty\) as \(n \to \infty\) in distribtion.

It follows that convergence with probability 1, convergence in probability, and convergence in mean all imply convergence in distribution, so the latter mode of convergence is indeed the weakest. However, our next theorem gives an important converse to part (c) in (7), when the limiting variable is a constant. Of course, a constant can be viewed as a random variable defined on any probability space.

Suppose that \(X_n\) is a real-valued random variable for each \(n \in \N_+\), defined on the same probability space, and that \(c \in \R\). If \(X_n \to c\) as \(n \to \infty\) in distribution then \(X_n \to c\) as \(n \to \infty\) in probability.

Proof

Assume that the probability space is \((\Omega, \mathscr F, \P)\). Note first that \(\P(X_n \le x) \to 0\) as \(n \to \infty\) if \(x \lt c\) and \(\P(X_n \le x) \to 1\) as \(n \to \infty\) if \(x \gt c\). It follows that \(\P\left(\left|X_n - c\right| \le \epsilon\right) \to 1\) as \(n \to \infty\) for every \(\epsilon \gt 0\).

The Skorohod Representation

As noted in the summary above, convergence in distribution does not imply convergence with probability 1, even when the random variables are defined on the same probability space. However, the next theorem, known as the Skorohod representation theorem, gives an important partial result in this direction.

Suppose that \(P_n\) is a probability measure on \((\R, \mathscr R)\) for each \(n \in \N_+^*\) and that \(P_n \Rightarrow P_\infty\) as \(n \to \infty\). Then there exist real-valued random variables \(X_n\) for \(n \in \N_+^*\), defined on the same probability space, such that

\(X_n\) has distribution \(P_n\) for \(n \in \N_+^*\).
\(X_n \to X_\infty\) as \(n \to \infty\) with probability 1.

Proof

Let \((\Omega, \mathscr F, \P)\) be a probability space and \(U\) a random variable defined on this space that is uniformly distributed on the interval \((0, 1)\). For a specific construction, we could take \(\Omega = (0, 1)\), \(\mathscr F\) the \(\sigma\)-algebra of Borel measurable subsets of \((0, 1)\), and \(\P\) Lebesgue measure on \((\Omega, \mathscr F)\) (the uniform distribution on \((0, 1)\)). Then let \(U\) be the identity function on \(\Omega\) so that \(U(\omega) = \omega\) for \(\omega \in \Omega\), so that \(U\) has probability distribution \(\P\). We have seen this construction many times before.

For \(n \in \N_+^*\), let \(F_n\) denote the distribution function of \(P_n\) and define \(X_n = F_n^{-1}(U)\) where \(F_n^{-1}\) is the quantile functions of \(F_n\). Recall that \(X_n\) has distribution function \(F_n\) and therefore \(X_n\) has distribution \(P_n\) for \(n \in \N_+^*\). Of course, these random variables are also defined on \((\Omega, \mathscr F, \P)\).
Let \(\epsilon \gt 0\) and let \(u \in (0, 1)\). Pick a continuity point \(x\) of \(F_\infty\) such that \(F_\infty^{-1}(u) - \epsilon \lt x \lt F_\infty^{-1}(u)\). Then \(F_\infty(x) \lt u\) and hence \(F_n(x) \lt u\) for all but finitely many \(n \in \N_+\). It follows that \(F_\infty^{-1}(u) - \epsilon \lt x \lt F_n^{-1}(u)\) for all but finitely many \(n \in \N_+\). Let \(n \to \infty\) and \(u \downarrow 0\) to conclude that \(F_\infty^{-1}(u) \le \liminf_{n \to \infty} F_n^{-1}(u)\). Next, let \(v\) satisfy \(0 \lt u \lt v \lt 1\) and let \(\epsilon \gt 0\). Pick a continuity point \(x\) of \(F_\infty\) such that \(F_\infty^{-1}(v) \lt x \lt F_\infty^{-1}(v) + \epsilon\). Then \(u \lt v \lt F_\infty(x)\) and hence \(u \lt F_n(x)\) for all but finitely many \(n \in \N_+\). It follows that \(F_n^{-1}(u) \le x \lt F_\infty^{-1}(v) + \epsilon\) for all but finitely many \(n \in \N_+\). Let \(n \to \infty\) and \(\epsilon \downarrow 0\) to conclude that \(\limsup_{n \to \infty} F_n^{-1}(u) \le F_\infty^{-1}(v)\). Letting \(v \downarrow u\) it follows that \(\limsup_{n \to \infty} F_n^{-1}(u) \le F_\infty^{-1}(u)\) if \(u\) is a point of continuity of \(F_\infty^{-1}\). Therefore \(F_n^{-1}(u) \to F_\infty^{-1}(u)\) as \(n \to \infty\) if \(u\) is a point of continuity of \(F_\infty^{-1}\). Recall from analysis that since \(F_\infty^{-1}(u)\) is increasing, the set \(D \subseteq (0, 1)\) of discontinuities of \(F_\infty^{-1}\) is countable. Since \( U \) has a continuous distribution, \(\P(U \in D) = 0\). Finally, it follows that \(\P(X_n \to X_\infty \text{ as } n \to \infty) = 1\).

The following theorem illustrates the value of the Skorohod representation and the usefulness of random variable notation for convergence in distribution. The theorem is also quite intuitive, since a basic idea is that continuity should preserve convergence.

Suppose that \(X_n\) is a real-valued random variable for each \(n \in \N_+^*\) (not necessarily defined on the same probability space). Suppose also that \(g: \R \to \R\) is measurable, and let \(D_g\) denote the set of discontinuities of \(g\), and \(P_\infty\) the distribution of \(X_\infty\). If \(X_n \to X_\infty\) as \(n \to \infty\) in distribution and \(P_\infty(D_g) = 0\), then \(g(X_n) \to g(X_\infty)\) as \(n \to \infty\) in distribution.

Proof

By Skorohod's theorem, there exists random variables \(Y_n\) for \(n \in \N_+^*\), defined on the same probability space \((\Omega, \mathscr F, \P)\), such that \(Y_n\) has the same distribution as \(X_n\) for \(n \in \N_+^*\), and \(Y_n \to Y_\infty\) as \(n \to \infty\) with probability 1. Since \(\P(Y_\infty \in D_g) = P_\infty(D_g) = 0\) it follows that \(g(Y_n) \to g(Y_\infty)\) as \(n \to \infty\) with probability 1. Hence by the theorem above, \(g(Y_n) \to g(Y_\infty)\) as \(n \to \infty\) in distribution. But \(g(Y_n)\) has the same distribution as \(g(X_n)\) for each \(n \in \N_+^*\).

As a simple corollary, if \(X_n\) converges \(X_\infty\) as \(n \to \infty\) in distribution, and if \(a, \, b \in \R\) then \(a + b X_n\) converges to \(a + b X\) as \(n \to \infty\) in distribution. But we can do a little better:

Suppose that \(X_n\) is a real-valued random variable and that \(a_n, \, b_n \in \R\) for each \(n \in \N_+^*\). If \(X_n \to X_\infty\) as \(n \to \infty\) in distribution and if \(a_n \to a_\infty\) and \(b_n \to b_\infty\) as \(n \to \infty\), then \(a_n + b_n X_n \to a + b X_\infty\) as \(n \to \infty\) in distribution.

Proof

Again by Skorohod's theorem, there exist random variables \(Y_n\) for \(n \in \N_+^*\), defined on the same probability space \((\Omega, \mathscr F, \P)\) such that \(Y_n\) has the same distribution as \(X_n\) for \(n \in \N_+^*\) and \(Y_n \to Y_\infty\) as \(n \to \infty\) with probability 1. Hence also \(a_n + b_n Y_n \to a_\infty + b_\infty Y_\infty\) as \(n \to \infty\) with probability 1. By the result above, \(a_n + b_n Y_n \to a_\infty + b_\infty Y_\infty\) as \(n \to \infty\) in distribution. But \(a_n + b_n Y_n\) has the same distribution as \(a_n + b_n X_n\) for \(n \in \N_+^*\).

The definition of convergence in distribution requires that the sequence of probability measures converge on sets of the form \((-\infty, x]\) for \(x \in \R\) when the limiting distrbution has probability 0 at \(x\). It turns out that the probability measures will converge on lots of other sets as well, and this result points the way to extending convergence in distribution to more general spaces. To state the result, recall that if \(A\) is a subset of a topological space, then the boundary of \(A\) is \(\partial A = \cl(A) \setminus \interior(A)\) where \(\cl(A)\) is the closure of \(A\) (the smallest closed set that contains \(A\)) and \(\interior(A)\) is the interior of \(A\) (the largest open set contained in \(A\)).

Suppose that \(P_n\) is a probability measure on \((\R, \mathscr R)\) for \(n \in \N_+^*\). Then \(P_n \Rightarrow P_\infty\) as \(n \to \infty\) if and only if \(P_n(A) \to P_\infty(A)\) as \(n \to \infty\) for every \(A \in \mathscr R\) with \(P(\partial A) = 0\).

Proof

Suppose that \(P_n \Rightarrow P_\infty\) as \(n \to \infty\). Let \(X_n\) be a random variable with distribution \(P_n\) for \(n \in \N_+^*\). (We don't care about the underlying probability spaces.) If \(A \in \mathscr R\) then the set of discontinuities of \(\bs 1_A\), the indicator function of \(A\), is \(\partial A\). So, suppose \(\P_\infty(\partial A) = 0\). By the continuity theorem above, \(\bs 1_A(X_n) \to \bs 1_A(X_\infty)\) as \(n \to \infty\) in distribution. Let \(G_n\) denote the CDF of \(\bs 1_A(X_n)\) for \(n \in \N_+^*\). The only possible points of discontinuity of \(G_\infty\) are 0 and 1. Hence \(G_n\left(\frac 1 2\right) \to G_\infty\left(\frac 1 2\right) \) as \(n \to \infty\). But \(G_n\left(\frac 1 2\right) = P_n(A^c)\) for \(n \in \N_+^*\). Hence \(P_n(A^c) \to \P_\infty(A^c)\) and so also \(P_n(A) \to P_\infty(A)\) as \(n \to \infty\).

Conversely, suppose that the condition in the theorem holds. If \(x \in \R\), then the boundary of \((-\infty, x]\) is \(\{x\}\), so if \(P_\infty\{x\} = 0\) then \(P_n(-\infty, x] \to P_\infty(-\infty, x]\) as \(n \to \infty\). So by definition, \(P_n \Rightarrow P_\infty\) as \(n \to \infty\).

In the context of this result, suppose that \(a, \, b \in \R\) with \(a \lt b\). If \(P\{a\} = P\{b\} = 0\), then as \(n \to \infty\) we have \(P_n(a, b) \to P(a, b)\), \(P_n[a, b) \to P[a, b)\), \(P_n(a, b] \to P(a, b]\), and \(P_n[a, b] \to P[a, b]\). Of course, the limiting values are all the same.

Examples and Applications

Next we will explore several interesting examples of the convergence of distributions on \((\R, \mathscr R)\). There are several important cases where a special distribution converges to another special distribution as a parameter approaches a limiting value. Indeed, such convergence results are part of the reason why such distributions are special in the first place.

The Hypergeometric Distribution

Recall that the hypergeometric distribution with parameters \(m\), \(r\), and \(n\) is the distribution that governs the number of type 1 objects in a sample of size \(n\), drawn without replacement from a population of \(m\) objects with \(r\) objects of type 1. It has discrete probability density function \(f\) given by \[ f(k) = \frac{\binom{r}{k} \binom{m - r}{n - k}}{\binom{m}{n}}, \quad k \in \{0, 1, \ldots, n\} \] The pramaters \(m\), \(r\), and \(n\) are positive integers with \(n \le m\) and \(r \le m\). The hypergeometric distribution is studied in more detail in the chapter on Finite Sampling Models

Recall next that Bernoulli trials are independent trials, each with two possible outcomes, generically called success and failure. The probability of success \(p \in [0, 1]\) is the same for each trial. The binomial distribution with parameters \(n \in \N_+\) and \(p\) is the distribution of the number successes in \(n\) Bernoulli trials. This distribution has probability density function \(g\) given by \[ g(k) = \binom{n}{k} p^k (1 - p)^{n - k}, \quad k \in \{0, 1, \ldots, n\} \] The binomial distribution is studied in more detail in the chapter on Bernoulli Trials. Note that the binomial distribution with parameters \(n\) and \(p = r / m\) is the distribution that governs the number of type 1 objects in a sample of size \(n\), drawn with replacement from a population of \(m\) objects with \(r\) objects of type 1. This fact is motivation for the following result:

Suppose that \(r_m \in \{0, 1, \ldots, m\}\) for each \(m \in \N_+\) and that \(r_m / m \to p\) as \(m \to \infty\). For fixed \(n \in \N_+\), the hypergeometric distribution with parameters \(m\), \(r_m\), and \(n\) converges to the binomial distribution with parameters \(n\) and \(p\) as \(m \to \infty\).

Proof

Recall that for \( a \in \R \) and \( j \in \N \), we let \( a^{(j)} = a \, (a - 1) \cdots [a - (j - 1)] \) denote the falling power of \( a \) of order \( j \). The hypergeometric PDF can be written as \[ f_m(k) = \binom{n}{k} \frac{r_m^{(k)} (m - r_m)^{(n - k)}}{m^{(n)}}, \quad k \in \{0, 1, \ldots, n\} \] In the fraction above, the numerator and denominator both have \( n \) fractors. Suppose that we group the \( k \) factors in \( r_m^{(k)} \) with the first \( k \) factors of \( m^{(n)} \) and the \( n - k \) factors of \( (m - r_m)^{(n-k)} \) with the last \( n - k \) factors of \( m^{(n)} \) to form a product of \( n \) fractions. The first \( k \) fractions have the form \( (r_m - j) \big/ (m - j) \) for some \( j \) that does not depend on \( m \). Each of these converges to \( p \) as \( m \to \infty \). The last \( n - k \) fractions have the form \( (m - r_m - j) \big/ (m - k - j) \) for some \( j \) that does not depend on \( m \). Each of these converges to \( 1 - p \) as \( m \to \infty \). Hence \[f_m(k) \to \binom{n}{k} p^k (1 - p)^{n-k} \text{ as } m \to \infty \text{ for each } k \in \{0, 1, \ldots, n\}\] The result now follows from the theorem above on density functions.

From a practical point of view, the last result means that if the population size \(m\) is large compared to sample size \(n\), then the hypergeometric distribution with parameters \(m\), \(r\), and \(n\) (which corresponds to sampling without replacement) is well approximated by the binomial distribution with parameters \(n\) and \(p = r / m\) (which corresponds to sampling with replacement). This is often a useful result, not computationally, but rather because the binomial distribution has fewer parameters than the hypergeometric distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the limiting binomial distribution, we do not need to know the population size \(m\) and the number of type 1 objects \(r\) individually, but only in the ratio \(r / m\).

In the ball and urn experiment, set \(m = 100\) and \(r = 30\). For each of the following values of \(n\) (the sample size), switch between sampling without replacement (the hypergeometric distribution) and sampling with replacement (the binomial distribution). Note the difference in the probability density functions. Run the simulation 1000 times for each sampling mode and compare the relative frequency function to the probability density function.

The Binomial Distribution

Recall again that the binomial distribution with parameters \(n \in \N_+\) and \(p \in [0, 1]\) is the distribution of the number successes in \(n\) Bernoulli trials, when \(p\) is the probability of success on a trial. This distribution has probability density function \(f\) given by \[ f(k) = \binom{n}{k} p^k (1 - p)^{n - k}, \quad k \in \{0, 1, \ldots, n\} \] Recall also that the Poisson distribution with parameter \(r \in (0, \infty)\) has probability density function \(g\) given by \[g(k) = e^{-r} \frac{r^k}{k!}, \quad k \in \N\] The distribution is named for Simeon Poisson and governs the number of random points in a region of time or space, under certain ideal conditions. The parameter \(r\) is proportional to the size of the region of time or space. The Poisson distribution is studied in more detail in the chapter on the Poisson Process.

Suppose that \(p_n \in [0, 1]\) for \(n \in \N_+\) and that \(n p_n \to r \in (0, \infty)\) as \(n \to \infty\). Then the binomial distribution with parameters \(n\) and \(p_n\) converges to the Poisson distribution with parameter \(r\) as \(n \to \infty\).

Proof

For \( k, \, n \in \N \) with \( k \le n \), the binomial PDF can be written as \[ f_n(k) = \frac{n^{(k)}}{k!} p_n^k (1 - p_n)^{n - k} = \frac{1}{k!} (n p_n) \left[(n - 1) p_n\right] \cdots \left[(n - k + 1) p_n\right] (1 - p_n)^{n - k} \] First, \( (n - j) p_n \to r \) as \(n \to \infty\) for \(j \in \{0, 1, \ldots, n - 1\}\). Next, by a famous limit from calculus, \( (1 - p_n)^n = (1 - n p_n / n)^n \to e^{-r} \) as \( n \to \infty \). Hence also \((1 - p_n)^{n-k} \to e^{-r}\) as \(n \to \infty\) for fixed \(k \in \N_+\). Therefore \(f_n(k) \to e^{-r} r^k / k!\) as \(n \to \infty\) for each \(k \in \N_+\). The result now follows from the theorem above on density functions.

From a practical point of view, the convergence of the binomial distribution to the Poisson means that if the number of trials \(n\) is large and the probability of success \(p\) small, so that \(n p^2\) is small, then the binomial distribution with parameters \(n\) and \(p\) is well approximated by the Poisson distribution with parameter \(r = n p\). This is often a useful result, again not computationally, but rather because the Poisson distribution has fewer parameters than the binomial distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the approximating Poisson distribution, we do not need to know the number of trials \(n\) and the probability of success \(p\) individually, but only in the product \(n p\). As we will see in the next chapter, the condition that \(n p^2\) be small means that the variance of the binomial distribution, namely \(n p (1 - p) = n p - n p^2\) is approximately \(r = n p\), which is the variance of the approximating Poisson distribution.

In the binomial timeline experiment, set the parameter values as follows, and observe the graph of the probability density function. (Note that \(n p = 5\) in each case.) Run the experiment 1000 times in each case and compare the relative frequency function and the probability density function. Note also the successes represented as random points in discrete time.

\(n = 10\), \(p = 0.5\)
\(n = 20\), \(p = 0.25\)
\(n = 100\), \(p = 0.05\)

In the Poisson experiment, set \(r = 5\) and \(t = 1\), to get the Poisson distribution with parameter 5. Note the shape of the probability density function. Run the experiment 1000 times and compare the relative frequency function to the probability density function. Note the similarity between this experiment and the one in the previous exercise.

The Geometric Distribution

Recall that the geometric distribution on \(\N_+\) with success parameter \(p \in (0, 1]\) has probability density function \(f\) given by \[ f(k) = p (1 - p)^{k-1}, \quad k \in \N_+\] The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials.

Suppose that \(U\) has the geometric distribution on \(\N_+\) with success parameter \(p \in (0, 1]\). For \( n \in \N_+ \), the conditional distribution of \( U \) given \( U \le n \) converges to the uniform distribution on \(\{1, 2, \ldots, n\}\) as \(p \downarrow 0\).

Proof

The CDF \(F\) of \( U \) is given by \( F(k) = 1 - (1 - p)^k \) for \(k \in \N_+\). Hence for \(n \in \N_+\), the conditional CDF of \( U \) given \( U \le n \) is \[ F_n(k) = \P(U \le k \mid U \le n) = \frac{\P(U \le k)}{\P(U \le n)} = \frac{1 - (1 - p)^k}{1 - (1 - p)^n}, \quad k \in \{1, 2, \ldots n\} \] Using L'Hospital's rule, gives \( F_n(k) \to k / n \) as \( p \downarrow 0 \) for \(k \in \{1, 2, \ldots, n\}\). As a function of \(k\) this is the CDF of the uniform distribution on \( \{1, 2, \ldots, n\} \).

Next, recall that the exponential distribution with rate parameter \(r \in (0, \infty)\) has distribution function \(G\) given by \[ G(t) = 1 - e^{-r t}, \quad 0 \le t \lt \infty \] The exponential distribution governs the time between arrivals in the Poisson model of random points in time.

Suppose that \(U_n\) has the geometric distribution on \(\N_+\) with success parameter \(p_n \in (0, 1]\) for \(n \in \N_+\), and that \(n p_n \to r \in (0, \infty)\) as \(n \to \infty\). The distribution of \(U_n / n\) converges to the exponential distribution with parameter \(r\) as \(n \to \infty\).

Proof

Let \( F_n \) denote the CDF of \( U_n / n \). Then for \( x \in [0, \infty) \) \[ F_n(x) = \P\left(\frac{U_n}{n} \le x\right) = \P(U_n \le n x) = \P\left(U_n \le \lfloor n x \rfloor\right) = 1 - \left(1 - p_n\right)^{\lfloor n x \rfloor} \] We showed in the proof of the convergence of the binomial distribution that \( (1 - p_n)^n \to e^{-r} \) as \( n \to \infty \), and hence \( \left(1 - p_n\right)^{n x} \to e^{-r x} \) as \( n \to \infty \). But by definition, \( \lfloor n x \rfloor \le n x \lt \lfloor n x \rfloor + 1\) or equivalently, \( n x - 1 \lt \lfloor n x \rfloor \le n x \) so it follows from the squeeze theorem that \( \left(1 - p_n \right)^{\lfloor n x \rfloor} \to e^{- r x} \) as \( n \to \infty \). Hence \( F_n(x) \to 1 - e^{-r x} \) as \( n \to \infty \). As a function of \(x \in [0, \infty), this is the CDF of the exponential distribution with parameter \(r\).

Note that the limiting condition on \(n\) and \(p\) in the last result is precisely the same as the condition for the convergence of the binomial distribution to the Poisson distribution. For a deeper interpretation of both of these results, see the section on the Poisson distribution.

In the negative binomial experiment, set \(k = 1\) to get the geometric distribution. Then decrease the value of \(p\) and note the shape of the probability density function. With \(p = 0.5\) run the experiment 1000 times and compare the relative frequency function to the probability density function.

In the gamma experiment, set \(k = 1\) to get the exponential distribution, and set \(r = 5\). Note the shape of the probability density function. Run the experiment 1000 times and compare the empirical density function and the probability density function. Compare this experiment with the one in the previous exercise, and note the similarity, up to a change in scale.

The Matching Distribution

For \(n \in \N_+\), consider a random permutation \((X_1, X_2, \ldots, X_n)\) of the elements in the set \(\{1, 2, \ldots, n\}\). We say that a match occurs at position \(i\) if \(X_i = i\).

\(\P\left(X_i = i\right) = \frac{1}{n}\) for each \(i \in \{1, 2, \ldots, n\}\).

Proof

The number of permutations of \(\{1, 2, \ldots, n\}\) is \(n!\). For \(i \in \{1, 2, \ldots, n\}\), the number of such permutations with \(i\) in position \(i\) is \((n - 1)!\). Hence \(\P(X_i = i) = (n - 1)! / n! = 1 / n\). A more direct argument is that \(i\) is no more or less likely to end up in position \(i\) as any other number.

So the matching events all have the same probability, which varies inversely with the number of trials.

\(\P\left(X_i = i, X_j = j\right) = \frac{1}{n (n - 1)}\) for \(i, \, j \in \{1, 2, \ldots, n\}\) with \(i \ne j\).

Proof

Again, the number of permutations of \(\{1, 2, \ldots, n\}\) is \(n!\). For distinct \(i, \, j \in \{1, 2, \ldots, n\}\), the number of such permutations with \(i\) in position \(i\) and \(j\) in position \(j\) is \((n - 2)!\). Hence \(\P(X_i = i, X_j = j) = (n - 2)! / n! = 1 / n (n - 1)\).

So the matching events are dependent, and in fact are positively correlated. In particular, the matching events do not form a sequence of Bernoulli trials. The matching problem is studied in detail in the chapter on Finite Sampling Models. In that section we show that the number of matches \(N_n\) has probability density function \(f_n\) given by: \[ f_n(k) = \frac{1}{k!} \sum_{j=0}^{n-k} \frac{(-1)^j}{j!}, \quad k \in \{0, 1, \ldots, n\} \]

The distribution of \(N_n\) converges to the Poisson distribution with parameter 1 as \(n \to \infty\).

Proof

For \( k \in \N \), \[ f_n(k) = \frac{1}{k!} \sum_{j=0}^{n-k} \frac{(-1)^j}{j!} \to \frac{1}{k!} \sum_{j=0}^\infty \frac{(-1)^j}{j!} = \frac{1}{k!} e^{-1} \] As a function of \(k \in \N\), this is the PDF of the Poisson distribution with parameter 1. So the result follows from the theorem above on density functions.

In the matching experiment, increase \(n\) and note the apparent convergence of the probability density function for the number of matches. With selected values of \(n\), run the experiment 1000 times and compare the relative frequency function and the probability density function.

The Extreme Value Distribution

Suppose that \((X_1, X_2, \ldots)\) is a sequence of independent random variables, each with the standard exponential distribution (parameter 1). Thus, recall that the common distribution function \(G\) is given by \[ G(x) = 1 - e^{-x}, \quad 0 \le x \lt \infty \]

As \(n \to \infty\), the distribution of \(Y_n = \max\{X_1, X_2, \ldots, X_n\} - \ln n \) converges to the distribution with distribution function \(F\) given by \[ F(x) = e^{-e^{-x}}, \quad x \in \R\]

Proof

Let \( X_{(n)} = \max\{X_1, X_2, \ldots, X_n\} \) and recall that \( X_{(n)} \) has CDF \( G^n \). Let \( F_n \) denote the CDF of \( Y_n \). For \( x \in \R \) \[ F_n(x) = \P(Y_n \le x) = \P\left(X_{(n)} \le x + \ln n \right) = G^n(x + \ln n) = \left[1 - e^{-(x + \ln n) }\right]^n = \left(1 - \frac{e^{-x}}{n} \right)^n \] By our famous limit from calculus again, \( F_n(x) \to e^{-e^{-x}} \) as \( n \to \infty \).

The limiting distribution in Exercise (27) is the standard extreme value distribution, also known as the standard Gumbel distribution in honor of Emil Gumbel. Extreme value distributions are studied in detail in the chapter on Special Distributions.

The Pareto Distribution

Recall that the Pareto distribution with shape parameter \(a \in (0, \infty)\) has distribution function \(F\) given by \[F(x) = 1 - \frac{1}{x^a}, \quad 1 \le x \lt \infty\] The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution sometimes used to model financial variables. It is studied in more detail in the chapter on Special Distributions.

Suppose that \(X_n\) has the Pareto distribution with parameter \(n\) for each \(n \in \N_+\). Then

\(X_n \to 1\) as \(n \to \infty\) in distribution (and hence also in probability).
The distribution of \(Y_n = nX_n - n\) converges to the standard exponential distribution as \(n \to \infty\).

Proof

The CDF of \( X_n \) is \( F_n(x) = 1 - 1 / x^n \) for \( x \ge 1 \). Hence \( F_n(x) = 0 \) for \( n \in \N_+ \) and \( x \le 1 \) while \( F_n(x) \to 1 \) as \( n \to \infty \) for \( x \gt 1 \). Thus the limit of \( F_n \) agrees with the CDF of the constant 1, except at \(x = 1\), the point of discontinuity.
Let \( G_n \) denote the CDF of \( Y_n \). For \( x \ge 0 \), \[ G_n(x) = \P(Y_n \le x) = \P(X_n \le 1 + x / n) = 1 - \frac{1}{(1 + x / n)^n} \] By our famous theorem from calculus again, it follows that \( G_n(x) \to 1 - 1 / e^x = 1 - e^{-x} \) as \( n \to \infty \). As a function of \(x \in [0, \infty\), this is the CDF of the standard exponential distribution.

Fundamental Theorems

The two fundamental theorems of basic probability theory, the law of large numbers and the central limit theorem, are studied in detail in the chapter on Random Samples. For this reason we will simply state the results in this section. So suppose that \((X_1, X_2, \ldots)\) is a sequence of independent, identically distributed, real-valued random variables (defined on the same probability space) with mean \(\mu \in (-\infty. \infty)\) and standard deviation \(\sigma \in (0, \infty)\). For \(n \in \N_+\), let \( Y_n = \sum_{i=1}^n X_i \) denote the sum of the first \(n\) variables, \( M_n = Y_n \big/n \) the average of the first \( n \) variables, and \( Z_n = (Y_n - n \mu) \big/ \sqrt{n} \sigma \) the standard score of \( Y_n \).

The fundamental theorems of probability

\( M_n \to \mu \) as \( n \to \infty \) with probability 1 (and hence also in probability and in distribution). This is the law of large numbers.
The distribution of \( Z_n \) converges to the standard normal distribution as \( n \to \infty \). This is the central limit theorem.

In part (a), convergence with probability 1 is the strong law of large numbers while convergence in probability and in distribution are the weak laws of large numbers.

General Spaces

Our next goal is to define convergence of probability distributions on more general measurable spaces. For this discussion, you may need to refer to other sections in this chapter: the integral with respect to a positive measure, properties of the integral, and density functions. In turn, these sections depend on measure theory developed in the chapters on Foundations and Probability Measures.

Definition and Basic Properties

First we need to define the type of measurable spaces that we will use in this subsection.

We assume that \((S, d)\) is a complete, separable metric space and let \(\mathscr S\) denote the Borel \(\sigma\)-algebra of subsets of \(S\), that is, the \(\sigma\)-algebra generated by the topology. The standard spaces that we often use are special cases of the measurable space \((S, \mathscr S)\):

Discrete: \(S\) is countable and is given the discrete metric so \(\mathscr S\) is the collection of all subsets of \(S\).
Euclidean: \(\R^n\) is given the standard Euclidean metric so \(\mathscr R_n\) is the usual \(\sigma\)-algebra of Borel measurable subsets of \(\R^n\).

Additional details

Recall that the metric space \((S, d)\) is complete if every Cauchy sequence in \(S\) converges to a point in \(S\). The space is separable if there exists a coutable subset that is dense. A complete, separable metric space is sometimes called a Polish space because such spaces were extensively studied by a group of Polish mathematicians in the 1930s, including Kazimierz Kuratowski.

As suggested by our setup, the definition for convergence in distribution involves both measure theory and topology. The motivation is the theorem above for the one-dimensional Euclidean space \((\R, \mathscr R)\).

Convergence in distribution:

Suppose that \(P_n\) is a probability measure on \((S, \mathscr S)\) for each \(n \in \N_+^*\). Then \(P_n\) converges (weakly) to \(P_\infty\) as \(n \to \infty\) if \(P_n(A) \to P_\infty(A)\) as \(n \to \infty\) for every \(A \in \mathscr S\) with \(P_\infty(\partial A) = 0\). We write \(P_n \Rightarrow P_\infty\) as \(n \to \infty\).
Suppose that \(X_n\) is a random variable with distribution \(P_n\) on \((S, \mathscr S)\) for each \(n \in \N_+^*\). Then \(X_n\) converges in distribution to \(X_\infty\) as \(n \to \infty\) if \(P_n \Rightarrow P_\infty\) as \(n \to \infty\). We write \(X_n \to X_\infty\) as \(n \to \infty\) in distribution.

Notes

The definition makes sense since \(A \in \mathscr S\) implies \(\partial A \in \mathscr S\). Specifically, \(\cl(A) \in \mathscr S\) because \(\cl(A)\) is closed, and \(\interior(A) \in \mathscr S\) because \(\interior(A)\) is open.
The random variables need not be defined on the same probability space.

Let's consider our two special cases. In the discrete case, as usual, the measure theory and topology are not really necessary.

Suppose that \(P_n\) is a probability measures on a discrete space \((S, \mathscr S)\) for each \(n \in \N_+^*\). Then \(P_n \Rightarrow P_\infty\) as \(n \to \infty\) if and only if \(P_n(A) \to P_\infty(A)\) as \(n \to \infty\) for every \(A \subseteq S\).

Proof

This follows from the definition. Every subset is both open and closed so \(\partial A = \emptyset\) for every \(A \subseteq S\).

In the Euclidean case, it suffices to consider distribution functions, as in the one-dimensional case. If \(P\) is a probability measure on \((\R^n, \mathscr R_n)\), recall that the distribution function \(F\) of \(P\) is given by \[F(x_1, x_2, \ldots, x_n) = P\left((-\infty, x_1] \times (-\infty, x_2] \times \cdots \times (-\infty, x_n]\right), \quad (x_1, x_2, \ldots, x_n) \in \R^n\]

Suppose that \(P_n\) is a probability measures on \((\R^n, \mathscr R_n)\) with distribution function \(F_n\) for each \(n \in \N_+^*\). Then \(P_n \Rightarrow P_\infty\) as \(n \to \infty\) if and only if \(F_n(\bs x) \to F_\infty(\bs x)\) as \(n \to \infty\) for every \(\bs x \in \R^n\) where \(F_\infty\) is continuous.

Convergence in Probability

As in the case of \((\R, \mathscr R)\), convergence in probability implies convergence in distribution.

Suppose that \(X_n\) is a random variable with values in \(S\) for each \(n \in \N_+^*\), all defined on the same probability space. If \(X_n \to X_\infty\) as \(n \to \infty\) in probability then \(X_n \to X_\infty\) as \(n \to \infty\) in distribution.

Notes

Assume that the common probability space is \((\Omega, \mathscr F, \P)\). Recall that convergence in probability means that \(\P[d(X_n, X_\infty) \gt \epsilon] \to 0\) as \(n \to \infty\) for every \(\epsilon \gt 0\),

So as before, convergence with probability 1 implies convergence in probability which in turn implies convergence in distribution.

Skorohod's Representation Theorem

As you might guess, Skorohod's theorem for the one-dimensional Euclidean space \((\R, \mathscr R)\) can be extended to the more general spaces. However the proof is not nearly as straightforward, because we no longer have the quantile function for constructing random variables on a common probability space.

Suppose that \(P_n\) is a probability measures on \((S, \mathscr S)\) for each \(n \in \N_+^*\) and that \(P_n \Rightarrow P_\infty\) as \(n \to \infty\). Then there exists a random variable \(X_n\) with values in \(S\) for each \(n \in \N_+^*\), defined on a common probability space, such that

\(X_n\) has distribution \(P_n\) for \(n \in \N_+^*\)
\(X_n \to X_\infty\) as \(n \to \infty\) with probability 1.

One of the main consequences of Skorohod's representation, the preservation of convergence in distribution under continuous functions, is still true and has essentially the same proof. For the general setup, suppose that \((S, d, \mathscr S)\) and \((T, e, \mathscr T)\) are spaces of the type described above.

Suppose that \(X_n\) is a random variable with values in \(S\) for each \(n \in \N_+^*\) (not necessarily defined on the same probability space). Suppose also that \(g: S \to T\) is measurable, and let \(D_g\) denote the set of discontinuities of \(g\), and \(P_\infty\) the distribution of \(X_\infty\). If \(X_n \to X_\infty\) as \(n \to \infty\) in distribution and \(P_\infty(D_g) = 0\), then \(g(X_n) \to g(X_\infty)\) as \(n \to \infty\) in distribution.

Proof

By Skorohod's theorem, there exists random variables \(Y_n\) with values in \(S\) for \(n \in \N_+^*\), defined on the same probability space \((\Omega, \mathscr F, \P)\), such that \(Y_n\) has the same distribution as \(X_n\) for \(n \in \N_+^*\), and \(Y_n \to Y_\infty\) as \(n \to \infty\) with probability 1. Since \(\P(Y_\infty \in D_g) = P_\infty(D_g) = 0\) it follows that \(g(Y_n) \to g(Y_\infty)\) as \(n \to \infty\) with probability 1. Hence \(g(Y_n) \to g(Y_\infty)\) as \(n \to \infty\) in distribution. But \(g(Y_n)\) has the same distribution as \(g(X_n)\) for each \(n \in \N_+^*\).

A simple consequence of the continuity theorem is that if a sequence of random vectors in \(\R^n\) converge in distribution, then the sequence of each coordinate also converges in distribution. Let's just consider the two-dimensional case to keep the notation simple.

Suppose that \((X_n, Y_n)\) is a random variable with values in \(\R^2\) for \(n \in \N_+^*\) and that \((X_n, Y_n) \to (X_\infty, Y_\infty)\) as \(n \to \infty\) in distribution. Then

\(X_n \to X_\infty\) as \(n \to \infty\) in distribution.
\(Y_n \to Y_\infty\) as \(n \to \infty\) in distribution.

Scheffé's Theorem

Our next discussion concerns an important result known as Scheffé's theorem, named after Henry Scheffé. To state our theorem, suppose that \( (S, \mathscr S, \mu) \) is a measure space, so that \( S \) is a set, \( \mathscr S \) is a \( \sigma \)-algebra of subsets of \( S \), and \( \mu \) is a positive measure on \( (S, \mathscr S) \). Further, suppose that \( P_n \) is a probability measure on \( (S, \mathscr S) \) that has density function \( f_n \) with respect to \( \mu \) for each \( n \in \N_+ \), and that \( P \) is a probability measure on \( (S, \mathscr S) \) that has density function \( f \) with respect to \( \mu \).

If \(f_n(x) \to f(x)\) as \(n \to \infty\) for almost all \( x \in S \) (with respect to \( \mu \)) then \(P_n(A) \to P(A)\) as \(n \to \infty\) uniformly in \(A \in \mathscr S\).

Proof

From basic properties of the integral it follows that for \( A \in \mathscr S \), \[\left|P(A) - P_n(A)\right| = \left|\int_A f \, d\mu - \int_A f_n \, d\mu \right| = \left| \int_A (f - f_n) \, d\mu\right| \le \int_A \left|f - f_n\right| \, d\mu \le \int_S \left|f - f_n\right| \, d\mu\] Let \(g_n = f - f_n\), and let \(g_n^+\) denote the positive part of \(g_n\) and \(g_n^-\) the negative part of \(g_n\). Note that \(g_n^+ \le f\) and \(g_n^+ \to 0\) as \(n \to \infty\) almost everywhere on \( S \). Since \( f \) is a probability density function, it is trivially integrable, so by the dominated convergence theorem, \(\int_S g_n^+ \, d\mu \to 0\) as \(n \to \infty\). But \(\int_\R g_n \, d\mu = 0\) so \(\int_\R g_n^+ \, d\mu = \int_\R g_n^- \, d\mu\). Therefore \(\int_S \left|g_n\right| \, d\mu = 2 \int_S g_n^+ d\mu \to 0\) as \(n \to \infty\). Hence \(P_n(A) \to P(A)\) as \(n \to \infty\) uniformly in \(A \in \mathscr S\).

Of course, the most important special cases of Scheffé's theorem are to discrete distributions and to continuous distributions on a subset of \( \R^n \), as in the theorem above on density functions.

Expected Value

Generating functions are studied in the chapter on Expected Value. In part, the importance of generating functions stems from the fact that ordinary (pointwise) convergence of a sequence of generating functions corresponds to the convergence of the distributions in the sense of this section. Often it is easier to show convergence in distribution using generating functions than directly from the definition.

In addition, converence in distribution has elegant characterizations in terms of the convergence of the expected values of certain types of functions of the underlying random variables.