4.7: Conditional Expected Value
- Page ID
- 10163
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)As usual, our starting point is a random experiment modeled by a probability space \((\Omega, \mathscr F, \P)\). So to review, \( \Omega \) is the set of outcomes, \( \mathscr F \) the collection of events, and \( \P \) the probability measure on the sample space \((\Omega, \mathscr F)\). Suppose next that \(X\) is a random variable taking values in a set \(S\) and that \(Y\) is a random variable taking values in \(T \subseteq \R\). We assume that either \(Y\) has a discrete distribution, so that \(T\) is countable, or that \(Y\) has a continuous distribution so that \(T\) is an interval (or perhaps a union of intervals). In this section, we will study the conditional expected value of \(Y\) given \(X\), a concept of fundamental importance in probability. As we will see, the expected value of \(Y\) given \(X\) is the function of \(X\) that best approximates \(Y\) in the mean square sense. Note that \(X\) is a general random variable, not necessarily real-valued, but as usual, we will assume that either \(X\) has a discrete distribution, so that \(S\) is countable or that \(X\) has a continuous distribution on \(S \subseteq \R^n\) for some \(n \in \N_+\). In the latter case, \(S\) is typically a region defined by inequalites involving elementary functions. We will also assume that all expected values that are mentioned exist (as real numbers).
Basic Theory
Definitions
Note that we can think of \((X, Y)\) as a random variable that takes values in the Cartesian product set \(S \times T\). We need recall some basic facts from our work with joint distributions and conditional distributions.
We assume that \( (X, Y) \) has joint probability density function \( f \) and we let \(g\) denote the (marginal) probability density function \( X \). Recall that if \(Y\) has a discrte distribution then \[ g(x) = \sum_{y \in T} f(x, y), \quad x \in S \] and if \(Y\) has a continuous distribution then \[ g(x) = \int_T f(x, y) \, dy, \quad x \in S \] In either case, for \( x \in S \), the conditional probability density function of \( Y \) given \( X = x \) is defined by \[ h(y \mid x) = \frac{f(x, y)}{g(x)}, \quad y \in T \]
We are now ready for the basic definitions:
For \( x \in S \), the conditional expected value of \(Y\) given \(X = x \in S\) is simply the mean computed relative to the conditional distribution. So if \(Y\) has a discrete distribution then \[E(Y \mid X = x) = \sum_{y \in T} y h(y \mid x), \quad x \in S\] and if \(Y\) has a continuous distribution then \[ \E(Y \mid X = x) = \int_T y h(y \mid x) \, dy, \quad x \in S \]
- The function \(v: S \to \R\) defined by \( v(x) = \E(Y \mid X = x)\) for \( x \in S \) is the regression function of \(Y\) based on \(X\).
- The random variable \(v(X)\) is called the conditional expected value of \(Y\) given \(X\) and is denoted \(\E(Y \mid X)\).
Intuitively, we treat \(X\) as known, and therefore not random, and we then average \(Y\) with respect to the probability distribution that remains. The advanced section on conditional expected value gives a much more general definition that unifies the definitions given here for the various distribution types.
Properties
The most important property of the random variable \(\E(Y \mid X)\) is given in the following theorem. In a sense, this result states that \( \E(Y \mid X) \) behaves just like \( Y \) in terms of other functions of \( X \), and is essentially the only function of \( X \) with this property.
The fundamental property
- \( \E\left[r(X) \E(Y \mid X)\right] = \E\left[r(X) Y\right] \) for every function \( r: S \to \R \).
- If \(u: S \to \R\) satisfies \(\E[r(X) u(X)] = \E[r(X) Y]\) for every \(r: S \to \R\) then \( \P\left[u(X) = \E(Y \mid X)\right] = 1 \).
Proof
We give the proof in the continuous case. The discrete case is analogous, with sums replacing integrals.
- From the change of variables theorem for expected value, \begin{align} \E\left[r(X) \E(Y \mid X)\right] & = \int_S r(x) \E(Y \mid X = x) g(x) \, dx = \int_S r(x) \left(\int_T y h(y \mid x) \, dy \right) g(x) \, dx\\ & = \int_S \int_T r(x) y h(y \mid x) g(x) \, dy \, dx = \int_{S \times T} r(x) y f(x, y) \, d(x, y) = \E[r(X) Y] \end{align}
- Suppose that \( u_1: S \to \R \) and \( u_2: S \to \R \) satisfy the condition in (b). Define \(r: S \to \R\) by \(r(x) = \bs 1[u_1(x) \gt u_2(x)]\). Then by assumption, \(\E\left[r(X) u_1(X)\right] = \E\left[r(X) Y\right] = \E\left[r(X) u_2(X)\right]\) But if \( \P\left[u_1(X) \gt u_2(X)\right] \gt 0 \) then \( \E\left[r(X) u_1(X)\right] \gt \E\left[r(X) u_2(X)\right] \), a contradiction. Hence we must have \( \P\left[u_1(X) \gt u_2(X)\right] = 0 \) and by a symmetric argument, \( \P[u_1(X) \lt u_2(X)] = 0 \).
Two random variables that are equal with probability 1 are said to be equivalent. We often think of equivalent random variables as being essentially the same object, so the fundamental property above essentially characterizes \( \E(Y \mid X) \). That is, we can think of \( \E(Y \mid X) \) as any random variable that is a function of \( X \) and satisfies this property. Moreover the fundamental property can be used as a definition of conditional expected value, regardless of the type of the distribution of \((X, Y)\). If you are interested, read the more advanced treatment of conditional expected value.
Suppose that \( X \) is also real-valued. Recall that the best linear predictor of \( Y \) based on \( X \) was characterized by property (a), but with just two functions: \( r(x) = 1 \) and \( r(x) = x \). Thus the characterization in the fundamental property is certainly reasonable, since (as we show below) \( \E(Y \mid X) \) is the best predictor of \( Y \) among all functions of \( X \), not just linear functions.
The basic property is also very useful for establishing other properties of conditional expected value. Our first consequence is the fact that \( Y \) and \( \E(Y \mid X) \) have the same mean.
\(\E\left[\E(Y \mid X)\right] = \E(Y)\).
Proof
Let \(r\) be the constant function 1 in the basic property.
Aside from the theoretical interest, this theorem is often a good way to compute \(\E(Y)\) when we know the conditional distribution of \(Y\) given \(X\). We say that we are computing the expected value of \(Y\) by conditioning on \(X\).
For many basic properties of ordinary expected value, there are analogous results for conditional expected value. We start with two of the most important: every type of expected value must satisfy two critical properties: linearity and monotonicity. In the following two theorems, the random variables \( Y \) and \( Z \) are real-valued, and as before, \( X \) is a general random variable.
Linear Properties
- \(\E(Y + Z \mid X) = \E(Y \mid X) + \E(Z \mid X)\).
- \(\E(c \, Y \mid X) = c \, \E(Y \mid X)\)
Proof
- Note that \( \E(Y \mid X) + \E(Z \mid X) \) is a function of \( X \). If \( r: S \to \R \) then \[ \E\left(r(x) \left[\E(Y \mid X) + \E(Z \mid X)\right]\right) = \E\left[r(X) \E(Y \mid X)\right] + \E\left[r(X) \E(Z \mid X)\right] = E\left[r(X) Y\right] + \E\left[r(X) Z\right] = \E\left[r(X) (Y + Z)\right] \] Hence the result follows from the basic property.
- Note that \( c \E(Y \mid X) \) is a function of \( X \). If \( r: S \to \R \) then \[ \E\left[r(X) c \E(Y \mid X)\right] = c \E\left[r(X) \E(Y \mid X)\right] = c \E\left[r(X) Y\right] = \E\left[r(X) (c Y)\right] \] Hence the result follows from the basic property
Part (a) is the additive property and part (b) is the scaling property. The scaling property will be significantly generalized below in (8).
Positive and Increasing Properties
- If \(Y \ge 0\) then \(\E(Y \mid X) \ge 0\).
- If \(Y \le Z\) then \(\E(Y \mid X) \le \E(Z \mid X)\).
- \( \left|\E(Y \mid X)\right| \le \E\left(\left|Y\right| \mid X\right)\)
Proof
- This follows directly from the definition.
- Note that if \( Y \le Z \) then \( Y - Z \ge 0 \) so by (a) and linearity, \[ \E(Y - Z \mid X) = \E(Y \mid X) - \E(Z \mid X) \ge 0 \]
- Note that \( -\left|Y\right| \le Y \le \left|Y\right| \) and hence by (b) and linearity, \(-\E\left(\left|Y\right| \mid X \right) \le \E(Y \mid X) \le \E\left(\left|Y\right| \mid X\right)\).
Our next few properties relate to the idea that \( \E(Y \mid X) \) is the expected value of \( Y \) given \( X \). The first property is essentially a restatement of the fundamental property.
If \(r: S \to \R\), then \(Y - \E(Y \mid X)\) and \(r(X)\) are uncorrelated.
Proof
Note that \( Y - \E(Y \mid X) \) has mean 0 by the mean property. Hence, by the basic property, \[ \cov\left[Y - \E(Y \mid X), r(X)\right] = \E\left\{\left[Y - \E(Y \mid X)\right] r(X)\right\} = \E\left[Y r(X)\right] - \E\left[\E(Y \mid X) r(X)\right] = 0 \]
The next result states that any (deterministic) function of \(X\) acts like a constant in terms of the conditional expected value with respect to \(X\).
If \(s: S \to \R\) then \[ \E\left[s(X)\,Y \mid X\right] = s(X)\,\E(Y \mid X) \]
Proof
Note that \( s(X) \E(Y \mid X) \) is a function of \( X \). If \( r: S \to \R \) then \[ \E\left[r(X) s(X) \E(Y \mid X)\right] = \E\left[r(X) s(X) Y\right] \] So the result now follow from the basic property.
The following rule generalizes theorem (8) and is sometimes referred to as the substitution rule for conditional expected value.
If \(s: S \times T \to \R\) then \[ \E\left[s(X, Y) \mid X = x\right] = \E\left[s(x, Y) \mid X = x\right] \]
In particular, it follows from (8) that \(\E[s(X) \mid X] = s(X)\). At the opposite extreme, we have the next result: If \(X\) and \(Y\) are independent, then knowledge of \(X\) gives no information about \(Y\) and so the conditional expected value with respect to \(X\) reduces to the ordinary (unconditional) expected value of \(Y\).
If \(X\) and \(Y\) are independent then \[ \E(Y \mid X) = \E(Y) \]
Proof
Trivially, \( \E(Y) \) is a (constant) function of \( X \). If \( r: S \to \R \) then \( \E\left[\E(Y) r(X)\right] = \E(Y) \E\left[r(X)\right] = \E\left[Y r(X)\right] \), the last equality by independence. Hence the result follows from the basic property.
Suppose now that \(Z\) is real-valued and that \(X\) and \(Y\) are random variables (all defined on the same probability space, of course). The following theorem gives a consistency condition of sorts. Iterated conditional expected values reduce to a single conditional expected value with respect to the minimum amount of information. For simplicity, we write \( \E(Z \mid X, Y) \) rather than \( \E\left[Z \mid (X, Y)\right] \).
Consistency
- \(\E\left[\E(Z \mid X, Y) \mid X\right] = \E(Z \mid X)\)
- \(\E\left[\E(Z \mid X) \mid X, Y\right] = \E(Z \mid X)\)
Proof
- Suppose that \( X \) takes values in \( S \) and \( Y \) takes values in \( T \), so that \( (X, Y) \) takes values in \( S \times T \). By definition, \( \E(Z \mid X) \) is a function of \( X \). If \( r: S \to \R \) then trivially \( r \) can be thought of as a function on \( S \times T \) as well. Hence \[ \E\left[r(X) \E(Z \mid X)\right] = \E\left[r(X) Z\right] = \E\left[r(X) \E(Z \mid X, Y)\right] \] It follows from the basic property that \(\E\left[\E(Z \mid X, Y) \mid X\right] = \E(Z \mid X) \).
- Note that since \( \E(Z \mid X) \) is a function of \( X \), it is trivially a function of \( (X, Y) \). Hence from (8), \( \E\left[\E(Z \mid X) \mid X, Y\right] = \E(Z \mid X) \).
Finally we show that \( \E(Y \mid X) \) has the same covariance with \( X \) as does \( Y \), not surprising since again, \( \E(Y \mid X) \) behaves just like \( Y \) in its relations with \( X \).
\(\cov\left[X, \E(Y \mid X)\right] = \cov(X, Y)\).
Proof
\( \cov\left[X, \E(Y \mid X)\right] = \E\left[X \E(Y \mid X)\right] - \E(X) \E\left[\E(Y \mid X)\right] \). But \( \E\left[X \E(Y \mid X)\right] = \E(X Y) \) by basic property, and \( \E\left[\E(Y \mid X)\right] = \E(Y) \) by the mean property. Hence \( \cov\left[X, \E(Y \mid X)\right] = \E(X Y) - \E(X) \E(Y) = \cov(X, Y) \).
Conditional Probability
The conditional probability of an event \(A\), given random variable \(X\) (as above), can be defined as a special case of the conditional expected value. As usual, let \(\bs 1_A\) denote the indicator random variable of \(A\).
If \(A\) is an event, defined \[ \P(A \mid X) = \E\left(\bs{1}_A \mid X\right) \]
Here is the fundamental property for conditional probability:
The fundamental property
- \( \E\left[r(X) \P(A \mid X)\right] = \E\left[r(X) \bs{1}_A\right] \) for every function \( r: S \to \R \).
- If \( u: S \to \R \) and \( u(X) \) satisfies \( \E[r(X) u(X)] = \E\left[r(X) \bs 1_A\right] \) for every function \( r: S \to \R \), then \( \P\left[u(X) = \P(A \mid X)\right] = 1 \).
For example, suppose that \( X \) has a discrete distribution on a countable set \( S \) with probability density function \( g \). Then (a) becomes \[ \sum_{x \in S} r(x) \P(A \mid X = x) g(x) = \sum_{x \in S} r(x) \P(A, X = x) \] But this is obvious since \( \P(A \mid X = x) = \P(A, X = x) \big/ \P(X = x) \) and \( g(x) = \P(X = x) \). Similarly, if \( X \) has a continuous distribution on \( S \subseteq \R^n \) then (a) states that \[ \E\left[r(X) \bs{1}_A\right] = \int_S r(x) \P(A \mid X = x) g(x) \, dx \]
The properties above for conditional expected value, of course, have special cases for conditional probability.
\(\P(A) = \E\left[\P(A \mid X)\right]\).
Proof
This is a direct result of the mean property, since \( \E(\bs{1}_A) = \P(A) \).
Again, the result in the previous exercise is often a good way to compute \(\P(A)\) when we know the conditional probability of \(A\) given \(X\). We say that we are computing the probability of \(A\) by conditioning on \(X\). This is a very compact and elegant version of the conditioning result given first in the section on Conditional Probability in the chapter on Probability Spaces and later in the section on Discrete Distributions in the Chapter on Distributions.
The following result gives the conditional version of the axioms of probability.
Axioms of probability
- \( \P(A \mid X) \ge 0 \) for every event \( A \).
- \( \P(\Omega \mid X) = 1 \)
- If \( \{A_i: i \in I\} \) is a countable collection of disjoint events then \( \P\left(\bigcup_{i \in I} A_i \bigm| X\right) = \sum_{i \in I} \P(A_i \mid X)\).
Details
There are some technical issues involving the countable additivity property (c). The conditional probabilities are random variables, and so for a given collection \(\{A_i: i \in I\}\), the left and right sides are the same with probability 1. We will return to this point in the more advanced section on conditional expected value
From the last result, it follows that other standard probability rules hold for conditional probability given \( X \). These results include
- the complement rule
- the increasing property
- Boole's inequality
- Bonferroni's inequality
- the inclusion-exclusion laws
The Best Predictor
The next result shows that, of all functions of \(X\), \(\E(Y \mid X)\) is closest to \(Y\), in the sense of mean square error. This is fundamentally important in statistical problems where the predictor vector \(X\) can be observed but not the response variable \(Y\). In this subsection and the next, we assume that the real-valued random variables have finite variance.
If \(u: S \to \R\), then
- \(\E\left(\left[\E(Y \mid X) - Y\right]^2\right) \le \E\left(\left[u(X) - Y\right]^2\right)\)
- Equality holds in (a) if and only if \(u(X) = \E(Y \mid X)\) with probability 1.
Proof
- Note that \begin{align} \E\left(\left[Y - u(X)\right]^2\right) & = \E\left(\left[Y - \E(Y \mid X) + \E(Y \mid X) - u(X)\right]^2\right) \\ & = \E\left(\left[Y - \E(Y \mid X)\right]^2 \right) + 2 \E\left(\left[Y - \E(Y \mid X)\right] \left[\E(Y \mid X) - u(X)\right]\right) + \E\left(\left[\E(Y \mid X) - u(X)\right]^2\right) \end{align} But \( Y - \E(Y \mid X) \) has mean 0, so the middle term on the right is \( 2 \cov\left[Y - \E(Y \mid X), \E(Y \mid X) - u(X)\right] \). Moreover, \( \E(Y \mid X) - u(X) \) is a function of \( X \) and hence is uncorrelated with \( Y - \E(Y \mid X) \) by the general uncorrelated property. Hence the middle term is 0, so \[ \E\left(\left[Y - u(X)\right]^2\right) = \E\left(\left[Y - \E(Y \mid X)\right]^2 \right) + \E\left(\left[\E(Y \mid X) - u(X)\right]^2\right) \] and therefore \( \E\left(\left[Y - \E(Y \mid X)\right]^2 \right) \le \E\left(\left[Y - u(X)\right]^2\right) \).
- Equality holds if and only if \( \E\left(\left[\E(Y \mid X) - u(X)\right]^2\right) = 0 \), if and only if \( \P\left[u(X) = \E(Y \mid X)\right] = 1 \).
Suppose now that \(X\) is real-valued. In the section on covariance and correlation, we found that the best linear predictor of \(Y\) given \(X\) is
\[ L(Y \mid X) = \E(Y) + \frac{\cov(X,Y)}{\var(X)} \left[X - \E(X)\right] \]
On the other hand, \(\E(Y \mid X)\) is the best predictor of \(Y\) among all functions of \(X\). It follows that if \(\E(Y \mid X)\) happens to be a linear function of \(X\) then it must be the case that \(\E(Y \mid X) = L(Y \mid X)\). However, we will give a direct proof also:
If \(\E(Y \mid X) = a + b X\) for constants \(a\) and \(b\) then \( \E(Y \mid X) = L(Y \mid X) \); that is,
- \(b = \cov(X,Y) \big/ \var(X) \)
- \(a = \E(Y) - \E(X) \cov(X,Y) \big/ \var(X) \)
Proof
First, \( \E(Y) = \E\left[\E(Y \mid X)\right] = a + b \E(X) \), so \( a = \E(Y) - b \E(X) \). Next, \( \cov(X, Y) = \cov[X \E(Y \mid X)] = \cov(X, a + b X) = b \var(X) \) and therefore \( b = \cov(X, Y) \big/ \var(X) \).
Conditional Variance
The conditional variance of \( Y \) given \( X \) is defined like the ordinary variance, but with all expected values conditioned on \( X \).
The conditional variance of \(Y\) given \(X\) is defined as \[ \var(Y \mid X) = \E\left(\left[Y - \E(Y \mid X)\right]^2 \biggm| X \right) \]
Thus, \( \var(Y \mid X) \) is a function of \( X \), and in particular, is a random variable. Our first result is a computational formula that is analogous to the one for standard variance—the variance is the mean of the square minus the square of the mean, but now with all expected values conditioned on \( X \):
\(\var(Y \mid X) = \E\left(Y^2 \mid X\right) - \left[\E(Y \mid X)\right]^2\).
Proof
Expanding the square in the definition and using basic properties of conditional expectation, we have
\begin{align} \var(Y \mid X) & = \E\left(Y^2 - 2 Y \E(Y \mid X) + \left[\E(Y \mid X)\right]^2 \biggm| X \right) = \E(Y^2 \mid X) - 2 \E\left[Y \E(Y \mid X) \mid X\right] + \E\left(\left[\E(Y \mid X)\right]^2 \mid X\right) \\ & = \E\left(Y^2 \mid X\right) - 2 \E(Y \mid X) \E(Y \mid X) + \left[\E(Y \mid X)\right]^2 = \E\left(Y^2 \mid X\right) - \left[\E(Y \mid X)\right]^2 \end{align}Our next result shows how to compute the ordinary variance of \( Y \) by conditioning on \( X \).
\(\var(Y) = \E\left[\var(Y \mid X)\right] + \var\left[\E(Y \mid X)\right]\).
Proof
From the previous theorem and properties of conditional expected value we have \( \E\left[\var(Y \mid X)\right] = \E\left(Y^2\right) - \E\left(\left[\E(Y \mid X)\right]^2\right) \). But \( \E\left(Y^2\right) = \var(Y) + \left[\E(Y)\right]^2 \) and similarly, \(\E\left(\left[\E(Y \mid X)\right]^2\right) = \var\left[\E(Y \mid X)\right] + \left(\E\left[\E(Y \mid X)\right]\right)^2 \). But also, \( \E\left[\E(Y \mid X)\right] = \E(Y) \) so subsituting we get \( \E\left[\var(Y \mid X)\right] = \var(Y) - \var\left[\E(Y \mid X)\right] \).
Thus, the variance of \( Y \) is the expected conditional variance plus the variance of the conditional expected value. This result is often a good way to compute \(\var(Y)\) when we know the conditional distribution of \(Y\) given \(X\). With the help of (21) we can give a formula for the mean square error when \(\E(Y \mid X)\) is used a predictor of \(Y\).
Mean square error \[ \E\left(\left[Y - \E(Y \mid X)\right]^2\right) = \var(Y) - \var\left[E(Y \mid X)\right] \]
Proof
From the definition of conditional variance, and using mean property and variance formula we have \[ \E\left(\left[Y - \E(Y \mid X)\right]^2\right) = \E\left[\var(Y \mid X)\right] = \var(Y) - \var\left[E(Y \mid X)\right] \]
Let us return to the study of predictors of the real-valued random variable \(Y\), and compare the three predictors we have studied in terms of mean square error.
Suppose that \( Y \) is a real-valued random variable.
- The best constant predictor of \(Y\) is \(\E(Y)\) with mean square error \(\var(Y)\).
- If \(X\) is another real-valued random variable, then the best linear predictor of \(Y\) given \(X\) is \[ L(Y \mid X) = \E(Y) + \frac{\cov(X,Y)}{\var(X)}\left[X - \E(X)\right] \] with mean square error \(\var(Y)\left[1 - \cor^2(X,Y)\right]\).
- If \(X\) is a general random variable, then the best overall predictor of \(Y\) given \(X\) is \(\E(Y \mid X)\) with mean square error \(\var(Y) - \var\left[\E(Y \mid X)\right]\).
Conditional Covariance
Suppose that \( Y \) and \( Z \) are real-valued random variables, and that \( X \) is a general random variable, all defined on our underlying probability space. Analogous to variance, the conditional covariance of \( Y \) and \( Z \) given \( X \) is defined like the ordinary covariance, but with all expected values conditioned on \( X \).
The conditional covariance of \(Y\) and \( Z \) given \(X\) is defined as \[ \cov(Y, Z \mid X) = \E\left([Y - \E(Y \mid X)] [Z - \E(Z \mid X) \biggm| X \right) \]
Thus, \( \cov(Y, Z \mid X) \) is a function of \( X \), and in particular, is a random variable. Our first result is a computational formula that is analogous to the one for standard covariance—the covariance is the mean of the product minus the product of the means, but now with all expected values conditioned on \( X \):
\(\cov(Y, Z \mid X) = \E\left(Y Z \mid X\right) - \E(Y \mid X) E(Z \mid X)\).
Proof
Expanding the product in the definition and using basic properties of conditional expectation, we have
\begin{align} \cov(Y, Z \mid X) & = \E\left(Y Z - Y \E(Z \mid X) - Z E(Y \mid X) + \E(Y \mid X) E(Z \mid X) \biggm| X \right) = \E(Y Z \mid X) - \E\left[Y \E(Z \mid X) \mid X\right] - \E\left[Z \E(Y \mid X) \mid X\right] + \E\left[\E(Y \mid X) \E(Z \mid X) \mid X\right] \\ & = \E\left(Y Z \mid X\right) - \E(Y \mid X) \E(Z \mid X) - \E(Y \mid X) \E(Z \mid X) + \E(Y \mid X) \E(Z \mid X) = \E\left(Y Z \mid X\right) - \E(Y \mid X) E(Z \mid X) \end{align}Our next result shows how to compute the ordinary covariance of \( Y \) and \( Z \) by conditioning on \( X \).
\(\cov(Y, Z) = \E\left[\cov(Y, Z \mid X)\right] + \cov\left[\E(Y \mid X), \E(Z \mid X) \right]\).
Proof
From (25) and properties of conditional expected value we have \[ \E\left[\cov(Y, Z \mid X)\right] = \E(Y Z) - \E\left[\E(Y\mid X) \E(Z \mid X) \right] \] But \( \E(Y Z) = \cov(Y, Z) + \E(Y) \E(Z)\) and similarly, \[\E\left[\E(Y \mid X) \E(Z \mid X)\right] = \cov[\E(Y \mid X), \E(Z \mid X) + \E[\E(Y\mid X)] \E[\E(Z \mid X)]\] But also, \( \E[\E(Y \mid X)] = \E(Y) \) and \( \E[\E(Z \mid X)] = \E(Z) \) so subsituting we get \[ \E\left[\cov(Y, Z \mid X)\right] = \cov(Y, Z) - \cov\left[E(Y \mid X), E(Z \mid X)\right] \]
Thus, the covariance of \( Y \) and \( Z \) is the expected conditional covariance plus the covariance of the conditional expected values. This result is often a good way to compute \(\cov(Y, Z)\) when we know the conditional distribution of \((Y, Z)\) given \(X\).
Examples and Applications
As always, be sure to try the proofs and computations yourself before reading the ones in the text.
Simple Continuous Distributions
Suppose that \((X,Y)\) has probability density function \(f\) defined by \(f(x,y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\).
- Find \(L(Y \mid X)\).
- Find \(\E(Y \mid X)\).
- Graph \(L(Y \mid X = x)\) and \(\E(Y \mid X = x)\) as functions of \(x\), on the same axes.
- Find \(\var(Y)\).
- Find \(\var(Y)\left[1 - \cor^2(X, Y)\right]\).
- Find \(\var(Y) - \var\left[\E(Y \mid X)\right]\).
Answer
- \(\frac{7}{11} - \frac{1}{11} X\)
- \(\frac{3 X + 2}{6 X + 3}\)
- \(\frac{11}{144} = 0.0764\)
- \(\frac{5}{66} = 0.0758\)
- \(\frac{1}{12} - \frac{1}{144} \ln 3 = 0.0757\)
Suppose that \((X,Y)\) has probability density function \(f\) defined by \(f(x,y) = 2 (x + y)\) for \(0 \le x \le y \le 1\).
- Find \(L(Y \mid X)\).
- Find \(\E(Y \mid X)\).
- Graph \(L(Y \mid X = x)\) and \(\E(Y \mid X = x)\) as functions of \(x\), on the same axes.
- Find \(\var(Y)\).
- Find \(\var(Y)\left[1 - \cor^2(X, Y)\right]\).
- Find \(\var(Y) - \var\left[\E(Y \mid X)\right]\).
Answer
- \(\frac{26}{43} + \frac{15}{43} X\)
- \(\frac{5 X^2 + 5 X + 2}{9 X + 3}\)
- \(\frac{3}{80} = 0.0375\)
- \(\frac{13}{430} = 0.0302\)
- \(\frac{1837}{21\;870} - \frac{512}{6561} \ln(2) = 0.0299\)
Suppose that \((X,Y)\) has probability density function \(f\) defined by \(f(x,y) = 6 x^2 y\) for \(0 \le x \le 1\), \(0 \le y \le 1\).
- Find \(L(Y \mid X)\).
- Find \(\E(Y \mid X)\).
- Graph \(L(Y \mid X = x)\) and \(\E(Y \mid X = x)\) as functions of \(x\), on the same axes.
- Find \(\var(Y)\).
- Find \(\var(Y)\left[1 - \cor^2(X, Y)\right]\).
- Find \(\var(Y) - \var\left[\E(Y \mid X)\right]\).
Answer
Note that \(X\) and \(Y\) are independent.
- \(\frac{2}{3}\)
- \(\frac{2}{3}\)
- \(\frac{1}{18}\)
- \(\frac{1}{18}\)
- \(\frac{1}{18}\)
Suppose that \((X,Y)\) has probability density function \(f\) defined by \(f(x,y) = 15 x^2 y\) for \(0 \le x \le y \le 1\).
- Find \(L(Y \mid X)\).
- Find \(\E(Y \mid X)\).
- Graph \(L(Y \mid X = x)\) and \(\E(Y \mid X = x)\) as functions of \(x\), on the same axes.
- Find \(\var(Y)\).
- Find \(\var(Y)\left[1 - \cor^2(X, Y)\right]\).
- Find \(\var(Y) - \var\left[\E(Y \mid X)\right]\).
Answer
- \(\frac{30}{51} + \frac{20}{51}X\)
- \(\frac{2(X^2 + X + 1)}{3(X + 1)}\)
- \(\frac{5}{252} = 0.0198\)
- \(\frac{5}{357} = 0.0140\)
- \(\frac{292}{63} - \frac{20}{3} \ln(2) = 0.0139\)
Exercises on Basic Properties
Suppose that \(X\), \(Y\), and \(Z\) are real-valued random variables with \(\E(Y \mid X) = X^3\) and \(\E(Z \mid X) = \frac{1}{1 + X^2}\). Find \(\E\left(Y\,e^X - Z\,\sin X \mid X\right)\).
Answer
\(X^3 e^X - \frac{\sin X}{1 + X^2}\)
Uniform Distributions
As usual, continuous uniform distributions can give us some geometric insight.
Recall first that for \( n \in \N_+ \), the standard measure on \(\R^n\) is \[\lambda_n(A) = \int_A 1 dx, \quad A \subseteq \R^n\] In particular, \(\lambda_1(A)\) is the length of \(A \subseteq \R\), \(\lambda_2(A)\) is the area of \(A \subseteq \R^2\), and \(\lambda_3(A)\) is the volume of \(A \subseteq \R^3\).
Details
Technically \(\lambda_n\) is Lebesgue measure on the measurable subsets of \(\R^n\). The integral representation is valid for the types of sets that occur in applications. In the discussion below, all subsets are assumed to be measurable.
With our usual setup, suppose that \(X\) takes values in \(S \subseteq \R^n\), \(Y\) takes values in \(T \subseteq \R\), and that \((X, Y)\) is uniformly distributed on \(R \subseteq S \times T \subseteq \R^{n+1}\). So \(0 \lt \lambda_{n+1}(R) \lt \infty\), and the joint probability density function \(f\) of \((X, Y)\) is given by \(f(x, y) = 1 / \lambda_{n+1}(R)\) for \((x, y) \in R\). Recall that uniform distributions, whether discrete or continuous, always have constant densities. Finally, recall that the cross section of \(R\) at \(x \in S\) is \(T_x = \{y \in T: (x, y) \in R\}\).
In the setting above, suppose that \( T_x \) is a bounded interval with midpoint \( m(x) \) and length \( l(x) \) for each \( x \in S \). Then
- \( \E(Y \mid X) = m(X) \)
- \( \var(Y \mid X) = \frac{1}{12}l^2(X) \)
Proof
This follows immediately from the fact that the conditional distribution of \( Y \) given \( X = x \) is uniformly distributed on \( T_x \) for each \( x \in S \).
So in particular, the regression curve \(x \mapsto \E(Y \mid X = x)\) follows the midpoints of the cross-sectional intervals.
In each case below, suppose that \( (X,Y) \) is uniformly distributed on the give region. Find \(\E(Y \mid X)\) and \( \var(Y \mid X) \)
- The rectangular region \(R = [a, b] \times [c, d]\) where \(a \lt b\) and \(c \lt d\).
- The triangular region \(T = \left\{(x,y) \in \R^2: -a \le x \le y \le a\right\}\) where \(a \gt 0\).
- The circular region \( C = \left\{(x, y) \in \R^2: x^2 + y^2 \le r\right\} \) where \( r \gt 0 \).
Answer
- \(\E(Y \mid X) = \frac{1}{2}(c + d)\), \( \var(Y \mid X) = \frac{1}{12}(d - c)^2 \). Note that \( X \) and \( Y \) are independent.
- \(\E(Y \mid X) = \frac{1}{2}(a + X)\), \( \var(Y \mid X) = \frac{1}{12}(a - X)^2 \)
- \( \E(Y \mid X) = 0 \), \( \var(Y \mid X) = 4 (r^2 - X^2) \)
In the bivariate uniform experiment, select each of the following regions. In each case, run the simulation 2000 times and note the relationship between the cloud of points and the graph of the regression function.
- square
- triangle
- circle
Suppose that \(X\) is uniformly distributed on the interval \((0, 1)\), and that given \(X\), random variable \(Y\) is uniformly distributed on \((0, X)\). Find each of the following:
- \(\E(Y \mid X)\)
- \(\E(Y)\)
- \(\var(Y \mid X)\)
- \(\var(Y)\)
Answer
- \(\frac{1}{2} X\)
- \(\frac{1}{4}\)
- \(\frac{1}{12} X^2\)
- \(\frac{7}{144}\)
The Hypergeometric Distribution
Suppose that a population consists of \(m\) objects, and that each object is one of three types. There are \(a\) objects of type 1, \(b\) objects of type 2, and \(m - a - b\) objects of type 0. The parameters \(a\) and \(b\) are positive integers with \(a + b \lt m\). We sample \(n\) objects from the population at random, and without replacement, where \( n \in \{0, 1, \ldots, m\} \). Denote the number of type 1 and 2 objects in the sample by \(X\) and \(Y\), so that the number of type 0 objects in the sample is \(n - X - Y\). In the in the chapter on Distributions, we showed that the joint, marginal, and conditional distributions of \( X \) and \( Y \) are all hypergeometric—only the parameters change. Here is the relevant result for this section:
In the setting above,
- \( \E(Y \mid X) = \frac{b}{m - a}(n - X) \)
- \( \var(Y \mid X) = \frac{b (m - a - b)}{(m - a)^2 (m - a - 1)} (n - X) (m - a - n + X)\)
- \( \E\left([Y - \E(Y \mid X)]^2\right) = \frac{n(m - n)b(m - a - b)}{m (m - 1)(m - a)} \)
Proof
Recall that \( (X, Y) \) has the (multivariate) hypergeometric distribution with parameters \( m \), \( a \), \( b \), and \( n \). Marginally, \( X \) has the hypergeometric distribution with parameters \( m \), \( a \), and \( n \), and \( Y \) has the hypergeometric distribution with parameters \( m \), \( b \), and \( n \). Given \( X = x \in \{0, 1, \ldots, n\} \), the remaining \( n - x \) objects are chosen at random from a population of \( m - a \) objects, of which \( b \) are type 2 and \( m - a - b \) are type 0. Hence, the conditional distribution of \( Y \) given \( X = x \) is hypergeometric with parameters \( m - a \), \( b \), and \( n - x \). Parts (a) and (b) then follow from the standard formulas for the mean and variance of the hypergeometric distribution, as functions of the parameters. Part (c) is the mean square error, and in this case can be computed most easily as \[ \var(Y) - \var[\E(Y \mid X)] = \var(Y) - \left(\frac{b}{m - a}\right)^2 \var(X) = n \frac{b}{m} \frac{m - b}{m} \frac{m - n}{m - 1} - \left(\frac{b}{m - a}\right)^2 n \frac{a}{m} \frac{m - a}{m} \frac{m - n}{m - 1} \] Simplifying gives the result.
Note that \( \E(Y \mid X) \) is a linear function of \( X \) and hence \( \E(Y \mid X) = L(Y \mid X) \).
In a collection of 120 objects, 50 are classified as good, 40 as fair and 30 as poor. A sample of 20 objects is selected at random and without replacement. Let \( X \) denote the number of good objects in the sample and \( Y \) the number of poor objects in the sample. Find each of the following:
- \( \E(Y \mid X) \)
- \( \var(Y \mid X) \)
- The predicted value of \( Y \) given \( X = 8 \)
Answer
- \( \E(Y \mid X) = \frac{80}{7} - \frac{4}{7} X \)
- \( \var(Y \mid X) = \frac{4}{1127}(20 - X)(50 + X) \)
- \( \frac{48}{7} \)
The Multinomial Trials Model
Suppose that we have a sequence of \( n \) independent trials, and that each trial results in one of three outcomes, denoted 0, 1, and 2. On each trial, the probability of outcome 1 is \( p \), the probability of outcome 2 is \( q \), so that the probability of outcome 0 is \( 1 - p - q \). The parameters \( p, \, q \in (0, 1) \) with \( p + q \lt 1 \), and of course \( n \in \N_+ \). Let \( X \) denote the number of trials that resulted in outcome 1, \( Y \) the number of trials that resulted in outcome 2, so that \( n - X - Y \) is the number of trials that resulted in outcome 0. In the in the chapter on Distributions, we showed that the joint, marginal, and conditional distributions of \( X \) and \( Y \) are all multinomial—only the parameters change. Here is the relevant result for this section:
In the setting above,
- \( \E(Y \mid X) = \frac{q}{1 - p}(n - X) \)
- \( \var(Y \mid X) = \frac{q (1 - p - q)}{(1 - p)^2}(n - X)\)
- \( \E\left([Y - \E(Y \mid X)]^2\right) = \frac{q (1 - p - q)}{1 - p} n \)
Proof
Recall that \( (X, Y) \) has the multinomial distribution with parameters \( n \), \( p \), and \( q \). Marginally, \( X \) has the binomial distribution with parameters \( n \) and \( p \), and \( Y \) has the binomial distribution with parameters \( n \) and \( q \). Given \( X = x \in \{0, 1, \ldots, n\} \), the remaining \( n - x \) trials are independent, but with just two outcomes: outcome 2 occurs with probability \( q / (1 - p) \) and outcome 0 occurs with probability \( 1 - q / (1 - p) \). (These are the conditional probabilities of outcomes 2 and 0, respectively, given that outcome 1 did not occur.) Hence the conditional distribution of \( Y \) given \( X = x \) is binomial with parameters \( n - x \) and \( q / (1 - p) \). Parts (a) and (b) then follow from the standard formulas for the mean and variance of the binomial distribution, as functions of the parameters. Part (c) is the mean square error and in this case can be computed most easily from \[ \E[\var(Y \mid X)] = \frac{q (1 - p - q)}{(1 - p)^2} [n - \E(X)] = \frac{q (1 - p - q)}{(1 - p)^2} (n - n p) = \frac{q (1 - p - q)}{1 - p} n\]
Note again that \( \E(Y \mid X) \) is a linear function of \( X \) and hence \( \E(Y \mid X) = L(Y \mid X) \).
Suppose that a fair, 12-sided die is thrown 50 times. Let \( X \) denote the number of throws that resulted in a number from 1 to 5, and \( Y \) the number of throws that resulted in a number from 6 to 9. Find each of the following:
- \( \E(Y \mid X) \)
- \( \var(Y \mid X) \)
- The predicted value of \( Y \) given \( X = 20 \)
Answer
- \( \E(Y \mid X) = \frac{4}{7}(50 - X) \)
- \( \var(Y \mid X) = \frac{12}{49}(50 - X) \)
- \( \frac{120}{7} \)
The Poisson Distribution
Recall that the Poisson distribution, named for Simeon Poisson, is widely used to model the number of random points
in a region of time or space, under certain ideal conditions. The Poisson distribution is studied in more detail in the chapter on the Poisson Process. The Poisson distribution with parameter \( r \in (0, \infty) \) has probability density function \(f\) defined by \[ f(x) = e^{-r} \frac{r^x}{x!}, \quad x \in \N \] The parameter \( r \) is the mean and variance of the distribution.
Suppose that \( X \) and \( Y \) are independent random variables, and that \( X \) has the Poisson distribution with parameter \( a \in (0, \infty) \) and \( Y \) has the Poisson distribution with parameter \( b \in (0, \infty) \). Let \( N = X + Y \). Then
- \( \E(X \mid N) = \frac{a}{a + b}N\)
- \( \var(X \mid N) = \frac{a b}{(a + b)^2} N \)
- \( \E\left([X - \E(X \mid N)]^2\right) = \frac{a b}{a + b} \)
Proof
We have shown before that the distribution of \( N \) is also Poisson, with parameter \( a + b \), and that the conditional distribution of \( X \) given \( N = n \in \N \) is binomial with parameters \( n \) and \( a / (a + b) \). Hence parts (a) and (b) follow from the standard formulas for the mean and variance of the binomial distribution, as functions of the parameters. Part (c) is the mean square error, and in this case can be computed most easily as \[ \E[\var(X \mid N)] = \frac{a b}{(a + b)^2} \E(N) = \frac{ab}{(a + b)^2} (a + b) = \frac{a b}{a + b} \]
Once again, \( \E(X \mid N) \) is a linear function of \( N \) and so \( \E(X \mid N) = L(X \mid N) \). If we reverse the roles of the variables, the conditional expected value is trivial from our basic properties: \[ \E(N \mid X) = \E(X + Y \mid X) = X + b \]
Coins and Dice
A pair of fair dice are thrown, and the scores \((X_1, X_2)\) recorded. Let \(Y = X_1 + X_2\) denote the sum of the scores and \(U = \min\left\{X_1, X_2\right\}\) the minimum score. Find each of the following:
- \(\E\left(Y \mid X_1\right)\)
- \(\E\left(U \mid X_1\right)\)
- \(\E\left(Y \mid U\right)\)
- \(\E\left(X_2 \mid X_1\right)\)
Answer
- \(\frac{7}{2} + X_1\)
-
\(x\) 1 2 3 4 5 6 \(\E(U \mid X_1 = x)\) 1 \(\frac{11}{6}\) \(\frac{5}{2}\) 3 \(\frac{10}{3}\) \(\frac{7}{2}\) -
\(u\) 1 2 3 4 5 6 \(\E(Y \mid U = u)\) \(\frac{52}{11}\) \(\frac{56}{9}\) \(\frac{54}{7}\) \(\frac{46}{5}\) \(\frac{32}{3}\) 12 - \(\frac{7}{2}\)
A box contains 10 coins, labeled 0 to 9. The probability of heads for coin \(i\) is \(\frac{i}{9}\). A coin is chosen at random from the box and tossed. Find the probability of heads.
Answer
\(\frac{1}{2}\)
This problem is an example of Laplace's rule of succession, named for Pierre Simon Laplace.
Random Sums of Random Variables
Suppose that \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of independent and identically distributed real-valued random variables. We will denote the common mean, variance, and moment generating function, respectively, by \(\mu = \E(X_i)\), \(\sigma^2 = \var(X_i)\), and \(G(t) = \E\left(e^{t\,X_i}\right)\). Let \[ Y_n = \sum_{i=1}^n X_i, \quad n \in \N \] so that \((Y_0, Y_1, \ldots)\) is the partial sum process associated with \(\bs{X}\). Suppose now that \(N\) is a random variable taking values in \(\N\), independent of \(\bs{X}\). Then \[ Y_N = \sum_{i=1}^N X_i \] is a random sum of random variables; the terms in the sum are random, and the number of terms is random. This type of variable occurs in many different contexts. For example, \(N\) might represent the number of customers who enter a store in a given period of time, and \(X_i\) the amount spent by the customer \(i\), so that \( Y_N \) is the total revenue of the store during the period.
The conditional and ordinary expected value of \(Y_N\) are
- \(\E\left(Y_N \mid N\right) = N \mu\)
- \(\E\left(Y_N\right) = \E(N) \mu\)
Proof
- Using the substitution rule and the independence of \( N \) and \( \bs{X} \) we have \[ \E\left(Y_N \mid N = n\right) = \E\left(Y_n \mid N = n\right) = \E(Y_n) = \sum_{i=1}^n \E(X_i) = n \mu \] so \(\E\left(Y_N \mid N\right) = N \mu\).
- From (a) and conditioning, \( E\left(Y_N\right) = \E\left[\E\left(Y_N \mid N\right)\right] = \E(N \mu) = \E(N) \mu \).
Wald's equation, named for Abraham Wald, is a generalization of the previous result to the case where \(N\) is not necessarily independent of \(\bs{X}\), but rather is a stopping time for \(\bs{X}\). Roughly, this means that the event \( N = n \) depends only \( (X_1, X_2, \ldots, X_n) \). Wald's equation is discussed in the chapter on Random Samples. An elegant proof of and Wald's equation is given in the chapter on Martingales. The advanced section on stopping times is in the chapter on Probability Measures.
The conditional and ordinary variance of \(Y_N\) are
- \(\var\left(Y_N \mid N\right) = N \sigma^2\)
- \(\var\left(Y_N\right) = \E(N) \sigma^2 + \var(N) \mu^2\)
Proof
- Using the substitution rule, the independence of \( N \) and \( \bs{X} \), and the fact that \( \bs{X} \) is an IID sequence, we have \[ \var\left(Y_N \mid N = n\right) = \var\left(Y_n \mid N = n\right) = \var\left(Y_n\right) = \sum_{i=1}^n \var(X_i) = n \sigma^2 \] so \( \var\left(Y_N \mid N\right) = N \sigma^2 \).
- From (a) and the previous result, \[ \var\left(Y_N\right) = \E\left[\var\left(Y_N \mid N\right)\right] + \var\left[\E(Y_N \mid \N)\right] = \E(\sigma^2 N) + \var(\mu N) = \E(N) \sigma^2 + \mu^2 \var(N)\]
Let \(H\) denote the probability generating function of \(N\). The conditional and ordinary moment generating function of \(Y_N\) are
- \(\E\left(e^{t Y_N} \mid N\right) = \left[G(t)\right]^N\)
- \(\E\left(e^{t N}\right) = H\left(G(t)\right)\)
Proof
- Using the substitution rule, the independence of \( N \) and \( \bs{X} \), and the fact that \( \bs{X} \) is an IID sequence, we have \[ \E\left(e^{t Y_N} \mid N = n\right) = \E\left(e^{t Y_n} \mid N = n\right) = \E\left(e^{t Y_n}\right) = \left[G(t)\right]^n \] (Recall that the MGF of the sum of independent variables is the product of the individual MGFs.)
- From (a) and conditioning, \( \E\left(e^{t N}\right) = \E\left[\E\left(e^{t N} \mid N\right)\right] = \E\left(G(t)^N\right) = H(G(t)) \).
Thus the moment generating function of \( Y_N \) is \( H \circ G \), the composition of the probability generating function of \( N \) with the common moment generating function of \( \bs{X} \), a simple and elegant result.
In the die-coin experiment, a fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let \(N\) denote the die score and \(Y\) the number of heads. Find each of the following:
- The conditional distribution of \(Y\) given \(N\).
- \(\E\left(Y \mid N\right)\)
- \(\var\left(Y \mid N\right)\)
- \(\E\left(Y_i\right)\)
- \(\var(Y)\)
Answer
- Binomial with parameters \(N\) and \(p = \frac{1}{2}\)
- \(\frac{1}{2} N\)
- \(\frac{1}{4} N\)
- \(\frac{7}{4}\)
- \(\frac{7}{3}\)
Run the die-coin experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
The number of customers entering a store in a given hour is a random variable with mean 20 and standard deviation 3. Each customer, independently of the others, spends a random amount of money with mean $50 and standard deviation $5. Find the mean and standard deviation of the amount of money spent during the hour.
Answer
- \($1000\)
- \($30.82\)
A coin has a random probability of heads \(V\) and is tossed a random number of times \(N\). Suppose that \(V\) is uniformly distributed on \([0, 1]\); \(N\) has the Poisson distribution with parameter \(a \gt 0\); and \(V\) and \(N\) are independent. Let \(Y\) denote the number of heads. Compute the following:
- \(\E(Y \mid N, V)\)
- \(\E(Y \mid N)\)
- \(\E(Y \mid V)\)
- \(\E(Y)\)
- \(\var(Y \mid N, V)\)
- \(\var(Y)\)
Answer
- \(N V\)
- \(\frac{1}{2} N\)
- \(a V\)
- \(\frac{1}{2} a\)
- \(N V (1 - V)\)
- \(\frac{1}{12} a^2 + \frac{1}{2} a\)
Mixtures of Distributions
Suppose that \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of real-valued random variables. Denote the mean, variance, and moment generating function of \( X_i \) by \(\mu_i = \E(X_i)\), \(\sigma_i^2 = \var(X_i)\), and \(M_i(t) = \E\left(e^{t\,X_i}\right)\), for \(i \in \N_+\). Suppose also that \(N\) is a random variable taking values in \(\N_+\), independent of \(\bs{X}\). Denote the probability density function of \(N\) by \(p_n = \P(N = n)\) for \(n \in \N_+\). The distribution of the random variable \(X_N\) is a mixture of the distributions of \(\bs{X} = (X_1, X_2, \ldots)\), with the distribution of \(N\) as the mixing distribution.
The conditional and ordinary expected value of \( X_N \) are
- \(\E(X_N \mid N) = \mu_N\)
- \(\E(X_N) = \sum_{n=1}^\infty p_n\,\mu_n\)
Proof
- Using the substitution rule and the independence of \( N \) and \( \bs{X} \), we have \( \E(X_N \mid N = n) = \E(X_n \mid N = n) = \E(X_n) = \mu_n \)
- From (a) and the conditioning rule, \[ \E\left(X_N\right) = \E\left[\E\left(X_N\right)\right] = \E\left(\mu_N\right) = \sum_{n=1}^\infty p_n \mu_n\]
The conditional and ordinary variance of \( X_N \) are
- \(\var\left(X_N \mid N\right) = \sigma_N^2\)
- \(\var(X_N) = \sum_{n=1}^\infty p_n (\sigma_n^2 + \mu_n^2) - \left(\sum_{n=1}^\infty p_n\,\mu_n\right)^2\).
Proof
- Using the substitution rule and the independence of \( N \) and \(\bs{X}\), we have \( \var\left(X_N \mid N = n\right) = \var\left(X_n \mid N = n\right) = \var\left(X_n\right) = \sigma_n^2 \)
- From (a) we have \begin{align} \var\left(X_N\right) & = \E\left[\var\left(X_N \mid N\right)\right] + \var\left[\E\left(X_N \mid N\right)\right] = \E\left(\sigma_N^2\right) + \var\left(\mu_N\right) = \E\left(\sigma_N^2\right) + \E\left(\mu_N^2\right) - \left[\E\left(\mu_N\right)\right]^2 \\ & = \sum_{n=1}^\infty p_n \sigma_n^2 + \sum_{n=1}^\infty p_n \mu_n^2 - \left(\sum_{n=1}^\infty p_n \mu_n\right)^2 \end{align}
The conditional and ordinary moment generating function of \( X_N \) are
- \( \E\left(e^{t X_N} \mid N\right) = M_N(t) \)
- \(\E\left(e^{tX_N}\right) = \sum_{i=1}^\infty p_i M_i(t)\).
Proof
- Using the substitution rule and the independence of \( N \) and \( \bs{X} \), we have \( \E\left(e^{t X_N} \mid N = n\right) = \E\left(e^{t X_n} \mid N = n\right) = \E\left(^{t X_n}\right) = M_n(t) \)
- From (a) and the conditioning rule, \( \E\left(e^{t X_N}\right) = \E\left[\E\left(e^{t X_N} \mid N\right)\right] = \E\left[M_N(t)\right] = \sum_{n=1}^\infty p_n M_n(t)\)
In the coin-die experiment, a biased coin is tossed with probability of heads \(\frac{1}{3}\). If the coin lands tails, a fair die is rolled; if the coin lands heads, an ace-six flat die is rolled (faces 1 and 6 have probability \(\frac{1}{4}\) each, and faces 2, 3, 4, 5 have probability \(\frac{1}{8}\) each). Find the mean and standard deviation of the die score.
Answer
- \(\frac{7}{2}\)
- \(1.8634\)
Run the coin-die experiment 1000 times and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.