# 11.2: Mathematical Expectation and General Random Variables

- Page ID
- 10851

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)In this unit, we extend the definition and properties of mathematical expectation to the general case. In the process, we note the relationship of mathematical expectation to the Lebesque integral, which is developed in abstract measure theory. Although we do not develop this theory, which lies beyond the scope of this study, identification of this relationship provides access to a rich and powerful set of properties which have far reaching consequences in both application and theory.

## Extension to the General Case

In the unit on Distribution Approximations, we show that a bounded random variable \(X\) can be represented as the limit of a nondecreasing sequence of simple random variables. Also, a real random variable can be expressed as the difference \(X = X^{+} - X^{-}\) of two nonnegative random variables. The extension of mathematical expectation to the general case is based on these facts and certain basic properties of simple random variables, some of which are established in the unit on expectation for simple random variables. We list these properties and sketch how the extension is accomplished.

Definition: almost surely

A condition on a random variable or on a relationship between random variables is said to hold *almost surely*, abbreviated “a.s.” iff the condition or relationship holds for all \(\omega\) except possibly a set with probability zero.

**Basic properties of simple random variables**

(E0) : If \(X = Y\) a.s. then \(E[X] = E[Y]\).

(E1): \(E(aI_E) = aP(E)\).

(E2): *Linearity*. \(X = \sum_{i = 1}^{n} a_i X_i\) implies \(E[X] = \sum_{i = 1}^{n} a_i E[X_i]\)|

(E3): *Positivity: monotonicity*

a. If \(X \ge 0\) a.s. , then \(E[X] \ge 0\), with equality iff \(X = 0\) a.s. .

b. If \(X \ge Y\) a.s. , then \(E[X] \ge E[Y]\), with equality iff \(X = Y\) a.s. .

(E4):* Fundamental lemma* If \(X \ge 0\) is bounded and \(\{X_n: 1 \le n\}\) is an a.s. nonnegative, nondecreasing sequence with \(\text{lim}_{n} \ X_n(\omega) \ge X(\omega)\) for almost every \(\omega\), then \(\text{lim}_{n} \ E[X_n] \ge E[X]\).

(E4a): If for all \(n\), \(0 \le X_n \le X_{n + 1}\) a.s. and \(X_n \to X\) a.s. , then \(E[X_n] \to E[X]\) (i.e. , the expectation of the limit is the limit of the expectations).

**Ideas of the proofs of the fundamental properties**

- Modifying the random variable \(X\) on a set of probability zero simply modifies one or more of the \(A_i\) without changing \(P(A_i)\)
- Properties (E1) and (E2) are established in the unit on expectation of simple random variables..
- Positivity (E3a) is a simple property of sums of real numbers. Modification of sets of probability zero cannot affect the expectation.
- Monotonicity (E3b) is a consequence of positivity and linearity.

\(X \ge Y\) iff \(X - Y \ge 0\) a.s. and \(E[X] \ge E[Y]\) iff \(E[X] - E[Y] = E[X - Y] \ge 0\)

- The fundamental lemma (E4) plays an essential role in extending the concept of expectation. It involves elementary, but somewhat sophisticated, use of linearity and monotonicity, limited to nonnegative random variables and positive coefficients. We forgo a proof.
- Monotonicity and the fundamental lemma provide a very simple proof of the monotone convergence theoem, often designated MC. Its role is essential in the extension.

**Nonnegative random variables**

There is a nondecreasing sequence of nonnegative simple random variables converging to \(X\). Monotonicity implies the integrals of the nondecreasing sequence is a nondecreasing sequence of real numbers, which must have a limit or increase without bound (in which case we say the limit is infinite). We define \(E[X] = \text{lim } E[X_n]\).

Two questions arise.

Is the limit unique? The approximating sequences for a simple random variable are not unique, although their limit is the same.

Is the definition consistent? If the limit random variable \(X\) is simple, does the new definition coincide with the old?

The fundamental lemma and monotone convergence may be used to show that the answer to both questions is affirmative, so that the definition is reasonable. Also, the six fundamental properties survive the passage to the limit.

As a simple applications of these ideas, consider discrete random variables such as the geometric (\(p\)) or Poisson (\(\mu\)), which are integer-valued but unbounded.

Example 11.2.1: Unbounded, nonnegative, integer-valued random variables

The random variable \(X\) may be expressed

\(X = \sum_{k = 0}^{\infty} k I_{E_k}\), where \(E_k = \{X = k\}\) with \(P(E_k) = p_k\)

Let

\(X_n = \sum_{k = 0}^{n - 1} kI_{E_k} + n I_{B_n}\), where \(B_n = \{X \ge n\}\)

Then each \(X_n\) is a simple random variable with \(X_n \le X_{n + 1}\). If \(X(\omega) = k\), then \(X_n(\omega) = k = X(\omega)\) for all \(n \ge k + 1\). Hence, \(X_{n} (\omega) \to X(\omega)\) for all \(\omega\). By monotone convergence, \(E[X_n] \to E[X]\). Now

\(E[X_n] = \sum_{k = 1}^{n - 1} k P(E_k) + nP(B_n)\)

If \(\sum_{k = 0}^{\infty} kP(E_k) < \infty\), then

\(0 \le nP(B_n) = n \sum_{k = n}^{\infty} P(E_k) \le \sum_{k = n}^{\infty} kP(E_k) \to 0\) as \(n \to \infty\)

Hence

\(E[X] = \text{lim}_{n} \ E[X_n] = \sum_{k = 0}^{\infty} k P(A_k)\)

We may use this result to establish the expectation for the geometric and Poisson distributions.

Example 11.2.2: X~geometric (\(p\))

We have \(p_k = P(X = k) = q^k p\). \(0 \le k\). By the result of Example 11.2.1.

\(E[X] = \sum_{k = 0}^{\infty} kpq^k = pq \sum_{k = 1}^{\infty} kq^{k - 1} = \dfrac{pq}{(1 - q)^2} = q/p\)

For \(Y - 1\) ~ geometric (\(p\)), \(p_k = pq^{k - 1}\) so that \(E[Y] = \dfrac{1}{q} E[X] = 1/p\)

Example 11.2.3: X~poisson (\(\mu\))

We have \(p_k = e^{-\mu} \dfrac{\mu^{k}}{k!}\). By the result of Example 11.2.1.

\(E[X] = e^{-\mu} \sum_{k = 0}^{\infty} k \dfrac{\mu^k}{k!} = \mu e^{-\mu} \sum_{k = 1}^{\infty} \dfrac{\mu^{k - 1}}{(k - 1)!} = \mu e^{-\mu} e^{\mu} = \mu\)

**The general case**

We make use of the fact that \(X = X^{+} - X^{-}\) , where both \(X^{+}\) and \(X^{-}\) are nonnegative. Then

\(E[X] = E[X^{+}] - E[X^{-}]\) provided at least one of \(E[X^{+}]\), \(E[X^{-}]\) is finite

**Definition**. If both \(E[X^{+}]\) and \(E[X^{-}]\) are finite, \(X\) is said to be *integrable*.

The term integrable comes from the relation of expectation to the abstract Lebesgue integral of measure theory.

Again, the basic properties survive the extension. The property (E0) is subsumed in a more general uniqueness property noted in the list of properties discussed below.

*Theoretical note*

The development of expectation sketched above is exactly the development of the Lebesgue integral of the random variable \(X\) as a measurable function on the basic probability space (\(\Omega\), **\(F\)**, \(P\)), so that

\[E[X] = \int_{\Omega} X\ dP\]

As a consequence, we may utilize the properties of the general Lebesgue integral. In its abstract form, it is not particularly useful for actual calculations. A careful use of the mapping of probability mass to the real line by random variable \(X\) produces a corresponding mapping of the integral on the basic space to an integral on the real line. Although this integral is also a Lebesgue integral it agrees with the ordinary Riemann integral of calculus when the latter exists, so that ordinary integrals may be used to compute expectations.

*Additional properties*

The fundamental properties of simple random variables which survive the extension serve as the basis of an extensive and powerful list of properties of expectation of real random variables and real functions of random vectors. Some of the more important of these are listed in the table in __Appendix E__. We often refer to these properties by the numbers used in that table.

*Some basic forms*

The mapping theorems provide a number of basic integral (or summation) forms for computation.

In general, if \(Z = g(X)\) with distribution functions \(F_X\) and \(F_Z\), we have the expectation as a Stieltjes integral.

\(E[Z] = E[g(X)] = \int g(t) F_X (dt) = \int u F_Z (du)\)

If \(X\) and \(g(X)\) are absolutely continuous, the Stieltjes integrals are replaced by

\(E[Z] = \int g(t) f_X (t)\ dt = \int uF_Z (du)\)

where limits of integration are determined by \(f_X\) or \(f_Y\). Justification for use of the density function is provided by the Radon-Nikodym theorem—property (E19).

If \(X\) is simple, in a primitive form (including canonical form), then

\(E[Z] = E[g(X)] = \sum_{j = 1}^{m} g(c_j) P(C_j)\)

If the distribution for \(Z = g(X)\) is determined by a csort operation, then

\(E[Z] = \sum_{k = 1}^{n} v_k P(Z = v_k)\)

The extension to unbounded, nonnegative, integer-valued random variables is shown in Example 11.2.1, above. The finite sums are replaced by infinite series (provided they converge).

For \(Z = g(X, Y)\),

\(E[Z] = E[g(X, Y)] = \int \int g(t, u) F_{XY} (dtdu) = \int v F_Z (dv)\)

In the absolutely continuous case

\(E[Z] = E[g(X,Y)] = \int \int g(t,u) f_{XY} (t, u) dudt = \int v f_Z (v) dv\)

For joint simple \(X,Y\) (Section on Expectation for Simple Random Variables)

\(E[Z] = E[g(X, Y)] = \sum_{i = 1}^{n} \sum_{j = 1}^{m} g(t_i, u_j) P(X = t_i, Y = u_j)\)

## Mechanical interpretation and approximation procedures

In elementary mechanics, since the total mass is one, the quantity \(E[X] = \int t f_X (t)\ dt\) is the location of the center of mass. This theoretically rigorous fact may be derived heuristically from an examination of the expectation for a simple approximating random variable. Recall the discussion of the m-procedure for discrete approximation in the unit on Distribution Approximations The range of \(X\) is divided into equal subintervals. The values of the approximating random variable are at the midpoints of the subintervals. The associated probability is the probability mass in the subinterval, which is approximately \(f_X (t_i) dx\), where \(dx\) is the length of the subinterval. This approximation improves with an increasing number of subdivisions, with corresponding decrease in dxdx \(X_s\) is

\(E[X_s] = \sum_{i} t_i f_X(t_i) dx \approx \int tf_X(t)\ dt\)

The approximation improves with increasingly fine subdivisions. The center of mass of the approximating distribution approaches the center of mass of the smooth distribution.

It should be clear that a similar argument for \(g(X)\) leads to the integral expression

\(E[g(X)] = \int g(t) f_X (t)\ dt\)

This argument shows that we should be able to use tappr to set up for approximating the expectation \(E[g(X)]\) as well as for approximating \(P(g(X) \in M)\), etc. We return to this in Section.

**Mean values for some absolutely continuous distributions**

**Uniform** on \([a, b]f_X (t) = \dfrac{1}{b-a}\), \(a \le t \le b\) The center of mass is at \((a + b)/2\). To calculate the value formally, we write

\(E[X] = \int tf_X (t) dt = \dfrac{1}{b - a} \int_{a}^{b} t dt = \dfrac{b^2 - a^2}{2(b - a)} = \dfrac{b + a}{2}\)

**Symmetric triangular on**[\(a, b\)] The graph of the density is an isoceles triangle with base on the interval \([a, b]\). By symmetry, the center of mass, hence the expectation, is at the midpoint \((a + b)/2\).

**Exponential**(\(\lambda\)) \(f_X (t) = \lambda e^{-\lambda t}\), \(0 \le t\) Using a well known definite integral (see __Appendix B__), we have

\(E[X] = \int tf_X(t)\ dt = \int_{0}^{\infty} \lambda te^{-\lambda t} dt = 1/\lambda\)

**Gamma**(\(\alpha, \lambda\)) \(f_X (t) = \dfrac{1}{\Gamma} (\alpha) t^{\alpha - 1} \lambda^{\alpha} e^{-\lambda t}\), \(0 \le t\) Again we use one of the integrals in __Appendix B__ to obtain

\(E[X] = \int tf_X (t)\ dt = \dfrac{1}{\Gamma} \int_{0}^{\infty} \lambda^{\alpha} t^{\alpha} e^{-\lambda t} dt = \dfrac{\Gamma(\alpha + 1)}{\lambda \Gamma (\alpha)} = a/lambda\)

The last equality comes from the fact that \(\Gamma (\alpha + 1) = \alpha \Gamma (\alpha)\).

**Beta**(\(r, s\)). \(f_X (t) = \dfrac{\Gamma (r + s)}{\Gamma (r) \Gamma (s)} t^{r - 1} (1 - t)^{s - 1}\), \(0 < t < 1\) We use the fact that

\(\int_{0}^{1} u^{r - 1} (1 - u)^{s - 1} \ du = \dfrac{\Gamma (r) \Gamma (s)}{\Gamma (r + s)}\), \(r > 0\), \(s > 0\).

\(E[X] = \int tf_X (t)\ dt = \dfrac{\Gamma (r + s)}{\Gamma (r) \Gamma (s)} \int_{0}^{1} t^r (1 - t)^{s - 1} dt = \dfrac{\Gamma (r + s)}{\Gamma (r) \Gamma (s)} \cdot \dfrac{\Gamma (r + 1) \Gamma (s)}{\Gamma (r + s + 1)} = \dfrac{r}{r + s}\)

**Weibull**(\(\alpha, \lambda, v\)). \(F_X (t) = 1 - e^{-\lambda (t - v)^{\alpha}}\) \(\alpha > 0\), \(\lambda > 0\), \(v \ge 0\), \(t \ge v\). Differentiation shows

\(f_X (t) = \alpha \lambda (t - v)^{\alpha - 1} e^{-\lambda (t -v)^{\alpha}}\), \(t \ge v\)

First, consider \(Y\) ~ exponential \((\lambda)\). For this random variable

\(E[Y^r] = \int_{0}^{\infty} t^r \lambda e^{-\lambda t}\ dt = \dfrac{\Gamma (r + 1)}{\lambda^r}\)

If \(Y\) is exponential (1), then techniques for functions of random variables show that \([\dfrac{1}{\lambda} Y]^{1/\alpha} + v\) ~ Weibull (\(\alpha, lambda, v\)). Hence,

\(E[X] = \dfrac{1}{\lambda ^{1/\alpha}} E[Y^{1/\alpha}] + v = \dfrac{1}{\lambda ^{1/\alpha}} \Gamma (\dfrac{1}{\alpha} + 1) + v\)

**Normal**(\(\mu, \sigma^2\)) The symmetry of the distribution about \(t = \mu\) shows that \(E[X] = \mu\). This, of course, may be verified by integration. A standard trick simplifies the work.

\(E[X] = \int_{-\infty}^{\infty} t f_X (t) \ dt = \int_{-\infty}^{\infty} (t - \mu) f_X (t) \ dt + \mu\)

We have used the fact that \(\int_{-\infty}^{\infty} f_X (t) \ dt = 1\). If we make the change of variable \(x = t-\mu\) in the last integral, the integrand becomes an odd function, so that the integral is zero. Thus, \(E[X] = \mu\).

## Properties and Computation

The properties in the table in __Appendix E__ constitute a powerful and convenient resource for the use of mathematical expectation. These are properties of the abstract Lebesgue integral, expressed in the notation for mathematical expectation.

\[E[g(X)] = \int g(X)\ dP\]

In the development of additional properties, the four basic properties: (E1) Expectation of indicator functions, (E2) Linearity, (E3) Positivity; monotonicity, and (E4a) Monotone convergence play a foundational role. We utilize the properties in the table, as needed, often referring to them by the numbers assigned in the table.

In this section, we include a number of examples which illustrate the use of various properties. Some are theoretical examples, deriving additional properties or displaying the basis and structure of some in the table. Others apply these properties to facilitate computation

Example 11.2.4: Probability as expectation

Probability may be expressed entirely in terms of expectation.

- By properties (E1) and positivity (E3a), \(P(A) = E[I_A] \ge 0\).
- As a special cases of (E1), we have \(P(\Omega) = E[I_{\Omega}] = 1\)
- By the countable sums property (E8),

\(A = \bigvee_i A_i\) implies \(P(A) = E[I_A] = E[ \sum_{i} I_{A_i}] = \sum_i E[I_{A_i}] = \sum_i P(A_i)\)

Thus, the three defining properties for a probability measure are satisfied.

*Remark*. There are treatments of probability which characterize mathematical expectation with properties (E0) through (E4a), then define \(P(A) = E[I_A]\). Although such a development is quite feasible, it has not been widely adopted.

Example 11.2.5: An indicator function pattern

Suppose \(X\) is a real random variable and \(E = X^{-1} (M) =\{\omega: X(\omega) \in M\}\). Then

\(I_E = I_M (X)\)

To see this, note that \(X(\omega) \in M\) iff \(\omega \in E\), so that \(I_E(\omega) = 1\) iff \(I_M(X(\omega)) = 1\).

Similarly, if \(E = X^{-1} (M) \cap Y^{-1} (N)\), then \(I_E = I_M (X) I_N (Y)\). We thus have, by (E1).

\(P(X \in M) = E[I_M(X)]\) and \(P(X \in M, Y \in N) = E[I_M(X) I_N (Y)]\)

Example 11.2.6: Alternate interpretation of the mean value

\(E[(X - c)^2]\) is a minimum iff \(c = E[X]\), in which case \(E[(X - E[X])^2] = E[X^2] - E^2[X]\)

INTERPRETATION. If we approximate the random variable \(X\) by a constant \(c\), then for any *ω* the error of approximation is \(X(\omega) - c\). The probability weighted average of the square of the error (often called the *mean squared error*) is \(E[(X - c)^2]\). This average squared error is smallest iff the approximating constant \(c\) is the mean value.

**verification**

We expand \((X - c)^2\) and apply linearity to obtain

\(E[(X - c)^2 = E[X^2 - 2cX + c^2] = E[X^2] - 2E[X] c + c^2\)

The last expression is a quadratic in \(c\) (since \(E[X^2]\) and \(E[X]\) are constants). The usual calculus treatment shows the expression has a minimum for \(c = E[X]\). Substitution of this value for \(c\) shows the expression reduces to \(E[X^2] - E^2[X]\).

A number of inequalities are listed among the properties in the table. The basis for these inequalities is usually some standard analytical inequality on random variables to which the monotonicity property is applied. We illustrate with a derivation of the important Jensen's inequality.

Example 11.2.7: Jensen's inequality

If \(X\) is a real random variable and \(g\) is a convex function on an interval \(I\) which includes the range of \(X\), then

**verification**

The function \(g\) is convex on \(I\) iff for each \(t_0 \in [a,b]\) there is a number \(\lambda (t_0)\) such that

\(g(t) \ge g(t_0) + \lambda (t_0) (t - t_0)\)

This means there is a line through (\(t_0, g(t_0)\)) such that the graph of \(g\) lies on or above it. If \(a \le X \le b\), then by monotonicity \(E(a) = a \le E[X] \le E[b] = b\) (this is the mean value property (E11)). We may choose \(t_0 = E[X] \in I\). If we designate the constant \(\lambda (E[X])\) by \(c\), we have

\(g(X) \ge g(E[X]) + c(X - E[X])\)

Recalling that \(E[X]\) is a constant, we take expectation of both sides, using linearity and monotonicity, to get

\(E[g(X)] \ge g(E[X]) + c(E[X] - E[X]) = g(E[X])\)

*Remark*. It is easy to show that the function \(\lambda (\cdot)\) is nondecreasing. This fact is used in establishing Jensen's inequality for conditional expectation.

*The product rule for expectations of independent random variables*

Example 11.2.8: product rule for simple random variables

Consider an independent pair \(\{X, Y\}\) of simple random variables

\(X = \sum_{i = 1}^{n} t_i I_{A_i}\) \(Y = \sum_{j = 1}^{m} u_j I_{B_j}\) (both in canonical form)

We know that each pair \(\{A_i, B_j\}\) is independent, so that \(P(A_i B_j) = P(A_i) P(B_j)\). Consider the product \(XY\). According to the pattern described after Example 9 from "Mathematical Expectation: Simple Random Variables."

\(XY = \sum_{i = 1}^{n} t_i I_{A_i} \sum_{j = 1}^{m} u_j I_{B_j} = \sum_{i = 1}^{n} \sum_{j = 1}^{m} t_i u_j I_{A_i B_j}\)

The latter double sum is a primitive form, so that

\(E[XY] = \sum_{i = 1}^{n} \sum_{j = 1}^{m} t_i u_j P(A_i B_j) = \sum_{i = 1}^{n} \sum_{j = 1}^{m} t_i u_j P(A_i) P(B_j) = (\sum_{i = 1}^{n} t_i P(A_i)) (\sum_{j = 1}^{m} u_j P(B_j)) = E[X]E[Y]\)

Thus the product rule holds for independent simple random variables.

Example 11.2.9: approximating simple functions for an independent pair

Suppose \(\{X, Y\}\) is an independent pair, with an approximating simple pair \(\{X_s, Y_s\}\). As functions of \(X\) and \(Y\), respectively, the pair \(\{X_s, Y_s\}\) is independent. According to Example, above, the product rule \(E[X_s Y_s] = E[X_s] E[Y_s]\) must hold.

Example 11.2.10. product rule for an independent pair

For \(X \ge 0\), \(Y \ge 0\), there exist nondecreasing sequences \(\{X_n: 1 \le n\}\) and \(\{Y_n: 1 \le n\}\) of simple random variables increasing to \(X\) and \(Y\), respectively. The sequence \(\{X_n Y_n: 1 \le n\}\) is also a nondecreasing sequence of simple random variables, increasing to \(XY\). By the monotone convergence theorem (MC)

\(E[X_n] \nearrow E[X]\), \(E[Y_n] \nearrow E[Y]\), and \(E[X_n Y_n] \nearrow E[XY]\)

Since \(E[X_n Y_n] = E[X_n] E[Y_n]\) for each \(n\), we conclude \(E[XY] = E[X] E[Y]\)

In the general case,

\(XY = (X^{+} - X^{-}) (Y^{+} - Y^{-}) = X^{+}Y^{+} - X^{+} Y^{-} - X^{-} Y^{+} + X^{-} Y^{-}\)

Application of the product rule to each nonnegative pair and the use of linearity gives the product rule for the pair \(\{X, Y\}\)

*Remark*. It should be apparent that the product rule can be extended to any finite independent class.

Example 11.2.11: the joint distribution of three random variables

The class \(\{X, Y, Z\}\) is independent, with the marginal distributions shown below. Let

\(W = g(X, Y, Z) = 3X^2 + 2XY - 3XYZ\). Determine \(E[W]\).

X = 0:4; Y = 1:2:7; Z = 0:3:12; PX = 0.1*[1 3 2 3 1]; PY = 0.1*[2 2 3 3]; PZ = 0.1*[2 2 1 3 2];

icalc3 % Setup for joint dbn for {X,Y,Z} Enter row matrix of X-values X Enter row matrix of Y-values Y Enter row matrix of Z-values Z Enter X probabilities PX Enter Y probabilities PY Enter Z probabilities PZ Use array operations on matrices X, Y, Z, PX, PY, PZ, t, u, v, and P EX = X*PX' % E[X] EX = 2 EX2 = (X.^2)*PX' % E[X^2] EX2 = 5.4000 EY = Y*PY' % E[Y] EY = 4.4000 EZ = Z*PZ' % E[Z] EZ = 6.3000 G = 3*t.^2 + 2*t.*u - 3*t.*u.*v; % W = g(X,Y,Z) = 3X^2 + 2XY - 2XYZ

Example 11.2.12. a function with a compound definition: truncated exponential

Suppose \(X\) ~ exponential (0, 3). Let

\(Z = \begin{cases} X^2 & \text{for } X \le 4 \\ 16 & \text{for } X > 4 \end{cases} = I_{[0, 4]} (X) X^2 + I_{(4, \infty]} (X) 16\)

Determine \(E(Z)\).

**Analytic Solution**

\(E[g(X)] = \int g(t) f_X (t) \ dt = \int_{0}^{\infty} I_{[0, 4]} (t) t^2 0.3 e^{-0.3t}\ dt + 16 E[I_{(4, \infty]} (X)]\)

\(= \int_{0}^{4} t^2 0.3 e^{-0.3t}\ dt + 16 P(X > 4) \approx 7.4972\) (by Maple)

APPROXIMATION

To obtain a simple aproximation, we must approximate the exponential by a bounded random variable. Since \(P(X > 50) = e^{-15} \approx 3 \cdot 10^{-7}\) we may safety truncate \(X\) at 50.

tappr Enter matrix [a b] of x-range endpoints [0 50] Enter number of x approximation points 1000 Enter density as a function of t 0.3*exp(-0.3*t) Use row matrices X and PX as in the simple case M = X <= 4 G = M.*X.^2 + 16*(1 - M); % g(X) EG = G*PX' % E[g(X)] EG = 7.4972 [Z,PZ] = csort(G,PX); % Distribution for Z = g(X) EZ = Z*PZ' % E[Z] from distribution EZ = 7.4972

Because of the large number of approximation points, the results agree quite closely with the theoretical value.

Example 11.2.13. stocking for random demand (see exercise 4 from "Problems on functions of random variables")

The manager of a department store is planning for the holiday season. A certain item costs \(c\) dollars per unit and sells for \(p\) dollars per unit. If the demand exceeds the amount \(m\) ordered, additional units can be special ordered for \(s\) dollars per unit \((s > c)\). If demand is less than amount ordered, the remaining stock can be returned (or otherwise disposed of) at \(r\) dollars per unit (\(r < c\)). Demand \(D\) for the season is asumed to be a random variable with Poisson (\(\mu\)) distribution. Suppose \(\mu = 50\), \(c = 30\), \(p = 50\), \(s = 40\), \(r = 20\). What about \(m\) should the manager order to maximize the expected profit?

PROBLEM FORMULATION

Suppose \(D\) is the demand and \(X\) is the profit. Then

For \(D \le m\), \(X = D(p - c) - (m - D) (c - r) = D(p - r) + m (r - c)\)

For \(D > m\), \(X = m(p - c) + (D - m) (p - s) = D(p - s) + m(s - c)\)

It is convenient to write the expression for \(X\) in terms of \(I_M\), where \(M = (-\infty, m]\). Thus

\(X = I_M (D) [D (p - r) + m(r - c)] + [1 - I_M(D)] [D(p - s) + m (s - c)]\)

\(= D(p - s) + m(s - c) + I_M(D) [D(p - r) + m(r - c) - D(p - s) - m(s - c)]\)

\(= D(p - s) + m(s - c) + I_M(D) (s - r) (D - m)\)

Then \(E[X] = (p - c) E[D] + m(s - c) + (s - r) E[I_M(D) D] - (s - r) m E[I_M (D)].

**Analytic Solution**

For \(D\) ~ Poisson (\(\mu\)), \(E[D] = \mu\) and \(E[I_M(D)] = P(D \le m)\)

\(E[I_M(D) D] = e^{-\mu} \sum_{k = 1}^{m} k \dfrac{\mu^k}{k!} = \mu e^{-\mu} \sum_{k = 1}^{m} \dfrac{\mu^{k - 1}}{(k - 1)!} = \mu P(D \le m - 1)\)

Hence,

\(E[X] = (p - s) E[D] + m(s - c) + (s - r) E[I_M (D) D] - (s - r) m E[I_M(D)]\)

\(= (p - s)\mu + m(s - c) + (s - r) \mu P(D \le m - 1) - (s - r) m P(D \le m)\)

Because of the discrete nature of the problem, we cannot solve for the optimum \(m\) by ordinary calculus. We may solve for various \(m\) about \(m = \mu\) and determine the optimum. We do so with the aid of MATLAB and the m-function cpoisson.

mu = 50; c = 30; p = 50; s = 40; r = 20; m = 45:55; EX = (p - s)*mu + m*(s - c) + (s - r)*mu*(1 - cpoisson(mu, m))... -(s - r)*m.*(1 - cpoisson(mu,m+1)); disp([m;EX]') 45.0000 930.8604 46.0000 935.5231 47.0000 939.1895 48.0000 941.7962 49.0000 943.2988 50.0000 943.6750 % Optimum m = 50 51.0000 942.9247 52.0000 941.0699 53.0000 938.1532 54.0000 934.2347 55.0000 929.3886

A direct, solution may be obtained by MATLAB, using finite approximation for the Poisson distribution.

APPROXIMATION

ptest = cpoisson(mu,100) %Check for suitable value of n ptest = 3.2001e-10 n = 100; t = 0:n; pD = ipoisson(mu,t); for i = 1:length(m) % Step by step calculation for various m M = t > m(i); G(i,:) = t*(p - r) - M.*(t - m(i))*(s - r) - m(i)*(c - r); end EG = G*pD'; % Value agree with theoretical to four decimals

An advantage of the second solution, based on simple approximation to *D*, is that the distribution of gain for each \(m\) could be studied — e.g., the maximum and minimum gains.

— □

Example 11.2.14. a jointly distributed pair

Suppose the pair \(\{X, Y\}\) has joint density \(f_{XY} (t, u) = 3u\) on the triangular region bounded by \(u = 0\), \(u = 1 + t\), \(u = 1 - t\) (see Figure 11.2.1). Let \(Z = g(X, Y) = X^2 + 2XY\). Determine \(E[Z]\).

**Figure 11.2.1**. The density for Example 11.2.14.

**Analytic Solution**

\(E[Z] = \int \int (t^2 + 2tu) f_{XY} (t, u) \ dudt\)

\(= 3 \int_{-1}^{0} \int_{0}^{1 + t} (t^2 u + 2tu^2) \ dudt + 3 \int_{0}^{1} \int_{0}^{1 - t} (t^2 u + 2tu^2)\ dudt = 1/10\)

APPROXIMATION

tuappr Enter matrix [a b] of X-range endpoints [-1 1] Enter matrix [c d] of Y-range endpoints [0 1] Enter number of X approximation points 400 Enter number of Y approximation points 200 Enter expression for joint density 3*u.*(u<=min(1+t,1-t)) Use array operations on X, Y, PX, PY, t, u, and P G = t.^2 + 2*t.*u; % g(X,Y) = X^2 + 2XY EG = total(G.*P) % E[g(X,Y)] EG = 0.1006 % Theoretical value = 1/10 [Z, PZ] = csort(G,P); % Distribution for Z EZ = Z*PZ' % E[Z] from distribution EZ = 0.1006

Example 11.2.15. Afunction with a compound definition

The pair \(\{X, Y\}\) has joint density \(f_{XY} (t, u) = 1/2\) on the square region bounded to \(u = 1 + t, u = 1 - t, u = 3 - t\), and \(u = t - 1\) (see Figure 11.2.2).

\(W = \begin{cases} X & \text{for max } \{X, Y\} \le 1 \\ 2Y & \text{for max } \{X, Y\} > 1 \end{cases} = I_Q (X, Y) X + I_{Q^c} (X,Y) 2Y\)

where \(Q = \{(t, u): \text{max } \{t, u\} \le 1\} = \{(t, u): t \le 1, u \le 1\}\). Determine \(E[W]\).

**Figure 11.2.2**. The density for Example 11.2.15

**Analytic Solution**

The intersection of the region \(Q\) and the square is the set for which \(0 \le t \le 1\) and \(1 - t \le u \le 1\). Reference to the figure shows three regions of integration.

\(E[W] = \dfrac{1}{2} \int_0^1 \int_{1 - t}^{1} t\ dudt + \dfrac{1}{2} \int_{0}^{1} \int_{1}^{1 + t} 2u\ dudt + \dfrac{1}{2} \int_{1}^{2} \int_{t - 1}^{3 - t} 2u \ dudt = 11/6 \approx 1.8333\)

APPROXIMATION

tuappr Enter matrix [a b] of X-range endpoints [0 2] Enter matrix [c d] of Y-range endpoints [0 2] Enter number of X approximation points 200 Enter number of Y approximation points 200 Enter expression for joint density ((u<=min(t+1,3-t))& ... (u>=max(1-t,t-1))/2 Use array operation on X, Y, PX, PY, t, u, and P M = max(t,u)<=1; G = t.*M + 2*u.*(1 - M); % Z = g(X,Y) EG = total(G.*P) % E[g(X,Y)] EG = 1.8340 % Theoretical 11/6 = 1.8333 [Z,PZ] = csort(G,P); % Distribution for Z EZ = dot(Z,PZ) % E[Z] from distribution EZ = 1.8340

**Special forms for expectation**

The various special forms related to property __(E20a)__ are often useful. The general result, which we do not need, is usually derived by an argument which employs a general form of what is known as Fubini's theorem. The special form __(E20b)__

\(E[X] = \int_{-\infty}^{\infty} [u(t) - F_X (t)]\ dt\)

may be derived from __(E20a)__ by use of integration by parts for Stieltjes integrals. However, we use the relationship between the graph of the distribution function and the graph of the quantile function to show the equivalence of __(E20b)__ and __(E20f)__. The latter property is readily established by elementary arguments.

Example 11.2.16. The property (e20f)

If \(Q\) is the quantile function for the distribution function \(F_X\), then

\(E[g(X)] = \int_{0}^{1} g[G(u)]\ du\)

**VERIFICATION**

If \(Y = Q(U)\), where \(U\) ~ uniform on (0, 1), then \(Y\) has the same distribution as \(X\). Hence,

\(E[g(X)] = E[g(Q(U))] = \int g(Q(u)) f_U (u)\ du = \int_{0}^{1} g(Q(u))\ du\)

Example 11.2.17. Reliability and expectation

In reliability, if \(X\) is the life duration (time to failure) for a device, the reliability function is the probability at any time \(t\) the device is still operative. Thus

\(R(t) = P(X > t) = 1 - F_X(t)\)

According to property (E20b)

\(E[X] = \int_{0}^{\infty} R(t) \ dt\)

Example 11.2.18. Use of the quantile function

Suppose \(F_X (t) = t^a\), \(a > 0\), \(0 \le t \le 1\). Then \(Q(u) = u^{1/a}\), \(0 \le u \le a\).

\(E[X] = \int_{0}^{1} u^{1/a} \ du = \dfrac{1}{1 + 1/a} = \dfrac{a}{a + 1}\)

The same result could be obtained by using \(f_X(t) = F_{X}^{'} (t)\) and evaluating \(\int t f_X (t)\ dt\).

Example 11.2.19. Equivalence of (e20b) and (e20f)

For the special case \(g(X) = X\). Figure 3(a) shows \int_{0}^{1} Q(u) \ du\) is the difference in the shaded areas

\(\int_{0}^{1} Q(u)\ du = \text{Area } A - \text{Area } B\)

The corresponding graph of the distribution function *F* is shown in Figure 11.2.3(b). Because of the construction, the areas of the regions marked \(A\) and \(B\) are the same in the two figures. As may be seen,

\(\text{Area } A = \int_{0}^{\infty} [1 - F(t)]\ dt\) and \(\text{Area } B = \int_{-\infty}^{0} F(t)\ dt\)

Use of the unit step function \(u(t) = 1\) for \(t > 0\) and 0 for \(t < 0\) (defined arbitrarily at \(t = 0\)) enables us to combine the two expressions to get

\(\int_{0}^{1} Q(u)\ du = \text{Area } A - \text{Area } B = \int_{-\infty}^{\infty} [u(t) - F(t)]\ dt\)

Figure 11.2.3. Equivalence of properties (E20b) and (E20f).

Property (E20c) is a direct result of linearity and (E20b), with the unit step functions cancelling out.

Example 11.2.20. Property (e20d) useful inequalities

Suppose \(X \ge 0\). Then

\(\sum_{n = 0}^{\infty} P(X \ge n + 1) \le E[X] \le \sum_{n = 0}^{\infty} P(X \ge n) \le N \sum_{k = 0}^{\infty} P(X \ge kN)\), for all \(N \ge 1\)

**VERIFICATION**

For \(X \ge 0\), by (E20b)

\(E[X] = \int_{0}^{\infty} [1 - F(t)]\ dt = \int_{0}^{\infty} P(X > t)\ dt\)

Since \(F\) can have only a countable number of jumps on any interval and \(P(X > t\) and \(P(X \ge t)\) differ only at jump points, we may assert

\(\int_{a}^{b} P(X > t)\ dt = \int_{a}^{b} P(X \ge t)\ dt\)

For each nonnegative integer \(n\), let \(E_n = [n, n + 1]\). By the countable additivity of expectation

\(E[X] = \sum_{n = 0}^{\infty} E[I_{E_n} X] = \sum_{n = 0}^{\infty} \int_{E_n} P(X \ge t) \ dt \)

Since \(P(X \ge t)\) is decreasing with \(t\) and each \(E_n\) has unit length, we have by the mean value theorem

\(P(X \ge n + 1) \le E[I_{E_n} X] \le P(X \ge n)\)

The third inequality follows from the fact that

\(\int_{kN}^{(k + 1)N} P(X \ge t) \ dt \le N \int_{E_{kN}} P(X \ge t) \ dt \le NP(X \ge kN)\)

*Remark*. Property __(E20d)__ is used primarily for theoretical purposes. The special case __(E20e)__ is more frequently used.

Example 11.2.21. Property (e20e)

If \(X\) is nonnegative, integer valued, then

\(E[X] = \sum_{k = 1}^{\infty} P(X \ge k) = \sum_{k = 0}^{\infty} P(X > k)\)

**VERIFICATION**

The result follows as a special case of __(E20d)__. For integer valued random variables,

\(P(X \ge t) = P(X \ge n)\) on \(E_n\) and \(P(X \ge t) = P(X > n) = P(X \ge n + 1)\) on \(E_{n + 1}\)

An elementary derivation of __(E20e)__ can be constructed as follows.

Example 11.2.22. (e20e) for integer-valued random variables

By definition

\(E[X] = \sum_{k = 1}^{\infty} kP(X = k) = \text{lim}_n \sum_{k = 1}^{n} kP(X =k)\)

Now for each finite \(n\),

\(\sum_{k = 1}^{n} kP(X = k) = \sum_{k = 1}^{n} \sum_{j = 1}^{k} P(X = k) = \sum_{j = 1}^{n} \sum_{k = j}^{n} P(X = k) = \sum_{j = 1}^{n} P(X \ge j)\)

Taking limits as \(n \to \infty\) yields the desired result.

Example 11.2.23. the geometric distribution

Suppose \(X\) ~ geometric (\(p\)). Then \(P(X \ge k) = q^k\). Use of (E20e) gives

\(E[X] = \sum_{k = 1}^{\infty} q^k = q \sum_{k = 0}^{\infty} q^k = \dfrac{q}{1 - q} = q/p\)