4.8: Expected Value and Covariance Matrices
The main purpose of this section is a discussion of expected value and covariance for random matrices and vectors. These topics are somewhat specialized, but are particularly important in multivariate statistical models and for the multivariate normal distribution. This section requires some prerequisite knowledge of linear algebra.
We assume that the various indices \( m, \, n, p, k \) that occur in this section are positive integers. Also we assume that expected values of real-valued random variables that we reference exist as real numbers, although extensions to cases where expected values are \(\infty\) or \(-\infty\) are straightforward, as long as we avoid the dreaded indeterminate form \(\infty - \infty\).
Basic Theory
Linear Algebra
We will follow our usual convention of denoting random variables by upper case letters and nonrandom variables and constants by lower case letters. In this section, that convention leads to notation that is a bit nonstandard, since the objects that we will be dealing with are vectors and matrices. On the other hand, the notation we will use works well for illustrating the similarities between results for random matrices and the corresponding results in the one-dimensional case. Also, we will try to be careful to explicitly point out the underlying spaces where various objects live.
Let \(\R^{m \times n}\) denote the space of all \(m \times n\) matrices of real numbers. The \( (i, j) \) entry of \( \bs{a} \in \R^{m \times n} \) is denoted \( a_{i j} \) for \( i \in \{1, 2, \ldots, m\} \) and \( j \in \{1, 2, \ldots, n\} \). We will identify \(\R^n\) with \(\R^{n \times 1}\), so that an ordered \(n\)-tuple can also be thought of as an \(n \times 1\) column vector. The transpose of a matrix \(\bs{a} \in \R^{m \times n}\) is denoted \(\bs{a}^T\)—the \( n \times m \) matrix whose \( (i, j) \) entry is the \( (j, i) \) entry of \( \bs{a} \). Recall the definitions of matrix addition, scalar multiplication, and matrix multiplication. Recall also the standard inner product (or dot product ) of \( \bs{x}, \, \bs{y} \in \R^n \): \[ \langle \bs{x}, \bs{y} \rangle = \bs{x} \cdot \bs{y} = \bs{x}^T \bs{y} = \sum_{i=1}^n x_i y_i \] The outer product of \( \bs{x} \) and \(\bs{y}\) is \( \bs{x} \bs{y}^T \), the \( n \times n \) matrix whose \( (i, j) \) entry is \( x_i y_j \). Note that the inner product is the trace (sum of the diagonal entries) of the outer product. Finally recall the standard norm on \( \R^n \), given by \[ \|\bs{x}\| = \sqrt{\langle \bs{x}, \bs{x}\rangle} = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}\] Recall that inner product is bilinear , that is, linear (preserving addition and scalar multiplication) in each argument separately. As a consequence, for \( \bs{x}, \, \bs{y} \in \R^n \), \[ \|\bs{x} + \bs{y}\|^2 = \|\bs{x}\|^2 + \|\bs{y}\|^2 + 2 \langle \bs{x}, \bs{y} \rangle \]
Expected Value of a Random Matrix
As usual, our starting point is a random experiment modeled by a probability space \((\Omega, \mathscr F, \P)\). So to review, \( \Omega \) is the set of outcomes, \( \mathscr F \) the collection of events, and \( \P \) the probability measure on the sample space \( (\Omega, \mathscr F) \). It's natural to define the expected value of a random matrix in a component-wise manner.
Suppose that \(\bs{X}\) is an \(m \times n\) matrix of real-valued random variables, whose \((i, j)\) entry is denoted \(X_{i j}\). Equivalently, \(\bs{X}\) is as a random \(m \times n\) matrix, that is, a random variable with values in \( \R^{m \times n} \). The expected value \(\E(\bs{X})\) is defined to be the \(m \times n\) matrix whose \((i, j)\) entry is \(\E\left(X_{i j}\right)\), the expected value of \(X_{i j}\).
Many of the basic properties of expected value of random variables have analogous results for expected value of random matrices, with matrix operation replacing the ordinary ones. Our first two properties are the critically important linearity properties . The first part is the additive property —the expected value of a sum is the sum of the expected values.
\(\E(\bs{X} + \bs{Y}) = \E(\bs{X}) + \E(\bs{Y})\) if \(\bs{X}\) and \(\bs{Y}\) are random \(m \times n\) matrices.
Proof
This is true by definition of the matrix expected value and the ordinary additive property. Note that \( \E\left(X_{i j} + Y_{i j}\right) = \E\left(X_{i j}\right) + \E\left(Y_{i j}\right) \). The left side is the \( (i, j) \) entry of \( \E(\bs{X} + \bs{Y}) \) and the right side is the \( (i, j) \) entry of \( \E(\bs{X}) + \E(\bs{Y}) \).
The next part of the linearity properties is the scaling property —a nonrandom matrix factor can be pulled out of the expected value.
Suppose that \(\bs{X}\) is a random \(n \times p\) matrix.
- \(\E(\bs{a} \bs{X}) = \bs{a} \E(\bs{X})\) if \(\bs{a} \in \R^{m \times n}\).
- \( \E(\bs{X} \bs{a}) = \E(\bs{X}) \bs{a}\) if \( \bs{a} \in \R^{p \times n} \).
Proof
- By the ordinary linearity and scaling properties, \( \E\left(\sum_{j=1}^n a_{i j} X_{j k}\right) = \sum_{j=1}^n a_{i j} \E\left(X_{j k}\right) \). The left side is the \( (i, k) \) entry of \( \E(\bs{a} \bs{X}) \) and the right side is the \( (i, k) \) entry of \( \bs{a} \E(\bs{X}) \).
- The proof is similar to (a).
Recall that for independent, real-valued variables, the expected value of the product is the product of the expected values. Here is the analogous result for random matrices.
\(\E(\bs{X} \bs{Y}) = \E(\bs{X}) \E(\bs{Y})\) if \(\bs{X}\) is a random \(m \times n\) matrix, \(\bs{Y}\) is a random \(n \times p\) matrix, and \(\bs{X}\) and \(\bs{Y}\) are independent.
Proof
By the ordinary linearity properties and by the independence assumption, \[ \E\left(\sum_{j=1}^n X_{i j} Y_{j k}\right) = \sum_{j=1}^n \E\left(X_{i j} Y_{j k}\right) = \sum_{j=1}^n \E\left(X_{i j}\right) \E\left(Y_{j k}\right)\] The left side is the \( (i, k) \) entry of \( \E(\bs{X} \bs{Y}) \) and the right side is the \( (i, k) \) entry of \( \E(\bs{X}) \E(\bs{Y}) \).
Actually the previous result holds if \( \bs{X} \) and \( \bs{Y} \) are simply uncorrelated in the sense that \( X_{i j} \) and \( Y_{j k} \) are uncorrelated for each \( i \in \{1, \ldots, m\} \), \( j \in \{1, 2, \ldots, n\} \) and \( k \in \{1, 2, \ldots p\} \). We will study covariance of random vectors in the next subsection.
Covariance Matrices
Our next goal is to define and study the covariance of two random vectors.
Suppose that \(\bs{X}\) is a random vector in \(\R^m\) and \(\bs{Y}\) is a random vector in \(\R^n\).
- The covariance matrix of \(\bs{X}\) and \(\bs{Y}\) is the \(m \times n\) matrix \(\cov(\bs{X}, \bs{Y})\) whose \((i,j)\) entry is \(\cov\left(X_i, Y_j\right)\) the ordinary covariance of \(X_i\) and \(Y_j\).
- Assuming that the coordinates of \( \bs{X} \) and \(\bs{Y}\) have positive variance, the correlation matrix of \( \bs{X} \) and \( \bs{Y} \) is the \( m \times n \) matrix \( \cor(\bs{X}, \bs{Y}) \) whose \( (i, j) \) entry is \( \cor\left(X_i, Y_j\right)\), the ordinary correlation of \( X_i \) and \( Y_j \)
Many of the standard properties of covariance and correlation for real-valued random variables have extensions to random vectors. For the following three results, \( \bs X \) is a random vector in \( \R^m \) and \( \bs Y \) is a random vector in \( \R^n \).
\(\cov(\bs{X}, \bs{Y}) = \E\left(\left[\bs{X} - \E(\bs{X})\right]\left[\bs{Y} - \E(\bs{Y})\right]^T\right)\)
Proof
By the definition of the expected value of a random vector and by the defintion of matrix multiplication, the \( (i, j) \) entry of \( \left[\bs{X} - \E(\bs{X})\right]\left[\bs{Y} - \E(\bs{Y})\right]^T \) is simply \( \left[X_i - \E\left(X_i\right)\right] \left[Y_j - \E\left(Y_j\right)\right] \). The expected value of this entry is \( \cov\left(X_i, Y_j\right) \), which in turn, is the \( (i, j) \) entry of \( \cov(\bs{X}, \bs{Y}) \)
Thus, the covariance of \( \bs{X} \) and \( \bs{Y} \) is the expected value of the outer product of \( \bs{X} - \E(\bs{X}) \) and \( \bs{Y} - \E(\bs{Y}) \). Our next result is the computational formula for covariance: the expected value of the outer product of \( \bs{X} \) and \( \bs{Y} \) minus the outer product of the expected values.
\(\cov(\bs{X},\bs{Y}) = \E\left(\bs{X} \bs{Y}^T\right) - \E(\bs{X}) \left[\E(\bs{Y})\right]^T\).
Proof
The \( (i, j) \) entry of \( \E\left(\bs{X} \bs{Y}^T\right) - \E(\bs{X}) \left[\E(\bs{Y})\right]^T\) is \( \E\left(X_i, Y_j\right) - \E\left(X_i\right) \E\left(Y_j\right) \), which by the standard computational formula, is \( \cov\left(X_i, Y_j\right) \), which in turn is the \( (i, j) \) entry of \( \cov(\bs{X}, \bs{Y}) \).
The next result is the matrix version of the symmetry property.
\(\cov(\bs{Y}, \bs{X}) = \left[\cov(\bs{X}, \bs{Y})\right]^T\).
Proof
The \( (i, j) \) entry of \( \cov(\bs{X}, \bs{Y}) \) is \( \cov\left(X_i, Y_j\right) \), which is the \((j, i) \) entry of \( \cov(\bs{Y}, \bs{X}) \).
In the following result, \( \bs{0} \) denotes the \( m \times n \) zero matrix.
\(\cov(\bs{X}, \bs{Y}) = \bs{0}\) if and only if \(\cov\left(X_i, Y_j\right) = 0\) for each \(i\) and \(j\), so that each coordinate of \(\bs{X}\) is uncorrelated with each coordinate of \(\bs{Y}\).
Proof
This follows immediately from the definition of \( \cov(\bs{X}, \bs{Y}) \).
Naturally, when \( \cov(\bs{X}, \bs{Y}) = \bs{0} \), we say that the random vectors \( \bs{X} \) and \(\bs{Y}\) are uncorrelated . In particular, if the random vectors are independent, then they are uncorrelated. The following results establish the bi-linear properties of covariance.
The additive properties.
- \(\cov(\bs{X} + \bs{Y}, \bs{Z}) = \cov(\bs{X}, \bs{Z}) + \cov(\bs{Y}, \bs{Z})\) if \(\bs{X}\) and \(\bs{Y}\) are random vectors in \(\R^m\) and \(\bs{Z}\) is a random vector in \(\R^n\).
- \(\cov(\bs{X}, \bs{Y} + \bs{Z}) = \cov(\bs{X}, \bs{Y}) + \cov(\bs{X}, \bs{Z})\) if \(\bs{X}\) is a random vector in \(\R^m\), and \(\bs{Y}\) and \(\bs{Z}\) are random vectors in \(\R^n\).
Proof
- From the ordinary additive property of covariance, \( \cov\left(X_i + Y_i, Z_j\right) = \cov\left(X_i, Z_j\right) + \cov\left(Y_i, Z_j\right) \). The left side is the \( (i, j) \) entry of \( \cov(\bs{X} + \bs{Y}, \bs{Z}) \) and the right side is the \( (i, j) \) entry of \( \cov(\bs{X}, \bs{Z}) + \cov(\bs{Y}, \bs{Z}) \).
- The proof is similar to (a), using the additivity of covariance in the second argument.
The scaling properties
- \(\cov(\bs{a} \bs{X}, \bs{Y}) = \bs{a} \cov(\bs{X}, \bs{Y})\) if \(\bs{X}\) is a random vector in \(\R^n\), \(\bs{Y}\) is a random vector in \(\R^p\), and \(\bs{a} \in \R^{m \times n}\).
- \(\cov(\bs{X}, \bs{a} \bs{Y}) = \cov(\bs{X}, \bs{Y}) \bs{a}^T\) if \(\bs{X}\) is a random vector in \(\R^m\), \(\bs{Y}\) is a random vector in \(\R^n\), and \(\bs{a} \in \R^{k \times n}\).
Proof
- Using the ordinary linearity properties of covariance in the first argument, we have \[ \cov\left(\sum_{j=1}^n a_{i j} X_j, Y_k\right) = \sum_{j=1}^n a_{i j} \cov\left(X_j, Y_k\right) \] The left side is the \( (i, k) \) entry of \( \cov(\bs{a} \bs{X}, \bs{Y}) \) and the right side is the \( (i, k) \) entry of \( \bs{a} \cov(\bs{X}, \bs{Y}) \).
- The proof is similar to (a), using the linearity of covariance in the second argument.
Variance-Covariance Matrices
Suppose that \(\bs{X}\) is a random vector in \(\R^n\). The covariance matrix of \(\bs{X}\) with itself is called the variance-covariance matrix of \(\bs{X}\): \[ \vc(\bs{X}) = \cov(\bs{X}, \bs{X}) = \E\left(\left[\bs{X} - \E(\bs{X})\right]\left[\bs{X} - \E(\bs{X})\right]^T\right)\]
Recall that for an ordinary real-valued random variable \( X \), \( \var(X) = \cov(X, X) \). Thus the variance-covariance matrix of a random vector in some sense plays the same role that variance does for a random variable.
\(\vc(\bs{X})\) is a symmetric \(n \times n\) matrix with \(\left(\var(X_1), \var(X_2), \ldots, \var(X_n)\right)\) on the diagonal.
Proof
Recall that \( \cov\left(X_i, X_j\right) = \cov\left(X_j, X_i\right) \). Also, the \( (i, i) \) entry of \( \vc(\bs{X}) \) is \( \cov\left(X_i, X_i\right) = \var\left(X_i\right) \).
The following result is the formula for the variance-covariance matrix of a sum, analogous to the formula for the variance of a sum of real-valued variables.
\(\vc(\bs{X} + \bs{Y}) = \vc(\bs{X}) + \cov(\bs{X}, \bs{Y}) + \cov(\bs{Y}, \bs{X}) + \vc(\bs{Y})\) if \(\bs{X}\) and \(\bs{Y}\) are random vectors in \(\R^n\).
Proof
This follows from the additive property of covariance: \[ \vc(\bs{X} + \bs{Y}) = \cov(\bs{X} + \bs{Y}, \bs{X} + \bs{Y}) = \cov(\bs{X}, \bs{X}) + \cov(\bs{X}, \bs{Y}) + \cov(\bs{Y}, \bs{X}) + \cov(\bs{Y}, \bs{Y}) \]
Recall that \( \var(a X) = a^2 \var(X) \) if \( X \) is a real-valued random variable and \( a \in \R \). Here is the analogous result for the variance-covariance matrix of a random vector.
\(\vc(\bs{a} \bs{X}) = \bs{a} \vc(\bs{X}) \bs{a}^T\) if \(\bs{X}\) is a random vector in \(\R^n\) and \(\bs{a} \in \R^{m \times n}\).
Proof
This follows from the scaling property of covariance: \[ \vc(\bs{a} \bs{X}) = \cov(\bs{a} \bs{X}, \bs{a} \bs{X}) = \bs{a} \cov(\bs{X}, \bs{X}) \bs{a}^T \]
Recall that if \( X \) is a random variable, then \( \var(X) \ge 0 \), and \( \var(X) = 0 \) if and only if \( X \) is a constant (with probability 1). Here is the analogous result for a random vector:
Suppose that \( \bs{X} \) is a random vector in \( \R^n \).
- \( \vc(\bs{X}) \) is either positive semi-definite or positive definite.
- \(\vc(\bs{X})\) is positive semi-definite but not positive definite if and only if there exists \(\bs{a} \in \R^n\) and \(c \in \R\) such that, with probability 1, \(\bs{a}^T \bs{X} = \sum_{i=1}^n a_i X_i = c\)
Proof
- From the previous result, \(0 \le \var\left(\bs{a}^T \bs{X}\right) = \vc\left(\bs{a}^T \bs{X}\right) = \bs{a}^T \vc(\bs{X}) \bs{a} \) for every \( \bs{a} \in \R^n \). Thus, by definition, \( \vc(\bs{X}) \) is either positive semi-definite or positive definite.
- In light of (a), \( \vc(\bs{X}) \) is positive semi-definite but not positive definite if and only if there exists \( \bs{a} \in \R^n \) such that \( \bs{a}^T \vc(\bs{X}) \bs{a} = \var\left(\bs{a}^T \bs{X}\right) = 0 \). But in turn, this is true if and only if \( \bs{a}^T \bs{X} \) is constant with probability 1.
Recall that since \(\vc(\bs{X})\) is either positive semi-definite or positive definite, the eigenvalues and the determinant of \(\vc(\bs{X})\) are nonnegative. Moreover, if \(\vc(\bs{X})\) is positive semi-definite but not positive definite, then one of the coordinates of \(\bs{X}\) can be written as a linear transformation of the other coordinates (and hence can usually be eliminated in the underlying model). By contrast, if \(\vc(\bs{X})\) is positive definite, then this cannot happen; \(\vc(\bs{X})\) has positive eigenvalues and determinant and is invertible.
Best Linear Predictor
Suppose that \(\bs{X}\) is a random vector in \(\R^m\) and that \(\bs{Y}\) is a random vector in \(\R^n\). We are interested in finding the function of \(\bs{X}\) of the form \(\bs{a} + \bs{b} \bs{X}\), where \(\bs{a} \in \R^n\) and \(\bs{b} \in \R^{n \times m}\), that is closest to \(\bs{Y}\) in the mean square sense. Functions of this form are analogous to linear functions in the single variable case. However, unless \( \bs{a} = \bs{0} \), such functions are not linear transformations in the sense of linear algebra, so the correct term is affine function of \( \bs{X} \). This problem is of fundamental importance in statistics when random vector \(\bs{X}\), the predictor vector is observable, but not random vector \(\bs{Y}\), the response vector . Our discussion here generalizes the one-dimensional case, when \(X\) and \(Y\) are random variables. That problem was solved in the section on Covariance and Correlation. We will assume that \(\vc(\bs{X})\) is positive definite, so that \( \vc(\bs{X}) \) is invertible, and none of the coordinates of \(\bs{X}\) can be written as an affine function of the other coordinates. We write \( \vc^{-1}(\bs{X}) \) for the inverse instead of the clunkier \( \left[\vc(\bs{X})\right]^{-1} \).
As with the single variable case, the solution turns out to be the affine function that has the same expected value as \( \bs{Y} \), and whose covariance with \( \bs{X} \) is the same as that of \( \bs{Y} \).
Define \( L(\bs{Y} \mid \bs{X}) = \E(\bs{Y}) + \cov(\bs{Y},\bs{X}) \vc^{-1}(\bs{X}) \left[\bs{X} - \E(\bs{X})\right] \). Then \( L(\bs{Y} \mid \bs{X}) \) is the only affine function of \( \bs{X} \) in \( \R^n \) satisfying
- \( \E\left[L(\bs{Y} \mid \bs{X})\right] = \E(\bs{Y}) \)
- \( \cov\left[L(\bs{Y} \mid \bs{X}), \bs{X}\right] = \cov(\bs{Y}, \bs{X}) \)
Proof
From linearity, \[ \E\left[L(\bs{Y} \mid \bs{X})\right] = E(\bs{Y}) + \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X})\left[\E(\bs{X}) - \E(\bs{X})\right] = 0\] From linearity and the fact that a constant vector is independent (and hence uncorrelated) with any random vector, \[ \cov\left[L(\bs{Y} \mid \bs{X}), \bs{X}\right] = \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \cov(\bs{X}, \bs{X}) = \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \vc(\bs{X}) = \cov(\bs{Y}, \bs{X}) \] Conversely, suppose that \( \bs{U} = \bs{a} + \bs{b} \bs{X} \) for some \( \bs{a} \in \R^n \) and \( \bs{b} \in \R^{m \times n} \), and that \( \E(\bs{U}) = \E(\bs{Y}) \) and \( \cov(\bs{U}, \bs{X}) = \cov(\bs{Y}, \bs{X}) \). From the second equation, again using linearity and the uncorrelated property of constant vectors, we get \( \bs{b} \cov(\bs{X}, \bs{X}) = \cov(\bs{Y}, \bs{X}) \) and therefore \( \bs{b} = \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \). Then from the first equation, \( \bs{a} + \bs{b} \E(\bs{X}) = \bs{Y} \) so \( \bs{a} = \E(\bs{Y}) - \bs{b} \E(\bs{X}) \).
A simple corollary is the \( \bs{Y} - L(\bs{Y} \mid \bs{X}) \) is uncorrelated with any affine function of \( \bs{X} \):
If \( \bs{U} \) is an affine function of \( \bs{X} \) then
- \( \cov\left[\bs{Y} - L(\bs{Y} \mid \bs{X}), \bs{U}\right] = \bs{0} \)
- \( \E\left(\langle \bs{Y} - L(\bs{Y} \mid \bs{X}), \bs{U}\rangle\right) = 0\)
Proof
Suppose that \( \bs{U} = \bs{a} + \bs{b} \bs{X} \) where \( \bs{a} \in \R^n \) and \( \bs{b} \in \R^{m \times n} \). For simplicity, let \( \bs{L} = L(\bs{Y} \mid \bs{X}) \)
- From the previous result, \( \cov(\bs{Y}, \bs{X}) = \cov(\bs{L}, \bs{X}) \). Hence using linearity, \[ \cov\left(\bs{Y} - \bs{L}, \bs{U}\right) = \cov(\bs{Y} - \bs{L}, \bs{a}) + \cov(\bs{Y} - \bs{L}, \bs{X}) \bs{b}^T = \bs{0} + \left[\cov(\bs{Y}, \bs{X}) - \cov(\bs{L}, \bs{X})\right] = \bs{0} \]
- Recall that \(\langle \bs{Y} - \bs{L}, \bs{U}\rangle\) is the trace of \( \cov(\bs{Y} - \bs{L}, \bs{U}) \) and hence has expected value 0 by part (a).
The variance-covariance matrix of \( L(\bs{Y} \mid \bs{X}) \), and its covariance matrix with \( \bs{Y} \) turn out to be the same, again analogous to the single variable case.
Additional properties of \( L(\bs{Y} \mid \bs{X}) \):
- \( \cov\left[\bs{Y}, L(\bs{Y} \mid \bs{X})\right] = \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \cov(\bs{X}, \bs{Y}) \)
- \( \vc\left[L(\bs{Y} \mid \bs{X})\right] = \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \cov(\bs{X}, \bs{Y}) \)
Proof
Recall that \( L(\bs{Y} \mid \bs{X}) = \E(\bs{Y}) + \cov(\bs{Y},\bs{X}) \vc^{-1}(\bs{X}) \left[\bs{X} - \E(\bs{X})\right] \)
- Using basic properties of covariance, \[ \cov\left[Y, L(\bs{Y} \mid \bs{X})\right] = \cov\left[\bs{Y}, \bs{X} - \E(\bs{X})\right] \left[\cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X})\right]^T = \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \cov(\bs{X}, \bs{Y}) \]
- Using basic properties of variance-covariance, \[ \vc\left[L(\bs{Y} \mid \bs{X})\right] = \vc\left[\cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \bs{X} \right] = \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \vc(\bs{X}) \left[\cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X})\right]^T = \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \cov(\bs{X}, \bs{Y})\]
Next is the fundamental result that \( L(\bs{Y} \mid \bs{X}) \) is the affine function of \( \bs{X} \) that is closest to \( \bs{Y} \) in the mean square sense.
Suppose that \( \bs{U} \in \R^n \) is an affine function of \( \bs{X} \). Then
- \( \E\left(\|\bs{Y} - L(\bs{Y} \mid \bs{X})\|^2\right) \le \E\left(\|\bs{Y} - \bs{U}\|^2\right) \)
- Equality holds in (a) if and only if \( \bs{U} = L(\bs{Y} \mid \bs{X}) \) with probability 1.
Proof
Again, let \( \bs{L} = L(\bs{Y} \mid \bs{X}) \) for simplicity and let \( \bs{U} \in \R^n \) be an affine function of \( \bs{X}\).
- Using the linearity of expected value, note that \[ \E\left(\|\bs{Y} - \bs{U}\|^2\right) = \E\left[\|(\bs{Y} - \bs{L}) + (\bs{L} - \bs{U})\|^2\right] = \E\left(\|\bs{Y} - \bs{L}\|^2\right) + 2 \E(\langle \bs{Y} - \bs{L}, \bs{L} - \bs{U}\rangle) + \E\left(\|\bs{L} - \bs{U}\|^2\right) \] But \( \bs{L} - \bs{U} \) is an affine function of \( \bs{X} \) and hence the middle term is 0 by our previous corollary. Hence \( \E\left(\|\bs{Y} - \bs{U}\|^2\right) = \E\left(\|\bs{L} - \bs{Y}\|^2\right) + \E\left(\|\bs{L} - \bs{U}\|^2\right) \ge \E\left(\|\bs{L} - \bs{Y}\|^2\right) \)
- From (a), equality holds in the inequality if and only if \( \E\left(\|\bs{L} - \bs{U}\|^2\right) = 0 \) if and only if \( \P(\bs{L} = \bs{U}) = 1 \).
The variance-covariance matrix of the difference between \( \bs{Y} \) and the best affine approximation is given in the next theorem.
\( \vc\left[\bs{Y} - L(\bs{Y} \mid \bs{X})\right] = \vc(\bs{Y}) - \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \cov(\bs{X}, \bs{Y}) \)
Proof
Again, we abbreviate \( L(\bs{Y} \mid \bs{X}) \) by \(\bs{L}\). Using basic properties of variance-covariance matrices, \[ \vc(\bs{Y} - \bs{L}) = \vc(\bs{Y}) - \cov(\bs{Y}, \bs{L}) - \cov(\bs{L}, \bs{Y}) + \vc(\bs{L}) \] But \( \cov(\bs{Y}, \bs{L}) = \cov(\bs{L}, \bs{Y}) = \vc(\bs{L}) = \cov(\bs{Y}, \bs{X}) \vc^{-1}(\bs{X}) \cov(\bs{Y}, \bs{X}) \). Substituting gives the result.
The actual mean square error when we use \( L(\bs{Y} \mid \bs{X}) \) to approximate \( \bs{Y} \), namely \( \E\left(\left\|\bs{Y} - L(\bs{Y} \mid \bs{X})\right\|^2\right) \), is the trace (sum of the diagonal entries) of the variance-covariance matrix above. The function of \(\bs{x}\) given by \[ L(\bs{Y} \mid \bs{X} = \bs{x}) = \E(\bs{Y}) + \cov(\bs{Y},\bs{X}) \vc^{-1}(\bs{X}) \left[\bs{x} - \E(\bs{X})\right] \] is known as the (distribution) linear regression function . If we observe \(\bs{x}\) then \(L(\bs{Y} \mid \bs{X} = \bs{x})\) is our best affine prediction of \(\bs{Y}\).
Multiple linear regression is more powerful than it may at first appear, because it can be applied to non-linear transformations of the random vectors. That is, if \( g: \R^m \to \R^j \) and \( h: \R^n \to \R^k \) then \( L\left[h(\bs{Y}) \mid g(\bs{X})\right] \) is the affine function of \( g(\bs{X}) \) that is closest to \( h(\bs{Y}) \) in the mean square sense. Of course, we must be able to compute the appropriate means, variances, and covariances.
Moreover, Non-linear regression with a single, real-valued predictor variable can be thought of as a special case of multiple linear regression. Thus, suppose that \(X\) is the predictor variable, \(Y\) is the response variable, and that \((g_1, g_2, \ldots, g_n)\) is a sequence of real-valued functions. We can apply the results of this section to find the linear function of \(\left(g_1(X), g_2(X), \ldots, g_n(X)\right)\) that is closest to \(Y\) in the mean square sense. We just replace \(X_i\) with \(g_i(X)\) for each \(i\). Again, we must be able to compute the appropriate means, variances, and covariances to do this.
Examples and Applications
Suppose that \((X, Y)\) has probability density function \(f\) defined by \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Find each of the following:
- \(\E(X, Y)\)
- \(\vc(X, Y)\)
Answer
- \(\left(\frac{7}{12}, \frac{7}{12}\right)\)
- \(\left[\begin{matrix} \frac{11}{144} & -\frac{1}{144} \\ -\frac{1}{144} & \frac{11}{144}\end{matrix}\right]\)
Suppose that \((X, Y)\) has probability density function \(f\) defined by \(f(x, y) = 2 (x + y)\) for \(0 \le x \le y \le 1\). Find each of the following:
- \(\E(X, Y)\)
- \(\vc(X, Y)\)
Answer
- \(\left(\frac{5}{12}, \frac{3}{4}\right)\)
- \(\left[\begin{matrix} \frac{43}{720} & \frac{1}{48} \\ \frac{1}{48} & \frac{3}{80} \end{matrix} \right]\)
Suppose that \((X, Y)\) has probability density function \(f\) defined by \(f(x, y) = 6 x^2 y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Find each of the following:
- \(\E(X, Y)\)
- \(\vc(X, Y)\)
Answer
Note that \(X\) and \(Y\) are independent.
- \(\left(\frac{3}{4}, \frac{2}{3}\right)\)
- \(\left[\begin{matrix} \frac{3}{80} & 0 \\ 0 & \frac{1}{18} \end{matrix} \right]\)
Suppose that \((X, Y)\) has probability density function \(f\) defined by \(f(x, y) = 15 x^2 y\) for \(0 \le x \le y \le 1\). Find each of the following:
- \(\E(X, Y)\)
- \(\vc(X, Y)\)
- \(L(Y \mid X)\)
- \(L\left[Y \mid \left(X, X^2\right)\right]\)
- Sketch the regression curves on the same set of axes.
Answer
- \(\left( \frac{5}{8}, \frac{5}{6} \right)\)
- \(\left[ \begin{matrix} \frac{17}{448} & \frac{5}{336} \\ \frac{5}{336} & \frac{5}{252} \end{matrix} \right]\)
- \(\frac{10}{17} + \frac{20}{51} X\)
- \(\frac{49}{76} + \frac{10}{57} X + \frac{7}{38} X^2\)
Suppose that \((X, Y, Z)\) is uniformly distributed on the region \(\left\{(x, y, z) \in \R^3: 0 \le x \le y \le z \le 1\right\}\). Find each of the following:
- \(\E(X, Y, Z)\)
- \(\vc(X, Y, Z)\)
- \(L\left[Z \mid (X, Y)\right]\)
- \(L\left[Y \mid (X, Z)\right]\)
- \(L\left[X \mid (Y, Z)\right]\)
- \( L\left[(Y, Z) \mid X\right] \)
Answer
- \(\left(\frac{1}{4}, \frac{1}{2}, \frac{3}{4}\right)\)
- \(\left[\begin{matrix} \frac{3}{80} & \frac{1}{40} & \frac{1}{80} \\ \frac{1}{40} & \frac{1}{20} & \frac{1}{40} \\ \frac{1}{80} & \frac{1}{40} & \frac{3}{80} \end{matrix}\right]\)
- \(\frac{1}{2} + \frac{1}{2} Y\). Note that there is no \(X\) term.
- \(\frac{1}{2} X + \frac{1}{2} Z\). Note that this is the midpoint of the interval \([X, Z]\).
- \(\frac{1}{2} Y\). Note that there is no \(Z\) term.
- \( \left[\begin{matrix} \frac{1}{3} + \frac{2}{3} X \\ \frac{2}{3} + \frac{1}{3} X \end{matrix}\right] \)
Suppose that \(X\) is uniformly distributed on \((0, 1)\), and that given \(X\), random variable \(Y\) is uniformly distributed on \((0, X)\). Find each of the following:
- \(\E(X, Y)\)
- \(\vc(X, Y)\)
Answer
- \(\left(\frac{1}{2}, \frac{1}{4}\right)\)
- \(\left[\begin{matrix} \frac{1}{12} & \frac{1}{24} \\ \frac{1}{24} & \frac{7}{144} \end{matrix} \right]\)