# 5.7: The Multivariate Normal Distribution

- Page ID
- 10173

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)The multivariate normal distribution is among the most important of multivariate distributions, particularly in statistical inference and the study of Gaussian processes such as Brownian motion. The distribution arises naturally from linear transformations of independent normal variables. In this section, we consider the bivariate normal distribution first, because explicit results can be given and because graphical interpretations are possible. Then, with the aid of matrix notation, we discuss the general multivariate distribution.

## The Bivariate Normal Distribution

### The Standard Distribution

Recall that the probability density function \( \phi \) of the standard normal distribution is given by \[ \phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-z^2/2}, \quad z \in \R\] The corresponding distribution function is denoted \( \Phi \) and is considered a special function in mathematics: \[ \Phi(z) = \int_{-\infty}^z \phi(x) \, dx = \int_{-\infty}^z \frac{1}{\sqrt{2 \pi}} e^{-x^2/2} \, dx, \quad z \in \R \] Finally, the moment generating function \( m \) is given by \[ m(t) = \E\left(e^{t Z}\right) = \exp\left[\frac{1}{2} \var(t Z)\right] = e^{t^2/2}, \quad t \in \R \]

Suppose that \( Z \) and \( W \) are independent random variables, each with the standard normal distribution. The distribution of \( (Z, W) \) is known as the standard bivariate normal distribution.

The basic properties of the standard bivariate normal distribution follow easily from independence and properties of the (univariate) normal distribution. Recall first that the graph of a function \( f: \R^2 \to \R \) is a surface. For \( c \in \R \), the set of points \( \left\{(x, y) \in \R^2: f(x, y) = c\right\} \) is the level curve of \( f \) at level \( c \). The graph of \( f \) can be understood by means of the level curves.

The probability density function \( \phi_2 \) of the standard bivariate normal distribution is given by \[ \phi_2(z, w) = \frac{1}{2 \pi} e^{-\frac{1}{2}\left(z^2 + w^2\right)}, \quad (z, w) \in \R^2 \]

- The level curves of \( \phi_2 \) are circles centered at the origin.
- The mode of the distribution is \( (0, 0) \).
- \( \phi_2 \) is concave downward on \( \left\{(z, w) \in \R^2: z^2 + w^2 \lt 1\right\} \)

## Proof

By independence, \( \phi_2(z, w) = \phi(z) \phi(w) \) for \( (z, w) \in \R^2 \). Parts (a) and (b) are clear. For part (c), the second derivative matrix of \( \phi_2 \) is \[ \left[\begin{matrix} \phi_2(z, w)\left(z^2 - 1\right) & \phi_2(z, w) z w \\ \phi_2(z, w) z w & \phi_2(z, w)\left(w^2 - 1\right) \end{matrix}\right]\] with determinant \( \phi_2^2(z, w) \left(1 - z^2 - w^2\right) \). The determinant is positive and the diagonal entries negative on the circular region \( \left\{(z, w) \in \R^2: z^2 + w^2 \lt 1\right\} \), so the matrix is negative definite on this region.

Clearly \( \phi \) has a number of symmetry properties as well: \( \phi_2(z, w) \) is symmetric in \( z \) about 0 so that \(\phi_2(-z, w) = \phi_2(z, w)\); \( \phi_2(z, w) \) is symmetric in \( w \) about 0 so that \( \phi_2(z, -w) = \phi_2(z, w) \); \( \phi_2(z, w) \) is symmetric in \( (z, w) \) so that \( \phi_2(z, w) = \phi_2(w, z) \). In short, \( \phi_2 \) has the classical bell shape

that we associate with normal distributions.

Open the bivariate normal experiment, keep the default settings to get the standard bivariate normal distribution. Run the experiment 1000 times. Observe the cloud of points in the scatterplot, and compare the empirical density functions to the probability density functions.

Suppose that \( (Z, W) \) has the standard bivariate normal distribution. The moment generating function \( m_2 \) of \( (Z, W) \) is given by \[ m_2(s, t) = \E\left[\exp(s Z + t W)\right] = \exp\left[\frac{1}{2} \var(s Z + t W)\right] = \exp\left[\frac{1}{2}\left(s^2 + t^2\right)\right], \quad (s, t) \in \R^2 \]

## Proof

By independence, \( m_2(s, t) = m(s) m(t) \) for \( (s, t) \in \R^2 \) where \( m \) is the standard normal MGF.

### The General Distribution

The general bivariate normal distribution can be constructed by means of an affine transformation on a standard bivariate normal vector. The distribution has 5 parameters. As we will see, two are location parameters, two are scale parameters, and one is a correlation parameter.

Suppose that \( (Z, W) \) has the standard bivariate normal distribution. Let \(\mu, \, \nu \in \R\); \(\sigma, \, \tau \in (0, \infty)\); and \(\rho \in (-1, 1)\), and let \(X\) and \(Y\) be new random variables defined by \begin{align} X & = \mu + \sigma \, Z \\ Y & = \nu + \tau \rho Z + \tau \sqrt{1 - \rho^2} W \end{align} The joint distribution of \((X, Y)\) is called the bivariate normal distribution with parameters \((\mu, \nu, \sigma, \tau, \rho)\).

We can use the change of variables formula to find the joint probability density function.

Suppose that \( (X, Y) \) has the bivariate normal distribution with the parameters \( (\mu, \nu, \sigma, \tau, \rho) \) as specified above. The joint probability density function \( f \) of \((X, Y)\) is given by \[ f(x, y) = \frac{1}{2 \pi \sigma \tau \sqrt{1 - \rho^2}} \exp \left\{ -\frac{1}{2 (1 - \rho^2)} \left[ \frac{(x - \mu)^2}{\sigma^2} - 2 \rho \frac{(x - \mu)(y - \nu)}{\sigma \tau} + \frac{(y - \nu)^2}{\tau^2} \right] \right\}, \quad (x, y) \in \R^2 \]

- The level curves of \( f \) are ellipses centered at \((\mu, \nu)\).
- The mode of the distribution is \( (\mu, \nu) \).

## Proof

Consider the transformation that defines \((x, y)\) from \((z, w)\) in the definition. The inverse transformation is given by \begin{align} z & = \frac{x - \mu}{\sigma} \\ w & = \frac{x - \mu}{\sigma \, \sqrt{1 - \rho^2}} - \rho \frac{y - \nu}{\tau \, \sqrt{1 - \rho^2}} \end{align} The Jacobian of the inverse transformation is \[ \frac{\partial(z, w)}{\partial(x, y)} = \frac{1}{\sigma \tau \sqrt{1 - \rho^2}} \] Note that the Jacobian is a constant, because the transformation is affine. The result now follows from the independence of \(Z\) and \(W\), and the change of variables formula

- Note that \( f \) has the form \( f(x, y) = a \exp\left[- b g(x, y)\right] \) where \( a \) and \( b \) are positive constants and \[ g(x, y) = \frac{(x - \mu)^2}{\sigma^2} - 2 \rho \frac{(x - \mu)(y - \nu)}{\sigma \tau} + \frac{(y - \nu)^2}{\tau^2}, \quad (x, y) \in \R^2 \] The graph of \( g \) is a paraboloid opening upward. The level curves of \( f \) are the same as the level curves of \( g \) (but at different levels of course).
- The maximum of \( f \) occurs at the minimum of \( g \), at the point \( (\mu, \nu) \).

The following theorem gives fundamental properties of the bivariate normal distribution.

Suppose that \((X, Y)\) has the bivariate normal distribution with parameters \((\mu, \nu, \sigma, \tau, \rho)\) as specified above. Then

- \(X\) is normally distributed with mean \(\mu\) and standard deviation \(\sigma\).
- \(Y\) is normally distributed with mean \(\nu\) and standard deviation \(\tau\).
- \(\cor(X, Y) = \rho\).
- \( X \) and \( Y \) are independent if and only if \( \rho = 0 \).

## Proof

These result can be proved from the probability density function, but it's easier and more helpful to use the transformation definition. So, assume that \( (X, Y) \) is defined in terms of the standard bivariate normal pair \( (Z, W) \) as in the definition.

- \( X = \mu + \sigma Z \) so \( X \) has the normal distribution with mean \( \mu \) and standard deviation \( \sigma \). This is a basic property of the normal distribution, and indeed is the way that the general normal variable is constructed from a standard normal variable.
- Since \( Z \) and \( W \) are independent and each has the standard normal distribution, \( Y = \nu + \tau \rho Z + \tau \sqrt{1 - \rho^2} W \) is normally distributed by another basic property. Because \( Z \) and \( W \) have mean 0, it follows from the linear property of expected value that \( \E(Y) = \nu \). Similarly, since \( Z \) and \( W \) have variance 1, it follows from basic properties of variance that \( \var(Y) = \tau^2 \rho^2 + \tau^2 (1 - \rho^2) = \tau^2\).
- Using the bi-linear property of covariance and independence we have \( \cov(X, Y) = \rho \tau \sigma \, \cov(Z, Z) = \rho \tau \sigma \), and hence from (a) and (b), \( \cor(X, Y) = \rho \).
- As a general property, recall that if \( X \) and \( Y \) are independent then \( \cor(X, Y) = 0 \). Conversely, if \( \rho = 0 \) then \( X = \mu + \sigma Z \) and \( Y = \nu + \tau W \). Since \( Z \) and \( W \) are independent, so are \( X \) and \( Y \).

Thus, two random variables with a joint normal distribution are independent if and only if they are uncorrelated.

In the bivariate normal experiment, change the standard deviations of \(X\) and \(Y\) with the scroll bars. Watch the change in the shape of the probability density functions. Now change the correlation with the scroll bar and note that the probability density functions do *not* change. For various values of the parameters, run the experiment 1000 times. Observe the cloud of points in the scatterplot, and compare the empirical density functions to the probability density functions.

In the case of perfect correlation (\( \rho = 1 \) or \( \rho = -1 \)), the distribution of \( (X, Y) \) is also said to be bivariate normal, but degenerate. In this case, we know from our study of covariance and correlation that \( (X, Y) \) takes values on the regression line \( \left\{(x, y) \in \R^2: y = \nu + \rho \frac{\tau}{\sigma} (x - \mu)\right\} \), and hence does not have a probability density function (with respect to Lebesgue measure on \( \R^2 \)). Degenerate normal distributions will be discussed in more detail below.

In the bivariate normal experiment, run the experiment 1000 times with the values of \( \rho \) given below and selected values of \( \sigma \) and \( \tau \). Observe the cloud of points in the scatterplot and compare the empirical density functions to the probability density functions.

- \( \rho \in \{0, 0.3, 0.5, 0.7, 1\} \)
- \( \rho \in \{-0.3, -0.5, -0.7, -1\} \)

The conditional distributions are also normal.

Suppose that \( (X, Y) \) has the bivariate normal distribution with parameters \( (\mu, \nu, \sigma, \tau, \rho) \) as specified above.

- For \( x \in \R \), the conditional distribution of \(Y\) given \(X = x\) is normal with mean \( \E(Y \mid X = x) = \nu + \rho \frac{\tau}{\sigma} (x - \mu) \) and variance \(\var(Y \mid X = x) = \tau^2 \left(1 - \rho^2\right)\).
- For \( y \in \R \), the conditional distribution of \(X\) given \(Y = y\) is normal with mean \( \E(X \mid Y = y) = \mu + \rho \frac{\sigma}{\tau} (y - \nu) \) and variance \(\var(X \mid Y = y) = \sigma^2 \left(1 - \rho^2\right)\).

## Proof from density functions

By symmetry, we need only prove (a). The conditional PDF of \( Y \) given \( X = x \) is \( y \mapsto f(x, y) \big/ g(x) \) where \( f \) is the joint PDF, and where \( g \) is the PDF of \( X \), namely the normal PDF with mean \( \mu \) and standard deviation \( \sigma \). The result then follows after some algebra.

## Proof from random variables

Again, we only need to prove (a). We can assume that \( (X, Y) \) is defined in terms of a standard normal pair \( (Z, W) \) as in the definition. Hence \[ Y = \nu + \rho \tau \frac{X - \mu}{\sigma} + \tau \sqrt{1 - \rho^2} W \] Since that \(X\) and \(W\) are independent, the conditional distribution of \( Y \) given \( X = x \) is the distribution of \( \nu + \rho \tau \frac{x - \mu}{\sigma} + \tau \sqrt{1 - \rho^2} W \). The latter distribution is normal, with mean and variance specified in the theorem.

Note that the conditional variances do not depend on the value of the given variable.

In the bivariate normal experiment, set the standard deviation of \(X\) to 1.5, the standard deviation of \(Y\) to 0.5, and the correlation to 0.7.

- Run the experiment 100 times.
- For each run, compute \(\E(Y \mid X = x)\) the predicted value of \(Y\) for the given the value of \(X\).
- Over all 100 runs, compute the square root of the average of the squared errors between the predicted value of \(Y\) and the true value of \(Y\).

You may be perplexed by the lack of symmetry in how \( (X, Y) \) is defined in terms of \( (Z, W) \) in the original definition. Note however that the distribution is completely determined by the 5 parameters. If we define \( X^\prime = \mu + \sigma \rho Z + \sigma \sqrt{1 - \rho^2} W \) and \( Y^\prime = \nu + \tau Z \) then \( (X^\prime, Y^\prime) \) has the same distribution as \( (X, Y) \), namely the bivariate normal distribution with parameters \( (\mu, \nu, \sigma, \tau, \rho) \) (although, of course \( (X^\prime, Y^\prime) \) and \( (X, Y) \) are different random vectors). There are other ways to define the same distribution as an affine transformation of \( (Z, W) \)—the situation will be clarified in the next subsection.

Suppose that \( (X, Y) \) has the bivariate normal distribution with parameters \( (\mu, \nu, \sigma, \tau, \rho) \). Then \( (X, Y) \) has moment generating function \( M \) given by \[ M(s, t) = \E\left[\exp\left(sX + tY\right)\right] = \exp\left[\E(s X + t Y) + \frac{1}{2} \var(s X + t Y)\right] = \exp\left[\mu s + \nu t + \frac{1}{2} \left( \sigma^2 s^2 + 2 \rho \sigma \tau s t + \tau^2 t^2\right) \right], \quad (s, t) \in \R^2 \]

## Proof

Using the representation of \( (X, Y) \) in terms of the standard bivariate normal vector \( (Z, W) \) in the definition and collecting terms gives \[ M(s, t) = \E\left(\exp\left[(\mu s + \nu t) + (\sigma s + \rho \tau t) Z + \tau \sqrt{1 - \rho^2} t W\right] \right)\] Hence from independence we have \[ M(s, t) = \exp(\mu s + \nu t) m(\sigma s + \rho \tau t) m\left(\tau \sqrt{1 - \rho^2} t\right) \] where \( m \) is the standard normal MGF. Substituting and simplifying gives the result.

We showed above that if \( (X, Y) \) has a bivariate normal distribution then the marginal distributions of \( X \) and \( Y \) are also normal. The converse is not true.

Suppose that \( (X, Y) \) has probability density function \( f \) given by \[ f(x, y) = \frac{1}{2 \pi} e^{-(x^2 + y^2)/2} \left[1 + x y e^{-(x^2 + y^2 - 2)} / 2\right], \quad (x, y) \in \R^2 \]

- \( X \) and \( Y \) each have standard normal distributions.
- \( (X, Y) \) does not have a bivariate normal distribution.

## Proof

Note that \( f(x, y) = \phi_2(x, y) [1 + u(x) u(y)] \) for \( (x, y) \in \R^2 \), where \( \phi_2 \) is the bivariate standard normal PDF and where \( u \) is given by \( u(t) = t e^{-(t^2 - 1) / 2} \) for \( t \in \R \). From simple calculus, \( u \) is symmetric about 0, has a local maximum at \( t = 1 \), and \( u(t) \to 0 \) as \( t \to \infty \). In particular, \( |u(t)| \le 1 \) for \( t \in \R \) and hence \( f(x, y) \ge 0 \) for \( (x, y) \in \R^2 \). Next, a helpful trick is that we can write integrals of \( f \) as expected values of functions of a standard normal pair \( (Z, W) \). In particular, \[ \int_{\R^2} f(x, y) d(x, y) = \E[1 + u(Z) u(W)] = 1 + \E[u(Z)] \E[u(W)] = 1 \] since \( \E[u(Z)] = \E[u(W)] = 0 \) by the symmetry of the standard normal distribution and the symmetry of \( u \) about 0. Hence \( f \) is a valid PDF on \( \R^2 \). Suppose now that \( (X, Y) \) has PDF \( f \).

- The PDF of \( X \) at \( x \in \R \) is \[\int_\R f(x, y) dy = \int_\R \phi_2(x, y) dy + u(x) \E[u(W)] = \phi(x)\] where as usual, \( \phi \) is the standard normal PDF on \( \R \). By symmetry, \( Y \) also has the standard normal distribution.
- \( f \) does not have the form of a bivariate normal PDF and hence \( (X, Y) \) does not have a bivariate normal distribution.

### Transformations

Like its univariate counterpart, the family of bivariate normal distributions is preserved under two types of transformations on the underlying random vector: affine transformations and sums of independent vectors. We start with a preliminary result on affine transformations that should help clarify the original definition. Throughout this discussion, we assume that the parameter vector \( (\mu, \nu, \sigma, \tau, \rho) \) satisfies the usual conditions: \( \mu, \, \nu \in \R \), and \( \sigma, \, \tau \in (0, \infty) \), and \( \rho \in (-1, 1) \).

Suppose that \( (Z, W) \) has the standard bivariate normal distribution. Let \( X = a_1 + b_1 Z + c_1 W \) and \( Y = a_2 + b_2 Z + c_2 W \) where the coefficients are in \( \R \) and \( b_1 c_2 - c_1 b_2 \ne 0 \). Then \( (X, Y) \) has a bivariate normal distribution with parameters given by

- \( \E(X) = a_1 \)
- \( \E(Y) = a_2 \)
- \( \var(X) = b_1^2 + c_1^2 \)
- \( \var(Y) = b_2^2 + c_2^2 \)
- \( \cov(X, Y) = b_1 b_2 + c_1 c_2 \)

## Proof

A direct proof using the change of variables formula is possible, but our goal is to show that \( (X, Y) \) can be written in the form given above in the definition. First, parts (a)–(e) follow from basic properties of expected value, variance, and covariance. So, in the notation used in the definition, we have \( \mu = a_1 \), \( \nu = a_2 \), \( \sigma = \sqrt{b_1^2 + c_1^2} \), \( \tau = \sqrt{b_2^2 + c_2^2} \), and \[ \rho = \frac{b_1 b_2 + c_1 c_2}{\sqrt{b_1^2 + c_1^2} \sqrt{b_2^2 + c_2^2}} \] (Note from the assumption on the coefficients that \( b_1^2 + c_1^2 \gt 0 \) and \( b_2^2 + c_2^2 \gt 0 \)). Simple algebra shows that \[ \sqrt{1 - \rho^2} = \frac{b_1 c_2 - c_1 b_2}{\sqrt{b_1^2 + c_1^2} \sqrt{b_2^2 + c_2^2}} \] Next we define \begin{align} U & = \frac{b_1 Z + c_1 W}{\sqrt{b_1^2 + c_1^2}} \\ V & = \frac{c_1 Z - b_1 W}{\sqrt{b_1^2 + c_1^2}} \end{align} The transformation that defines \( (u, v) \) from \( (z, w) \) is its own inverse, and has Jacobian 1. Hence it follows that \( (U, V) \) has the same joint distribution as \( (Z, W) \), namely the standard bivariate normal distribution. Simple algebra shows that \begin{align} X & = a_1 + \sqrt{b_1^2 + c_1^2} U = \mu + \sigma U \\ Y & = a_2 + \frac{b_1 b_2 + c_1 c_2}{\sqrt{b_1^2 + c_1^2}} U + \frac{c_1 b_2 - b_1 c_2}{\sqrt{b_1^2 + c_1^2}} V = \nu + \tau \rho U + \tau \sqrt{1 - \rho^2} V \end{align} This is the form given in the definition, so it follows that \( (X, Y) \) has a bivariate normal distribution.

Now it is easy to show more generally that the bivariate normal distribution is closed with respect to affine transformations.

Suppose that \( (X, Y) \) has the bivariate normal distribution with parameters \( (\mu, \nu, \sigma, \tau, \rho) \). Define \(U = a_1 + b_1 X + c_1 Y\) and \(V = a_2 + b_2 X + c_2 Y\), where the coefficients are in \(\R\) and \(b_1 c_2 - c_1 b_2 \ne 0\). Then \((U, V)\) has a bivariate normal distribution with parameters as follows:

- \(\E(U) = a_1 + b_1 \mu + c_1 \nu\)
- \(\E(V) = a_2 + b_2 \mu + c_2 \nu\)
- \(\var(U) = b_1^2 \sigma^2 + c_1^2 \tau^2 + 2 b_1 c_1 \rho \sigma \tau\)
- \(\var(V) = b_2^2 \sigma^2 + c_2^2 \tau^2 + 2 b_2 c_2 \rho \sigma \tau\)
- \(\cov(U, V) = b_1 b_2 \sigma^2 + c_1 c_2 \tau^2 + (b_1 c_2 + b_2 c_1) \rho \sigma \tau \)

## Proof

From our original construction, we can write \( X = \mu + \sigma Z \) and \( Y = \nu + \tau \rho Z + \tau \sqrt{1 - \rho^2} W \) where \( (Z, W) \) has the standard bivariate normal distribution. Then by simple substitution, \( U = A_1 + B_1 Z + C_1 W \) and \( V = A_2 + B_2 Z + C_2 W \) where \(A_i = a_i + b_i \mu + c_i\nu \), \(B_i = b_i \sigma + c_i \tau \rho \), \(C_i = c_i \tau \sqrt{1 - \rho^2} \) for \( i \in \{1, 2\} \). By simple algebra, \[ B_1 C_2 - C_1 B_2 = \sigma \tau \sqrt{1 - \rho^2}(b_1 c_2 - c_1 b_2) \ne 0 \] Hence \( (U, V) \) has a bivariate normal distribution from the previous theorem. Parts (a)–(e) follow from basic properties of expected value, variance, and covariance.

The bivariate normal distribution is preserved with respect to sums of independent variables.

Suppose that \( (X_i, Y_i) \) has the bivariate normal distribution with parameters \( (\mu_i, \nu_i, \sigma_i, \tau_i, \rho_i) \) for \( i \in \{1, 2\} \), and that \( (X_1, Y_1) \) and \( (X_2, Y_2) \) are independent. Then \( (X_1 + X_2, Y_1 + Y_2) \) has the bivariate normal distribution with parameters given by

- \( \E(X_1 + X_2) = \mu_1 + \mu_2 \)
- \( \E(Y_1 + Y_2) = \nu_1 + \nu_2 \)
- \( \var(X_1 + X_2) = \sigma_1^2 + \sigma_2^2 \)
- \( \var(Y_1 + Y_2) = \tau_1^2 + \tau_2^2 \)
- \( \cov(X_1 + X_2, Y_1 + Y_2) \) = \( \rho_1 \sigma_1 \tau_1 + \rho_2 \sigma_2 \tau_2 \)

## Proof

Let \( M_i \) denote the MGF of \( (X_i, Y_i) \) for \( i \in \{1, 2\} \) and let \( M \) denote the MGF of \( (X_1 + X_2, Y_1 + Y_2) \). By independence, \( M(s, t) = M_1(s, t) M_2(s, t) \) for \( (s, t) \in \R^2 \). Using the bivariate normal MGF, and basic properties of the exponential function, \[ M(s, t) = \exp\left(\E\left[s(X_1 + X_2) + t(Y_1 + Y_2)\right] + \frac{1}{2} \var\left[s(X_1 + X_2) + t(Y_1 + Y_2)\right]\right), \quad (s, t) \in \R^2 \] Of course from basic properties of expected value, variance, and covariance, \begin{align} \E\left[s(X_1 + X_2) + t(Y_1 + Y_2)\right] & = s(\mu_1 + \mu_2) + t(\nu_1 + \nu_2) \\ \var\left[s(X_1 + X_2) + t(Y_1 + Y_2)\right] & = s(\sigma_1^2 + \sigma_2^2) + t(\tau_1^2 + \tau_2^2) + 2 s t (\rho_1 \sigma_1 \tau_1 + \rho_2 \sigma_2 \tau_2) \end{align} Substituting gives the result.

The following result is important in the simulation of normal variables.

Suppose that \( (Z, W) \) has the standard bivariate normal distribution. Define the polar coordinates \((R, \Theta)\) of \((Z, W)\) by the equations \(Z = R \, \cos \Theta\), \(W = R \, \sin \Theta\) where \(R \ge 0\) and \(0 \le \Theta \lt 2 \, \pi\). Then

- \(R\) has probability density function \( g \) given by \(g(r) = r \, e^{-\frac{1}{2} r^2}\) for \(r \in [0, \infty)\).
- \(\Theta\) is uniformly distributed on \([0, 2 \pi)\).
- \(R\) and \(\Theta\) are independent.

## Proof

The Jacobian of the polar coordinate transformation that gives \( (z, w) \) from \( (r, \theta) \) is \( r \), as we all remember from calculus. Hence by the change of variables theorem, the PDF \( g \) of \( (R, \Theta) \) in terms of the from standard normal PDF \( \phi_2 \) is given by \[ g(r, \theta) = \phi_2(r \cos \theta, r \sin \theta) r = \frac{1}{2 \pi} r e^{-r^2 /2}, \quad (r, \theta) \in [0, \infty) \times [0, 2 \pi) \] The result then follows from the factorization theorem for independent random variables.

The distribution of \(R\) is known as the standard Rayleigh distribution, named for William Strutt, Lord Rayleigh. The Rayleigh distribution studied in more detail in a separate section.

Since the quantile function \( \Phi^{-1} \) of the normal distribution cannot be given in a simple, closed form, we cannot use the usual random quantile method of simulating a normal random variable. However, the quantile method works quite well to simulate a Rayleigh variable, and of course simulating uniform variables is trivial. Hence we have a way of simulating a standard bivariate normal vector with a pair of random numbers (which, you will recall are independent random variables, each with the standard uniform distribution, that is, the uniform distribution on \( [0, 1) \)).

Suppose that \( U \) and \( V \) are independent random variables, each with the standard uniform distribution. Let \( R = \sqrt{-2 \ln U} \) and \( \Theta = 2 \pi V \). Define \( Z = R \cos \Theta \) and \( W = R \sin \Theta \). Then \( (Z, W) \) has the standard bivariate normal distribution.

## Proof

The Rayleigh distribution function \( F \) is given by \( F(r) = 1 - e^{-r^2 / 2} \) for \( r \in [0, \infty) \) and hence the quantile function is given by \( F^{-1}(p) = \sqrt{-2 \ln(1 - p)} \) for \( p \in [0, 1) \). Hence if \( U \) has the standard uniform distribution, then \( \sqrt{-2 \ln (1 - U)} \) has the Rayleigh distribution. But \( 1 - U \) also has the standard uniform distribution so \( R = \sqrt{-2 \ln U } \) also has the Rayleigh distribution. If \( V \) has the standard uniform distribution then of course \( 2 \pi V \) is uniformly distributed on \( [0, 2 \pi) \). If \( U \) and \( V \) are independent, then so are \( R \) and \( \Theta \). By the previous theorem, if \( Z = R \cos \Theta \) and \( W = R \sin \Theta \), then \( (Z, W) \) has the standard bivariate normal distribution.

Of course, if we can simulate \( (Z, W) \) with a standard bivariate normal distribution, then we can simulate \( (X, Y) \) with the general bivariate normal distribution, with parameter \( (\mu, \nu, \sigma, \tau, \rho) \) by definition (5), namely \( X = \mu + \sigma Z \), \( Y = \nu + \tau \rho Z + \tau \sqrt{1 - \rho^2} W \).

## The General Multivariate Normal Distribution

The general multivariate normal distribution is a natural generalization of the bivariate normal distribution studied above. The exposition is very compact and elegant using expected value and covariance matrices, and would be horribly complex without these tools. Thus, this section requires some prerequisite knowledge of linear algebra. In particular, recall that \(\boldsymbol{A}^T\) denotes the transpose of a matrix \(\boldsymbol{A}\) and that we identify a vector in \(\R^n\) with the corresponding \(n \times 1\) column vector.

### The Standard Distribution

Suppose that \(\bs Z = (Z_1, Z_2, \ldots, Z_n)\) is a vector of independent random variables, each with the standard normal distribution. Then \(\bs Z\) is said to have the \(n\)-dimensional standard normal distribution.

- \(\E(\bs Z) = \bs{0}\) (the zero vector in \(\R^n\)).
- \(\vc(\bs Z) = I\) (the \(n \times n\) identity matrix).

\(\bs Z\) has probability density function \( \phi_n \) given by \[ \phi_n(\bs z) = \frac{1}{(2 \pi)^{n/2}} \exp \left( -\frac{1}{2} \bs z \cdot \bs z \right) = \frac{1}{(2 \pi)^{n/2}} \exp\left(-\frac{1}{2} \sum_{i=1}^n z_i^2\right), \quad \bs z = (z_1, z_2, \ldots, z_n) \in \R^n \] where as usual, \( \phi \) is the standard normal PDF.

## Proof

By independence, \( \phi_n(\bs z) = \phi(z_1) \phi(z_2) \cdots \phi(z_n) \).

\(\bs Z\) has moment generating function \( m_n \) given by \[ m_n(\bs t) = \E\left[\exp(\bs t \cdot \bs Z)\right] = \exp\left[\frac{1}{2} \var(\bs t \cdot \bs Z)\right] = \exp \left( \frac{1}{2} \bs t \cdot \bs t \right) = \exp\left(\frac{1}{2} \sum_{i=1}^n t_i^2\right), \quad \bs t = (t_1, t_2, \ldots, t_n) \in \R^n \]

## Proof

By independence, \( \E\left[\exp(\bs t \cdot \bs Z)\right] = m(t_1) m(t_2) \cdots m(t_n) \) where \( m \) is the standard normal MGF.

### The General Distribution

Suppose that \(\bs Z\) has the \(n\)-dimensional standard normal distribution. Suppose also that \(\bs \mu \in \R^n\) and that \(\bs A \in \R^{n \times n}\) is invertible. The random vector \(\bs X = \bs \mu + \bs A \bs Z\) is said to have an \(n\)-dimensional normal distribution.

- \(\E(\bs X) = \bs \mu\).
- \(\vc(\bs X) = \bs A \, \bs A^T\).

## Proof

- From the linear property of expected value, \( \E(\bs X) = \bs \mu + \bs A \E(\bs Z) = \bs \mu \).
- From basic properties of the variance-covariance matrix, \( \vc(\bs X) = \bs A \bs A^T \vc(\bs Z) = \bs A \bs A^T \).

In the context of this result, recall that the variance-covariance matrix \(\vc(\bs X) = \bs A \bs A^T\) is symmetric and positive definite (and hence also invertible). We will now see that the multivariate normal distribution is completely determined by the expected value vector \( \bs \mu \) and the variance-covariance matrix \( \bs V \), and hence these give the basic parameters of the distribution.

Suppose that \(\bs X\) has an \( n \)-dimensional normal distribution with expected value vector \( \bs \mu \) and variance-covariance matrix \(\bs V \). The probability density function \( f \) of \( \bs X \) is given by

\[ f(\bs x) = \frac{1}{(2 \pi)^{n/2} \sqrt{\det(\bs V)}} \exp \left[ -\frac{1}{2} (\bs x - \bs \mu) \cdot \bs V^{-1} (\bs x - \bs \mu) \right], \quad \bs x \in \R^n \]## Proof

From the definition can assume that \( \bs X = \bs \mu + \bs A \bs Z \) where \( \bs A \in \R^{n \times n}\) is invertible and \( \bs Z \) has the \( n \)-dimensional standard normal distribution, so that \( \bs V = \bs A \bs A^T \) The inverse of the transformation \( \bs x = \bs \mu + \bs A \bs z \) is \( \bs z = \bs A^{-1}(\bs x - \bs \mu) \) and hence the Jacobian of the inverse transformation is \( \det\left(\bs A^{-1}\right) = 1 \big/ \det(\bs A) \). Using the multivariate change of variables theorem, \[ f(\bs x) = \frac{1}{\left|\det(\bs A)\right|} \phi_n\left[\bs A^{-1}(\bs x - \bs \mu)\right] = \frac{1}{(2 \pi)^{n/2} \left|\det(\bs A)\right|} \exp\left[-\frac{1}{2} \bs A^{-1}(\bs x - \bs \mu) \cdot \bs A^{-1}(\bs x - \bs \mu)\right], \quad \bs x \in \R^n \] But \( \det(\bs V) = \det\left(\bs A \bs A^T\right) = \det(\bs A) \det\left(\bs A^T\right) = \left[\det(\bs A)\right]^2 \) and hence \( \left|\det(\bs A)\right| = \sqrt{\det(\bs V)} \). Also, \begin{align} \bs A^{-1}(\bs x - \bs \mu) \cdot \bs A^{-1}(\bs x - \bs \mu) & = \left[\bs A^{-1}(\bs x - \bs \mu)\right]^T \bs A^{-1}(\bs x - \bs \mu) = (\bs x - \bs \mu)^T \left(\bs A^{-1}\right)^T \bs A^{-1} (\bs x - \bs \mu) \\ & = (\bs x - \bs \mu)^T \left(\bs A^T\right)^{-1} \bs A^{-1} (\bs x - \bs \mu) = (\bs x - \bs \mu)^T \left(\bs A \bs A^T\right)^{-1} (\bs x - \bs \mu)\\ & = (\bs x - \bs \mu)^T \bs V^{-1} (\bs x - \bs \mu) = (\bs x - \bs \mu) \cdot \bs V^{-1} (\bs x - \bs \mu) \end{align}

Suppose again that \(\bs X\) has an \( n \)-dimensional normal distribution with expected value vector \( \bs \mu \) and variance-covariance matrix \(\bs V \). The moment generating function \( M \) of \( \bs X \) is given by \[ M(\bs t) = \E\left[\exp(\bs t \cdot \bs X)\right] = \exp\left[\E(\bs t \cdot \bs X) + \frac{1}{2} \var(\bs t \cdot \bs X)\right] = \exp \left( \bs t \cdot \bs \mu + \frac{1}{2} \bs t \cdot \bs V \bs t \right), \quad \bs t \in \R^n \]

## Proof

Once again we start with the definition and assume that \( \bs X = \bs \mu + \bs A \bs X \) where \( \bs A \in \R^{n \times n} \) is invertible. we have \(\E\left[\exp(\bs t \cdot \bs X\right] = \exp(\bs t \cdot \bs \mu) \E\left[\exp(\bs t \cdot \bs A \bs Z)\right] \). But \( \bs t \cdot \bs A \bs Z = \left(\bs A^T \bs t\right) \cdot \bs Z \) so using the MGF of \( \bs Z \) we have \[ \E\left[\exp(\bs t \cdot \bs A \bs Z)\right] = \exp\left[\frac{1}{2} \left(\bs A^T \bs t\right) \cdot \left(\bs A^T \bs t\right)\right] = \exp\left[\frac{1}{2} \bs t^T \bs A \bs A^T \bs t\right] = \exp\left[\frac{1}{2} \bs t \cdot \bs V \bs t\right] \]

Of course, the moment generating function completely determines the distribution. Thus, if a random vector \( \bs X \) in \( \R^n \) has a moment generating function of the form given above, for some \( \bs \mu \in \R^n \) and symmetric, positive definite \( \bs V \in \R^{n \times n} \), then \( \bs X \) has the \( n \)-dimensional normal distribution with mean \( \bs \mu \) and variance-covariance matrix \( \bs V \).

Note again that in the representation \( \bs X = \bs \mu + \bs A \bs Z \), the distribution of \( \bs X \) is uniquely determined by the expected value vector \( \bs \mu \) and the variance-covariance matrix \(\bs V = \bs A \bs A^T\), but not by \( \bs \mu \) and \( \bs A \). In general, for a given positive definite matrix \(\bs V\), there are many invertible matrices \(\bs A\) such that \( \bs V = \bs A \bs A^T\) (the matrix \( \bs A \) is a bit like a square root of \( \bs V \)). A theorem in matrix theory states that there is a unique lower triangular matrix \(\bs L\) with this property. The representation \( \bs X = \bs \mu + \bs L \bs Z \) is known as the canonical representation of \( \bs X \).

If \( \bs X = (X, Y) \) has bivariate normal distribution with parameters \( (\mu, \nu, \sigma, \tau, \rho) \), then the lower triangular matrix \( \bs L \) such that \( \bs L \bs L^T = \vc(\bs X) \) is \[ \bs L = \left[\begin{matrix} \sigma & 0 \\ \tau \rho & \tau \sqrt{1 - \rho^2} \end{matrix} \right] \]

## Proof

Note that \[ \bs L \bs L^T = \left[\begin{matrix} \sigma^2 & \sigma \tau \rho \\ \sigma \tau \rho & \tau^2 \end{matrix}\right] = \vc(X, Y) \]

Note that the matrix \( \bs L \) above gives the canonical representation of \( (X, Y) \) in terms of the standard normal vector \( (Z, W) \) in the original definition, namely \( X = \mu + \sigma Z \), \( Y = \nu + \tau \rho Z + \tau \sqrt{1 - \rho^2} W \).

If the matrix \( \bs A \in \R^{n \times n} \) in the definition is not invertible, then the variance-covariance matrix \( \bs V = \bs A \bs A^T\) is symmetric, but only positive semi-definite. The random vector \( \bs X = \bs \mu + \bs A \bs Z \) takes values in a lower dimensional affine subspace of \( \R^n \) that has measure 0 relative to \( n \)-dimensional Lebesgue measure \( \lambda_n \). Thus, \( \bs X \) does not have a probability density function relative to \( \lambda_n \), and so the distribution is degenerate. However, the formula for the moment generating function still holds. Degenerate normal distributions are discussed in more detail below.

### Transformations

The multivariate normal distribution is invariant under two basic types of transformations on the underlying random vectors: affine transformations (with linearly independent rows), and concatenation of independent vectors. As simple corollaries of these two results, the normal distribution is also invariant with respect to subsequences of the random vector, re-orderings of the terms in the random vector, and sums of independent random vectors. The main tool that we will use is the moment generating function. We start with the first main result on affine transformations.

Suppose that \(\bs X\) has the \(n\)-dimensional normal distribution with mean vector \(\bs \mu\) and variance-covariance matrix \(\bs V\). Suppose also that \(\bs{a} \in \R^m\) and that \(\bs A \in \R^{m \times n}\) has linearly independent rows (thus, \(m \le n\)). Then \(\bs{Y} = \bs{a} + \bs A \bs X\) has an \(m\)-dimensional normal distribution, with

- \(\E(\bs{Y}) = \bs{a} + \bs A \bs \mu\)
- \(\vc(\bs{Y}) = \bs A \bs V \bs A^T\)

## Proof

For \( \bs t \in \R^m \), \( \E\left[\exp(\bs t \cdot \bs{Y})\right] = \exp(\bs t \cdot \bs{a}) \E\left[\bs t \cdot \bs A \bs X\right] \). but \( \bs t \cdot \bs A \bs X = \left(\bs A^T \bs t\right) \cdot \bs X \), so using the MGF of \( \bs X \) we have \[ \E\left[\exp(\bs t \cdot \bs A \bs X)\right] = \exp\left[\left(\bs A^T \bs t\right) \cdot \bs \mu + \frac{1}{2}\left(\bs A^T \bs t\right) \cdot \bs V \left(\bs A^T \bs t\right)\right] \] But \( \left(\bs A^T \bs t\right) \cdot \bs \mu = \bs t \cdot \bs A \bs \mu \) and \(\left(\bs A^T \bs t\right) \cdot \bs V \left(\bs A^T \bs t\right) = \bs t \cdot \left(\bs A \bs V \bs A^T\right) \bs t\), so letting \( \bs{b} = \bs{a} + \bs A \bs \mu \) and \( \bs{U} = \bs A \bs V \bs A^T\) and putting the pieces together, we have \( \E\left[\exp( \bs t \cdot \bs{Y})\right] = \exp\left[ \bs{b} \cdot \bs t + \frac{1}{2} \bs t \cdot \bs{U} \bs t \right] \).

A clearly important special case is \( m = n \), which generalizes the definition. Thus, if \( \bs{a} \in \R^n \) and \( \bs A \in \R^{n \times n}\) is invertible, then \( \bs{Y} = \bs{a} + \bs A \bs X\) has an \( n \)-dimensional normal distribution. Here are some other corollaries:

Suppose that \(\bs X = (X_1, X_2, \ldots, X_n)\) has an \(n\)-dimensional normal distribution. If \( \{i_1, i_2, \ldots, i_m\} \) is a set of distinct indices, then \( \bs{Y} = \left(X_{i_1}, X_{i_2}, \ldots, X_{i_m}\right) \) has an \( m \)-dimensional normal distribution.

## Proof

Let \( A \in \R^{m \times n}\) be the matrix defined by the condition that for \( j \in \{1, 2, \ldots, m\} \), row \( j \) has 1 in position \( i_j \) and has 0 in all other positions. Then \( \bs A \) has linearly independent rows (since the \( i_j \) are distinct in \( j \)) and \( \bs{Y} = \bs A \bs X \). Thus the result follows from the general theorem on affine transformations.

In the context of the previous result, if \( \bs X \) has mean vector \( \bs \mu \) and variance-covariance matrix \( \bs V \), then \( \bs{Y} \) has mean vector \( \bs A \bs \mu \) and variance-covariance matrix \( \bs A \bs V \bs A^T \), where \( \bs A \) is the 0-1 matrix defined in the proof. As simple corollaries, note that if \( \bs X = (X_1, X_2, \ldots, X_n) \) has an \( n \)-dimensional normal distribution, then any permutation of the coordinates of \( \bs X \) also has an \( n \)-dimensional normal distribution, and \( (X_1, X_2, \ldots, X_m) \) has an \( m \)-dimensional normal distribution for any \( m \le n \). Here is a slight extension of the last statement.

Suppose that \(\bs X\) is a random vector in \(\R^m\), \(\bs{Y}\) is a random vector in \(\R^n\), and that \((\bs X, \bs{Y})\) has an \((m + n)\)-dimensional normal distribution. Then

- \( \bs X \) has an \( m \)-dimensional normal distribution.
- \( \bs{Y} \) has an \( n \)-dimensional normal distribution.
- \(\bs X\) and \(\bs{Y}\) are independent if and only if \(\cov(\bs X, \bs{Y}) = \bs{0}\) (the \(m \times n\) zero matrix).

## Proof

As we already noted, parts (a) and (b) are a simple consequence of the previous theorem. Thus, we just need to verify (c). In block form, note that \[ \vc(\bs X, \bs{Y}) = \left[\begin{matrix} \vc(\bs X) & \cov(\bs X, \bs{Y}) \\ \cov(\bs{Y}, \bs X) & \vc(\bs{Y})\end{matrix} \right] \] Now let \( M \) denote the moment generating function of \( (\bs X, \bs{Y}) \), \( M_1 \) the MGF of \( \bs X \), and \( M_2 \) the MGF of \( \bs{Y} \). From the form of the MGF, note that \( M(\bs{s}, \bs t) = M_1(\bs{s}) M_2(\bs t) \) for all \( \bs{s} \in \R^m \), \( \bs t \in \R^n \) if and only if \( \cov(\bs X, \bs{Y}) = \bs{0} \), the \( m \times n \) zero matrix.

Next is the converse to part (c) of the previous result: concatenating independent normally distributed vectors produces another normally distributed vector.

Suppose that \( \bs X \) has the \( m \)-dimensional normal distribution with mean vector \( \bs \mu \) and variance-covariance matrix \( \bs{U} \), \( \bs{Y} \) has the \( n \)-dimensional normal distribution with mean vector \( \bs{\nu} \) and variance-covariance matrix \( \bs V \), and that \( \bs X \) and \( \bs{Y} \) are independent. Then \( \bs Z = (\bs X, \bs{Y})\) has the \(m + n\)-dimensional normal distribution with

- \( \E(\bs X, \bs{Y}) = (\bs \mu, \bs{\nu}) \)
- \( \vc(\bs X, \bs{Y}) = \left[\begin{matrix} \vc(\bs X) & \bs{0} \\ \bs{0}^T & \vc(\bs{Y})\end{matrix}\right]\) where \( \bs{0} \) is the \( m \times n \) zero matrix.

## Proof

For \( \bs t \in \R^{m + n} \), write \( \bs t \) in block form as \( \bs t = (\bs{r}, \bs{s}) \) where \( \bs{r} \in \R^m \) and \(\bs{s} \in \R^n\). By independence, the MGF of \( (\bs X, \bs{Y}) \) is \[ \E\left(\exp\left[\bs t \cdot (\bs X, \bs{Y})\right]\right) = \E\left[\bs{r} \cdot \bs X + \bs{s} \cdot \bs{Y}\right] = \E\left[\exp(\bs{r} \cdot \bs X)\right] \E\left[\exp(\bs{s} \cdot \bs{Y})\right]\] Using the formula for the normal MGF we have \[ \E\left(\exp\left[\bs t \cdot (\bs X, \bs{Y})\right]\right) = \exp \left( \bs{r} \cdot \bs \mu + \frac{1}{2} \bs{r} \cdot \bs{U} \, \bs{r} \right) \exp \left( \bs{s} \cdot \bs{\nu} + \frac{1}{2} \bs{s} \cdot \bs V \, \bs{s} \right) = \exp\left[(\bs{r} \cdot \bs \mu + \bs{s} \cdot \bs{\nu}) + \frac{1}{2} (\bs{r} \cdot \bs{U} \bs{r} + \bs{s} \cdot \bs V \bs{s})\right] \] But \( \bs{r} \cdot \bs \mu + \bs{s} \cdot \bs{\nu} = \bs t \cdot (\bs \mu, \bs{\nu}) \) and \( \bs{r} \cdot \bs{U} \bs{r} + \bs{s} \cdot \bs V \bs{s} = \bs t \cdot \left[\begin{matrix} \vc(\bs X) & \bs{0} \\ \bs{0}^T & \vc(\bs{Y})\end{matrix}\right] \bs t \) so the proof is complet

Just as in the univariate case, the normal family of distributions is closed with respect to sums of independent variables. The proof follows easily from the previous result.

Suppose that \( \bs X \) has the \( n \)-dimensional normal distribution with mean vector \( \bs \mu \) and variance-covariance matrix \( \bs{U} \), \( \bs{Y} \) has the \( n \)-dimensional normal distribution with mean vector \( \bs{\nu} \) and variance-covariance matrix \( \bs V \), and that \( \bs X \) and \( \bs{Y} \) are independent. Then \( \bs X + \bs{Y}\) has the \( n \)-dimensional normal distribution with

- \( \E(\bs X + \bs{Y}) = \bs \mu + \bs{\nu}\)
- \( \vc(\bs X + \bs{Y}) = \bs{U} + \bs V \)

## Proof

From the previous result \( (\bs X, \bs{Y}) \) has a \( 2 n \)-dimensional normal distribution. Moreover, \( \bs X + \bs{Y} = \bs A(\bs X, \bs{Y}) \) where \( \bs A \) is the \( n \times 2 n \) matrix defined by the condition that for \( i \in \{1, 2, \ldots, n\} \), row \( i \) has 1 in positions \( i \) and \( n + i \) and \( 0 \) in all other positions. The matrix \( A \) has linearly independent rows and thus the result follows from the general theorem on affine transformations. Parts (a) and (b) are standard results for sums of independent random vectors.

We close with a trivial corollary to the general result on affine transformation, but this corollary points the way to a further generalization of the multivariate normal distribution that includes the degenerate distributions.

Suppose that \(\bs X\) has an \(n\)-dimensional normal distribution with mean vector \(\bs \mu\) and variance-covariance matrix \(\bs V\), and that \(\bs{a} \in \R^n\) with \( \bs{a} \ne \bs{0} \). Then \(Y = \bs{a} \cdot \bs X \) has a (univariate) normal distribution with

- \(\E(Y) = \bs{a} \cdot \bs \mu\)
- \(\var(Y) = \bs{a} \cdot \bs V \bs{a}\)

## Proof

Note again that \( \bs{a} \cdot \bs X = \bs{a}^T \bs X\). Since \( \bs{a} \ne \bs{0} \), the single row of \( \bs{a}^T \) is linearly independent and hence the result follows from the general theorem on affine transformations.

## A Further Generalization

The last result can be used to give a simple, elegant definition of the multivariate normal distribution that includes the degenerate distributions as well as the ones we have considered so far. First we will adopt our general definition of the univariate normal distribution that includes constant random variables.

A random variable \( \bs X \) that takes values in \( \R^n \) has an \( n \)-dimensional normal distribution if and only if \( \bs{a} \cdot \bs X \) has a univariate normal distribution for every \( \bs{a} \in \R^n \).

Although an \( n \)-dimensional normal distribution may not have a probability density function with respect to \( n \)-dimensional Lebesgue measure \( \lambda_n \), the form of the moment generating function is unchanged.

Suppose that \( \bs X \) has mean vector \( \bs \mu \) and variance-covariance matrix \( \bs V \), and that \( \bs X \) has an \( n \)-dimensional normal distribution. The moment generating function of \(\bs X\) is given by \[ \E\left[\exp(\bs t \cdot \bs X)\right] = \exp\left[\E(\bs t \cdot \bs X) + \frac{1}{2} \var(\bs t \cdot \bs X)\right] = \exp \left( \bs t \cdot \bs \mu + \frac{1}{2} \bs t \cdot \bs V \, \bs t \right), \quad \bs t \in \R^n \]

## Proof

If \( \bs t \in \R^n \), then by definition, \( \bs t \cdot \bs X \) has a univariate normal distribution. Thus \( \E\left[\exp(\bs t \cdot \bs X)\right] \) is simply the moment generating function of \( \bs t \cdot \bs X \), evaluated at the argument 1. The results then follow from the univariate MGF.

Our new general definition really is a generalization.

Suppose that \( \bs X \) has an \( n \)-dimensional normal distribution in the sense of the general definition, and that the distribution of \( \bs X \) has a probability density function on \( \R^n \) with respect to Lebesgue measure \( \lambda_n \). Then \( \bs X \) has an \( n \)-dimensional normal distribution in the sense of our original definition.

## Proof

This follows from our previous results, since both the MGF and the PDF completely determine the distribution of \( \bs X \).