4.2: Predictions and the Hat Matrix
- Page ID
- 57718
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)In this section, we continue to explore the results of our decision to define "best" in the manner we did... and the additional assumptions we made in 4.1: Matrix Representation. This is how mathematics progresses. Assumptions are made and mathematicians explore the consequences of those assumptions. Here, we will just focus on the hat matrix and see what it can tell us about our choices.
✦•················• ✦ •··················•✦
Beyond modeling the relationship, one may also want to estimate or predict values of \(\mathbf{Y}\) for a given value of \(\mathbf{X}\). In matrix terms, this requires solving the equation \(\mathbf{\hat{Y}} = \mathbf{XB}\). But note the following:
\begin{align}
\mathbf{\hat{Y}} &= \mathbf{X}\mathbf{b} \\[1em]
&= \mathbf{X}\ \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \mathbf{Y} \\[1em]
&= \left( \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\right)\ \mathbf{Y}
\end{align}
Note that the matrix \(\mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\) "puts a hat" on the \(\mathbf{Y}\) matrix. As such, it is called the "hat matrix," \(\mathbf{H}\).
\begin{equation}
\mathbf{H} = \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime
\end{equation}
Thus, we have simple matrix equations for the estimators and the residuals:
\begin{align}
\mathbf{\hat{Y}} &= \mathbf{H} \mathbf{Y}\\[2em]
\mathbf{E} &= \mathbf{Y} - \mathbf{\hat{Y}} = \left( \mathbf{I}-\mathbf{H}\right) \mathbf{Y}
\end{align}
Why is this important? It shows that the predictions and the residuals are orthogonal.
Exploring the \(\mathbf{H}\) Matrix
There are some surprising results from the above observations. Let us spend the rest of this section exploring the hat matrix, \(\mathbf{H}\).
First, let us show that the OLS estimator is optimal in the sense that it is closest to reality of any estimator.
The matrices \(\mathbf{H}\) and \(\mathbf{I-H}\) are orthogonal.
Proof:
To show orthogonality, we need to show that the inner product is zero:
\begin{align}
\mathbf{H}^\prime \left( \mathbf{I}-\mathbf{H} \right) &= \mathbf{H} \left( \mathbf{I}-\mathbf{H} \right) \\[1em]
&= \mathbf{H} - \mathbf{H}\mathbf{H} \\[1em]
&= \mathbf{H} - \mathbf{H} \\[1em]
&= \mathbf{0}
\end{align}
In the proof, we used the fact that the hat matrix is symmetric idempotent. The next theorem proves this to be the case.
The matrix \(\mathbf{H}\) is symmetric idempotent.
Proof:
Let us start by showing \(\mathbf{H}\) is symmetric.
\begin{align}
\mathbf{H}^\prime &= \left( \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime \right)^\prime \\[1em]
\end{align}
Recall from 20.4: Other Terms and Operations that \((\mathbf{A}\mathbf{B})^\prime = \mathbf{B}^\prime \mathbf{A}^\prime\). Thus:
\begin{align}
\left(\mathbf{X} \left(\mathbf{X}^\prime\mathbf{X} \right)^{-1} \mathbf{X}^\prime\right)^\prime &= \mathbf{X}^{\prime\prime} \left( (\mathbf{X}^\prime\mathbf{X})^{-1} \right)^\prime \mathbf{X}^\prime \\[1em]
&= \mathbf{X} \left( (\mathbf{X}^\prime\mathbf{X})^{-1} \right)^\prime \mathbf{X}^\prime
\end{align}
I leave it as an exercise to show that \(\mathbf{X}^\prime\mathbf{X}\) is symmetric, and so is its inverse. Thus
\begin{align}
\left(\mathbf{X} \left(\mathbf{X}^\prime\mathbf{X} \right)^{-1} \mathbf{X}^\prime\right)^\prime &= \mathbf{X} (\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime \\[1em]
&= \mathbf{H}
\end{align}
Next, let us show that \(\mathbf{H}\) is idempotent.
\begin{align}
\mathbf{H}\mathbf{H} &= \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\ \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \\[1em]
&= \mathbf{X} \left[ \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\ \mathbf{X}\right] \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \\[1em]
&= \mathbf{X}\ \mathbf{I}\ \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \\[1em]
&= \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \\[1em]
&= \mathbf{H}
\end{align}
\(\blacksquare\)
Since \(\mathbf{H}\) is symmetric and idempotent, it is an projection matrix
The vectors \(\mathbf{\hat{Y}}\) and \(\mathbf{E}\) are orthogonal.
Proof:
I leave this as an exercise.
Since the predictions and residuals are orthogonal, we know the following is true by the Pythagorean Theorem:
\begin{equation}
\mathbf{Y}^\prime\mathbf{Y} = \mathbf{\hat{Y}}^\prime \mathbf{\hat{Y}} + \mathbf{E}^\prime \mathbf{E}
\end{equation}
Let us also prove this using matrices.
\begin{equation}
\mathbf{Y}^\prime\mathbf{Y} = \mathbf{\hat{Y}}^\prime \mathbf{\hat{Y}} + \mathbf{E}^\prime \mathbf{E}
\end{equation}
Proof.
Let us prove this without resorting to the Pythagorean Theorem. We know \(\mathbf{Y} = \mathbf{\hat{Y}} + \mathbf{E}\). Thus,
\begin{align}
\mathbf{Y}^\prime\mathbf{Y} &= \left( \mathbf{\hat{Y}} + \mathbf{E} \right)^\prime \left( \mathbf{\hat{Y}} + \mathbf{E} \right) \\[1em]
&= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E} + \mathbf{\hat{Y}}^\prime \mathbf{E} + \mathbf{E}^\prime \mathbf{\hat{Y}} \\[1em]
&= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E} + \left(\mathbf{HY}\right)^\prime (\mathbf{I}-\mathbf{H})\mathbf{Y} + \left( (\mathbf{I}-\mathbf{H})\mathbf{Y}\right)^\prime \mathbf{H}\mathbf{Y} \\[1em]
&= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E} + \mathbf{Y}^\prime\mathbf{H}^\prime (\mathbf{I}-\mathbf{H})\mathbf{Y} + \mathbf{Y}^\prime(\mathbf{I}-\mathbf{H})^\prime \mathbf{H}\mathbf{Y} \\[1em]
\text{Remember that \(\mathbf{H}\) and \(\mathbf{I}-\mathbf{H}\) are symmetric. That gives us} \hfill & \nonumber \\[1em]
&= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E} + \mathbf{Y}^\prime\mathbf{H}(\mathbf{I}-\mathbf{H})\mathbf{Y} + \mathbf{Y}^\prime(\mathbf{I}-\mathbf{H})\mathbf{H}\mathbf{Y}
\end{align}
Finally, since \(\mathbf{H}(\mathbf{I}-\mathbf{H})=(\mathbf{I}-\mathbf{H})\mathbf{H}=\mathbf{0}\), we have
\begin{align}
\mathbf{Y}^\prime\mathbf{Y} &= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E} + \mathbf{Y}^\prime \mathbf{0} \mathbf{Y} + \mathbf{Y}^\prime \mathbf{0} \mathbf{Y} \\[1em]
\mathbf{Y}^\prime\mathbf{Y} &= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E}
\end{align}
\(\blacksquare\)
This will come in handy when we add probability distributions to our mathematics, thus creating statistics. Independence is important in determining the distributions of test statistics.
By the way, we also can show that the residuals and predicted values are uncorrelated by showing their covariance is zero.
\begin{equation}
Cov[\mathbf{\hat{Y}},\mathbf{E}] = 0
\end{equation}
Proof.
I will only give the first step to this proof. The rest will be up to you to figure out.
\begin{align}
Cov[\mathbf{\hat{Y}}, \mathbf{E}] &= Cov[\mathbf{H}\mathbf{Y}, (\mathbf{I}-\mathbf{H})\mathbf{Y}]
\end{align}
By the way, this result should not be surprising given that the prediction and residual vectors are orthogonal.
Consequences
In this section, we started with the matrix equation \(\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}\) and obtained the OLS estimator of \(\mathbf{B}\). With that solution (and the requirement that \(\mathbf{X}\) be full column rank), we have another result.
\begin{equation}
\mathbf{X}^\prime \mathbf{E} = \mathbf{0}
\end{equation}
Proof.
Again, I will just start you off with this proof. Completing it is up to you.
\begin{align}
\mathbf{Y} &= \mathbf{X}\mathbf{B} + \mathbf{E}
\end{align}
Where from here?
What does this theorem mean? Recall that \(\mathbf{X}^\prime \mathbf{E}\) is a \(p \times 1\) matrix. The first column of \(\mathbf{X}\) is a column of 1s. Thus, the first element of \(\mathbf{X}^\prime \mathbf{E}\) is just the sum of the residuals.
- That means the residuals must sum to 0 when we use the OLS estimator.
The other elements in the \(\mathbf{X}^\prime \mathbf{E}\) matrix consist of the sum of the residuals times the values of each independent variable.
- This means that, under OLS, the residuals are necessarily linearly independent of each of the independent variables. It is a consequence of the mathematics used.
To see this in simple linear regression:
\begin{align}
\mathbf{X}^\prime \mathbf{E} &= \left[\begin{array}{ccccc}
1 & 1 & 1 & \cdots & 1 \\
x_1 & x_2 & x_3 & \cdots & x_n \\
\end{array} \right] \left[\begin{array}{c}
e_1 \\
e_2 \\
e_3 \\
\vdots \\
e_n \\
\end{array} \right] \\
&= \left[\begin{array}{c}
\sum e_i \\[1ex]
\sum x_i e_i \\
\end{array} \right]
\end{align}
This matrix is \(\mathbf{0}\) only when all of its elements are also 0. Thus, we have \(\sum e_i = 0\); the sum of the residuals in OLS is mathematically guaranteed to be zero.
We also have \(\sum x_i e_i = 0\), which is equivalent to \(\sum x_i e_i - n\overline{x}\,\overline{e}\) because \(\overline{e}=0\) and thus to \((n-1) Cov[x, e]\). This covariance is zero if \(x\) and \(e\) are linearly independent. This means that the residuals arising from OLS estimation are linearly uncorrelated with the predictor variables.
Again, these are mathematical results from applying ordinary least squares. They are guaranteed simply because of the estimation method we selected. Had we chosen a different definition of "best fit," then this section may not hold.
Everything follows from our chosen definition of "best fit."


