Skip to main content
Statistics LibreTexts

4.2: Predictions and the Hat Matrix

  • Page ID
    57718
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    In this section, we continue to explore the results of our decision to define "best" in the manner we did... and the additional assumptions we made in 4.1: Matrix Representation. This is how mathematics progresses. Assumptions are made and mathematicians explore the consequences of those assumptions. Here, we will just focus on the hat matrix and see what it can tell us about our choices.

    ✦•················• ✦ •··················•✦

    Beyond modeling the relationship, one may also want to estimate or predict values of \(\mathbf{Y}\) for a given value of \(\mathbf{X}\). In matrix terms, this requires solving the equation \(\mathbf{\hat{Y}} = \mathbf{XB}\). But note the following:

    \begin{align}
    \mathbf{\hat{Y}} &= \mathbf{X}\mathbf{b} \\[1em]
    &= \mathbf{X}\ \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \mathbf{Y} \\[1em]
    &= \left( \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\right)\ \mathbf{Y}
    \end{align}

    Note that the matrix \(\mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\) "puts a hat" on the \(\mathbf{Y}\) matrix. As such, it is called the "hat matrix," \(\mathbf{H}\).

    \begin{equation}
    \mathbf{H} = \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime
    \end{equation}

    Thus, we have simple matrix equations for the estimators and the residuals:

    \begin{align}
    \mathbf{\hat{Y}} &= \mathbf{H} \mathbf{Y}\\[2em]
    \mathbf{E} &= \mathbf{Y} - \mathbf{\hat{Y}} = \left( \mathbf{I}-\mathbf{H}\right) \mathbf{Y}
    \end{align}

    Why is this important? It shows that the predictions and the residuals are orthogonal.

    Exploring the \(\mathbf{H}\) Matrix

    There are some surprising results from the above observations. Let us spend the rest of this section exploring the hat matrix, \(\mathbf{H}\).

    First, let us show that the OLS estimator is optimal in the sense that it is closest to reality of any estimator.

    Theorem \(\PageIndex{1}\)

    The matrices \(\mathbf{H}\) and \(\mathbf{I-H}\) are orthogonal.

    Proof:
    To show orthogonality, we need to show that the inner product is zero:

    \begin{align}
    \mathbf{H}^\prime \left( \mathbf{I}-\mathbf{H} \right) &= \mathbf{H} \left( \mathbf{I}-\mathbf{H} \right) \\[1em]
    &= \mathbf{H} - \mathbf{H}\mathbf{H} \\[1em]
    &= \mathbf{H} - \mathbf{H} \\[1em]
    &= \mathbf{0}
    \end{align}

    In the proof, we used the fact that the hat matrix is symmetric idempotent. The next theorem proves this to be the case.

    Theorem \(\PageIndex{2}\)

    The matrix \(\mathbf{H}\) is symmetric idempotent.

    Proof:
    Let us start by showing \(\mathbf{H}\) is symmetric.

    \begin{align}
    \mathbf{H}^\prime &= \left( \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime \right)^\prime \\[1em]
    \end{align}

    Recall from 20.4: Other Terms and Operations that \((\mathbf{A}\mathbf{B})^\prime = \mathbf{B}^\prime \mathbf{A}^\prime\). Thus:

    \begin{align}
    \left(\mathbf{X} \left(\mathbf{X}^\prime\mathbf{X} \right)^{-1} \mathbf{X}^\prime\right)^\prime &= \mathbf{X}^{\prime\prime} \left( (\mathbf{X}^\prime\mathbf{X})^{-1} \right)^\prime \mathbf{X}^\prime \\[1em]
    &= \mathbf{X} \left( (\mathbf{X}^\prime\mathbf{X})^{-1} \right)^\prime \mathbf{X}^\prime
    \end{align}

    I leave it as an exercise to show that \(\mathbf{X}^\prime\mathbf{X}\) is symmetric, and so is its inverse. Thus

    \begin{align}
    \left(\mathbf{X} \left(\mathbf{X}^\prime\mathbf{X} \right)^{-1} \mathbf{X}^\prime\right)^\prime &= \mathbf{X} (\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime \\[1em]
    &= \mathbf{H}
    \end{align}

    Next, let us show that \(\mathbf{H}\) is idempotent.

    \begin{align}
    \mathbf{H}\mathbf{H} &= \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\ \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \\[1em]
    &= \mathbf{X} \left[ \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\ \mathbf{X}\right] \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \\[1em]
    &= \mathbf{X}\ \mathbf{I}\ \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \\[1em]
    &= \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \\[1em]
    &= \mathbf{H}
    \end{align}

    \(\blacksquare\)

    Since \(\mathbf{H}\) is symmetric and idempotent, it is an projection matrix that projects \(\mathbf{Y}\)-space onto the restricted (smaller) \(\mathbf{\hat{Y}}\)-space. Because it is an orthogonal projection matrix, \(\mathbf{\hat{Y}}\) is as close to \(\mathbf{Y}\) as possible in its subspace. That is, the errors are minimized. Figure \(\PageIndex{1}\), below, illustrates this.

    Model of a triangle in 3-space
    Figure \(\PageIndex{1}\): A schematic illustrating that \(\mathbf{\hat{Y}}\) is as close to \(\mathbf{Y}\) as possible, while remaining in its subspace (represented by the plane). In other words, the \(\mathbf{Y}\) matrix exists in an \(n\)-dimensional space. The solution, \(\mathbf{\hat{Y}}\), is in a \(p\)-dimensional space, with \(n > p\). Under the assumptions of ordinary least squares, the distance between \(\mathbf{Y}\) and \(\mathbf{\hat{Y}}\) (represented as the residuals, \(\mathbf{E}\)) is as small as possible if you define "distance" in terms of the Euclidean distance, \(L_2\).
    Theorem \(\PageIndex{3}\)

    The vectors \(\mathbf{\hat{Y}}\) and \(\mathbf{E}\) are orthogonal.

    Proof:
    I leave this as an exercise.

    Since the predictions and residuals are orthogonal, we know the following is true by the Pythagorean Theorem:

    \begin{equation}
    \mathbf{Y}^\prime\mathbf{Y} = \mathbf{\hat{Y}}^\prime \mathbf{\hat{Y}} + \mathbf{E}^\prime \mathbf{E}
    \end{equation}

    Let us also prove this using matrices.

    Theorem \(\PageIndex{4}\)

    \begin{equation}
    \mathbf{Y}^\prime\mathbf{Y} = \mathbf{\hat{Y}}^\prime \mathbf{\hat{Y}} + \mathbf{E}^\prime \mathbf{E}
    \end{equation}

    Proof.
    Let us prove this without resorting to the Pythagorean Theorem. We know \(\mathbf{Y} = \mathbf{\hat{Y}} + \mathbf{E}\). Thus,

    \begin{align}
    \mathbf{Y}^\prime\mathbf{Y} &= \left( \mathbf{\hat{Y}} + \mathbf{E} \right)^\prime \left( \mathbf{\hat{Y}} + \mathbf{E} \right) \\[1em]
    &= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E} + \mathbf{\hat{Y}}^\prime \mathbf{E} + \mathbf{E}^\prime \mathbf{\hat{Y}} \\[1em]
    &= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E} + \left(\mathbf{HY}\right)^\prime (\mathbf{I}-\mathbf{H})\mathbf{Y} + \left( (\mathbf{I}-\mathbf{H})\mathbf{Y}\right)^\prime \mathbf{H}\mathbf{Y} \\[1em]
    &= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E} + \mathbf{Y}^\prime\mathbf{H}^\prime (\mathbf{I}-\mathbf{H})\mathbf{Y} + \mathbf{Y}^\prime(\mathbf{I}-\mathbf{H})^\prime \mathbf{H}\mathbf{Y} \\[1em]
    \text{Remember that \(\mathbf{H}\) and \(\mathbf{I}-\mathbf{H}\) are symmetric. That gives us} \hfill & \nonumber \\[1em]
    &= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E} + \mathbf{Y}^\prime\mathbf{H}(\mathbf{I}-\mathbf{H})\mathbf{Y} + \mathbf{Y}^\prime(\mathbf{I}-\mathbf{H})\mathbf{H}\mathbf{Y}
    \end{align}

    Finally, since \(\mathbf{H}(\mathbf{I}-\mathbf{H})=(\mathbf{I}-\mathbf{H})\mathbf{H}=\mathbf{0}\), we have

    \begin{align}
    \mathbf{Y}^\prime\mathbf{Y} &= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E} + \mathbf{Y}^\prime \mathbf{0} \mathbf{Y} + \mathbf{Y}^\prime \mathbf{0} \mathbf{Y} \\[1em]
    \mathbf{Y}^\prime\mathbf{Y} &= \mathbf{\hat{Y}}^\prime\mathbf{\hat{Y}} + \mathbf{E}^\prime\mathbf{E}
    \end{align}

    \(\blacksquare\)

    This will come in handy when we add probability distributions to our mathematics, thus creating statistics. Independence is important in determining the distributions of test statistics.

    By the way, we also can show that the residuals and predicted values are uncorrelated by showing their covariance is zero.

    Theorem \(\PageIndex{5}\)

    \begin{equation}
    Cov[\mathbf{\hat{Y}},\mathbf{E}] = 0
    \end{equation}

    Proof.
    I will only give the first step to this proof. The rest will be up to you to figure out.

    \begin{align}
    Cov[\mathbf{\hat{Y}}, \mathbf{E}] &= Cov[\mathbf{H}\mathbf{Y}, (\mathbf{I}-\mathbf{H})\mathbf{Y}]
    \end{align}

    So, where to go from this?

    By the way, this result should not be surprising given that the prediction and residual vectors are orthogonal.

    Consequences

    In this section, we started with the matrix equation \(\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}\) and obtained the OLS estimator of \(\mathbf{B}\). With that solution (and the requirement that \(\mathbf{X}\) be full column rank), we have another result.

    Theorem \(\PageIndex{6}\)

    \begin{equation}
    \mathbf{X}^\prime \mathbf{E} = \mathbf{0}
    \end{equation}

    Proof.
    Again, I will just start you off with this proof. Completing it is up to you.

    \begin{align}
    \mathbf{Y} &= \mathbf{X}\mathbf{B} + \mathbf{E}
    \end{align}

    Where from here?

    What does this theorem mean? Recall that \(\mathbf{X}^\prime \mathbf{E}\) is a \(p \times 1\) matrix. The first column of \(\mathbf{X}\) is a column of 1s. Thus, the first element of \(\mathbf{X}^\prime \mathbf{E}\) is just the sum of the residuals.

    • That means the residuals must sum to 0 when we use the OLS estimator.

    The other elements in the \(\mathbf{X}^\prime \mathbf{E}\) matrix consist of the sum of the residuals times the values of each independent variable.

    • This means that, under OLS, the residuals are necessarily linearly independent of each of the independent variables. It is a consequence of the mathematics used.

    To see this in simple linear regression:

    \begin{align}
    \mathbf{X}^\prime \mathbf{E} &= \left[\begin{array}{ccccc}
    1 & 1 & 1 & \cdots & 1 \\
    x_1 & x_2 & x_3 & \cdots & x_n \\
    \end{array} \right] \left[\begin{array}{c}
    e_1 \\
    e_2 \\
    e_3 \\
    \vdots \\
    e_n \\
    \end{array} \right] \\
    &= \left[\begin{array}{c}
    \sum e_i \\[1ex]
    \sum x_i e_i \\
    \end{array} \right]
    \end{align}

    This matrix is \(\mathbf{0}\) only when all of its elements are also 0. Thus, we have \(\sum e_i = 0\); the sum of the residuals in OLS is mathematically guaranteed to be zero.

    We also have \(\sum x_i e_i = 0\), which is equivalent to \(\sum x_i e_i - n\overline{x}\,\overline{e}\) because \(\overline{e}=0\) and thus to \((n-1) Cov[x, e]\). This covariance is zero if \(x\) and \(e\) are linearly independent. This means that the residuals arising from OLS estimation are linearly uncorrelated with the predictor variables.

    Be Aware!

    Again, these are mathematical results from applying ordinary least squares. They are guaranteed simply because of the estimation method we selected. Had we chosen a different definition of "best fit," then this section may not hold.

    Everything follows from our chosen definition of "best fit."


    This page titled 4.2: Predictions and the Hat Matrix is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?