4.1: Matrix Representation
- Page ID
- 57717
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)In the previous chapter, we were introduced to the classical linear model and estimating the parameters using ordinary linear regression. All of the work was done using a scalar representation of the data. When moving beyond simple linear regression, the estimators are more difficult to determine. The calculus remains almost as simple, but solving the system of equations becomes prohibitive.
The usual solution to solving complicated system of equations is to use a matrix representation of the problem. That is what this chapter does. Along the way, we discover more about linear models than we expected.
✦•················• ✦ •··················•✦
As in the previous chapter, let \(x\) and \(y\) be numeric variables. The linear relationship between \(x\) and \(y\) can be summarized by a line that "best" fits the observed data. That is, we can (and will) summarize the relationship between \(x\) and \(y\) using a linear equation:
\begin{equation}
y = \beta_0 + \beta_1 x
\end{equation}
The above holds in the case of simple linear regression (SLR). So, what do we do when there are more independent variables? Here is that representation:
\begin{equation}
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \cdots + \beta_k x_k \label{eq:lm2b-lobf}
\end{equation}
Here, \(\beta_0\) is the y-intercept (still). And, \(\beta_i\) is the effect of variable \(x_i\) on the dependent variable, assuming all the other variables remain constant.
This is called the ceteris paribus assumption. If the independent variables are independent of each other, then this requirement it met. However, there is frequently some correlation among the independent variables. Read on to see what to do about this.
We say that the "line" given by equation \ref{eq:lm2b-lobf} best fits the observed data. However, when dealing with two independent variables, it is not a line but a plane; with three, a space; with four, a hyperplane; etc. Clearly, meaningfully representing an entire four-variable model is quite difficult.
Matrix Representation
We learned a lot about our solution by exploring the scalar representation of the system of equations in the previous chapter. We may be able to gain some additional insights by exploring its matrix representation.
It may be helpful to re-familiarize yourself with Appendix M: The Appendix of Matrices at this point.
And so, let us begin with our matrix model.
\begin{equation}
\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}
\end{equation}
In this model, \(\mathbf{Y}\) represents the response variable; \(\mathbf{X}\), the predictor variable(s) pre-pended with a column of 1s; \(\mathbf{B}\), the coefficient vector; and \(\mathbf{E}\), the residuals. The dimensions are \(n \times 1\) for \(\mathbf{Y}\), \(n \times p\) for \(\mathbf{X}\), \(p \times 1\) for \(\mathbf{B}\), and \(n \times 1\) for \(\mathbf{E}\).
Thus, in the case of simple linear regression, the matrices are
\begin{equation*}
\mathbf{Y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{array} \right] \qquad
\mathbf{X} = \left[ \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ \vdots \\ 1 & x_n \end{array} \right] \qquad
\mathbf{B} = \left[ \begin{array}{c} \beta_0 \\ \beta_1 \end{array} \right] \qquad
\mathbf{E} = \left[ \begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \vdots \\ \varepsilon_n \end{array} \right]
\end{equation*}
In this formulation, \(n\) is the sample size and \(p=2\) is the number of parameters that need to be estimated. Usually, this is one more than the number of independent variables, \(k=1\).
Note that \(\mathbf{X}\) is a matrix of "the predictor variable(s) prepended with a column of 1s."
Question:
What does this mean?
Also:
Why are those 1s needed?
Let our independent variable be the same as in the Toy Example in Section 3.1. Since \(x = \{ -2, 0, 0, 2\}\), the corresponding \(\mathbf{X}\) matrix is
\begin{equation}
\mathbf{X} = \left[ \begin{array}{lr} 1 & -2 \\ 1 & 0 \\ 1 & 0 \\ 1 & 2 \end{array} \right]
\end{equation}
Again, we want to minimize the sum of squared errors. Again, we will create the objective function Q, take its derivative with respect to the parameter vector, \(\mathbf{B}\), and solve:
\begin{align}
Q &= \mathbf{E}^\prime \mathbf{E} \\[1em]
&= \left(\mathbf{Y} - \mathbf{X}\mathbf{B}\right)^\prime \left(\mathbf{Y}-\mathbf{X}\mathbf{B}\right) \\[1em]
&= \mathbf{Y}^\prime \mathbf{Y} - \mathbf{B}^\prime \mathbf{X}^\prime \mathbf{Y} - \mathbf{Y}^\prime \mathbf{X}\mathbf{B} + \mathbf{B}^\prime \mathbf{X}^\prime \mathbf{X}\mathbf{B}
\end{align}
Note that each of these terms is a \(1 \times 1\) matrix, thus each is equal to its transpose. Using that on the third term and gathering the two like terms together gives our objective function.
\begin{align}
Q &= \mathbf{Y}^\prime \mathbf{Y} - 2\mathbf{B}^\prime \mathbf{X}^\prime \mathbf{Y} + \mathbf{B}^\prime \mathbf{X}^\prime \mathbf{X}\mathbf{B}
\end{align}
Now, taking the derivative with respect to \(\mathbf{B}\) and simplifying gives
\begin{align}
\frac{d}{d\mathbf{B}} Q &= -2 \mathbf{X}^\prime \mathbf{Y} + 2\mathbf{X}^\prime \mathbf{X}\mathbf{B} \hspace{4em} \\[1em]
\mathbf{0} &\stackrel{\text{set}}{=} -\mathbf{X}^\prime \mathbf{Y} + \mathbf{X}^\prime \mathbf{X}\mathbf{b} \\[1em]
\mathbf{X}^\prime \mathbf{Y} &= \mathbf{X}^\prime \mathbf{X}\mathbf{b} \\[1em]
\left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \mathbf{Y} &= \mathbf{b}
\end{align}
This result is so important that I will repeat it here:
\begin{equation}
\mathbf{b} = \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \mathbf{Y} \label{eq:lm2b-olsMatrixEquation}
\end{equation}
Note the switch between \(\mathbf{B}\) and \(\mathbf{b}\). The former concerns the population. It is a population parameter that we are trying to estimate.
\begin{equation}
\mathbf{B} = \left[ \begin{array}{l} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{p-1} \end{array} \right]
\end{equation}
The latter concerns the sample. It is the estimator we are using to estimate the population parameter.
\begin{equation}
\mathbf{b} = \left[ \begin{array}{l} b_0 \\ b_1 \\ \vdots \\ b_{p-1} \end{array} \right]
\end{equation}
With this, the equation for our OLS regression line (plane, space, hyperplane, etc.) is
\begin{equation}
\hat{\mathbf{Y}} = \mathbf{X}\mathbf{b}
\end{equation}
Requirement
In performing these calculations, we made one assumption: \(\left( \mathbf{X}^\prime \mathbf{X}\right)^{-1}\) exists. If it does not exist, then the last step in the process cannot be done. So, the first question to ask is:
When does \(\left( \mathbf{X}^\prime \mathbf{X}\right)^{-1}\) not exist?
Ans. It does not exist when \(\det{\mathbf{X}^\prime \mathbf{X}} = 0\).
So, when does \(\det{\mathbf{X}^\prime \mathbf{X}} = 0\)?
Ans. From linear algebra (and The Appendix of Matrices) we know that this determinant is zero when the \(\mathbf{X}\) matrix is not of full (column) rank; that is, when \(\text{rank} \mathbf{X} \ne p\). This happens when at least one column of the \(\mathbf{X}\) matrix is a linear combination of the other columns. A statistician would say that there is redundant information in \(\mathbf{X}\); one variable can be determined by the others.
When in the realm of multiple regression (more than one independent variable), this happens when one variable is a linear combination of the others. This condition is called "multicollinearity" or "super multicollinearity," depending on the source.
Why didn't we have to worry about multicollinearity in the SLR case?
Ans. We did!
When in the realm of simple linear regression (SLR), multicolinearity happens when there is no variation in the \(x\) variable (i.e., it is a constant multiple of the columns of 1s).
Assumption/s
Before we continue, as before, let us make the three assumptions about our residuals. These are just the same non-parametric assumptions we made back in Section 3.3, but in matrix form.
- The first is that they are realizations of a random variable (\(\mathbf{E}\) has a distribution).
- The second is that the expected value of the residuals is zero, \(E[\mathbf{E}]=\mathbf{0}\) (the measurements are not systematically biased).
- The third is that the residuals are independent and have a finite constant variance, \(V[\mathbf{E}] = \sigma^2\mathbf{I}\), with \(\sigma^2 < \infty\).
In other words, let us make this assumption:
\begin{equation}
\mathbf{E} \sim N\left(\mathbf{0},\ \sigma^2\mathbf{I}\right)
\end{equation}
Here, \(\sigma^2 < \infty\).
Results
Again, we have several results from this simple assumption.
\(E[\mathbf{Y}] = \mathbf{X}\mathbf{B}\)
Proof.
The proof of this proceeds from simple algebra.
\begin{align}
E[\mathbf{Y}] &= E[ \mathbf{X}\mathbf{B} + \mathbf{E} ] \\[1em]
&= E[ \mathbf{X}\mathbf{B} ] + E[\mathbf{E} ]
\end{align}
One pervasive requirement is that the values of \(\mathbf{X}\) are not random variables. That is, the researcher selected those particular \(x\) values. Since this is true,
\begin{align}
E[\mathbf{Y}] &= \mathbf{X}\ E[ \mathbf{B} ] + E[\mathbf{E}]
\end{align}
Also, the values in the \(\mathbf{B}\) matrix are population parameters. They, too, are not random variables. In fact, the only random variable on the right-hand side of that matrix equation is the zero-mean \(\mathbf{E}\) matrix. Thus, we have
\begin{align}
E[\mathbf{Y}] &= \mathbf{X} \mathbf{B} + E[\mathbf{E}] \\[1em]
&= \mathbf{X} \mathbf{B}
\end{align}
\( \blacksquare \)
The requirement that the independent variables are are not random allows us to easily calculate expected values, variances, and covariances. When designing experiments, this assumption is not problematic.
When working with observational data, this becomes troublesome in terms of the mathematics. It also becomes troublesome in terms of the variances of \(\mathbf{Y}\)... and of the \(\mathbf{b}\). The confidence intervals for the estimates are wider than estimated here. Also, if the variability in the \(\mathbf{X}\) are not independent, even more difficulties arise.
If any of this interests you, please look into errors-in-variables models (among other topics).
Similarly, it is quite easy to prove \(V[\mathbf{Y}\ |\ \mathbf{XB}] = \sigma^2 \mathbf{I}\). I leave that to you as an exercise.
Another result is that the two estimators are unbiased (i.e., their expected values equal the population parameter):
The OLS estimator \(\mathbf{b}\) is unbiased for \(\mathbf{B}\).
Proof.
An estimator is unbiased for the parameter if its expected value equals the parameter. Thus, we need only show \(E[\mathbf{b}] = \mathbf{B}\).
\begin{align}
E[\mathbf{b}] &= E[ \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \mathbf{Y}] \\[1em]
&= \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime E[\mathbf{Y}] \\[1em]
&= \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \mathbf{X}\mathbf{B} \\[1em]
&= \mathbf{B}
\end{align}
\( \blacksquare \)
A third result is that the two estimators are not necessarily independent.
The OLS estimators \(b_0\) and \(b_1\) are not necessarily independent.
Proof.
To see this, we calculate the covariance matrix of \(\mathbf{b}\) and look at the value corresponding to the covariance between \(b_0\) and \(b_1\).
\begin{align}
V[\mathbf{b}] &= V\left[ \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \mathbf{Y}\right] \\[1em]
&= \big(\left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\big)\ V[\mathbf{Y}] \left( \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \right) ^\prime \\[1em]
&= \big( \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\big)\ \sigma^2 \mathbf{I}\ \left( \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \right) ^\prime \\[1em]
&= \big( \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\big)\ \sigma^2\ \left( \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \right) ^\prime \\[1em]
&= \sigma^2\ \big( \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \big) \left( \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \right) ^\prime \\[1em]
&= \sigma^2\ \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime\ \mathbf{X} \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \\[1em]
&= \sigma^2\ \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1}
\end{align}
If this matrix is diagonal, then the estimators are independent.
To see that the two estimators are linearly correlated (i.e., not independent), we just need to calculate the matrix \(\left( \mathbf{X}^\prime \mathbf{X}\right)^{-1}\).
Note that in general, this is difficult to do by hand. However, if we restrict ourselves to simple linear regression, that inverse is rather straight-forward to calculate. So, let's see the correlation in the SLR case:
\begin{equation}
\mathbf{X} = \left[\begin{array}{cc}
1 & x_1 \\
1 & x_2 \\
1 & x_3 \\
\vdots & \vdots \\
1 & x_n \\
\end{array} \right]
\end{equation}
With that, we have
\begin{align}
\mathbf{X}^\prime \mathbf{X} &= \left[
\begin{array}{cc}
n & n\bar{x} \\[1ex]
n \bar{x} & \sum_{i=1}^n x_i^2
\end{array}
\right]
\end{align}
The determinant of this \(\mathbf{X}^\prime \mathbf{X}\) is
\begin{align}
\det{ \mathbf{X}^\prime \mathbf{X} } &= n \sum_{i=1}^n x_i^2 - n^2 \bar{x}^2 = n\, S_{xx}\\
\end{align}
Thus, its inverse is
\begin{align}
\left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} &= \frac{1}{ n\, S_{xx} }\ \left[
\begin{array}{cc}
\sum_{i=1}^n x_i^2 & -n\bar{x} \\[1ex]
-n \bar{x} & n
\end{array}
\right]
\end{align}
Finally, we have the covariance matrix:
\begin{align}
V[\mathbf{b}[ &= \frac{\sigma^2}{ n\, S_{xx} }\ \left[
\begin{array}{cc}
\sum_{i=1}^n x_i^2 & -n\bar{x} \\[1ex]
-n \bar{x} & n
\end{array}
\right]
\end{align}
From this matrix, we see that the covariance between \(b_0\) and \(b_1\) is
\begin{equation}
Cov[b_0,b_1] = -n\bar{x}\ \frac{\sigma^2}{ n\, S_{xx} } = -\sigma^2\frac{\bar{x}}{S_{xx}}
\end{equation}
Thus, the OLS estimators are independent if and only if \(\bar{x}=0\).
\( \blacksquare \)
As an extension, note that the sign of the covariance is the opposite that of \(\bar{x}\).
Finally, while this last results may seem just slightly interesting, it is the basis of the Working-Hotelling (1929) procedure, which we will see later in Section 5.4.
This last result also suggests why many disciplines tend to center their x-values (subtract off \(\bar{x}\) from all of the \(x\) values) before doing regression. It ensures that the two estimators are independent.
Let us revisit the Toy Example and show how to use the matrix representation to answer the same problem.
Solution.
The first step is to create the two matrices. The dependent variable matrix is
\begin{equation}
\mathbf{Y} = \left[\begin{matrix} \phantom{-}3 \\ \phantom{-}0 \\ \phantom{-}2 \\ -1 \\ \end{matrix}\right]
\end{equation}
The independent variable matrix, also called the "data matrix" and the "design matrix," is
\begin{equation}
\mathbf{X} = \left[\begin{matrix}
1 & -2 \\ 1 & \phantom{-}0 \\ 1 & \phantom{-}0 \\ 1 & \phantom{-}2 \\
\end{matrix}\right]
\end{equation}
Where did the column of 1s come from in \(\mathbf{X}\)? Remember that the matrix equation is
\begin{equation}
\mathbf{Y} = \mathbf{XB} + \mathbf{E}
\end{equation}
and that this is equivalent (in simple linear regression) to
\begin{equation}
y_i = \beta_0\ 1 + \beta_1\ x_i + \varepsilon_i
\end{equation}
The 1s column in \(\mathbf{X}\) is the multiplier of the \(\beta_0\) in the \(\mathbf{B}\) matrix. As long as you have a \(\beta_0\) in your model, you need that column of 1s.
Now that we have the two matrices, we can calculate \(\mathbf{b}\).
\begin{align}
\mathbf{b} &= \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \mathbf{Y} \\[1em]
&= \left( \left[\begin{matrix}
1 & -2 \\ 1 & \phantom{-}0 \\ 1 & \phantom{-}0 \\ 1 & \phantom{-}2 \\
\end{matrix}\right]^\prime \left[\begin{matrix}
1 & -2 \\ 1 & \phantom{-}0 \\ 1 & \phantom{-}0 \\ 1 & \phantom{-}2 \\
\end{matrix}\right]\right)^{-1} \left[\begin{matrix}
1 & -2 \\ 1 & \phantom{-}0 \\ 1 & \phantom{-}0 \\ 1 & \phantom{-}2 \\
\end{matrix}\right]^\prime\left[\begin{matrix} \phantom{-}3 \\ \phantom{-}0 \\ \phantom{-}2 \\ -1 \\ \end{matrix}\right] \\
\left[\begin{matrix}
1 & 1 & 1 & 1 \\ -2 & 0 & 0 & 2 \\ \end{matrix}\right] \left[\begin{matrix}
1 & -2 \\ 1 & \phantom{-}0 \\ 1 & \phantom{-}0 \\ 1 & \phantom{-}2 \\
\end{matrix}\right] &= \left[\begin{matrix} 4 & 0 \\ 0 & 8 \end{matrix}\right] \\
\Rightarrow \qquad \qquad \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} &= \frac{1}{32} \left[\begin{matrix} 8 & 0 \\ 0 & 4 \end{matrix}\right] \\
%
%
\mathbf{X}^\prime \mathbf{Y} &= \left[\begin{matrix}
\phantom{-}1 & 1 & 1 & 1 \\ -2 & 0 & 0 & 2 \\ \end{matrix}\right] \left[\begin{matrix} \phantom{-}3 \\ \phantom{-}0 \\ \phantom{-}2 \\ -1 \\ \end{matrix}\right] \\
&= \left[\begin{matrix} \phantom{-}4 \\ -8 \end{matrix}\right]
\end{align}
Thus, we have
\begin{align}
\mathbf{b} &= \left( \mathbf{X}^\prime \mathbf{X}\right)^{-1} \mathbf{X}^\prime \mathbf{Y} \\[1em]
&= \frac{1}{32} \left[\begin{matrix} 8 & 0 \\ 0 & 4 \end{matrix}\right] \left[\begin{matrix} \phantom{-}4 \\ -8 \end{matrix}\right] \\[1em]
\mathbf{b} &= \frac{1}{32} \left[\begin{matrix} \phantom{-}32 \\ -32 \end{matrix}\right] \\[1em]
\text{And finally,} \hfill & \\
\mathbf{b}\ =\ \left[\begin{matrix} b_0 \\ b_1 \end{matrix} \right] &= \left[\begin{matrix} \phantom{-}1 \\ -1 \end{matrix} \right]
\end{align}
From all of this, we have \(b_0 = 1\) and \(b_1 = -1\).
The conclusion is exactly the same, \(\hat{y} = 1 - x\). The process is different. Here, this process is much easier for computers to perform, as they can do matrix multiplication (and inverting) with little problem. We have to spend a lot of extra effort to perform those operations. Here it is in R:
y = matrix( c(3,0,2,-1), ncol=1 ) x = matrix( c(1,1,1,1,-2,0,0,2), ncol=2 ) solve( t(x)%*%x ) %*% t(x) %*% y
Also, if we have more than one independent variable, we need to calculate the OLS estimator equations again; the ones from last chapter only hold for one independent variable. Using matrices, however, Formula \(\ref{eq:lm2b-olsMatrixEquation}\) holds for any number of independent variables.


