5.1: Probability Distributions
- Page ID
- 57724
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Alright, we've set the stage with our model. We made a few assumptions. Now, let's see what we can actually say about the numbers we calculate from our data — the estimators. This section is all about understanding where these numbers "live" and how they behave. Since our model's errors (residuals) are Normally distributed, it turns out most of our key estimators — like the slope \(b_1\), the intercept \(b_0\), and our predictions \(\hat{Y}\) — also follow Normal distributions. We'll walk through why that's true and pin down exactly what their means and variances are.
We'll also look at the difference between predicting an average outcome versus a brand new single observation. Finally, we'll check out the distribution of our estimate for the error variance, the MSE. These results might seem a bit math-heavy now, but they're the secret sauce that lets us build confidence intervals and run hypothesis tests later on. So let's dive in and get to know our estimators a little better. King Rudolph would want us to.
✦•················• ✦ •··················•✦
The Point Estimators
From this one assumption/requirement, and the math from the previous chapter, we have many consequences. This section explores the model and provides some results regarding the distribution of our estimators. The next sections build on this.
The distribution of \(Y\), conditional on the value of \(x\), is
\begin{equation}
Y\ |\ x \stackrel{\text{ind}}{\sim} N \left( \beta_0 + \beta_1 x,\ \sigma^2\right)
\end{equation}
Proof.
We are given \(Y = \beta_0 + \beta_1 x + \varepsilon\), with the only random variable on the RHS being \(\varepsilon\). Since \(\varepsilon\) follows a Normal distribution, so too does \(Y\) (see Theorem: The Sum of Normals).
Next, since the Normal distribution has two parameters, the mean and the variance, we need to determine those two values:
The expected value of \(Y\ |\ x\) is
\begin{align}
E[Y\ |\ x] &= E[\beta_0 + \beta_1 x + \varepsilon\ |\ x] \\[1em]
&= E[\beta_0\ |\ x] + E[\beta_1 x\ |\ x] + E[\varepsilon\ |\ x] \\[1em]
&= \beta_0 + \beta_1 x + 0 \\[1em]
&= \beta_0 + \beta_1 x
\end{align}
The variance of \(Y\), conditional on \(x\), is
\begin{align}
V[Y\ |\ x] &= V[\beta_0 + \beta_1 x + \varepsilon\ |\ x] \\[1em]
&= V[\beta_0\ |\ x] + V[\beta_1 x\ |\ x] + V[\varepsilon\ |\ x] \\[1em]
&= 0 + 0 + V[\varepsilon] \\[1em]
&= \sigma^2
\end{align}
Thus, putting this together, we have our final result
\begin{equation}
Y\ |\ x \stackrel{\text{ind}}{\sim} N \left( \beta_0 + \beta_1 x,\ \sigma^2\right)
\end{equation}
\(\blacksquare\)
Note that the \(Y\) are only "independently distributed" and not "independent and identically distributed." This is because the expected value of \(Y\) depends on the value of \(x\). Since the \(Y\) do not all have the same (identical) distribution, they are only "independently distributed."
As for the results of the theorem above, they may not be too interesting. However, as our estimators depend on the \(Y_i\), so too do their distributions. And that is where the interest arises.
We see this in the next theorem.
The distribution of \(b_1\) is \(N \left( \beta_1,\ \sigma^2 \frac{\displaystyle 1}{\displaystyle S_{xx}} \right)\).
Proof.
Before we start, we need to note that \(b_1\) can be written as a linear combination of the \(Y_i\):
\begin{equation}
b_1 = \frac{\displaystyle \sum_{i=1}^n (x_i - \bar{x})\ Y_i }{\displaystyle \sum_{i=1}^n (x_i-\bar{x})^2}
\end{equation}
I leave the proof of this as an exercise.
Now, since our \(b_1\) is a linear combination of the \(Y_i\), and since the \(Y_i\) come from independent Normal distributions, we have that \(b_1\) also follows a Normal distribution (again, see Theorem: The Sum of Normals).
Again, since the Normal distribution has two parameters, the mean and the variance, we need to find those two values, as we do next.
The expected value of \(b_1\) is
\begin{align}
E[b_1] &= E\left[\frac{\displaystyle \sum_{i=1}^n (x_i - \bar{x})\ Y_i }{\displaystyle \sum_{i=1}^n (x_i-\bar{x})^2}\right] \\[2em]
&= \frac{\sum_{i=1}^n (x_i - \bar{x})\ E[Y_i] }{\sum_{i=1}^n (x_i-\bar{x})^2} \\[1em]
&= \frac{\sum_{i=1}^n (x_i - \bar{x})\ (\beta_0 + \beta_1 x_i) }{\sum_{i=1}^n (x_i-\bar{x})^2} \\[1em]
&= \frac{\sum_{i=1}^n (x_i - \bar{x})\ \beta_0 }{\sum_{i=1}^n (x_i-\bar{x})^2} + \frac{\sum_{i=1}^n (x_i - \bar{x})\ \beta_1 x_i }{\sum_{i=1}^n (x_i-\bar{x})^2} \\[1em]
&= \beta_0 \frac{\sum_{i=1}^n (x_i - \bar{x}) }{\sum_{i=1}^n (x_i-\bar{x})^2} + \beta_1 \frac{\sum_{i=1}^n (x_i - \bar{x})\ x_i }{\sum_{i=1}^n (x_i-\bar{x})^2} \\[1em]
&= \beta_0 \frac{ 0 }{\sum_{i=1}^n (x_i-\bar{x})^2} + \beta_1 \frac{\sum_{i=1}^n (x_i - \bar{x})(x_i-\bar{x}) }{\sum_{i=1}^n (x_i-\bar{x})^2} \\[1em]
&= 0 + \beta_1 \frac{\sum_{i=1}^n (x_i-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2} \\[1em]
&= \beta_1
\end{align}
In this sequence, note that (and be able to prove that):
\begin{equation}
\sum_{i=1}^n (x_i - \bar{x}) = 0
\end{equation}
and that
\begin{equation}
\sum_{i=1}^n (x_i - \bar{x})^2 = \sum_{i=1}^n (x_i - \bar{x})\ x_i
\end{equation}
Thus, we know \(E[b_1] = \beta_1\); that is, our estimator is unbiased.
The final part is to determine the variance of \(b_1\):
\begin{align}
V[b_1] &= V\left[\frac{\displaystyle \sum_{i=1}^n (x_i - \bar{x})\ Y_i }{\displaystyle \sum_{i=1}^n (x_i-\bar{x})^2}\right] \\[2em]
&= \frac{\sum_{i=1}^n (x_i - \bar{x})^2\ V[Y_i] }{\left(\sum_{i=1}^n (x_i-\bar{x})^2\right)^2} \\[1em]
&= \frac{\sum_{i=1}^n (x_i - \bar{x})^2 }{\left(\sum_{i=1}^n (x_i-\bar{x})^2\right)^2}\ \sigma^2 \\[1em]
&= \frac{1 }{ \sum_{i=1}^n (x_i-\bar{x})^2}\ \sigma^2 \\[1em]
&= \sigma^2 \frac{1}{S_{xx}}
\end{align}
Recall that since we will be coming across the quantity \(\sum_{i=1}^n (x_i-\bar{x})^2\) many, many, many times, we denote it by \(S_{xx}\). And so, putting all of these parts together gives us
\begin{equation}
b_1 \sim N\left( \beta_1,\ \sigma^2 \frac{\displaystyle 1}{\displaystyle S_{xx}} \right)
\end{equation}
\(\blacksquare\)
The covariance between our \(b_1\) estimator and \(\overline{Y}\) is 0.
Proof.
I leave this as an exercise.
The distribution of \(b_0\) is \(N \Bigg( \beta_0,\ \sigma^2\left(\frac{\displaystyle 1}{\displaystyle n} + \frac{\displaystyle \bar{x}^2}{\displaystyle S_{xx}}\right) \Bigg)\).
Proof.
Remember that our estimator is
\begin{equation}
b_0 = \overline{Y} - b_1 \bar{x}
\end{equation}
Since we have previously shown \(Cov[\overline{Y},\ b_1]=0\), the proof is straight forward.
First, we note that \(b_0\) is a linear combination of the \(Y_i\). Thus, it follows a Normal distribution. (Again, see Theorem: The Sum of Normals for a proof of this.) Because the Normal distribution has two parameters, we must find formulas for each:
Expected value:
\begin{align}
E[b_0] &= E[\overline{Y} - b_1 \bar{x}] \\[1em]
&= E[\overline{Y}] - E[b_1 \bar{x}] \\[1em]
&= \left( \beta_0 + \beta_1 \bar{x} \right) - \beta_1 \bar{x} \\[1em]
&= \beta_0
\end{align}
Variance:
\begin{align}
V[b_0] &= V[\overline{Y} + b_1 \bar{x}] \\[1em]
&= V[\overline{Y}] + V[b_1 \bar{x}] + 2\; Cov[\overline{Y},\ b_1\bar{x}] \\[1em]
&= V[\overline{Y}] + V[b_1] \bar{x}^2 + 2 Cov[\overline{Y},\ b_1] \bar{x} \\[1em]
&= \frac{\sigma^2}{n} + \frac{\sigma^2}{S_{xx}}\bar{x}^2 + 0\bar{x} \\[2em]
V[b_0] &= \sigma^2 \left(\frac{\displaystyle 1}{\displaystyle n} + \frac{\displaystyle \bar{x}^2}{\displaystyle S_{xx}}\right)
\end{align}
Finally, putting these three parts together gives us what we want:
\begin{equation}
b_0 \sim N \Bigg( \beta_0,\ \sigma^2\left(\frac{\displaystyle 1}{\displaystyle n} + \frac{\displaystyle \bar{x}^2}{\displaystyle S_{xx}}\right) \Bigg)
\end{equation}
\(\blacksquare\)
The Estimates and Predictions
The above looked at distributions of the usual parameters of interest. Here, let us look at the distribution of estimates and predictions... hoping to tease out the difference between an estimate and a prediction.
The distribution of \(Y\) for an observed value of \(x_i\), which we will term \(\hat{Y}_i\), is
\begin{equation}
\hat{Y}_i \sim N\Bigg( \beta_0 + \beta_1x_i,\ \sigma^2\left(\frac{\displaystyle 1}{\displaystyle n} + \frac{\displaystyle (x_i-\bar{x})^2}{\displaystyle S_{xx}} \right) \Bigg)
\end{equation}
What does this actually mean?
If we repeat this experiment (of collecting a sample of size \(n\)) an infinite number of times and estimate \(\hat{Y}_i\) for each of those experiments using our formulas, then those many \(\hat{Y}_i\) would follow the specified distribution.
Proof.
Remember that \(\hat{Y}_i = b_0 + b_1 x_i\) and that \(x\) is non-stochastic (it is not a random variable). With this, we have that \(\hat{Y}_i\) is a linear combination of Normally distributed random variables (\(b_0\) and \(b_1\)). As such, the name of the distribution of \(\hat{Y}_i\) is "Normal." What remains is to calculate the expected value and variance.
\begin{align}
E[\hat{Y}_i] &= E[b_0 + b_1 x_i] \\[1em]
&= E[b_0] + E[b_1 x_i] \\[1em]
&= \beta_0 + \beta_1 x_i
\end{align}
As expected, the estimator is unbiased.
What about the variance? That is a bit more difficult, because we must deal with the covariance between \(b_0\) and \(b_1\).
\begin{align}
V[\hat{Y}_i] &= V[b_0 + b_1 x_i] \\[1em]
&= V[b_0] + V[b_1x_i] + 2 Cov[b_0,\ b_1x_i] \\[1em]
&= V[b_0] + V[b_1]x_i^2 + 2 Cov[b_0,\ b_1]\ x_i \\[1em]
&= \sigma^2 \left( \frac{\displaystyle 1}{\displaystyle n} + \frac{\displaystyle \bar{x}^2}{\displaystyle S_{xx}}\right) + \sigma^2\left( \frac{\displaystyle 1}{\displaystyle S_{xx}} \right)x_i^2 + 2\frac{\displaystyle -\bar{x} \sigma^2}{\displaystyle S_{xx}}x_i \\[1em]
&= \frac{\displaystyle \sigma^2}{\displaystyle n} + \frac{\displaystyle \sigma^2}{\displaystyle S_{xx}} \left( \bar{x}^2 + x_i^2 -2 \bar{x} x_i \right) \\[1em]
&= \frac{\displaystyle \sigma^2}{\displaystyle n} + \frac{\displaystyle \sigma^2}{\displaystyle S_{xx}} \left( \bar{x} - x_i \right)^2 \\[2em]
&= \sigma^2 \left( \frac{\displaystyle 1}{\displaystyle n} + \frac{\displaystyle \left( \bar{x} - x_i \right)^2}{\displaystyle S_{xx}} \right)
\end{align}
And so, putting these three things together gives us our hoped-for result
\begin{equation}
\hat{Y}_i \sim N\Bigg( \beta_0 + \beta_1x_i,\ \sigma^2\left(\frac{\displaystyle 1}{\displaystyle n} + \frac{ \displaystyle (x_i-\bar{x})^2}{\displaystyle S_{xx}} \right) \Bigg)
\end{equation}
... as we expected.
\(\blacksquare\)
There are a couple of things interesting about this result.
First, the uncertainty in \(\hat{Y}_i\) is a function of \(n\), \(S_{xx}\), and \(\bar{x} - x_i\). Larger sample sizes (larger \(n\)) produce a more precise estimate.
Second, samples with larger values of \(S_{xx}\) also produce more precise estimates. To maximize \(S_{xx}\), the researcher must have half of the \(x_i\) values at the minimum and half at the maximum. Unfortunately, the drawback to doing this is that one is not able to detect a curvature in the expected values of \(Y\). Thus, we again see that there is a trade off in statistics. The important part is to understand what you are trying to understand... and use your statistical understanding to understand it.
Finally, the precision of the estimate also depends on how far that \(x\) value is from the center of gravity, \((\bar{x},\ \bar{y})\). Note that the uncertainty in \(\hat{Y}_i\) when \(x=\bar{x}\) only comes from the uncertainty in the value of \(\bar{Y}\).
Convince yourself that all of this makes sense (non-mathematically). Draw graphics to illustrate these results. Trust me, it will help you better connect the mathematics with the reality.
The distribution of \(Y_{new}\), a new observation, for a new value of \(x\), is
\begin{equation}
Y_{new} \sim N\Bigg( \beta_0 + \beta_1 x_{new},\ \sigma^2\left(\displaystyle 1 + \frac{\displaystyle 1}{\displaystyle n} + \frac{\displaystyle (x_{new}-\bar{x})^2}{\displaystyle S_{xx}} \right)\Bigg)
\end{equation}
Before we begin this proof, remember that
\begin{equation}
Y_{new} = b_0 + b_1 x_{new} + \varepsilon = \hat{Y}_{new} + \varepsilon
\end{equation}
Since we are estimating a new observation (as opposed to just an expected value), we need to include \(\varepsilon\) in our calculations. This is subtle and very important. It emphasizes the importance of \(\varepsilon\).
Also, before we start the proof, compare and contrast this distribution with the distribution of \(\hat{Y}_i\). What is the difference? Where does that difference come from?
Proof.
And now for the expected proof. See that \(Y_{new}\) is a linear combination of Normally distributed random variables (\(b_0\), \(b_1\), and \(\varepsilon\)). Thus, \(Y_{new}\) follows a Normal distribution. All that remains is to calculate its expected value and its variance. To do so, we rely on the previous theorem.
\begin{align}
E[Y_{new}] &= E[b_0 + b_1 x_{new} + \varepsilon] \\[1em]
&= E[b_0 + b_1 x_{new}] + E[\varepsilon] \\[1em]
&= \beta_0 + \beta_1 x_{new} + 0 \\[1em]
&= \beta_0 + \beta_1 x_{new}
\end{align}
Next, for the variance:
\begin{align}
V[Y_{new}] &= V[b_0 + b_1 x_{new} + \varepsilon] \\[1em]
&= V[\hat{Y} + \varepsilon] \\[1em]
&= V[\hat{Y}] + V[\varepsilon + 2 Cov[\hat{Y},\varepsilon] \\[1em]
&= \sigma^2 \left( \frac{1}{n} + \frac{\left( \bar{x} - x_{new} \right)^2}{S_{xx}} \right) + \sigma^2 + 0 \\[1.5em]
&= \sigma^2 \left( \displaystyle 1 + \frac{\displaystyle 1}{\displaystyle n} + \frac{\displaystyle \left( \bar{x} - x_{new} \right)^2}{\displaystyle S_{xx}} \right)
\end{align}
Putting these parts together gives us the distribution of a new observation (a prediction):
\begin{equation}
Y_{new} \sim N\Bigg( \beta_0 + \beta_1 x_{new},\ \sigma^2\left(\displaystyle 1 + \frac{\displaystyle 1}{\displaystyle n} + \frac{\displaystyle (x_{new}-\bar{x})^2}{\displaystyle S_{xx}} \right)\Bigg)
\end{equation}
\(\blacksquare\)
Note that the only difference in the uncertainties between \(Y_{new}\) and \(\hat{Y}\) is an additional term of \(\sigma^2\) due to the inclusion of the residuals. Thus, all of the things that affect the variance of \(\hat{Y}\) also affect the variance of \(Y_{new}\), and in the same way.
Also note that the uncertainty in an observation is higher than the uncertainty in the expected value (see Figure \(\PageIndex{1}\), below).
The important difference between this theorem and the previous is that this theorem models a new observation, while the previous models the expected value of an observation. The difference is important.
The Mean Square Error
There is another parameter in our model that we may like to estimate. That is the variance of \(\varepsilon\). The ordinary least squares estimator of \(\sigma^2\) is called the mean square error. It is defined as
\begin{equation}
\mathrm{MSE} = \frac{1}{n-p} \sum_{i=1}^n \varepsilon_i
\end{equation}
Here, \(p\) is the number of parameters in the regression. So far, we have dealt with estimating \(\beta_0\) and \(\beta_1\). Thus, \(p=2\) in simple linear regression.
The distribution of the mean square error, \(\mathrm{MSE}\), can be written as
\begin{equation}
\frac{\displaystyle (n-p)\ \mathrm{MSE}}{\displaystyle \sigma^2}\ \sim\ \chi^2_{n-p}
\end{equation}
Proof.
The first thing to do is remind ourselves of the definition of a \(\chi^2\) random variable. From the definition of the Chi-square distribution, we have that if \(Z_i \sim N(0,\ 1)\), then \(\sum Z_i^2 \sim \chi^2_{\nu}\), where \(\nu\) is the number of those \(Z_i\) that are independent (the degrees of freedom).
With this definition, we just need to find a random variable with a Normal distribution and transform it into the proper form. To that end, here is some algebra:
\begin{align}
\varepsilon_i &\sim N\left(0,\ \sigma^2\right) \\[1em]
\frac{\varepsilon_i}{\sigma} &\sim N(0,\ 1) \\[1em]
\frac{\varepsilon_i^2}{\sigma^2} &\sim \chi^2_{\nu=1}\\[1em]
\frac{ \sum \varepsilon_i^2}{\sigma^2} &\sim \chi^2_{\nu=n-p} \label{eq:lm3-chichi} \\[1em]
\frac{ (n-p)\ \frac{1}{n-p} \sum \varepsilon_i^2}{\sigma^2} &\sim \chi^2_{n-p} \\[1em]
\frac{ (n-p)\ \mathrm{MSE} }{\sigma^2} &\sim \chi^2_{n-p}
\end{align}
And this is what we were to prove.
\(\blacksquare\)
As usual, knowing the distribution of a sample statistic like the \(\mathrm{MSE}\) allows us to create confidence intervals and perform hypothesis testing about the variance of the residuals, \(\sigma^2\).
With that said, the importance of the previous theorem lies more in how we can use it to obtain confidence intervals and test hypotheses about the OLS estimators of the intercept and slope parameters.
By the way, the reason that equation \(\ref{eq:lm3-chichi}\) has \(n-p\) degrees of freedom is that there are only \(n-p\) independent terms. The other \(p\) terms can be determined (to within a constant) from the \(n-p\) terms.


