12.2: The MLE and the CLM
- Page ID
- 57764
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Recall that the classical linear model (CLM) assumes
\begin{equation}
\mathbf{Y} = \mathbf{XB} + \mathbf{E},
\end{equation}
with \(\mathbf{E} \sim N_n\left(\mathbf{0};\ \sigma^2 \mathbf{I}\right)\). When we fit this model using ordinary least squares (OLS), we obtained the following estimators:
\begin{equation}
\left\{
\begin{array}{ll}
\quad b_0 = \bar{y} - b_1 \bar{x} \\[1em]
\quad b_1 = \frac{\displaystyle \sum (x_i-\bar{x})(y_i-\bar{y})}{\displaystyle \sum (x_i-\bar{x})^2} \\
\end{array}
\right.
\end{equation}
Let us see what we get when we fit this model using maximum likelihood methods.
The maximum likelihood estimator of \(\beta_0\) is
\begin{equation}
\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}
\end{equation}
This is equivalent to the OLS estimator of the y-intercept.
Proof.
The first step is to determine the likelihood function. The second step is to maximize that likelihood with respect to the parameter. As is usual, one maximizes the logarithm of the likelihood instead of the likelihood itself. It is generally easier.
Remember the conditional distribution of \(y\). With that in mind, here is the likelihood for \emph{one} observation:
\begin{align}
\mathcal{L}(\mu, \sigma^2;\ x,y) &= \frac{1}{\sqrt{2\pi\sigma^2}}\ \exp \left[ -\frac{1}{2}\ \frac{(y - \mu)^2}{\sigma^2} \right] \\[2em]
&= \frac{1}{\sqrt{2\pi\sigma^2}}\ \exp \left[ -\frac{1}{2}\ \frac{\left(y - \hat{y}\right)^2}{\sigma^2} \right] \\[2em]
&= \frac{1}{\sqrt{2\pi\sigma^2}}\ \exp \left[ -\frac{1}{2}\ \frac{\big(y - (\beta_0 + \beta_1 x)\big)^2}{\sigma^2} \right]
\end{align}
Jeeee-willikers! That is just the probability density function for the normal distribution, where \(\mu\) (as always) represents an expected value.
That was for a single observation. However, we rarely deal with just one data point. We deal with \(n\) of them. We remember from our introductory statistics course that if the data are independent, then \(P[A \cap B] = P[A]P[B]\). That means that the likelihood of observing all of our data is just the product of the individual likelihoods (also see Lemma: Independent Events).
With that, we have
\begin{equation}
\mathcal{L}\left(\beta_0,\beta_1, \sigma^2;\ \mathbf{x},\mathbf{Y}\right) = \prod_{i=1}^n\ \frac{1}{\sqrt{2\pi\sigma^2}}\exp \left[ -\frac{1}{2} \frac{\left(y_i - (\beta_0 + \beta_1 x_i)\right)^2}{\sigma^2} \right]
\end{equation}
Since you may not have seen the notation before, \(\prod\) is the product symbol just like \(\sum\) is the summation symbol.
The next step is to maximize this likelihood. From calculus, we recall the product formula for derivatives. Just try applying the product formula here. You will shortly go bald from pulling out your hair. There is no easy way to maximize this likelihood function directly. Qué lástima!
However, if we apply an increasing bijection to this likelihood, then maximizing that function is equivalent to maximizing the original likelihood... equivalent in terms of the value that produces the maximum (the "argmax").
Because the likelihood has a lot of products, and because it is easier to maximize a sum, we use the logarithm function. The log-likelihood function of the above function is just
\begin{equation}
\ell\left(\beta_0,\beta_1, \sigma^2;\ \mathbf{x},\mathbf{Y}\right) = \sum_{i=1}^n\ \left( -\frac{1}{2} \log\left(-2\pi\sigma^2\right)\ -\frac{1}{2} \frac{\left(y_i - (\beta_0 + \beta_1 x_i)\right)^2}{\sigma^2} \right)
\end{equation}
Taking the derivative of a summation is so much easier than taking the derivative of a product... so much easier!
And, now that we have a practically differentiable function, we use calculus to maximize it with respect to \(\beta_0\):
\begin{align}
\frac{\partial}{\partial\beta_0} \ell(\beta_0,\beta_1, \sigma^2;\ \mathbf{x},\mathbf{Y}) &=
\frac{\partial}{\partial\beta_0} \sum_{i=1}^n\ \left( -\frac{1}{2} \log\left(-2\pi\sigma^2\right)\ -\frac{1}{2} \frac{\left(y_i - (\beta_0 + \beta_1 x_i)\right)^2}{\sigma^2} \right) \nonumber\\[2em]
&= \sum_{i=1}^n\ -\frac{1}{2}\ \frac{2\left(y_i - (\beta_0 + \beta_1 x_i)\right)(-1)}{\sigma^2} \\[2em]
&= \sum_{i=1}^n\ \frac{y_i - \beta_0 - \beta_1 x_i}{\sigma^2} \\[1em]
0 &\stackrel{\text{set}}{=} \sum_{i=1}^n\ \frac{y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i}{\sigma^2} \\[1em]
&= \sum_{i=1}^n\ y_i - \sum_{i=1}^n\ \hat{\beta}_0 - \sum_{i=1}^n\ \hat{\beta}_1 x_i \\[1em]
\sum_{i=1}^n\ \hat{\beta}_0 &= \sum_{i=1}^n\ y_i - \sum_{i=1}^n\ \hat{\beta}_1 x_i \\[1em]
n\hat{\beta}_0 &= n\bar{y} - n\hat{\beta}_1\bar{x} \\[1em]
\hat{\beta}_0 &= \bar{y} - \hat{\beta}_1\bar{x}
\end{align}
Thus, we have shown that \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\), as we desired. Note that this is also the OLS estimator of the y-intercept. Very interesting!
\(\blacksquare\)
The maximum likelihood estimator of \(\beta_1\) is
\begin{equation}
\hat{\beta}_1 = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}
\end{equation}
Proof.
From our proof of the estimator of \(\beta_0\), we have the following as our log-likelihood function:
\begin{equation}
\ell\left(\beta_0,\beta_1, \sigma^2;\ \mathbf{x},\mathbf{Y}\right) = \sum_{i=1}^n\ \left( -\frac{1}{2} \log\left(-2\pi\sigma^2\right)\ -\frac{1}{2} \frac{\left(y_i - (\beta_0 + \beta_1 x_i)\right)^2}{\sigma^2} \right)
\end{equation}
And so, the proof proceeds by taking the derivative with respect to \(\beta_1\) and solving for \(\hat{\beta}_1\).
\begin{align}
\frac{\partial}{\partial\beta_1} \ell(\beta_0,\beta_1, \sigma^2;\ \mathbf{x},\mathbf{Y}) &=
\frac{\partial}{\partial\beta_1} \sum_{i=1}^n\ \left( -\frac{1}{2} \log\left(-2\pi\sigma^2\right)\ -\frac{1}{2} \frac{\left(y_i - (\beta_0 + \beta_1 x_i)\right)^2}{\sigma^2} \right) \nonumber \\[2em]
&= -\frac{1}{2} \sum_{i=1}^n\ \frac{2\left(y_i - (\beta_0 + \beta_1 x_i)\right)(-x_i)}{\sigma^2} \\[1em]
&= \sum_{i=1}^n\ \frac{x_iy_i - \beta_0x_i - \beta_1 x_i^2}{\sigma^2} \\[1em]
0 & \stackrel{\text{set}}{=} \sum_{i=1}^n\ \frac{x_iy_i - \hat{\beta}_0x_i - \hat{\beta}_1 x_i^2}{\sigma^2} \\[1em]
&= \sum_{i=1}^n x_iy_i - \sum_{i=1}^n\hat{\beta}_0x_i - \sum_{i=1}^n\hat{\beta}_1 x_i^2 \\[1em]
&= \sum_{i=1}^n x_iy_i - n\bar{x}\hat{\beta}_0 - \sum_{i=1}^n\hat{\beta}_1 x_i^2 \\[1em]
&= \sum_{i=1}^n x_iy_i - n\bar{x}\left( \bar{y}-\hat{\beta}_1\bar{x} \right) - \sum_{i=1}^n\hat{\beta}_1 x_i^2 \\[1em]
&= \sum_{i=1}^n x_iy_i - n\bar{x}\bar{y} + n\hat{\beta}_1\bar{x}^2 - \sum_{i=1}^n\hat{\beta}_1 x_i^2 \\[1em]
\sum_{i=1}^n\hat{\beta}_1 x_i^2 - n\hat{\beta}_1\bar{x}^2 &= \sum_{i=1}^n x_iy_i - n\bar{x}\bar{y} \\[1em]
\hat{\beta}_1 \left( \sum_{i=1}^n x_i^2 - n\bar{x}^2 \right) &= \sum_{i=1}^n x_iy_i - n\bar{x}\bar{y} \\[2em]
\hat{\beta}_1 &= \frac{ \sum_{i=1}^n x_iy_i - n\bar{x}\bar{y} }{\sum_{i=1}^n x_i^2 - n\bar{x}^2}
\end{align}
We have seen this before. It is the OLS estimator of the slope parameter. No surprise.
To finish the proof, use algebra to show that the final equation above is equivalent to
\begin{equation}
\hat{\beta}_1 = \frac{ \sum (x_i-\bar{x})(y_i-\bar{y}) }{ \sum (x_i-\bar{x})^2 }
\end{equation}
\(\blacksquare\)
The maximum likelihood estimator of \(\sigma^2\) is
\begin{equation}\label{eq:lm8-mlesigma}
\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \left( y_i - \hat{\beta}_0 - \hat{\beta}_1x_i \right)^2
\end{equation}
Proof.
It has been a while, so I will leave this as an exercise for you to prove this. I have already shown the log-likelihood function. All you have to do is differentiate with respect to \(\sigma^2\), solve for \(\hat{\sigma}^2\), and use algebra to move things into the right form.
\(\blacksquare\)
Note that the above formula (Eqn \(\ref{eq:lm8-mlesigma}\)) is equivalent to
\begin{equation}
\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n e_i^2
\end{equation}
This should raise a red (maybe only light yellow or a nice fuschia) flag, as this is a biased estimator of \(\sigma^2\).
Note, however, that asymptotically (as \(n \to \infty\)), the estimator becomes unbiased. This can be proven — more easily than you may imagine.
Actually, this is a guarantee of all maximum likelihood estimators: They may not be unbiased, but they are asymptotically unbiased.
Consequences
I leave it as an exercise to see the following consequences:
- \(\hat{\beta}_0\) is unbiased for \(\beta_0\).
- \(\hat{\beta}_1\) is unbiased for \(\beta_1\).
- \(\hat{\sigma}^2\) is biased for \(\sigma^2\).
Because the maximum likelihood estimators are identical to the ordinary least square estimators, and because we have not altered the Normality assumption of the classical linear model, everything from Chapters 3 through 5 hold.
Well, that is not entirely true. Remember that the MLE estimator of \(\sigma^2\) is not the same as the OLS estimator. Thus, the test statistic and confidence interval will need to be altered a bit. However, the differences are minor for large samples.
Multivariate Distributions*
There is one prerequisite to this textbook that would make things a little easier: an introduction to multivariate distribution. Thus far, I have "hand-waved" over the topic. Here, I will briefly discuss the topic. Feel free to treat this section only as a passing interest.
The following is a univariate distribution:
\begin{equation}
f(x;\ \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\ \exp\left[ -\frac{1}{2}\ \frac{(x-\mu)^2}{\sigma^2} \right]
\end{equation}
This is the (in)famous probability density function for the normal distribution. It is a function of just one variable (value), $x$. This is what makes it univariate. The prefix uni stands for neither the University of Northern Iowa nor edible sea urchin gonads. It is a Latin combining form for "one." Thus, "univariate" indicates "one variable."
The following is one example of a bivariate distribution:
\begin{equation}
f(x,y)=\frac{\exp \left\{ -\frac 1{2(1-\rho ^2)}\left[ \left( \frac{x-\mu _x}{\sigma _x}\right) ^2-2\rho \left( \frac{x-\mu _x}{\sigma _x}\right) \left( \frac{y-\mu _y}{\sigma _y}\right) +\left( \frac{y-\mu _y}{\sigma _y}\right)^2\right] \right\} }{2\pi \sigma _x\sigma _y\sqrt{1-\rho ^2}}
\end{equation}
This is a distribution where \(X\) and \(Y\) are distributed jointly normal, and where they have correlation \(\rho\) between them. There are a lot of symbols there because it is written in scalar form. Were we to write it in matrix form, we could generalize all of these "—variate" distributions into one form.
If the random vector \(\mathbf{Y}\) follows a multivariate normal distribution such that \(\mathbf{Y} \sim N_n\left( \mathbf{\mu};\ \Sigma \right)\), then
\begin{equation}
f\left(\mathbf{Y};\ \mathbf{\mu}, \Sigma\right) = \left(2\pi\right)^{-n/2}\ \abs{\Sigma}^{-1/2}\ \exp\left[ -\frac{1}{2}\ (\mathbf{x}-\mathbf{\mu})^\prime \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right]
\end{equation}
Here, \(n\) is the number of variables that are jointly normal. This means that each random variable follows a normal distribution, given the values of the others. The vector \(\mathbf{\mu}\) is a column vector of expected values for each \(X_i\). Finally, \(\Sigma\) is the correlation matrix between the \(n\) random variables. If the \(x\) values are independent, then \(\Sigma \in \mathcal{D}_{n}\) (diagonal). If the \(x\) values are independent and identically distributed, then \(\Sigma = \sigma^2\mathbf{I}_n\).
If \(n=1\), then the multivariate normal reduces to the univariate normal. If \(n=2\), then it reduces to the bivariate normal, where the off-diagonal entries in \(\Sigma\) are equal to \(\rho\sigma_1\sigma_2\) and the diagonal entries are \(\sigma_1^2\) and \(\sigma_2^2\).
That is, if \(n=2\), then
\begin{equation}
\Sigma = \left[ \begin{array}{cc} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \\ \end{array} \right]
\end{equation}
Still, not much in this subsection is important — if the observations are independent. If the observations are independent, then the \(n\)-variate normal is just the product of the \(n\) univariate normals.
HOWEVER, if the observations are not independent, however, then the joint distribution — the actual distribution we care about — is not so simple. In many cases, it has not been entirely formulated. For instance, what is the multivariate Binomial distribution? That is, what is the distribution of \(\{y_1, y_2, y_3, \ldots, y_n\}\), given correlation amongst those \(n\) measurements?
Even better: How could we measure such correlation?
In such cases, Dai, Ding, and Wahba (2013) may give you some insight into the difficulty of these questions — and their answers!
This really makes you appreciate random sampling, where both independence and identical distribution hold.


