10.2: Weighted Least Squares
- Page ID
- 57751
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)It may be that the residuals are independent, but that their variance is known to not be constant. That is, we may have a model that leads to this assumption:
\begin{equation}
\varepsilon_i \stackrel{\text{ind}}{\sim} N\left(0;\, \sigma^2_i \right)
\end{equation}
or as
\begin{equation}
\mathbf{E} \sim N_n\left(\mathbf{0};\, \sigma^2\mathbf{D}\right)
\end{equation}
Here, \(\mathbf{D}\) is a diagonal matrix. Again, the two statements are different ways of saying the same thing.
Note that the covariance matrix of this \(\mathbf{E}\) is
\begin{equation}
V[\mathbf{E}]\ =\ \sigma^2 \left[
\begin{matrix}
d_1 & 0 & 0 & \cdots & 0 \\
0 & d_2 & 0 & \cdots & 0 \\
0 & 0 & d_3 & & 0 \\
\vdots & \vdots & & \ddots & \vdots \\
0 & 0 & 0 & \cdots & d_n \\
\end{matrix}
\right]\ =\ \left[
\begin{matrix}
\sigma_1^2 & 0 & 0 & \cdots & 0 \\
0 & \sigma_2^2 & 0 & \cdots & 0 \\
0 & 0 & \sigma_3^2 & & 0 \\
\vdots & \vdots & & \ddots & \vdots \\
0 & 0 & 0 & \cdots & \sigma_n^2 \\
\end{matrix}
\right]
\end{equation}
The values along the diagonal represent the variances of each residual (in the population). That they are not necessarily the same value indicates that the variance of the residuals can differ from observation to observation. The values off the diagonal represent the covariance between the values of the residuals. So, the value at position 1,2 is the covariance between \(\varepsilon_1\) and \(\varepsilon_2\), which we symbolized as \(\sigma_{1,2}\) in The Appendix of Statistics (note that these are population values). Since \(\sigma_{1,2} = 0\), we are specifying that the two residuals are linearly uncorrelated in the population.
Remember that Greek letters refer to the population, while Latin refer to the sample (usually).
Thus, the covariance matrix above specifies that the variances of the residuals are allowed to be different and that the residuals are independent of each other.
This assumption is the only difference between weighted least squares and ordinary least squares. But, it is a rather significant difference.
Remember that the value of \(\sigma^2\) can indicate the variance of the population residuals or our uncertainty in the value of that residual.
To use weighted least squares (WLS), we need to know the structure of the \(\mathbf{D}\) matrix. We do not need to know the exact values, but we need to know them up to a constant multiplier. That is, we need to know the structure of that heteroskedasticity. This usually comes from understanding the data-generating process. Frequently, this is not known (thus, WLS should probably not be used). However, there are some cases when we would} know this structure.
For instance, if we are working with a response variable that is a proportion arising from a binomially-distributed variable, we know that the variance is
\begin{equation}
\sigma_i^2\ =\ \frac{\pi(1-\pi)}{n_i}\ =\ \pi(1-\pi)\frac{1}{n_i}
\end{equation}
Thus, the diagonal elements will be \(d_i = 1/n_i\) and the multiplier (constant part) will be \(\pi(1-\pi)\).
Fitting WLS: The Mathematics
Assuming we know the structure of the \(\mathbf{D}\) matrix, we can determine all we need to about the WLS estimators and estimates. We just reduce this problem to a previous problem. To clarify the similarities and differences between ordinary and weighted least squares, here is the classical linear model (CLM) for ordinary least squares:
\begin{equation}
\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}
\end{equation}
and for weighted least squares:
\begin{equation}
\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}
\end{equation}
Those are the same, whether one does OLS or WLS, because both come from the fact you are using the classical linear model. The difference comes in the assumption. Here is the assumption for OLS:
\begin{equation}
\mathbf{E} \sim N_n\left( 0;\, \sigma^2\mathbf{I}\right)
\end{equation}
Here is the assumption for WLS:
\begin{equation}
\mathbf{E} \sim N_n\left( 0;\, \sigma^2\mathbf{D}\right)
\end{equation}
Remember that \(\mathbf{D}\) is a diagonal matrix.
The All-Important Transformation
There is a joke about how a mathematician solved the problem of a hose connected to a fire hydrant:
A mathematician and a physicist were asked the following question:
Suppose you walked by a burning house and saw a hydrant and a hose not connected to the hydrant. What would you do?
P: I would attach the hose to the hydrant, turn on the water, and put out the fire.
M: I would attach the hose to the hydrant, turn on the water, and put out the fire.
Then they were asked this question:
Suppose you walked by a house and saw a hose connected to a hydrant. What would you do?
P: I would keep walking, as there is no problem to solve.
M: I would disconnect the hose from the hydrant and set the house on fire, reducing the problem to a previously solved form.
And so, in the spirit of mathematicians, let us reduce the weighted least squares problem to that of ordinary least squares. If we can do this via a bijective transformation, then we have our confidence intervals and test statistics.
If we define our weighting matrix \(\mathbf{W} = \mathbf{D}^{-1/2}\), then our problem is solved, sans burning down the house.
Let \(\mathbf{E} \sim N_n\left( 0,\, \sigma^2\mathbf{D}\right)\). If we define \(\mathbf{W} = \mathbf{D}^{-1/2}\), then
\begin{equation}
\mathbf{W}\mathbf{E} \sim N_n\left( 0,\ \sigma^2\mathbf{I}\right)
\end{equation}
Proof.
Since \(\mathbf{W}\) is a diagonal matrix and \(\mathbf{E}\) has a Normal distribution, \(\mathbf{W}\mathbf{E}\) will also follow a Normal distribution. Thus, we need to calculate \(E[\mathbf{W}\mathbf{E}]\) and \(V[\mathbf{W}\mathbf{E}]\). In doing this, note that \(\mathbf{W}\) is not a random matrix; it is known.
The expected value of \(\mathbf{W}\mathbf{E}\) is
\begin{align}
E[\mathbf{W}\mathbf{E}] &= \mathbf{W}\ E[\mathbf{E}] \\[1ex]
&= \mathbf{W}\ \mathbf{0} \\[1ex]
&= \mathbf{0}
\end{align}
For the variance we have
\begin{align}
V[\mathbf{W}\mathbf{E}] &= \mathbf{W}\ V[\mathbf{E}]\ \mathbf{W}^\prime \\[1ex]
&= \mathbf{W}\ \sigma^2\mathbf{D}\ \mathbf{W}^\prime \\[1ex]
&= \sigma^2\ \mathbf{D}\ \mathbf{W}\ \mathbf{W}^\prime \\[1ex]
&= \sigma^2\ \mathbf{D}\ \mathbf{D}^{-1/2}\ \left(\mathbf{D}^{-1/2}\right)^\prime \\[1ex]
&= \sigma^2\ \mathbf{D}\ \mathbf{D}^{-1/2}\ \mathbf{D}^{-1/2} \\[1ex]
&= \sigma^2\ \mathbf{D}\ \mathbf{D}^{-1} \\[1ex]
&= \sigma^2\ \mathbf{I}
\end{align}
In these steps, remember that matrix multiplication is commutative if the matrices are diagonal.
Thus, putting these three parts together gives our conclusion.
\(\blacksquare\)
How do we use this theorem? We pre-multiply the model equation by the matrix \(\mathbf{W}\) to obtain the following:
\begin{align}
\mathbf{Y} &= \mathbf{X}\mathbf{B} + \mathbf{E} \\[1em]
\mathbf{W}\mathbf{Y} &= \mathbf{W}\mathbf{X}\mathbf{B} + \mathbf{W}\mathbf{E} \\[1em]
\mathbf{Y}^* &= \mathbf{X}^*\mathbf{B} + \mathbf{E}^*
\end{align}
with \(\mathbf{E}^* \sim N_n\left(\mathbf{0},\, \sigma^2\mathbf{I}\right)\).
Thus, we can apply all of our OLS results to WLS, as long as we speak to the transformed response variable \(\mathbf{W}\mathbf{Y}\) and the transformed independent variable(s) \(\mathbf{W}\mathbf{X}\).
This quickly leads to our weighted least squares estimators of \(\mathbf{B}\).
To prove this, we could proceed as we did back in Section 4.1: Matrix Representation. Or, since we have reduced the WLS problem to an OLS problem, we can just write out the results and simplify:
\begin{align}
\mathbf{b}_{WLS} &= \left( \mathbf{X}^{*\prime}\mathbf{X}^*\right)^{-1}\mathbf{X}^{*\prime}\mathbf{Y}^* \\[1em]
&= \left( (\mathbf{WX})^{\prime}\mathbf{WX} \right)^{-1}\mathbf{WX}^{\prime}\mathbf{WY} \\[1em]
&= \left( \mathbf{X}^\prime \mathbf{W}^{\prime}\mathbf{W}\mathbf{X} \right)^{-1}\mathbf{X}^{\prime}\mathbf{W}^\prime \mathbf{W}\mathbf{Y} \\[1em]
&= \left( \mathbf{X}^\prime \mathbf{D}^{-1}\mathbf{X} \right)^{-1}\mathbf{X}^{\prime}\mathbf{D}^{-1}\mathbf{Y}\label{eq:lm6-WLS}
\end{align}
It also quickly leads to showing that the WLS estimator is unbiased for \(\mathbf{B}\):
Under the assumptions of weighted least squares, the WLS estimator for \(\mathbf{B}\) is unbiased.
Proof.
I am tempted to give this to you as an exercise, but let's see how to prove it.
\begin{align}
E\left[\mathbf{b}_{WLS}\right] &= E\left[ \left( \mathbf{X}^\prime \mathbf{D}^{-1}\mathbf{X} \right)^{-1}\mathbf{X}^{\prime}\mathbf{D}^{-1}\mathbf{Y} \right]
\end{align}
Remember that the \(\mathbf{D}\) matrix is known, is not a random variable.
\begin{align}
E\left[\mathbf{b}_{WLS}\right] &= \left( \mathbf{X}^\prime \mathbf{D}^{-1}\mathbf{X} \right)^{-1}\mathbf{X}^{\prime}\mathbf{D}^{-1} E[\mathbf{Y} ]
\end{align}
Since \(\mathbf{W}\mathbf{Y} = \mathbf{W}\mathbf{X}\mathbf{B} + \mathbf{W}\mathbf{E}\), and since \(\mathbf{W}\) and \(\mathbf{D}\) are invertible (why?), we have
\begin{align}
E\left[\mathbf{b}_{WLS}\right] &= \left( \mathbf{X}^\prime \mathbf{D}^{-1}\mathbf{X} \right)^{-1}\mathbf{X}^{\prime}\mathbf{D}^{-1} \mathbf{X}\, \mathbf{B} \\[1em]
&= \mathbf{B}
\end{align}
Thus, the WLS estimator is unbiased if \(\mathbf{D}\) is invertible and if \(\mathbf{D}\) is known (non-stochastic). I will leave it as an exercise for you to prove this theorem if \(\mathbf{D}\) is a random variable independent of \(\mathbf{X}\).
\(\blacksquare\)
Under the assumptions of weighted least squares, the variance of the WLS estimator for \(\mathbf{B}\) is
\begin{equation}
V\left[\mathbf{b}_{WLS}\right] = \sigma^2 \left(\mathbf{X}^\prime\mathbf{D}^{-1}\mathbf{X}\right)^{-1}
\end{equation}
Proof.
There should be no surprises with this proof. All you have to do is figure out what is a random variable and what is not. As such, I leave it as an exercise for you.
So very generous of me. =)
\(\blacksquare\)
Note that the WLS estimator of \(\mathbf{B}\) is a linear combination of independent Normal random variables. With that final observation, we have the distribution of the WLS estimator of \(\mathbf{B}\):
The distribution of the WLS estimator of \(\mathbf{B}\) is
\begin{equation}
\mathbf{b}_{WLS} \sim N \left( \mathbf{B},\ \sigma^2\left(\mathbf{X}^\prime\mathbf{D}^{-1}\mathbf{X}\right)^{-1} \right)
\end{equation}
We again note that the individual estimators are not independent of each other under typical circumstances. We also note that the confidence intervals for the estimators, estimates of \(Y\), etc. can easily be determined in the WLS realm. Nothing new is here, only the mathematics is a bit more involved.
By the way, the lm function in R actually does weighted least squares estimation. If you do not specify the weights, then it defaults to all being 1... which is just ordinary least squares. If you specify the weights, they need to be entered as a vector of inverse variances.
If you specify the weights, they need to be entered as a vector of inverse variances.
The Real Question
Weighted least squares takes care of the problem of heteroskedasticity in our data without introducing any major change in our modeling process or understanding. It just requires that we determine \(\mathbf{W}\) and transform our dependent and independent variables by pre-multiplying by that weighting matrix.
Question:
How do we obtain that weighting matrix?
The best way of obtaining it is through your substantive, scientific theory. The second best way is to utilize the hat matrix, \(\mathbf{H}\).
... from Theory
Sometimes, knowledge of the problem suggests the weighting matrix. Recall that the \(V[\mathbf{E}]\) covariance matrix measures our uncertainty in the residuals. If that uncertainty is known by the way the experiment is constructed, then \(\mathbf{W}\) can be determined. For instance, if the dependent variable is the result of a Binomial experiment, perhaps it is the number of successes out of a given number of trials (which may change), then the weighting matrix is just a diagonal matrix of the square root of trial sizes.
Why? Recall that the variance of a binomially-distributed random variable is \(\sigma^2 = \frac{\pi(1-\pi)}{N}\). The \(\pi\) are the unknown (constant) population proportion. The \(N_i\) are the (known) size within group \(i\). The population parameter is assumed constant. The sample size is measurable. This leads to the \(\mathbf{D}\) matrix being of the form
\begin{equation}
\mathbf{D} = \left[
\begin{matrix}
1/N_1 & 0 & 0 & \cdots & 0 \\
0 & 1/N_2 & 0 & \cdots & 0 \\
0 & 0 & 1/N_3 & & 0 \\
\vdots & \vdots & & \ddots & \vdots \\
0 & 0 & 0 & \cdots & 1/N_n \\
\end{matrix}
\right]
\end{equation}
...from the Hat Matrix
When we do not have the theory to know the structure of the \(\mathbf{D}\) matrix, one may want to use the hat matrix to give us a hint about its structure.
Make no mistake. This process is not mathematically exact... but what statistics procedure is? Statistics stands astride the real and the ideal, trying to get as much information about the real while acknowledging its limitations.
Remembering Chapter 6: Dood! Check the Requirements, not all violations affect inferences the same. Perhaps a good thing for you to do is to use the processes of that chapter to see how much using the hat matrix in lieu of a theoretically-driven \(\mathbf{D}\) matrix affects the estimates, confidence intervals, and p-values.
Let us use the symbol \(\mathbf{e}\) to represent the observed residuals. Until this point, we have only been working with the theoretical residuals, \(\mathbf{E}\). The conceptual difference between the two is really just the difference between the population (theoretical) and the sample (observations). In effect, the difference is in terms of the variances.
Question:
What is the variance of \(\mathbf{e}\)?
The variance of \(\mathbf{e}\) is \(V[\mathbf{e}] = \sigma^2(\mathbf{I}-\mathbf{H})\).
Proof.
\begin{align}
V[\mathbf{e}] &= V[(\mathbf{I}-\mathbf{H})\mathbf{Y}] \\[1em]
&= (\mathbf{I}-\mathbf{H})\ V[\mathbf{Y}]\ (\mathbf{I}-\mathbf{H})^\prime \\[1em]
&= (\mathbf{I}-\mathbf{H})\ \sigma^2\mathbf{I}\ (\mathbf{I}-\mathbf{H}) \\[1em]
&= (\mathbf{I}-\mathbf{H})\ \sigma^2\ (\mathbf{I}-\mathbf{H}) \\[1em]
&= \sigma^2(\mathbf{I}-\mathbf{H})
\end{align}
\(\blacksquare\)
- So, that was totes cool. What was its purpose? What does it mean?
Remember how we can interpret that variance. It is either the variance of a gazillion observed residuals, or we can see it as the uncertainty inherent in the measured residual. For example, the inherent uncertainty in the first residual can be estimated as
\begin{equation}
s_{1,1} = \mathrm{MSE} \left(1 - h_{1,1}\right)
\end{equation}
Here, \(h_{1,1}\) is the first element of the diagonal of the hat matrix.
That means, those diagonal elements of \(\mathbf{I}-\mathbf{H}\) indicate (or are estimates of) the precision of the \(y\) estimate for a given value of \(x\). An estimate of the structure of the \(\mathbf{D}\) matrix is just the diagonal of the \(\mathbf{I}-\mathbf{H}\) matrix.
The problem is that weighted least squares requires us to know the structure of the \(\mathbf{D}\) matrix, not that we estimate it from the data. This explains why the hat matrix technique is used only until something better comes along.
It does work nicely, but we statisticians like to "see the math" — sometimes. Also, if we are trying to draw important conclusions, using approximate methods tends to undercut the conclusions for many, especially for those who do not really understand statistics.
Question:
What do the off-diagonal elements of \(\mathbf{I}-\mathbf{H}\) estimate?


