11.1: Parameter Estimation
- Page ID
- 57758
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)The purpose of this section is to give some insight into the difficulties created by minimizing the sum of the absolute errors. Remember that the absolute value function has no derivative defined at its minimum and that the derivative has the same magnitude elsewhere. So, there is no way to create a formula for the minimum. So, an iterative procedure must be created to obtain the parameter estimates.
This section looks at a possible way to estimate those parameters using a rather rudimentary iterative procedure. It is not the actual method used; it just illustrates some of the inherent issues with an iterative technique... including the fact that there are several options to choose from, including one that you may created as your PhD dissertation.
✦•················• ✦ •··················•✦
A First Try Estimating the Parameters
In least squares, we just did calculus to get closed-form equations for the estimators. Here, since such solutions do not exist and since we need to use an iterative technique, I think working through a toy example may help understanding. And so, let us start with the data in the left two columns of the table below.
Remember that we want to minimize the sum of the absolute values (the \(L_1\) Norm) of the residuals. In other words, we want to minimize the \(L_1\) distance between the \(n\)-dimensional data vector and the \(p\)-dimensional parameter space. Recall the figure in Section 4.3 where we illustrated this with least squares.
Thus, the first step is to obtain residuals. This means we need to somehow obtain our first estimated regression line. Any will work as a starting point. So, let's say our first estimate of the line-of-best-fit is \(\ell_1 : y = 3\), which is just the horizontal line at the median.
The next step is to calculate the residuals. This is the \(e_1\) column in the table below. The the target function is
\begin{align}
Q_1 &= \sum_{i=1}^n\ |e_i| \\[1em]
&= \sum_{i=1}^n\ |y_i - \hat{y}_i| \\[1em]
&= \sum_{i=1}^n\ |y_i - 3| \\[1em]
&= |y_1 - 3| + |y_2 - 3| + |y_3 - 3| + |y_4 - 3| \\[1em]
&= |1 - 3| + |3 - 3| + |3 - 3| + |9 - 3| \\[1em]
&= 2 + 0 + 0 + 6
\end{align}
Thus, for line \(\ell_1\), the value of the target function is 8.
The next step is to change the regression line. How? Well, that is the important question. Different methods may ultimately lead to slightly different answers. As this section only seeks to illustrate a method --- and not even a good method --- let's use logic to see what would be next. Note that the lower values have estimates that are too low, and the higher values have estimates that are too high. So, it makes sense to increase the slope. So, let us increase the slope to 1. If we force the line to pass through the dimension-wise median \((\tilde{x},\ \tilde{y}) = (1.5,\ 3.0)\), the linear equation will be \(\ell_2 : y = 1.5 + 1 x\). This produces the estimates and residuals in the next two columns of the table below.
x y y-hat_1 e_1 y-hat_2 e_2 y-hat_3 e_3 0 1 3 -2.0 1.5 -0.5 0.0 1.0 1 3 3 0.0 2.5 0.5 2.0 1.0 2 3 3 0.0 3.5 -0.5 4.0 -1.0 3 9 3 6.0 4.5 3.5 6.0 3.0
The value of the target function is
\begin{equation}
Q_2 = \sum_{i=1}^n\ |e_i|= \sum_{i=1}^n\ |y_i - \hat{y}_i| = \sum_{i=1}^n\ |y_i - \left( 1.5 + x_i \right)|
\end{equation}
Note that this value is 5. As this is lower than the previous value, we headed in the right direction; we are closer to the estimates because we have reduced the sum of the absolute errors.
We got closer. Note that the error for higher \(x\)-values is greater than for lower \(x\)-values. This suggests we should increase the slope yet again.
So, let us select our third line as \(\ell_3 : y = 0 + 2x\). Again, we are forcing the line to pass through the dimension-wise median. Do we need to do this? No. There are algorithms that do not force this restriction. Again, the actual mathematics cannot reasonably be done by hand. I write this part so that you can get a feel for what the computer is doing. The last pair of columns in the table above provide the predictions and residuals for this third line.
The value of the target function for this third line is \(Q_3 = 6\). This value is not lower than \(Q_2\). Thus, this line is a worse fit than line \(\ell_2\). The next line, \(\ell_4\), needs to take this into consideration.
This process would continue until the change in target function values is "small enough." Usually, we define "small enough" as being less than some tolerance, like \(\tau = 0.000\, 001\).
The Big Question
The big question is how we get from one "line-of-best-fit" to the next, from \(\ell_{i}\) to \(\ell_{i+1}\). Unfortunately, there is no "best method" to minimize the \(L_1\) norm when there are more points than dimensions. It is even worse: We were able to find closed-form solutions to the unique estimators for the \(L_2\) norm (squaring). That cannot be done when using the \(L_1\) norm (absolute values). There are multiple appropriate algorithms. The estimators may not be unique. Those are just a few problems working with the \(L_1\) norm.
For those interested, here are some methods:
- Barrodale and Roberts (1973),
- Koenker and Bassett (1978),
- Koenker and d'Orey (1987, 1994),
- Li and Arce (2004), and
- Shu-guang and Jian-wen (1992).
These algorithms make use of different paradigms, different ways of seeing the problems. That is what makes studying statistics fun and interesting. Looking at a problem differently may be the key to its solution.
The R Function
In R, a function to perform median regression is rq in the package quantreg, which does not come with the default R installation; thus, you will need to install it from CRAN. Its use is very similar to what we are used to. While the rq function allows you to select different optimization methods, the default is the Barrodale and Roberts (1974) method. From my experience the optimization algorithm matters little for real data. If the data are all integers, there may be issues with non-unique solutions or non-convergent algorithms. The cause in these cases is the non-uniqueness of the median.
Using median regression, what is the relationship between the violent crime rates in 2000 and 1990 in the crime data?
Solution.
The following code estimates the median regression line for the relationship between the violent crime rates in 2000 and 1990 in the \data{crime} data:
library(quantreg)
dt = read.csv("http://rfs.kvasaheim.com/data/crime.csv")
attach(dt)
mod1 = rq(vcrime00~vcrime90)
summary(mod1)
The following is the output:
Call: rq(formula = vcrime00 ~ vcrime90)
tau: [1] 0.50
Coefficients:
coefficients lower bd upper bd
(Intercept) 93.24525 72.83955 102.31731
vcrime90 0.57764 0.57518 0.62676
The output is the usual output. The value of tau is \(0.500\), because we are examining the regression line for the median, the 50th percentile.
The coefficients are the estimates for the intercept and slope. The lower and upper bounds are the 95% confidence interval for those parameters. There are no p-values, because the distribution of the estimators does not follow a nice test distribution. However, because we have an approximate confidence interval, we have even more information than what a simple p-value would give. We are 95% confident that the relationship between the violent crime rate in 1990 and 2000 is between 0.575 and 0.627. Since this does not include the value 0, we can conclude that there is a significant relationship between the two variables.
\(\blacksquare\)
I leave it as an exercise for you to see that the OLS estimator for that effect is 0.581, with a 95% confidence interval from 0.518 to 0.643.
Clearly, there is a difference between the two estimation methods. That difference is in how the method is affected by influential points like the District of Columbia. Median regression reduces the influence of DC, while ordinary least squares does not. The absolute value function increases linearly as the residual increases. The squaring function increases quadratically as the residual increases. Thus, ordinary least squares will work harder to avoid making the DC residual too big. Median regression will not weight it as heavily.
Also, ordinary least squares models the conditional mean (expected value of \(Y\) given the value of \(x\)). Median regression models the conditional median (expected median of \(Y\) given the value of \(x\)).
Here is another example of using median regression. What is the relationship between the property crime rates in 1990 and 2000?
Solution.
The code is quite similar to that above:
mod3 = rq(pcrime00~pcrime90) summary(mod3)
The following is the partial output:
Coefficients:
coefficients lower bd upper bd
(Intercept) 730.46936 349.56585 1093.31979
pcrime90 0.60584 0.50893 0.77457
Again, the relationship is positive. A point estimate for that relationship is \(\tilde{\beta}_1 = 0.606\), with a 95% confidence interval from 0.509 to 0.775. I again leave it as an exercise for you to show that the OLS estimator is 0.582 with a 95% confidence interval from 0.458 to 0.707.
\(\blacksquare\)
When the data are "well behaved" without influential points, there tends to be little difference in the estimators. Figure \(\PageIndex{1}\), below, illustrates this.
What does the above graphic tell us?


