8.2: Least squares regression
- Page ID
- 56953
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Fitting linear models by eye is open to criticism since it is based on an individual’s preference. In this section, we use least squares regression as a more rigorous approach.
Gift aid for freshman at Elmhurst College
This section considers family income and gift aid data from a random sample of fifty students in the freshman class of Elmhurst College in Illinois. Gift aid is financial aid that does not need to be paid back, as opposed to a loan. A scatterplot of the data is shown in Figure [elmhurstScatterW2Lines] along with two linear fits. The lines follow a negative trend in the data; students who have higher family incomes tended to have lower gift aid from the university.
Is the correlation positive or negative in Figure [elmhurstScatterW2Lines]?
An objective measure for finding the best line
We begin by thinking about what we mean by “best”. Mathematically, we want a line that has small residuals. The first option that may come to mind is to minimize the sum of the residual magnitudes:
\[\begin{aligned} |e_1| + |e_2| + \dots + |e_n|\end{aligned}\]
which we could accomplish with a computer program. The resulting dashed line shown in Figure [elmhurstScatterW2Lines] demonstrates this fit can be quite reasonable. However, a more common practice is to choose the line that minimizes the sum of the squared residuals:
\[\begin{aligned} e_{1}^2 + e_{2}^2 + \dots + e_{n}^2\end{aligned}\]
The line that minimizes this is represented as the solid line in Figure [elmhurstScatterW2Lines]. This is commonly called the . The following are three possible reasons to choose this option instead of trying to minimize the sum of residual magnitudes without any squaring:
- It is the most commonly used method.
- Computing the least squares line is widely supported in statistical software.
- In many applications, a residual twice as large as another residual is more than twice as bad. For example, being off by 4 is usually more than twice as bad as being off by 2. Squaring the residuals accounts for this discrepancy.
The first two reasons are largely for tradition and convenience; the last reason explains why the least squares criterion is typically most helpful.2
Conditions for the least squares line
When fitting a least squares line, we generally require
- Linearity.
-
The data should show a linear trend. If there is a nonlinear trend (e.g. left panel of Figure [whatCanGoWrongWithLinearModel]), an advanced regression method from another book or later course should be applied.
- Nearly normal residuals.
-
Generally, the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points, which we’ll talk about more in Sections 3. An example of a residual that would be a potentially concern is shown in Figure [whatCanGoWrongWithLinearModel], where one observation is clearly much further from the regression line than the others.
- Constant variability.
-
The variability of points around the least squares line remains roughly constant. An example of non-constant variability is shown in the third panel of Figure [whatCanGoWrongWithLinearModel], which represents the most common pattern observed when this condition fails: the variability of \(y\) is larger when \(x\) is larger.
- Independent observations.
-
Be cautious about applying regression to data, which are sequential observations in time such as a stock price each day. Such data may have an underlying structure that should be considered in a model and analysis. An example of a data set where successive observations are not independent is shown in the fourth panel of Figure [whatCanGoWrongWithLinearModel]. There are also other instances where correlations within the data are important, which is further discussed in Chapter [ch_regr_mult_and_log].
Should we have concerns about applying least squares regression to the Elmhurst data in Figure [elmhurstScatterW2Lines]?
Finding the least squares line
For the Elmhurst data, we could write the equation of the least squares regression line as
\[\begin{aligned} \widehat{aid} = \beta_0 + \beta_{1}\times \textit{family\us{}income}\end{aligned}\]
Here the equation is set up to predict gift aid based on a student’s family income, which would be useful to students considering Elmhurst. These two values, \(\beta_0\) and \(\beta_1\), are the parameters of the regression line.
As in Chapters [ch_foundations_for_inf], [ch_inference_for_props], and [ch_inference_for_means], the parameters are estimated using observed data. In practice, this estimation is done using a computer in the same way that other estimates, like a sample mean, can be estimated using a computer or calculator. However, we can also find the parameter estimates by applying two properties of the least squares line:
- The slope of the least squares line can be estimated by
\[\begin{aligned} b_1 = \frac{s_y}{s_x} R \end{aligned}\]
where \(R\) is the correlation between the two variables, and \(s_x\) and \(s_y\) are the sample standard deviations of the explanatory variable and response, respectively.
- If \(\bar{x}\) is the sample mean of the explanatory variable and \(\bar{y}\) is the sample mean of the vertical variable, then the point \((\bar{x}, \bar{y})\) is on the least squares line.
Figure [summaryStatsElmhurstRegr] shows the sample means for the family income and gift aid as $101,780 and $19,940, respectively. We could plot the point \((101.8, 19.94)\) on Figure [elmhurstScatterW2Lines] to verify it falls on the least squares line (the solid line).
Next, we formally find the point estimates \(b_0\) and \(b_1\) of the parameters \(\beta_0\) and \(\beta_1\).
| Family Income (\(x\)) | Gift Aid (\(y\)) | |
| mean | \(\bar{x} = \text{\$101,780}\) | \(\bar{y} = \text{\$19,940}\) |
| sd | \(s_x = \text{\$63,200}\) | \(s_y = \text{\$5,460}\) |
[findingTheSlopeOfTheLSRLineForIncomeAndAid] Using the summary statistics in Figure [summaryStatsElmhurstRegr], compute the slope for the regression line of gift aid against family income.
You might recall the form of a line from math class, which we can use to find the model fit, including the estimate of \(b_0\). Given the slope of a line and a point on the line, \((x_0, y_0)\), the equation for the line can be written as
\[\begin{aligned} y - y_0 = slope\times (x - x_0)\end{aligned}\]
Identifying the least squares line from summary statistics To identify the least squares line from summary statistics:
- Estimate the slope parameter, \(b_1 = (s_y / s_x) R\).
- Noting that the point \((\bar{x}, \bar{y})\) is on the least squares line, use \(x_0 = \bar{x}\) and \(y_0 = \bar{y}\) with the point-slope equation: \(y - \bar{y} = b_1 (x - \bar{x})\).
- Simplify the equation, which would reveal that \(b_0 = \bar{y} - b_1 \bar{x}\).
Using the point \((101780, 19940)\) from the sample means and the slope estimate \(b_1 = -0.0431\) from Guided Practice [findingTheSlopeOfTheLSRLineForIncomeAndAid], find the least-squares line for predicting aid based on family income. [exampleToFindLSRLineOfElmhurstData] Apply the point-slope equation using \((101.78, 19.94)\) and the slope \(b_1 = -0.0431\):
\[\begin{aligned} y - y_0 &= b_1 (x - x_0) \\ y - \text{19,940} &= -0.0431(x - \text{101,780}) \end{aligned}\]
Expanding the right side and then adding 19,940 to each side, the equation simplifies:
\[\begin{aligned} \widehat{aid} = \text{24,327} - 0.0431 \times \textit{family\us{}income} \end{aligned}\]
Here we have replaced \(y\) with \(\widehat{aid}\) and \(x\) with familyincome to put the equation in context. The final equation should always include a “hat” on the variable being predicted, whether it is a generic “\(y\)” or a named variable like “\(aid\)”.
A computer is usually used to compute the least squares line, and a summary table generated using software for the Elmhurst regression line is shown in Figure [rOutputForIncomeAidLSRLine]. The first column of numbers provides estimates for \({b}_0\) and \({b}_1\), respectively. These results match those from Example [exampleToFindLSRLineOfElmhurstData] (with some minor rounding error).
| Estimate | Std. Error | t value | Pr(\(>\)\(|\)t\(|\)) | |
| (Intercept) | 24319.3 | 1291.5 | 18.83 | \(<\)0.0001 |
| familyincome | -0.0431 | 0.0108 | -3.98 | 0.0002 |
Examine the second, third, and fourth columns in Figure [rOutputForIncomeAidLSRLine]. Can you guess what they represent? (If you have not reviewed any inference chapter yet, skip this example.) We’ll describe the meaning of the columns using the second row, which corresponds to \(\beta_1\). The first column provides the point estimate for \(\beta_1\), as we calculated in an earlier example: \(b_1 = -0.0431\). The second column is a standard error for this point estimate: \(SE_{b_1} = 0.0108\). The third column is a \(t\)-test statistic for the null hypothesis that \(\beta_1 = 0\): \(T = -3.98\). The last column is the p-value for the \(t\)-test statistic for the null hypothesis \(\beta_1 = 0\) and a two-sided alternative hypothesis: 0.0002. We will get into more of these details in Section 4.
Suppose a high school senior is considering Elmhurst College. Can she simply use the linear equation that we have estimated to calculate her financial aid from the university? She may use it as an estimate, though some qualifiers on this approach are important. First, the data all come from one freshman class, and the way aid is determined by the university may change from year to year. Second, the equation will provide an imperfect estimate. While the linear equation is good at capturing the trend in the data, no individual student’s aid will be perfectly predicted.
Interpreting regression model parameter estimates
Interpreting parameters in a regression model is often one of the most important steps in the analysis.
The intercept and slope estimates for the Elmhurst data are \(b_0 = \text{24,319}\) and \(b_1 = -0.0431\). What do these numbers really mean? Interpreting the slope parameter is helpful in almost any application. For each additional $1,000 of family income, we would expect a student to receive a net difference of \(\$\text{1,000}\times (-0.0431) = -\$43.10\) in aid on average, i.e. $43.10 less. Note that a higher family income corresponds to less aid because the coefficient of family income is negative in the model. We must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables because these data are observational. That is, increasing a student’s family income may not cause the student’s aid to drop. (It would be reasonable to contact the college and ask if the relationship is causal, i.e. if Elmhurst College’s aid decisions are partially based on students’ family income.)
The estimated intercept \(b_0 = \text{24,319}\) describes the average aid if a student’s family had no income. The meaning of the intercept is relevant to this application since the family income for some students at Elmhurst is $0. In other applications, the intercept may have little or no practical value if there are no observations where \(x\) is near zero.
Interpreting parameters estimated by least squares The slope describes the estimated difference in the \(y\) variable if the explanatory variable \(x\) for a case happened to be one unit larger. The intercept describes the average outcome of \(y\) if \(x=0\) and the linear model is valid all the way to \(x=0\), which in many applications is not the case.
Extrapolation is treacherous
When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February \(6^{th}\) it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.
Stephen Colbert
April 6th, 20103
Linear models can be used to approximate the relationship between two variables. However, these models have real limitations. Linear regression is simply a modeling framework. The truth is almost always much more complex than our simple line. For example, we do not know how the data outside of our limited window will behave.
Use the model \(\widehat{aid} = \text{24,319} - 0.0431 \times \textit{family\us{}income}\) to estimate the aid of another freshman student whose family had income of $1 million. We want to calculate the aid for \(\textit{family\us{}income} = \text{1,000,000}\):
\[\begin{aligned} \text{24,319} - 0.0431\times \textit{family\us{}income} = \text{24,319} - 0.0431\times \text{1,000,000} = -\text{18,781} \end{aligned}\]
The model predicts this student will have -$18,781 in aid (!). However, Elmhurst College does not offer negative aid where they select some students to pay extra on top of tuition to attend.
Applying a model estimate to values outside of the realm of the original data is called . Generally, a linear model is only an approximation of the real relationship between two variables. If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.
Using \(R^2\) to describe the strength of a fit
We evaluated the strength of the linear relationship between two variables earlier using the correlation, \(R\). However, it is more common to explain the strength of a linear fit using \(R^2\), called . If provided with a linear model, we might like to describe how closely the data cluster around the linear fit.
The \(R^2\) of a linear model describes the amount of variation in the response that is explained by the least squares line. For example, consider the Elmhurst data, shown in Figure [elmhurstScatterWLSROnly]. The variance of the response variable, aid received, is about \(s_{aid}^2 \approx 29.8\) million. However, if we apply our least squares line, then this model reduces our uncertainty in predicting aid using a student’s family income. The variability in the residuals describes how much variation remains after using the model: \(s_{_{RES}}^2 \approx 22.4\) million. In short, there was a reduction of
\[\begin{aligned} \frac{s_{aid}^2 - s_{_{RES}}^2}{s_{aid}^2} = \frac{\text{29,800,000} - \text{22,400,000}} {\text{29,800,000}} = \frac{\text{7,500,000}}{\text{29,800,000}} = 0.25\end{aligned}\]
or about 25% in the data’s variation by using information about family income for predicting aid using a linear model. This corresponds exactly to the R-squared value:
\[\begin{aligned} R &= -0.499 &R^2 &= 0.25\end{aligned}\]
If a linear model has a very strong negative relationship with a correlation of -0.97, how much of the variation in the response is explained by the explanatory variable?
Categorical predictors with two levels
Categorical variables are also useful in predicting outcomes. Here we consider a categorical predictor with two levels (recall that a level is the same as a category). We’ll consider Ebay auctions for a video game, Mario Kart for the Nintendo Wii, where both the total price of the auction and the condition of the game were recorded. Here we want to predict total price based on game condition, which takes values and . A plot of the auction data is shown in Figure [marioKartNewUsed].
To incorporate the game condition variable into a regression equation, we must convert the categories into a numerical form. We will do so using an called , which takes value 1 when the game is new and 0 when the game is used. Using this indicator variable, the linear model may be written as
\[\begin{aligned} \widehat{price} = \beta_0 + \beta_1 \times \text{\var{cond\us{}new}}\end{aligned}\]
The parameter estimates are given in Figure [marioKartNewUsedRegrSummary], and the model equation can be summarized as
\[\begin{aligned} \widehat{price} = 42.87 + 10.90 \times \text{\var{cond\us{}new}}\end{aligned}\]
For categorical predictors with just two levels, the linearity assumption will always be satisfied. However, we must evaluate whether the residuals in each group are approximately normal and have approximately equal variance. As can be seen in Figure [marioKartNewUsed], both of these conditions are reasonably satisfied by the auction data.
| Estimate | Std. Error | t value | Pr(\(>\)\(|\)t\(|\)) | |
| (Intercept) | 42.87 | 0.81 | 52.67 | \(<\)0.0001 |
| condnew | 10.90 | 1.26 | 8.66 | \(<\)0.0001 |
Interpret the two parameters estimated in the model for the price of Mario Kart in eBay auctions. The intercept is the estimated price when takes value 0, i.e. when the game is in used condition. That is, the average selling price of a used version of the game is $42.87.
The slope indicates that, on average, new games sell for about $10.90 more than used games.
Interpreting model estimates for categorical predictors The estimated intercept is the value of the response variable for the first category (i.e. the category corresponding to an indicator value of 0). The estimated slope is the average change in the response variable between the two categories.
We’ll elaborate further on this topic in Chapter [ch_regr_mult_and_log], where we examine the influence of many predictor variables simultaneously using multiple regression.


