8.1: Fitting a line, residuals, and correlation
- Page ID
- 56952
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)It’s helpful to think deeply about the line fitting process. In this section, we define the form of a linear model, explore criteria for what makes a good fit, and introduce a new statistic called correlation.
Fitting a line to data
Figure [perfLinearModel] shows two variables whose relationship can be modeled perfectly with a straight line. The equation for the line is
\[\begin{aligned} y = 5 + 64.96 x\end{aligned}\]
Consider what a perfect linear relationship means: we know the exact value of \(y\) just by knowing the value of \(x\). This is unrealistic in almost any natural process. For example, if we took family income (\(x\)), this value would provide some useful information about how much financial support a college may offer a prospective student (\(y\)). However, the prediction would be far from perfect, since other factors play a role in financial support beyond a family’s finances.
Linear regression is the statistical method for fitting a line to data where the relationship between two variables, \(x\) and \(y\), can be modeled by a straight line with some error:
\[\begin{aligned} y = \beta_0 + \beta_1x + \varepsilon\end{aligned}\]
The values \(\beta_0\) and \(\beta_1\) represent the model’s parameters (\(\beta\) is the Greek letter beta), and the error is represented by \(\varepsilon\) (the Greek letter epsilon). The parameters are estimated using data, and we write their point estimates as \(b_0\) and \(b_1\). When we use \(x\) to predict \(y\), we usually call \(x\) the explanatory or variable, and we call \(y\) the response; we also often drop the \(\epsilon\) term when writing down the model since our main focus is often on the prediction of the average outcome.
It is rare for all of the data to fall perfectly on a straight line. Instead, it’s more common for data to appear as a cloud of points, such as those examples shown in Figure [imperfLinearModel]. In each case, the data fall around a straight line, even if none of the observations fall exactly on the line. The first plot shows a relatively strong downward linear trend, where the remaining variability in the data around the line is minor relative to the strength of the relationship between \(x\) and \(y\). The second plot shows an upward trend that, while evident, is not as strong as the first. The last plot shows a very weak downward trend in the data, so slight we can hardly notice it. In each of these examples, we will have some uncertainty regarding our estimates of the model parameters, \(\beta_0\) and \(\beta_1\). For instance, we might wonder, should we move the line up or down a little, or should we tilt it more or less? As we move forward in this chapter, we will learn about criteria for line-fitting, and we will also learn about the uncertainty associated with estimates of model parameters.
There are also cases where fitting a straight line to the data, even if there is a clear relationship between the variables, is not helpful. One such case is shown in Figure [notGoodAtAllForALinearModel] where there is a very clear relationship between the variables even though the trend is not linear. We discuss nonlinear trends in this chapter and the next, but details of fitting nonlinear models are saved for a later course.
Using linear regression to predict possum head lengths
Brushtail possums are a marsupial that lives in Australia, and a photo of one is shown in Figure [brushtail_possum]. Researchers captured 104 of these animals and took body measurements before releasing the animals back into the wild. We consider two of these measurements: the total length of each possum, from head to tail, and the length of each possum’s head.
[brushtail_possum]
Figure [scattHeadLTotalL] shows a scatterplot for the head length and total length of the possums. Each point represents a single possum from the data. The head and total length variables are associated: possums with an above average total length also tend to have above average head lengths. While the relationship is not perfectly linear, it could be helpful to partially explain the connection between these variables with a straight line.
We want to describe the relationship between the head length and total length variables in the possum data set using a line. In this example, we will use the total length as the predictor variable, \(x\), to predict a possum’s head length, \(y\). We could fit the linear relationship by eye, as in Figure [scattHeadLTotalLLine]. The equation for this line is
\[\begin{aligned} \hat{y} = 41 + 0.59x\end{aligned}\]
A “hat” on \(y\) is used to signify that this is an estimate. We can use this line to discuss properties of possums. For instance, the equation predicts a possum with a total length of 80 cm will have a head length of
\[\begin{aligned} \hat{y} &= 41 + 0.59\times 80 \\ &= 88.2 % mm\end{aligned}\]
The estimate may be viewed as an average: the equation predicts that possums with a total length of 80 cm will have an average head length of 88.2 mm. Absent further information about an 80 cm possum, the prediction for head length that uses the average is a reasonable estimate.
What other variables might help us predict the head length of a possum besides its length? Perhaps the relationship would be a little different for male possums than female possums, or perhaps it would differ for possums from one region of Australia versus another region. In Chapter [ch_regr_mult_and_log], we’ll learn about how we can include more than one predictor. Before we get there, we first need to better understand how to best build a simple linear model with one predictor.
Residuals
are the leftover variation in the data after accounting for the model fit:
\[\begin{aligned} \text{Data} = \text{Fit} + \text{Residual}\end{aligned}\]
Each observation will have a residual, and three of the residuals for the linear model we fit for the data is shown in Figure [scattHeadLTotalLLine]. If an observation is above the regression line, then its residual, the vertical distance from the observation to the line, is positive. Observations below the line have negative residuals. One goal in picking the right linear model is for these residuals to be as small as possible.
Let’s look closer at the three residuals featured in Figure [scattHeadLTotalLLine]. The observation marked by an “\(\times\)” has a small, negative residual of about -1; the observation marked by “\(+\)” has a large residual of about +7; and the observation marked by “\(\triangle\)” has a moderate residual of about -4. The size of a residual is usually discussed in terms of its absolute value. For example, the residual for “\(\triangle\)” is larger than that of “\(\times\)” because \(|-4|\) is larger than \(|-1|\).
Residual: difference between observed and expected The residual of the \(i^{th}\) observation \((x_i, y_i)\) is the difference of the observed response (\(y_i\)) and the response we would predict based on the model fit (\(\hat{y}_i\)):
\[\begin{aligned} e_i = y_i - \hat{y}_i\end{aligned}\]
We typically identify \(\hat{y}_i\) by plugging \(x_i\) into the model.
The linear fit shown in Figure [scattHeadLTotalLLine] is given as \(\hat{y} = 41 + 0.59x\). Based on this line, formally compute the residual of the observation \((77.0, 85.3)\). This observation is denoted by “\(\times\)” in Figure [scattHeadLTotalLLine]. Check it against the earlier visual estimate, -1. We first compute the predicted value of point “\(\times\)” based on the model:
\[\begin{aligned} \hat{y}_{\times} = 41+0.59x_{\times} = 41+0.59\times 77.0 = 86.4\end{aligned}\]
Next we compute the difference of the actual head length and the predicted head length:
\[\begin{aligned} e_{\times} = y_{\times} - \hat{y}_{\times} = 85.3 - 86.4 = -1.1\end{aligned}\]
The model’s error is \(e_{\times} = -1.1\)mm, which is very close to the visual estimate of -1mm. The negative residual indicates that the linear model overpredicted head length for this particular possum.
If a model underestimates an observation, will the residual be positive or negative? What about if it overestimates the observation?
Compute the residuals for the “\(+\)” observation \((85.0, 98.6)\) and the “\(\triangle\)” observation \((95.5, 94.0)\) in the figure using the linear relationship \(\hat{y} = 41 + 0.59x\).
Residuals are helpful in evaluating how well a linear model fits a data set. We often display them in a such as the one shown in Figure [scattHeadLTotalLResidualPlot] for the regression line in Figure [scattHeadLTotalLLine]. The residuals are plotted at their original horizontal locations but with the vertical coordinate as the residual. For instance, the point \((85.0,98.6)_{+}\) had a residual of 7.45, so in the residual plot it is placed at \((85.0, 7.45)\). Creating a residual plot is sort of like tipping the scatterplot over so the regression line is horizontal.
One purpose of residual plots is to identify characteristics or patterns still apparent in data after fitting a model. Figure [sampleLinesAndResPlots] shows three scatterplots with linear models in the first row and residual plots in the second row. Can you identify any patterns remaining in the residuals?
In the first data set (first column), the residuals show no obvious patterns. The residuals appear to be scattered randomly around the dashed line that represents 0.
The second data set shows a pattern in the residuals. There is some curvature in the scatterplot, which is more obvious in the residual plot. We should not use a straight line to model these data. Instead, a more advanced technique should be used.
The last plot shows very little upwards trend, and the residuals also show no obvious patterns. It is reasonable to try to fit a linear model to the data. However, it is unclear whether there is statistically significant evidence that the slope parameter is different from zero. The point estimate of the slope parameter, labeled \(b_1\), is not zero, but we might wonder if this could just be due to chance. We will address this sort of scenario in Section 4.
Describing linear relationships with correlation
We’ve seen plots with strong linear relationships and others with very weak linear relationships. It would be useful if we could quantify the strength of these linear relationships with a statistic.
Correlation: strength of a linear relationship , which always takes values between -1 and 1, describes the strength of the linear relationship between two variables. We denote the correlation by \(R\).
We can compute the correlation using a formula, just as we did with the sample mean and standard deviation. This formula is rather complex,1 and like with other statistics, we generally perform the calculations on a computer or calculator. Figure [posNegCorPlots] shows eight plots and their corresponding correlations. Only when the relationship is perfectly linear is the correlation either -1 or 1. If the relationship is strong and positive, the correlation will be near +1. If it is strong and negative, it will be near -1. If there is no apparent linear relationship between the variables, then the correlation will be near zero.
The correlation is intended to quantify the strength of a linear trend. Nonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength of the relationship; see three such examples in Figure [corForNonLinearPlots].
No straight line is a good fit for the data sets represented in Figure [corForNonLinearPlots]. Try drawing nonlinear curves on each plot. Once you create a curve for each, describe what is important in your fit.


