15.4: Quantifying the Fit of the Regression Model

Last updated
Save as PDF

Page ID: 4038

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

So we now know how to estimate the coefficients of a linear regression model. The problem is, we don’t yet know if this regression model is any good. For example, the regression.1 model claims that every hour of sleep will improve my mood by quite a lot, but it might just be rubbish. Remember, the regression model only produces a prediction \(\ \hat{Y_i}\) about what my mood is like: my actual mood is Y_i. If these two are very close, then the regression model has done a good job. If they are very different, then it has done a bad job.

R² value

Once again, let’s wrap a little bit of mathematics around this. Firstly, we’ve got the sum of the squared residuals:

\(\ SS_{res} = \sum_i(Y_i - \hat{Y_i})^2\)

which we would hope to be pretty small. Specifically, what we’d like is for it to be very small in comparison to the total variability in the outcome variable,

\(\ SS_{tot} = \sum_i(Y_i - \bar{Y})^2\)

While we’re here, let’s calculate these values in R. Firstly, in order to make my R commands look a bit more similar to the mathematical equations, I’ll create variables X and Y:

X <- parenthood$dan.sleep  # the predictor
Y <- parenthood$dan.grump  # the outcome

Now that we’ve done this, let’s calculate the \(\ \hat{Y}\) values and store them in a variable called Y.pred. For the simple model that uses only a single predictor, regression.1, we would do the following:

Y.pred <- -8.94 * X  +  125.97

Okay, now that we’ve got a variable which stores the regression model predictions for how grumpy I will be on any given day, let’s calculate our sum of squared residuals. We would do that using the following command:

SS.resid <- sum( (Y - Y.pred)^2 )
print( SS.resid )

## [1] 1838.722

Wonderful. A big number that doesn’t mean very much. Still, let’s forge boldly onwards anyway, and calculate the total sum of squares as well. That’s also pretty simple:

SS.tot <- sum( (Y - mean(Y))^2 )
print( SS.tot )

## [1] 9998.59

Hm. Well, it’s a much bigger number than the last one, so this does suggest that our regression model was making good predictions. But it’s not very interpretable.

Perhaps we can fix this. What we’d like to do is to convert these two fairly meaningless numbers into one number. A nice, interpretable number, which for no particular reason we’ll call R². What we would like is for the value of R² to be equal to 1 if the regression model makes no errors in predicting the data. In other words, if it turns out that the residual errors are zero – that is, if SS_res=0 – then we expect R²=1. Similarly, if the model is completely useless, we would like R² to be equal to 0. What do I mean by “useless”? Tempting as it is demand that the regression model move out of the house, cut its hair and get a real job, I’m probably going to have to pick a more practical definition: in this case, all I mean is that the residual sum of squares is no smaller than the total sum of squares, SS_res=SS_tot. Wait, why don’t we do exactly that? The formula that provides us with out R² value is pretty simple to write down,

\(\ R^2 ={ 1 -{ SS_{res} \over SS_{tot} }}\)

and equally simple to calculate in R:

R.squared <- 1 - (SS.resid / SS.tot)
print( R.squared )

## [1] 0.8161018

The R² value, sometimes called the coefficient of determination²¹⁶ has a simple interpretation: it is the proportion of the variance in the outcome variable that can be accounted for by the predictor. So in this case, the fact that we have obtained R²=.816 means that the predictor (my.sleep) explains 81.6% of the variance in the outcome (my.grump).

Naturally, you don’t actually need to type in all these commands yourself if you want to obtain the R² value for your regression model. As we’ll see later on in Section 15.5.3, all you need to do is use the summary() function. However, let’s put that to one side for the moment. There’s another property of R² that I want to point out.

relationship between regression and correlation

At this point we can revisit my earlier claim that regression, in this very simple form that I’ve discussed so far, is basically the same thing as a correlation. Previously, we used the symbol r to denote a Pearson correlation. Might there be some relationship between the value of the correlation coefficient r and the R² value from linear regression? Of course there is: the squared correlation r² is identical to the R² value for a linear regression with only a single predictor. To illustrate this, here’s the squared correlation:

r <- cor(X, Y)  # calculate the correlation
print( r^2 )    # print the squared correlation

## [1] 0.8161027

Yep, same number. In other words, running a Pearson correlation is more or less equivalent to running a linear regression model that uses only one predictor variable.

adjusted R² value

One final thing to point out before moving on. It’s quite common for people to report a slightly different measure of model performance, known as “adjusted R²”. The motivation behind calculating the adjusted R² value is the observation that adding more predictors into the model will always call the R² value to increase (or at least not decrease). The adjusted R² value introduces a slight change to the calculation, as follows. For a regression model with K predictors, fit to a data set containing N observations, the adjusted R² is:

\(\operatorname{adj} . R^{2}=1-\left(\dfrac{\mathrm{SS}_{r e s}}{\mathrm{SS}_{t o t}} \times \dfrac{N-1}{N-K-1}\right)\)

This adjustment is an attempt to take the degrees of freedom into account. The big advantage of the adjusted R² value is that when you add more predictors to the model, the adjusted R² value will only increase if the new variables improve the model performance more than you’d expect by chance. The big disadvantage is that the adjusted R² value can’t be interpreted in the elegant way that R² can. R² has a simple interpretation as the proportion of variance in the outcome variable that is explained by the regression model; to my knowledge, no equivalent interpretation exists for adjusted R².

An obvious question then, is whether you should report R² or adjusted R². This is probably a matter of personal preference. If you care more about interpretability, then R² is better. If you care more about correcting for bias, then adjusted R² is probably better. Speaking just for myself, I prefer R²: my feeling is that it’s more important to be able to interpret your measure of model performance. Besides, as we’ll see in Section 15.5, if you’re worried that the improvement in R² that you get by adding a predictor is just due to chance and not because it’s a better model, well, we’ve got hypothesis tests for that.

R2 value

relationship between regression and correlation

adjusted R2 value

R² value

adjusted R² value