15.4: Quantifying the Fit of the Regression Model
So we now know how to estimate the coefficients of a linear regression model. The problem is, we don’t yet know if this regression model is any good. For example, the
regression.1
model
claims
that every hour of sleep will improve my mood by quite a lot, but it might just be rubbish. Remember, the regression model only produces a prediction \(\ \hat{Y_i}\) about what my mood is like: my actual mood is Y
i
. If these two are very close, then the regression model has done a good job. If they are very different, then it has done a bad job.
R 2 value
Once again, let’s wrap a little bit of mathematics around this. Firstly, we’ve got the sum of the squared residuals:
\(\ SS_{res} = \sum_i(Y_i - \hat{Y_i})^2\)
which we would hope to be pretty small. Specifically, what we’d like is for it to be very small in comparison to the total variability in the outcome variable,
\(\ SS_{tot} = \sum_i(Y_i - \bar{Y})^2\)
While we’re here, let’s calculate these values in R. Firstly, in order to make my R commands look a bit more similar to the mathematical equations, I’ll create variables
X
and
Y
:
X <- parenthood$dan.sleep # the predictor
Y <- parenthood$dan.grump # the outcome
Now that we’ve done this, let’s calculate the \(\ \hat{Y}\) values and store them in a variable called
Y.pred
. For the simple model that uses only a single predictor,
regression.1
, we would do the following:
Y.pred <- -8.94 * X + 125.97
Okay, now that we’ve got a variable which stores the regression model predictions for how grumpy I will be on any given day, let’s calculate our sum of squared residuals. We would do that using the following command:
SS.resid <- sum( (Y - Y.pred)^2 )
print( SS.resid )
## [1] 1838.722
Wonderful. A big number that doesn’t mean very much. Still, let’s forge boldly onwards anyway, and calculate the total sum of squares as well. That’s also pretty simple:
SS.tot <- sum( (Y - mean(Y))^2 )
print( SS.tot )
## [1] 9998.59
Hm. Well, it’s a much bigger number than the last one, so this does suggest that our regression model was making good predictions. But it’s not very interpretable.
Perhaps we can fix this. What we’d like to do is to convert these two fairly meaningless numbers into one number. A nice, interpretable number, which for no particular reason we’ll call R 2 . What we would like is for the value of R 2 to be equal to 1 if the regression model makes no errors in predicting the data. In other words, if it turns out that the residual errors are zero – that is, if SS res =0 – then we expect R 2 =1. Similarly, if the model is completely useless, we would like R 2 to be equal to 0. What do I mean by “useless”? Tempting as it is demand that the regression model move out of the house, cut its hair and get a real job, I’m probably going to have to pick a more practical definition: in this case, all I mean is that the residual sum of squares is no smaller than the total sum of squares, SS res =SS tot . Wait, why don’t we do exactly that? The formula that provides us with out R 2 value is pretty simple to write down,
\(\ R^2 ={ 1 -{ SS_{res} \over SS_{tot} }}\)
and equally simple to calculate in R:
R.squared <- 1 - (SS.resid / SS.tot)
print( R.squared )
## [1] 0.8161018
The R
2
value, sometimes called the
coefficient of
determination
216
has a simple interpretation: it is the
proportion
of the variance in the outcome variable that can be accounted for by the predictor. So in this case, the fact that we have obtained R
2
=.816 means that the predictor (
my.sleep
) explains 81.6% of the variance in the outcome (
my.grump
).
Naturally, you don’t actually need to type in all these commands yourself if you want to obtain the R
2
value for your regression model. As we’ll see later on in Section 15.5.3, all you need to do is use the
summary()
function. However, let’s put that to one side for the moment. There’s another property of R
2
that I want to point out.
relationship between regression and correlation
At this point we can revisit my earlier claim that regression, in this very simple form that I’ve discussed so far, is basically the same thing as a correlation. Previously, we used the symbol r to denote a Pearson correlation. Might there be some relationship between the value of the correlation coefficient r and the R 2 value from linear regression? Of course there is: the squared correlation r 2 is identical to the R 2 value for a linear regression with only a single predictor. To illustrate this, here’s the squared correlation:
r <- cor(X, Y) # calculate the correlation
print( r^2 ) # print the squared correlation
## [1] 0.8161027
Yep, same number. In other words, running a Pearson correlation is more or less equivalent to running a linear regression model that uses only one predictor variable.
adjusted R 2 value
One final thing to point out before moving on. It’s quite common for people to report a slightly different measure of model performance, known as “adjusted R 2 ”. The motivation behind calculating the adjusted R 2 value is the observation that adding more predictors into the model will always call the R 2 value to increase (or at least not decrease). The adjusted R 2 value introduces a slight change to the calculation, as follows. For a regression model with K predictors, fit to a data set containing N observations, the adjusted R 2 is:
\(\operatorname{adj} . R^{2}=1-\left(\dfrac{\mathrm{SS}_{r e s}}{\mathrm{SS}_{t o t}} \times \dfrac{N-1}{N-K-1}\right)\)
This adjustment is an attempt to take the degrees of freedom into account. The big advantage of the adjusted R 2 value is that when you add more predictors to the model, the adjusted R 2 value will only increase if the new variables improve the model performance more than you’d expect by chance. The big disadvantage is that the adjusted R 2 value can’t be interpreted in the elegant way that R 2 can. R 2 has a simple interpretation as the proportion of variance in the outcome variable that is explained by the regression model; to my knowledge, no equivalent interpretation exists for adjusted R 2 .
An obvious question then, is whether you should report R 2 or adjusted R 2 . This is probably a matter of personal preference. If you care more about interpretability, then R 2 is better. If you care more about correcting for bias, then adjusted R 2 is probably better. Speaking just for myself, I prefer R 2 : my feeling is that it’s more important to be able to interpret your measure of model performance. Besides, as we’ll see in Section 15.5, if you’re worried that the improvement in R 2 that you get by adding a predictor is just due to chance and not because it’s a better model, well, we’ve got hypothesis tests for that.