# 15.4: Quantifying the Fit of the Regression Model

So we now know how to estimate the coefficients of a linear regression model. The problem is, we don’t yet know if this regression model is any good. For example, the regression.1 model claims that every hour of sleep will improve my mood by quite a lot, but it might just be rubbish. Remember, the regression model only produces a prediction $$\ \hat{Y_i}$$ about what my mood is like: my actual mood is Yi. If these two are very close, then the regression model has done a good job. If they are very different, then it has done a bad job.

## 15.4.1 The R2 value

Once again, let’s wrap a little bit of mathematics around this. Firstly, we’ve got the sum of the squared residuals:

$$\ SS_{res} = \sum_i(Y_i - \hat{Y_i})^2$$

which we would hope to be pretty small. Specifically, what we’d like is for it to be very small in comparison to the total variability in the outcome variable,

$$\ SS_{tot} = \sum_i(Y_i - \bar{Y})^2$$

While we’re here, let’s calculate these values in R. Firstly, in order to make my R commands look a bit more similar to the mathematical equations, I’ll create variables X and Y:

X <- parenthood$dan.sleep # the predictor Y <- parenthood$dan.grump  # the outcome

Now that we’ve done this, let’s calculate the $$\ \hat{Y}$$ values and store them in a variable called Y.pred. For the simple model that uses only a single predictor, regression.1, we would do the following:

Y.pred <- -8.94 * X  +  125.97

Okay, now that we’ve got a variable which stores the regression model predictions for how grumpy I will be on any given day, let’s calculate our sum of squared residuals. We would do that using the following command:

SS.resid <- sum( (Y - Y.pred)^2 )
print( SS.resid )
##  1838.722

Wonderful. A big number that doesn’t mean very much. Still, let’s forge boldly onwards anyway, and calculate the total sum of squares as well. That’s also pretty simple:

SS.tot <- sum( (Y - mean(Y))^2 )
print( SS.tot )

##  9998.59

Hm. Well, it’s a much bigger number than the last one, so this does suggest that our regression model was making good predictions. But it’s not very interpretable.

Perhaps we can fix this. What we’d like to do is to convert these two fairly meaningless numbers into one number. A nice, interpretable number, which for no particular reason we’ll call R2. What we would like is for the value of R2 to be equal to 1 if the regression model makes no errors in predicting the data. In other words, if it turns out that the residual errors are zero – that is, if SSres=0 – then we expect R2=1. Similarly, if the model is completely useless, we would like R2 to be equal to 0. What do I mean by “useless”? Tempting as it is demand that the regression model move out of the house, cut its hair and get a real job, I’m probably going to have to pick a more practical definition: in this case, all I mean is that the residual sum of squares is no smaller than the total sum of squares, SSres=SStot. Wait, why don’t we do exactly that? The formula that provides us with out R2 value is pretty simple to write down,

$$\ R^2 ={ 1 -{ SS_{res} \over SS_{tot} }}$$

and equally simple to calculate in R:

R.squared <- 1 - (SS.resid / SS.tot)
print( R.squared )
##  0.8161018

The R2 value, sometimes called the coefficient of determination216 has a simple interpretation: it is the proportion of the variance in the outcome variable that can be accounted for by the predictor. So in this case, the fact that we have obtained R2=.816 means that the predictor (my.sleep) explains 81.6% of the variance in the outcome (my.grump).

Naturally, you don’t actually need to type in all these commands yourself if you want to obtain the R2 value for your regression model. As we’ll see later on in Section 15.5.3, all you need to do is use the summary() function. However, let’s put that to one side for the moment. There’s another property of R2 that I want to point out.

## 15.4.2 The relationship between regression and correlation

At this point we can revisit my earlier claim that regression, in this very simple form that I’ve discussed so far, is basically the same thing as a correlation. Previously, we used the symbol r to denote a Pearson correlation. Might there be some relationship between the value of the correlation coefficient r and the R2 value from linear regression? Of course there is: the squared correlation r2 is identical to the R2 value for a linear regression with only a single predictor. To illustrate this, here’s the squared correlation:

r <- cor(X, Y)  # calculate the correlation
print( r^2 )    # print the squared correlation
##  0.8161027

Yep, same number. In other words, running a Pearson correlation is more or less equivalent to running a linear regression model that uses only one predictor variable.

## 15.4.3 The adjusted R2 value

One final thing to point out before moving on. It’s quite common for people to report a slightly different measure of model performance, known as “adjusted R2”. The motivation behind calculating the adjusted R2 value is the observation that adding more predictors into the model will always call the R2 value to increase (or at least not decrease). The adjusted R2 value introduces a slight change to the calculation, as follows. For a regression model with K predictors, fit to a data set containing N observations, the adjusted R2 is:

$$\operatorname{adj} . R^{2}=1-\left(\frac{\mathrm{SS}_{r e s}}{\mathrm{SS}_{t o t}} \times \frac{N-1}{N-K-1}\right)$$

This adjustment is an attempt to take the degrees of freedom into account. The big advantage of the adjusted R2 value is that when you add more predictors to the model, the adjusted R2 value will only increase if the new variables improve the model performance more than you’d expect by chance. The big disadvantage is that the adjusted R2 value can’t be interpreted in the elegant way that R2 can. R2 has a simple interpretation as the proportion of variance in the outcome variable that is explained by the regression model; to my knowledge, no equivalent interpretation exists for adjusted R2.

An obvious question then, is whether you should report R2 or adjusted R2. This is probably a matter of personal preference. If you care more about interpretability, then R2 is better. If you care more about correcting for bias, then adjusted R2 is probably better. Speaking just for myself, I prefer R2: my feeling is that it’s more important to be able to interpret your measure of model performance. Besides, as we’ll see in Section 15.5, if you’re worried that the improvement in R2 that you get by adding a predictor is just due to chance and not because it’s a better model, well, we’ve got hypothesis tests for that.