# 15.4: Quantifying the Fit of the Regression Model

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$ $$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\kernel}{\mathrm{null}\,}$$ $$\newcommand{\range}{\mathrm{range}\,}$$ $$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$ $$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$ $$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\kernel}{\mathrm{null}\,}$$ $$\newcommand{\range}{\mathrm{range}\,}$$ $$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$ $$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$ $$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$ $$\newcommand{\Span}{\mathrm{span}}$$

So we now know how to estimate the coefficients of a linear regression model. The problem is, we don’t yet know if this regression model is any good. For example, the regression.1 model claims that every hour of sleep will improve my mood by quite a lot, but it might just be rubbish. Remember, the regression model only produces a prediction $$\ \hat{Y_i}$$ about what my mood is like: my actual mood is Yi. If these two are very close, then the regression model has done a good job. If they are very different, then it has done a bad job.

## R2 value

Once again, let’s wrap a little bit of mathematics around this. Firstly, we’ve got the sum of the squared residuals:

$$\ SS_{res} = \sum_i(Y_i - \hat{Y_i})^2$$

which we would hope to be pretty small. Specifically, what we’d like is for it to be very small in comparison to the total variability in the outcome variable,

$$\ SS_{tot} = \sum_i(Y_i - \bar{Y})^2$$

While we’re here, let’s calculate these values in R. Firstly, in order to make my R commands look a bit more similar to the mathematical equations, I’ll create variables X and Y:

X <- parenthood$dan.sleep # the predictor Y <- parenthood$dan.grump  # the outcome

Now that we’ve done this, let’s calculate the $$\ \hat{Y}$$ values and store them in a variable called Y.pred. For the simple model that uses only a single predictor, regression.1, we would do the following:

Y.pred <- -8.94 * X  +  125.97

Okay, now that we’ve got a variable which stores the regression model predictions for how grumpy I will be on any given day, let’s calculate our sum of squared residuals. We would do that using the following command:

SS.resid <- sum( (Y - Y.pred)^2 )
print( SS.resid )
## [1] 1838.722

Wonderful. A big number that doesn’t mean very much. Still, let’s forge boldly onwards anyway, and calculate the total sum of squares as well. That’s also pretty simple:

SS.tot <- sum( (Y - mean(Y))^2 )
print( SS.tot )

## [1] 9998.59

Hm. Well, it’s a much bigger number than the last one, so this does suggest that our regression model was making good predictions. But it’s not very interpretable.

Perhaps we can fix this. What we’d like to do is to convert these two fairly meaningless numbers into one number. A nice, interpretable number, which for no particular reason we’ll call R2. What we would like is for the value of R2 to be equal to 1 if the regression model makes no errors in predicting the data. In other words, if it turns out that the residual errors are zero – that is, if SSres=0 – then we expect R2=1. Similarly, if the model is completely useless, we would like R2 to be equal to 0. What do I mean by “useless”? Tempting as it is demand that the regression model move out of the house, cut its hair and get a real job, I’m probably going to have to pick a more practical definition: in this case, all I mean is that the residual sum of squares is no smaller than the total sum of squares, SSres=SStot. Wait, why don’t we do exactly that? The formula that provides us with out R2 value is pretty simple to write down,

$$\ R^2 ={ 1 -{ SS_{res} \over SS_{tot} }}$$

and equally simple to calculate in R:

R.squared <- 1 - (SS.resid / SS.tot)
print( R.squared )
## [1] 0.8161018

The R2 value, sometimes called the coefficient of determination216 has a simple interpretation: it is the proportion of the variance in the outcome variable that can be accounted for by the predictor. So in this case, the fact that we have obtained R2=.816 means that the predictor (my.sleep) explains 81.6% of the variance in the outcome (my.grump).

Naturally, you don’t actually need to type in all these commands yourself if you want to obtain the R2 value for your regression model. As we’ll see later on in Section 15.5.3, all you need to do is use the summary() function. However, let’s put that to one side for the moment. There’s another property of R2 that I want to point out.

## relationship between regression and correlation

At this point we can revisit my earlier claim that regression, in this very simple form that I’ve discussed so far, is basically the same thing as a correlation. Previously, we used the symbol r to denote a Pearson correlation. Might there be some relationship between the value of the correlation coefficient r and the R2 value from linear regression? Of course there is: the squared correlation r2 is identical to the R2 value for a linear regression with only a single predictor. To illustrate this, here’s the squared correlation:

r <- cor(X, Y)  # calculate the correlation
print( r^2 )    # print the squared correlation
## [1] 0.8161027

Yep, same number. In other words, running a Pearson correlation is more or less equivalent to running a linear regression model that uses only one predictor variable.

$$\operatorname{adj} . R^{2}=1-\left(\dfrac{\mathrm{SS}_{r e s}}{\mathrm{SS}_{t o t}} \times \dfrac{N-1}{N-K-1}\right)$$