Now that we have a mathematical model (the least-squares regression line) that we can use to make predictions, we want to know: How good are these predictions, and how can we measure the error in a prediction?
Example
Highway Sign Visibility
Let’s begin our investigation by predicting the maximum distance that an 18-year-old driver can read a highway sign and then determining the error in our prediction.
We use the regression line equation:
Distance = 576 + (–3 * Age)
To predict the distance for an 18-year-old driver, we plug Age = 18 into the equation.
Predicted distance = 576 + (–3 * 18) = 522
Our prediction is that 522 feet is the maximum distance at which an 18-year-old driver can read a highway sign. Now let’s compare our prediction to the actual data for the 18-year-old driver: (18, 510).
The error in our prediction is 510 – 522 = –12.
This tells us that the actual distance for the 18-year-old driver is 12 feet closer than the prediction. In other words, our prediction is too large. It overestimates the actual distance by 12 feet.
So in general, we have Observed data value – Predicted value = Error.
Now let’s look at the error from a different perspective. We can think of the error as a way to adjust the prediction to match the data value.
This last equation says that the observed value is the predicted value plus the error. In other words, we can think of the error as the amount that we have to add to the prediction to get the observed value. From this point of view, the error can be thought of as a correction term. If the error is positive, it means the prediction is too small (the prediction underestimates the actual y-value). If the error is negative, it means the prediction is too large (the prediction overestimates the actual y-value).
In our next example, we look at prediction error from this point of view.
Example
Biology Courses
A biology department tracks the progress of students in its program. Grades in the introductory biology course have a strong linear relationship with grades in the upper-level biology courses (r = 0.91).
The least-squares regression equation is
Upper course grade = −8.9 + (1.05 * Intro course grade)
Let’s look at the predicted upper course grade for a student who makes a 75% in the introductory biology course.
Upper course grade = −8.9 + (1.05 * 75) = 69.85 ≈ 70
The regression line predicts that this student will make a 70% in the upper-level biology course.
The actual grade in the upper-level course for this student is 63%. The prediction is too high: it overestimates the data. To match the data value, we would need to subtract 7 from the prediction, so the error is −7.
In the scatterplot, notice that the regression line lies above the point (75, 63). Visually, we can see that the prediction is too high. This reinforces our previous observation that the prediction overestimates the data value. We would have to adjust the prediction downward to match the data value. Viewing the error as a correction term, we see the correction has to be negative.
Notice that when a point is close to the regression line, the prediction is close to the actual upper course grade, so the error is small. Another way to say this is that points close to the regression line have a small residual.