The distance of an observation is based on the error of prediction for the observation: The greater the error of prediction, the greater the distance. The most commonly used measure of distance is the \(\textit{studentized residual}\). The \(\textit{studentized residual}\) for an observation is closely related to the error of prediction for that observation divided by the standard deviation of the errors of prediction. However, the predicted score is derived from a regression equation in which the observation in question is not counted. The details of the computation of a \(\textit{studentized residual}\) are a bit complex and are beyond the scope of this work.
Even an observation with a large distance will not have that much influence if its leverage is low. It is the combination of an observation's leverage and distance that determines its influence.
Example \(\PageIndex{1}\)
Table \(\PageIndex{1}\) shows the leverage, \(\textit{studentized residual}\), and influence for each of the five observations in a small dataset.
Table \(\PageIndex{1}\): Example Data
ID |
X |
Y |
h |
R |
D |
A |
1 |
2 |
0.39 |
-1.02 |
0.40 |
B |
2 |
3 |
0.27 |
-0.56 |
0.06 |
C |
3 |
5 |
0.21 |
0.89 |
0.11 |
D |
4 |
6 |
0.20 |
1.22 |
0.19 |
E |
8 |
7 |
0.73 |
-1.68 |
8.86 |
In the above table, \(h\) is the leverage, \(R\) is the \(\textit{studentized residual}\), and \(D\) is Cook's measure of influence.
\(\text{Observation A}\) has fairly high leverage, a relatively high residual, and moderately high influence.
\(\text{Observation B}\) has small leverage and a relatively small residual. It has very little influence.
\(\text{Observation C}\) has small leverage and a relatively high residual. The influence is relatively low.
\(\text{Observation D}\) has the lowest leverage and the second highest residual. Although its residual is much higher than \(\text{Observation A }\), its influence is much less because of its low leverage.
\(\text{Observation E}\) has by far the largest leverage and the largest residual. This combination of high leverage and high residual makes this observation extremely influential.
Figure \(\PageIndex{1}\) shows the regression line for the whole dataset (blue) and the regression line if the observation in question is not included (red) for all observations. The observation in question is circled. Naturally, the regression line for the whole dataset is the same in all panels. The residual is calculated relative to the line for which the observation in question is not included in the analysis. The most influential observation is \(\text{Observation E}\) for which the two regression lines are very different. This indicates the influence of this observation.