12.2.5: Outliers

Last updated
Save as PDF

Page ID: 34854

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

A scatter plot should be checked for outliers. An outlier is a point that seems out of place when compared with the other points. Some of these points can affect the equation of the regression line.

Should linear regression be used with this data set?

\(x\) 1 3 8 2 1 3 2 2 3 1 \(y\) 2 3 8 2 3 1 3 1 2 1

Solution

A regression analysis for the data set was run on Excel.

Excel-generated table of regression statistics for the given data.

If we test for a significant correlation:

\(H_{0}: \rho = 0\)
\(H_{1}: \rho \neq 0\)

The correlation is \(r = 0.844\) and the p-value is 0.002, which is less than \(\alpha\) = 0.05, so we would reject \(H_{0}\) and conclude there is a significant relationship between \(x\) and \(y\).

However, if look at the scatterplot in Figure 12-18, with the regression equation we can clearly see that the point \((8,8)\) is an outlier. The outlier is pulling the slope up towards the point \((8,8)\).

Scatterplot of the given data with a linear regression equation of y = 0.8438x + 0.4063, an R-squared value of 0.7119, and an outlier at point (8, 8). — Figure 12-18: Scatterplot of data with an outlier.

If we were to take out the outlier point \((8,8)\) and run the regression analysis again on the modified data set we get the following Excel output.

Excel-generated regression statistics table of the given data with the (8, 8) outlier removed.

See Figure 12-19: note the correlation is now 0 and the p-value is 1, so there is no relationship at all between \(x\) and \(y\).

Scatterplot of the given data with the outlier point (8, 8) removed. Plot now takes the form of a 3-by-3 grid of points bounded by x and y values of 1 and 3, with the linear regression line equation now being y=2 and the R-squared value being 0. — Figure 12-19: Scatterplot of the same data as above with outlier removed.

This type of outlier is called a leverage point. Leverage points are positioned far away from the main cluster of data points on the \(x\)-axis.

There is another type of outlier called an influential point. Influential points are positioned far away from the main cluster of data points on the \(y\)-axis. There is an option in most software packages to get the “standardized” residuals. Standardized residuals are z-scores of the residuals. Any standardized residual that is not between \(-2\) and \(2\) may be an outlier. If it is not between \(-3\) and \(3\) then the point is an outlier. When this happens, the points are called influential points or influential observations.

Use technology to compute the standardized residuals. Should linear regression be used with this data set?

\(x\) 1 3 2 2 4 5 7 9 6 8 \(y\) 1 3 10 2 4 5 7 9 6 8

Solution

A regression analysis for the given data set was run on Excel, producing the following results:

Regression analysis table for the given data. Observation 3, the data point (2, 10), has a standard residual of 2.671, which is highlighted. All other observations have standard residual values between -1 and 1.

The point \((2, 10)\) shown in Figure 12-20 is pulling the left side of the line up and away from the points that form a line. This influential point changes the \(y\)-intercept and slope.

Scatterplot of the given data points. All points except for (2, 10) appear to line up; the regression line does not pass evenly through these points, but is tilted towards the (2, 10) point. — Figure 12-20: