# Extra Sum of Squares

- Page ID
- 226

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

When there are many predictors, it is often of interest to see if one or a few of the predictors can do the job of estimation of the mean response and prediction of new observations well enough. This can be put in the framework of comparison between a *reduced regression model* involving a subset of the variables \(X^{(1)},...,X^{p-1}\) versus the *full regression model* involving all the variables.

Example 1

Housing data: \(Y\) = selling price, \(X^{(1)}\) = house size, \(X^{(2)}\) = assessed value. The fitted models with various predictors are:

\(Y\) vs. \(X^{(1)}: \widehat{Y} = 28.981 + 2.941X^{(1)}, SSE = 203.17, R^2 = 0.635\)

\(Y\) vs. \(X^{(2)}: \widehat{Y} = 39.440 + 0.8585X^{(2)}, SSE = 338.46, R^2 = 0.391\)

\(Y\) vs. \(X^{(1)}\) & \(X^{(2)}:\) \(\widehat{Y} = 27.995 + 2.746X^{(1)} + 0.097X^{(2)}, SSE = 201.9277, R^2 = 0.6369\)

Decrease in \(SSE: Y\) vs. \(X^{(1)}\) to \(Y\) vs. \(X^{(1)}\) & \(X^{(2)}:\) : \(203.17 - 201.93 = 1.24.\)

Decrease in \(SSE: Y\) vs. \(X^{(2)}\) to \(Y\) vs. \(X^{(1)}\) & \(X^{(2)}:\) : \(338.46 - 201.93 = 136.53.\)

Let us denote by \(SSE(X^{(1)},X^{(2)})\) the \(SSE\) when we include both \(X^{(1)}\) and \(X^{(2)}\) in the model. Analogous definition applies to \(SSE(X^{(1)})\) and \(SSE(X^{(2)})\). Then define

\(SSR(X^{(2)}|X^{(1)}) = SSE(X^{(1)}) - SSE(X^{(1)},X^{(2)})\)

\(SSR(X^{(1)}|X^{(2)}) = SSE(X^{(2)}) - SSE(X^{(1)},X^{(2)})\).

In this example,

\(SSR(X^{(1)}|X^{(2)}) = 136.53\) and \(SSR(X^{(2)}|X^{(1)}) = 1.24\).

In general if we are comparing a regression model that involves variables \(X^{(1)},...,X^{(q-1)}\) (with \( q < p)\) against the full model, then define the **Extra sum of squares** due to inclusion of variables \(X^{(q)},...,X^{(p-1)}\) to be

\(SSR(X^{(q)},...,X^{(p-1)}|X^{(1)},...,X^{(q-1)}) = SSE(X^{(1)},...,X^{(q-1)}) - SSE(X^{(1)},...,X^{(p-1)})\).

## ANOVA decomposition in terms of extra sum of squares

Supposing we have two variables \(X^{(1)}\) and \(X^{(2)}\) (p = 3) in the full model. Then the following ANOVA decompositions are possible:

\(SSTO = SSR(X^{(1)}) + SSR(X^{(2)}|X^{(1)}) + SSE(X^{(1)},X^{(2)})\)

\(SSTO = SSR(X^{(2)}) + SSR(X^{(1)}|X^{(2)}) + SSE(X^{(1)},X^{(2)})\)

Observe that

\(SSR(X^{(2)}|X^{(1)}) + SSE(X^{(1)},X^{(2)}) = SSE(X^{(1)})\) and \(SSR(X^{(1)}|X^{(2)}) + SSE(X^{(1)},X^{(2)}) = SSE(X^{(2)})\).

## Use of extra sum of squares

We considered standard multiple regression model:

\[Y_i = \beta_0 + \beta_1X_i^{(1)} + ... + \beta_{p-1}X_i^{(p-1)} + \epsilon_i, i = 1,...,n,\]

where \(\epsilon_i\) have mean zero, variance \(\sigma^2\) and are uncorrelated.

Earlier we defined the extra sum of squares due to inclusion of variables \(X^{(q)},...,(X^{(p-1)}\), after variables \(X^{(1)},...,X^{(q-1)}\) have been included in the model, as

\[SSR(X^{(q)},...,X^{(p-1)}|X^{(1)},...,X^{(q-1)}) = SSE(X^{(1)},...,X^{(q-1)}) - SSE(X^{(1)},...,X^{(p-1)}) .\]

Also, we have the ANOVA decomposition of the total variability in the response \(Y\) as

\[SSTO = SSR(X^{(1)},...,X^{(q-1)}) + SSR(X^{(q+1)},...,X^{(p-1)}|X^{(1)},...,X^{(q-1)}) + SSE(X^{(1)},...,X^{(p-1)}) .\]

We can utilize the extra sum of squares to test hypotheses about the parameters.

### Test for a single parameter \(\beta_k\)

Suppose we want to test /(H_0: \beta_k = 0\) against /(H_1: \beta_k \neq 0\), where \(k\) is an integer between 1 and \(p\). This problem can be formulated as a comparison between the full model given by \((1)\) and the reduced model

\[Y_i = \beta_0 + \beta_1X_i^{(1)} + ... + \beta_{k-1}X_i^{(k-1)} + \beta_{k+1}X_i^{(k+1)} + ... + \beta_{p-1}X_i^{(p-1)} + \epsilon_i, i = 1,...,n.\]

Note that, by definition of \(SSE, SSE_{full} = SSE(X^{(1)},...,X^{(p-1)})\) with d.f. \(n-p\), and \(SSE_{red} = SSE(X^{(1)},...,X^{(k-1)},x^{(k+1)},...,X^{(p-1)})\) with d.f. \(n - p + 1\). Test statistic is

\[F^{*} = \dfrac{\dfrac{SSE_{red}-SSE_{full}}{d.f.(SSE_{red}) - d.f.(SSE_{full})}}{\dfrac{SSE_{full}}{d.f.(SSE_{full})}}\]

\[ =\dfrac{SSR(X^{(k)}|X^{(1)},...,x^{(k-1)},X^{(k)},...,X^{(p-1)})/1}{SSE(X^{(1)},...,X^{(p-1)})/(n-p)}.\]

Under \(H_0\) and the assumption of normality of errors, \(F^{*}\) has \(F_{1,n-p}\) distribution. So, \(H_0 : \beta_k \neq 0\) is rejected at level \(\alpha\) if \(F^{*} > F(1 - \alpha; 1, n - p).\)

**Connection with \(t\)-test** : Note that the \(t\) test for testing /(H_0: \beta_k = 0\) against /(H_1: \beta_k \neq 0\) uses the test statistic \(t^{*} = \frac{b_k}{s(b_k)}.\) It can be checked that \(F^{*} = (t^{*})^2.\) Thus for this combination of null and alternative hypotheses these two tests are equivalent.

Example 2

Consider the housing price data example. Suppose that you want to test \(H_0 : \beta_2 = 0\) against \(H_1 : \beta_2 \neq 0.\) We have, \(SSE(X^{(1)},X{(2)}) = 201.9277\) with d.f. \(n - 3 = 16\), and \(SSE(X^{(1)}) = 203.17\) with d.f. \(n - 2 = 17\). Therefore,

\[F^{*} = \frac{(203.17-201.9277)/1}{201.93/16} = \frac{1.2423}{12.6205} = 0.0984.\]

\(F^{*} = 0.0984 < 4.4940 = F(0.95;1,16).\) Hence cannot reject \(H_0 : \beta_2 = 0\) at 5% level of significance (check that \(p\)-value for this test is 0.664).

### Test for multiple parameters

Suppose we are testing \(H_0: \beta_1 = ... = \beta_{p-1} = 0\) (where \( 1 \leq q < p)\) against \(H_1\) : *for at least one k in* \({q,...,p - 1}, \beta_k \neq 0.\) Using the comparison procedure between the reduced and full model, the test statistic is

\[F^{*} =\dfrac{\dfrac{SSR(X_{q},...,X_{p-1}|X_{1},...,X_{q-1})}{p-q}} {\dfrac{SSE(X_{1},...,X_{p-1})}{n-p}} \]

and under \(H_0\) and the normality assumption, \(F^*\) has \(F_{p-q,n-p}\) distribution. Reject \(H_0\) at level \(\alpha\) if \(F^* > F(1-\alpha;p - q; n - p).\)

### Another Interpretation of extra sum of squares

It can be checkted that extra sum of squares \(SSR(X^{(k)}|X^{(1)},...,X^{(k-1)},X^{(k+1)},...,x^{(p-1)})\) is the sum of squares due to regression of \(Y\) on \(X^{(k)}\) after accounting for the linear regression (of both \(Y\) and \(X^{(k)}\)) on {\({X^{(1)},...,X^{(k-1)},X^{(k+1)},...,x^{(p-1)}}\)}. Similarly, for arbitrary \( q < p\), the extra sum of squares \(SSR(X^{(q)},...,X^{(p-1)}|X^{(1)},...,x^{(q-1)})\) is the sum of squares due to regression of \(Y\) on {\({X^{(q)},...,X^{(p-1)}}\)} after accounting for the linear regression on {\({X^{(1)},...,X^{(q-1)}}\)}.

### Coefficient of partial determination

The fraction

\[R^{2}_{Y k|1,...,k-1,k+1,...,p-1} := \frac{SSR(X^{(k)}|X^{(1)},...,X^{(k-1)},X^{(k+1)},...,X^{(p-1)})}{SSE(X^{(1)},...,X^{(k-1)},X^{(k+1)},...,X^{(p-1)}}),\]

respresents the proportional reduction in the error sum of squares due to inclusion of variable \(X^{(k)}\) in the model given by (2). Observe that \(R^{2}_{Y k|1,...,k-1,k+1,...,p-1}\) is the *squared correlation* coefficient between the residuals obtained by regressing \(Y\) and \(X^{(k)}\) (separately) on \(X^{(1)},...,X^{(k-1)},X^{(k+1)},...,X^{(p-1)}\). The latter correlation, denoted by \(r_{Y k|1,...,k-1,k+1,...,p-1}\), is called the **partial correlation coefficient** between \(Y\) and \(X^{(k)}\) given \(X^{(1)},...,X^{(k-1)},X^{(k+1)},...,X^{(p-1)}\).

Example 3

Consider the housing price data. We have \(r_{Y X_2} = r_{Y2} = 0.625\) so that \(R^2_{Y2} = (r_{Y2})^2 = 0.391.\) However, \(R^2_{Y 2|1} = SSR(X^{(2)}|X^{(1)})/SSE(X^{(1)}) = 1.2423/203.17 = 0.00611458.\)

This means that there is only a \(0.61%\) reduction in error sum of squares due to adding the variable \(X^{(2)}\) (assessed value of house) to the model containing \(X^{(1)}\) (house size) as the predictor.

## Multicollinearity

Suppose that one or more predictor variables are *perfectly linearly related*. Then there are *infinitely many relations* describing the same regression model. In this case one can discard a few variables to avoid redundancy.

In reality, such exact relationships among variables are rare. However, *near perfect linear relationship among some of the predictors* is quite possible. In the regression model \(Y = \beta_0 + \beta_1X^{(1)} + \beta_2X^{(2)} + \epsilon,\) if \(X^{(1)}\) and \(X^{(2)}\) are strongly related, then \(X^TX\) matrix will be *nearly singular*, and the inverse may not exist or may have rather large diagonal elements. Since \(s^2(b_j)\) is proportional to the \(j\) + 1-th diagonal element of \( (X^TX)^{-1}\), this shows that the **collinearity** between \(X^{(1)}\) and \(X^{(2)}\) leads to unstable estimates of \(\beta_j\)'s. If there are many predictor variables then the problem can be quite severe.

**Coefficient of partial determination** helps in detecting the presence of multicollinearity. For example, if there is collinearity between the predictor variables \(X^{(1)}\) and \(X^{(2)}\), and \(Y\) has a fairly strong linear relationship with both \(X^{(1)}\) and \(X^{(2)}\), then \(r^2_{Y1}\) and \(r^2_{Y2}\) (squared correlation coefficients between \(Y\) and \(X^{(1)}\), and \(Y\) and \(X^{(2)}\), respectively) will be large, but \(R^2_{Y 1|2}\) and \(R^2_{Y 2|1}\) will tend to have small values.

## Contributors:

- Scott Brunstein
- Debashis Paul