Extra Sum of Squares
( \newcommand{\kernel}{\mathrm{null}\,}\)
When there are many predictors, it is often of interest to see if one or a few of the predictors can do the job of estimation of the mean response and prediction of new observations well enough. This can be put in the framework of comparison between a reduced regression model involving a subset of the variables X(1),...,Xp−1 versus the full regression model involving all the variables.
Example 1
Housing data: Y = selling price, X(1) = house size, X(2) = assessed value. The fitted models with various predictors are:
Y vs. X(1):ˆY=28.981+2.941X(1),SSE=203.17,R2=0.635
Y vs. X(2):ˆY=39.440+0.8585X(2),SSE=338.46,R2=0.391
Y vs. X(1) & X(2): ˆY=27.995+2.746X(1)+0.097X(2),SSE=201.9277,R2=0.6369
Decrease in SSE:Y vs. X(1) to Y vs. X(1) & X(2): : 203.17−201.93=1.24.
Decrease in SSE:Y vs. X(2) to Y vs. X(1) & X(2): : 338.46−201.93=136.53.
Let us denote by SSE(X(1),X(2)) the SSE when we include both X(1) and X(2) in the model. Analogous definition applies to SSE(X(1)) and SSE(X(2)). Then define
SSR(X(2)|X(1))=SSE(X(1))−SSE(X(1),X(2))
SSR(X(1)|X(2))=SSE(X(2))−SSE(X(1),X(2)).
In this example,
SSR(X(1)|X(2))=136.53 and SSR(X(2)|X(1))=1.24.
In general if we are comparing a regression model that involves variables X(1),...,X(q−1) (with q<p) against the full model, then define the Extra sum of squares due to inclusion of variables X(q),...,X(p−1) to be
SSR(X(q),...,X(p−1)|X(1),...,X(q−1))=SSE(X(1),...,X(q−1))−SSE(X(1),...,X(p−1)).
ANOVA decomposition in terms of extra sum of squares
Supposing we have two variables X(1) and X(2) (p = 3) in the full model. Then the following ANOVA decompositions are possible:
SSTO=SSR(X(1))+SSR(X(2)|X(1))+SSE(X(1),X(2))
SSTO=SSR(X(2))+SSR(X(1)|X(2))+SSE(X(1),X(2))
Observe that
SSR(X(2)|X(1))+SSE(X(1),X(2))=SSE(X(1)) and SSR(X(1)|X(2))+SSE(X(1),X(2))=SSE(X(2)).
Use of extra sum of squares
We considered standard multiple regression model:
Yi=β0+β1X(1)i+...+βp−1X(p−1)i+ϵi,i=1,...,n,
where ϵi have mean zero, variance σ2 and are uncorrelated.
Earlier we defined the extra sum of squares due to inclusion of variables X(q),...,(X(p−1), after variables X(1),...,X(q−1) have been included in the model, as
SSR(X(q),...,X(p−1)|X(1),...,X(q−1))=SSE(X(1),...,X(q−1))−SSE(X(1),...,X(p−1)).
Also, we have the ANOVA decomposition of the total variability in the response Y as
SSTO=SSR(X(1),...,X(q−1))+SSR(X(q+1),...,X(p−1)|X(1),...,X(q−1))+SSE(X(1),...,X(p−1)).
We can utilize the extra sum of squares to test hypotheses about the parameters.
Test for a single parameter βk
Suppose we want to test /(H_0: \beta_k = 0\) against /(H_1: \beta_k \neq 0\), where k is an integer between 1 and p. This problem can be formulated as a comparison between the full model given by (1) and the reduced model
Yi=β0+β1X(1)i+...+βk−1X(k−1)i+βk+1X(k+1)i+...+βp−1X(p−1)i+ϵi,i=1,...,n.
Note that, by definition of SSE,SSEfull=SSE(X(1),...,X(p−1)) with d.f. n−p, and SSEred=SSE(X(1),...,X(k−1),x(k+1),...,X(p−1)) with d.f. n−p+1. Test statistic is
F∗=SSEred−SSEfulld.f.(SSEred)−d.f.(SSEfull)SSEfulld.f.(SSEfull)
=SSR(X(k)|X(1),...,x(k−1),X(k),...,X(p−1))/1SSE(X(1),...,X(p−1))/(n−p).
Under H0 and the assumption of normality of errors, F∗ has F1,n−p distribution. So, H0:βk≠0 is rejected at level α if F∗>F(1−α;1,n−p).
Connection with t-test : Note that the t test for testing /(H_0: \beta_k = 0\) against /(H_1: \beta_k \neq 0\) uses the test statistic t∗=bks(bk). It can be checked that F∗=(t∗)2. Thus for this combination of null and alternative hypotheses these two tests are equivalent.
Example 2
Consider the housing price data example. Suppose that you want to test H0:β2=0 against H1:β2≠0. We have, SSE(X(1),X(2))=201.9277 with d.f. n−3=16, and SSE(X(1))=203.17 with d.f. n−2=17. Therefore,
F∗=(203.17−201.9277)/1201.93/16=1.242312.6205=0.0984.
F∗=0.0984<4.4940=F(0.95;1,16). Hence cannot reject H0:β2=0 at 5% level of significance (check that p-value for this test is 0.664).
Test for multiple parameters
Suppose we are testing H0:β1=...=βp−1=0 (where 1≤q<p) against H1 : for at least one k in q,...,p−1,βk≠0. Using the comparison procedure between the reduced and full model, the test statistic is
F∗=SSR(Xq,...,Xp−1|X1,...,Xq−1)p−qSSE(X1,...,Xp−1)n−p
and under H0 and the normality assumption, F∗ has Fp−q,n−p distribution. Reject H0 at level α if F∗>F(1−α;p−q;n−p).
Another Interpretation of extra sum of squares
It can be checkted that extra sum of squares SSR(X(k)|X(1),...,X(k−1),X(k+1),...,x(p−1)) is the sum of squares due to regression of Y on X(k) after accounting for the linear regression (of both Y and X(k)) on {X(1),...,X(k−1),X(k+1),...,x(p−1)}. Similarly, for arbitrary q<p, the extra sum of squares SSR(X(q),...,X(p−1)|X(1),...,x(q−1)) is the sum of squares due to regression of Y on {X(q),...,X(p−1)} after accounting for the linear regression on {X(1),...,X(q−1)}.
Coefficient of partial determination
The fraction
R2Yk|1,...,k−1,k+1,...,p−1:=SSR(X(k)|X(1),...,X(k−1),X(k+1),...,X(p−1))SSE(X(1),...,X(k−1),X(k+1),...,X(p−1)),
respresents the proportional reduction in the error sum of squares due to inclusion of variable X(k) in the model given by (2). Observe that R2Yk|1,...,k−1,k+1,...,p−1 is the squared correlation coefficient between the residuals obtained by regressing Y and X(k) (separately) on X(1),...,X(k−1),X(k+1),...,X(p−1). The latter correlation, denoted by rYk|1,...,k−1,k+1,...,p−1, is called the partial correlation coefficient between Y and X(k) given X(1),...,X(k−1),X(k+1),...,X(p−1).
Example 3
Consider the housing price data. We have rYX2=rY2=0.625 so that R2Y2=(rY2)2=0.391. However, R2Y2|1=SSR(X(2)|X(1))/SSE(X(1))=1.2423/203.17=0.00611458.
This means that there is only a 0.61 reduction in error sum of squares due to adding the variable X(2) (assessed value of house) to the model containing X(1) (house size) as the predictor.
Multicollinearity
Suppose that one or more predictor variables are perfectly linearly related. Then there are infinitely many relations describing the same regression model. In this case one can discard a few variables to avoid redundancy.
In reality, such exact relationships among variables are rare. However, near perfect linear relationship among some of the predictors is quite possible. In the regression model Y=β0+β1X(1)+β2X(2)+ϵ, if X(1) and X(2) are strongly related, then XTX matrix will be nearly singular, and the inverse may not exist or may have rather large diagonal elements. Since s2(bj) is proportional to the j + 1-th diagonal element of (XTX)−1, this shows that the collinearity between X(1) and X(2) leads to unstable estimates of βj's. If there are many predictor variables then the problem can be quite severe.
Coefficient of partial determination helps in detecting the presence of multicollinearity. For example, if there is collinearity between the predictor variables X(1) and X(2), and Y has a fairly strong linear relationship with both X(1) and X(2), then r2Y1 and r2Y2 (squared correlation coefficients between Y and X(1), and Y and X(2), respectively) will be large, but R2Y1|2 and R2Y2|1 will tend to have small values.
Contributors:
- Scott Brunstein
- Debashis Paul