Analysis of variance approach to regression
( \newcommand{\kernel}{\mathrm{null}\,}\)
We divide the total variability in the observe data into two parts - one coming from the errors, the other coming from the predictor.
ANOVA Decomposition
The following decomposition
Yi−¯Y=(^Yi−¯Y)+(Yi−^Yi)
with i=1,2,...,n..
represents the deviation of the observed response from the mean response in terms of the sum of the deviation of the fitted value from the mean plus the residual.
Taking the sum of squares, and after some algebra we have:
n∑i=1(Yi−¯Y)2=n∑i=1(^Yi−¯Y)2+n∑i=1(Yi−^Yi)2.
or
SSTO=SSR+SSE
where
SSTO=n∑i=1(Yi−¯Y)2
and
SSR=n∑i=1(^Yi−¯Y)2.
is referred to as the ANOVA decomposition to the variation in the response. Note that
SSR=b21n∑i=1(Xi−¯X)2.
Degrees of freedom
The degrees of freedom of different terms in the decomposition Equation ??? are
df(SSTO)=n−1
df(SSR)=1
df(SSE)=n−2.
So,
df(SSTO)=d.f.(SSR)+d.f.(SSE).
Expected value and distribution
E(SSE)=(n−2)σ2, and E(SSR)=σ2+β21∑ni=1(Xi−¯X)2. Also, under the normal regression model, and under H0:β1=0,
SSR∼σ2χ21,SSE∼σ2χ2n−2,
and these two are independent.
Mean squares
MSE=SSEd.f.(SSE)=SSEn−2,MSR=SSRd.f.(SSR)=SSR1.
Also, E(MSE)=σ2,E(MSR)=σ2+β21∑ni=1(Xi−¯X)2.
F ratio
For testing H0:β1=0 versus H1:β1≠0, the following test statistics, called the F ratio, can be used:
F∗=MSRMSE.
The reason is that MSRMSE fluctuates around 1 + β21∑ni=1(Xi−¯X)2σ2. So, a significantly large value of F∗ provides evidence against H0 and for H1.
Under H0,F∗ has the F distribution with paired degrees of freedom (d.f.( SSR ), d.f.( SSE )) = (1, n - 2 ), (written F∗∼F1,n−2). Thus, the test rejects H0 at level of significance α if F∗>F(1−α;1,n−2), where F(1−α;1,n−2) is the (1−α) quantile of F1;n−2 distribution.
Relation between F-test and t-test
Check that F∗=(t∗)2. where t∗=b1s(b1) is the test statistic for testing H0:β1=0 versus H1:β1≠0. So, the F-test is equivalent to the t-test in this case.
ANOVA table
It is a table that gives the summary of the various objects used in testing H0:β1=0 against H1:β1≠0. It is of the form:
Source | df | SS | MS | F* |
---|---|---|---|---|
Regression | d.f.(SSR) = 1 | SSR | MSR | MSRMSE |
Error | d.f.(SSE) = n - 2 | SSE | MSE | |
Total | d.f.(SSTO) = n - 1 | SSTO |
Example 1: housing price data
We consider a data set on housing prices. Here Y = selling price of houses (in $1000), and X = size of houses (100 square feet). The summary statistics are given below:
n=19,¯X=15.719,¯Y=75.211,
∑i(Xi−¯X)2=40.805,∑i(Yi−¯Y)2=556.078,∑i(Xi−¯X)(Yi−¯Y)=120.001.
(Example) - Estimates of β1 and β0
b1=∑i(Xi−¯X)(Yi−¯Y)∑i(Xi−¯X)2=120.00140.805=2.941.
and
b0=¯Y−b1¯X=75.211−(2.941)(15.719)=28.981.
(Example) - MSE
The degrees of freedom (d.f.) = n−2=17.SSE=∑i(Yi−¯Y)2−b21∑i(Xi−¯X)2=203.17. So,
MSE=SSEn−2=203.1717=11.95.
Also, SSTO = 556.08 and SSR = SSTO - SSE = 352.91, MSR = SSR/1 = 352.91.
F∗=MSRMSE=29.529=(t∗)2, where t∗=b1s(b1)=2.9410.5412=5.434. Also, F( 0.95; 1, 17 ) = 4.45, t( 0.975; 17) = 2.11. So, we reject H0:β1=0. The ANOVA table is given below.
Source | df | SS | MS | F* |
---|---|---|---|---|
Regression | 1 | 352.91 | 352.91 | 29.529 |
Error | 17 | 203.17 | 11.95 | |
Total | 18 | 556.08 |
Contributors
- Valerie Regalia
- Debashis Paul