# Analysis of variance approach to regression

We divide the *total variability* in the observe data into two parts - one coming from the errors, the other coming from the predictor.

### ANOVA Decomposition

The following decomposition

\[ Y_i - \overline{Y} = (\widehat{Y_i} - \overline{Y}) + (Y_i - \widehat{Y_i} )\]

with \( i=1,2,...,n. \).

represents the *deviation of the observed response from the mean response* in terms of the sum of the *deviation of the fitted value* from the mean plus the *residual*.

Taking the sum of squares, and after some algebra we have:

\[ \sum_{i=1}^n (Y_i - \overline{Y})^2 = \sum_{i=1}^n (\widehat{Y_i} -\overline{Y})^2 + \sum_{i=1}^n (Y_i - \widehat{Y_i})^2. \label{1}\]

or

\[ SSTO = SSR +SSE\]

where

\[SSTO = \sum_{i=1}^n (Y_i - \overline{Y})^2 \]

and

\[SSR = \sum_{i=1}^n (\widehat{Y_i} -\overline{Y})^2. \label{2}\]

is referred to as the* ANOVA decomposition *to the variation in the response. Note that

\[ SSR = b_1^2 \sum_{i=1}^n (X_i - \overline{X})^2 .\]

### Degrees of freedom

The degrees of freedom of different terms in the decomposition Equation \ref{2} are

\[df( SSTO ) = n - 1\]

\[df( SSR ) = 1\]

\[df( SSE ) = n - 2. \]

So,

\[df( SSTO ) = d.f.( SSR ) + d.f.( SSE ). \]

### Expected value and distribution

\( E ( SSE ) = ( n - 2) \sigma^2, \) and \( E ( SSR ) = \sigma^2 + \beta_1^2 \sum_{i=1}^n (X_i - \overline{X})^2. \) Also, under the *normal regression model*, and under \( H_0 : \beta_1 = 0, \)

\[ SSR \sim \sigma^2 \chi_1^2, SSE \sim \sigma^2 \chi_{n-2}^2, \]

and these two are independent.

### Mean squares

\[ MSE = \dfrac{SSE}{d.f.(SSE)} = \dfrac{SSE}{n-2}, MSR = \dfrac{SSR}{d.f.(SSR)} = \dfrac{SSR}{1}. \]

Also, \( E ( MSE ) = \sigma^2 , E ( MSR ) = \sigma^2 + \beta_1^2 \sum_{i=1}^n (X_i - \overline{X})^2. \)

*F* ratio

For testing \( H_0 : \beta_1 = 0 \) versus \( H_1 : \beta_1 \neq 0, \) the following test statistics, called the *F ratio*, can be used:

\[ F^* = \dfrac{MSR}{MSE}. \]

The reason is that \( \dfrac{MSR}{MSE} \) fluctuates around 1 + \( \dfrac{ \beta_1^2 \sum_{i=1}^n (X_i - \overline{X})^2 }{\sigma^2}. \) So, a significantly large value of \(F^*\) provides evidence against \(H_0\) and for \(H_1.\)

Under \(H_0, F^* \) has the \(F\) distribution with *paired degrees of freedom* (d.f.( SSR ), d.f.( SSE )) = (1, n - 2 ), (written \(F^* \sim F_{1, n - 2}). \) Thus, the test rejects \(H_0\) at level of significance \(\alpha\) if \(F^* > F( 1 - \alpha; 1, n - 2 ), \) where \(F( 1 - \alpha; 1, n - 2 ) \) is the \( (1 - \alpha ) \) quantile of \(F_{1; n - 2} \) distribution.

### Relation between *F*-test and *t*-test

Check that \( F^* = ( t^* )^2. \) where \( t^* = \dfrac{b_1}{s ( b_1 )} \) is the test statistic for testing \(H_0 : \beta_1 = 0 \) versus \(H_1 : \beta_1 \neq 0. \) So, the *F*-test is equivalent to the *t*-test in this case.

### ANOVA table

It is a table that gives the summary of the various objects used in testing \(H_0 : \beta_1 = 0 \) against \(H_1 : \beta_1 \neq 0.\) It is of the form:

Source |
df |
SS |
MS |
F* |
---|---|---|---|---|

Regression |
d.f.(SSR) = 1 | SSR | MSR | \(\dfrac{MSR}{MSE} \) |

Error |
d.f.(SSE) = n - 2 | SSE | MSE | |

Total |
d.f.(SSTO) = n - 1 | SSTO |

Example \(\PageIndex{1}\): housing price data

We consider a data set on housing prices. Here Y = selling price of houses (in $1000), and X = size of houses (100 square feet). The summary statistics are given below:

$$ n = 19, \overline{X} = 15.719, \overline{Y} = 75.211, $$

\( \sum_i ( X_i - \overline{X} )^2 = 40.805, \sum_i ( Y_i - \overline{Y} )^2 = 556.078, \sum_i ( X_i - \overline{X} ) ( Y_i - \overline{Y} ) = 120.001. \)

#### (Example) - Estimates of \(\beta_1 \) and \(\beta_0\)

\( b_1 = \dfrac{\sum_i ( X_i - \overline{X} ) ( Y_i - \overline{Y} ) }{\sum_i ( X_i - \overline{X} )^2} = \dfrac{120.001}{40.805} = 2.941. \)

and

\( b_0 = \overline{Y} - b_1 \overline{X} = 75.211 - (2.941)(15.719) = 28.981. \)

#### (Example) - MSE

The degrees of freedom (d.f.) = \( n -2 = 17. SSE = \sum_i (Y_i - \overline{Y} )^2 - b_1^2 \sum_i ( X_i - \overline{X} )^2 = 203.17.\) So,

\[ MSE = \dfrac{SSE}{n - 2} = {203.17}{17} = 11.95. \]

Also, SSTO = 556.08 and SSR = SSTO - SSE = 352.91, MSR = SSR/1 = 352.91.

\(F^* = \dfrac{MSR}{MSE} = 29.529 = (t^* )^2,\) where \(t^* = \dfrac{b_1}{s ( b_1 )} = \dfrac{2.941}{0.5412} = 5.434.\) Also, *F*( 0.95; 1, 17 ) = 4.45, *t*( 0.975; 17) = 2.11. So, we reject \(H_0 : \beta_1 = 0. \) The ANOVA table is given below.

Source |
df |
SS |
MS |
F* |
---|---|---|---|---|

Regression |
1 | 352.91 | 352.91 | 29.529 |

Error |
17 | 203.17 | 11.95 | |

Total |
18 | 556.08 |

### Contributors

- Valerie Regalia
- Debashis Paul