Search

Text Color

Margin Size

Font Type

Enable Dyslexic Font

Analysis of variance approach to regression

Last updated

Aug 17, 2020
Save as PDF
- Simple linear regression
- Diagnostics for residuals(continued)

$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$

$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$

$\newcommand{\id}{\mathrm{id}}$ $\newcommand{\Span}{\mathrm{span}}$

( \newcommand{\kernel}{\mathrm{null}\,}\) $\newcommand{\range}{\mathrm{range}\,}$

$\newcommand{\RealPart}{\mathrm{Re}}$ $\newcommand{\ImaginaryPart}{\mathrm{Im}}$

$\newcommand{\Argument}{\mathrm{Arg}}$ $\newcommand{\norm}[1]{\| #1 \|}$

$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$

$\newcommand{\Span}{\mathrm{span}}$

$\newcommand{\id}{\mathrm{id}}$

$\newcommand{\Span}{\mathrm{span}}$

$\newcommand{\kernel}{\mathrm{null}\,}$

$\newcommand{\range}{\mathrm{range}\,}$

$\newcommand{\RealPart}{\mathrm{Re}}$

$\newcommand{\ImaginaryPart}{\mathrm{Im}}$

$\newcommand{\Argument}{\mathrm{Arg}}$

$\newcommand{\norm}[1]{\| #1 \|}$

$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$

$\newcommand{\Span}{\mathrm{span}}$ $\newcommand{\AA}{\unicode[.8,0]{x212B}}$

$\newcommand{\vectorA}[1]{\vec{#1}} % arrow$

$\newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$

$\newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$

$\newcommand{\vectorC}[1]{\textbf{#1}}$

$\newcommand{\vectorD}[1]{\overrightarrow{#1}}$

$\newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}$

$\newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}}$

$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$

$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$

$\newcommand{\avec}{\mathbf a}$

$\newcommand{\bvec}{\mathbf b}$

$\newcommand{\cvec}{\mathbf c}$

$\newcommand{\dvec}{\mathbf d}$

$\newcommand{\dtil}{\widetilde{\mathbf d}}$

$\newcommand{\evec}{\mathbf e}$

$\newcommand{\fvec}{\mathbf f}$

$\newcommand{\nvec}{\mathbf n}$

$\newcommand{\pvec}{\mathbf p}$

$\newcommand{\qvec}{\mathbf q}$

$\newcommand{\svec}{\mathbf s}$

$\newcommand{\tvec}{\mathbf t}$

$\newcommand{\uvec}{\mathbf u}$

$\newcommand{\vvec}{\mathbf v}$

$\newcommand{\wvec}{\mathbf w}$

$\newcommand{\xvec}{\mathbf x}$

$\newcommand{\yvec}{\mathbf y}$

$\newcommand{\zvec}{\mathbf z}$

$\newcommand{\rvec}{\mathbf r}$

$\newcommand{\mvec}{\mathbf m}$

$\newcommand{\zerovec}{\mathbf 0}$

$\newcommand{\onevec}{\mathbf 1}$

$\newcommand{\real}{\mathbb R}$

$\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}$

$\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}$

$\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}$

$\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}$

$\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$

$\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$

$\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$

$\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$

$\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}$

$\newcommand{\laspan}[1]{\text{Span}\{#1\}}$

$\newcommand{\bcal}{\cal B}$

$\newcommand{\ccal}{\cal C}$

$\newcommand{\scal}{\cal S}$

$\newcommand{\wcal}{\cal W}$

$\newcommand{\ecal}{\cal E}$

$\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}$

$\newcommand{\gray}[1]{\color{gray}{#1}}$

$\newcommand{\lgray}[1]{\color{lightgray}{#1}}$

$\newcommand{\rank}{\operatorname{rank}}$

$\newcommand{\row}{\text{Row}}$

$\newcommand{\col}{\text{Col}}$

$\renewcommand{\row}{\text{Row}}$

$\newcommand{\nul}{\text{Nul}}$

$\newcommand{\var}{\text{Var}}$

$\newcommand{\corr}{\text{corr}}$

$\newcommand{\len}[1]{\left|#1\right|}$

$\newcommand{\bbar}{\overline{\bvec}}$

$\newcommand{\bhat}{\widehat{\bvec}}$

$\newcommand{\bperp}{\bvec^\perp}$

$\newcommand{\xhat}{\widehat{\xvec}}$

$\newcommand{\vhat}{\widehat{\vvec}}$

$\newcommand{\uhat}{\widehat{\uvec}}$

$\newcommand{\what}{\widehat{\wvec}}$

$\newcommand{\Sighat}{\widehat{\Sigma}}$

$\newcommand{\lt}{<}$

$\newcommand{\gt}{>}$

$\newcommand{\amp}{&}$

$\definecolor{fillinmathshade}{gray}{0.9}$

We divide the total variability in the observe data into two parts - one coming from the errors, the other coming from the predictor.

ANOVA Decomposition

The following decomposition

$Y_i - \overline{Y} = (\widehat{Y_i} - \overline{Y}) + (Y_i - \widehat{Y_i} )$

with $i=1,2,...,n.$ .

represents the deviation of the observed response from the mean response in terms of the sum of the deviation of the fitted value from the mean plus the residual.

Taking the sum of squares, and after some algebra we have:

$\sum_{i=1}^n (Y_i - \overline{Y})^2 = \sum_{i=1}^n (\widehat{Y_i} -\overline{Y})^2 + \sum_{i=1}^n (Y_i - \widehat{Y_i})^2. \label{1}$

$SSTO = SSR +SSE$

where

$SSTO = \sum_{i=1}^n (Y_i - \overline{Y})^2$

and

$SSR = \sum_{i=1}^n (\widehat{Y_i} -\overline{Y})^2. \label{2}$

is referred to as the ANOVA decomposition to the variation in the response. Note that

$SSR = b_1^2 \sum_{i=1}^n (X_i - \overline{X})^2 .$

Degrees of freedom

The degrees of freedom of different terms in the decomposition Equation $\ref{2}$ are

$df( SSTO ) = n - 1$

$df( SSR ) = 1$

$df( SSE ) = n - 2.$

So,

$df( SSTO ) = d.f.( SSR ) + d.f.( SSE ).$

Expected value and distribution

$E ( SSE ) = ( n - 2) \sigma^2,$ and $E ( SSR ) = \sigma^2 + \beta_1^2 \sum_{i=1}^n (X_i - \overline{X})^2.$ Also, under the normal regression model, and under $H_0 : \beta_1 = 0,$

$SSR \sim \sigma^2 \chi_1^2, SSE \sim \sigma^2 \chi_{n-2}^2,$

and these two are independent.

Mean squares

$MSE = \dfrac{SSE}{d.f.(SSE)} = \dfrac{SSE}{n-2}, MSR = \dfrac{SSR}{d.f.(SSR)} = \dfrac{SSR}{1}.$

Also, $E ( MSE ) = \sigma^2 , E ( MSR ) = \sigma^2 + \beta_1^2 \sum_{i=1}^n (X_i - \overline{X})^2.$

F ratio

For testing $H_0 : \beta_1 = 0$ versus $H_1 : \beta_1 \neq 0,$ the following test statistics, called the F ratio, can be used:

$F^* = \dfrac{MSR}{MSE}.$

The reason is that $\dfrac{MSR}{MSE}$ fluctuates around 1 + $\dfrac{ \beta_1^2 \sum_{i=1}^n (X_i - \overline{X})^2 }{\sigma^2}.$ So, a significantly large value of $F^*$ provides evidence against $H_0$ and for $H_1.$

Under $H_0, F^*$ has the $F$ distribution with paired degrees of freedom (d.f.( SSR ), d.f.( SSE )) = (1, n - 2 ), (written $F^* \sim F_{1, n - 2}).$ Thus, the test rejects $H_0$ at level of significance $\alpha$ if $F^* > F( 1 - \alpha; 1, n - 2 ),$ where $F( 1 - \alpha; 1, n - 2 )$ is the $(1 - \alpha )$ quantile of $F_{1; n - 2}$ distribution.

Relation between F-test and t-test

Check that $F^* = ( t^* )^2.$ where $t^* = \dfrac{b_1}{s ( b_1 )}$ is the test statistic for testing $H_0 : \beta_1 = 0$ versus $H_1 : \beta_1 \neq 0.$ So, the F-test is equivalent to the t-test in this case.

ANOVA table

It is a table that gives the summary of the various objects used in testing $H_0 : \beta_1 = 0$ against $H_1 : \beta_1 \neq 0.$ It is of the form:

Source	df	SS	MS	F*
Regression	d.f.(SSR) = 1	SSR	MSR	$\dfrac{MSR}{MSE}$
Error	d.f.(SSE) = n - 2	SSE	MSE
Total	d.f.(SSTO) = n - 1	SSTO

Example $\PageIndex{1}$ : housing price data

We consider a data set on housing prices. Here Y = selling price of houses (in $1000), and X = size of houses (100 square feet). The summary statistics are given below:

$n = 19, \overline{X} = 15.719, \overline{Y} = 75.211,$

$\sum_i ( X_i - \overline{X} )^2 = 40.805, \sum_i ( Y_i - \overline{Y} )^2 = 556.078, \sum_i ( X_i - \overline{X} ) ( Y_i - \overline{Y} ) = 120.001.$

(Example) - Estimates of $\beta_1$ and $\beta_0$

$b_1 = \dfrac{\sum_i ( X_i - \overline{X} ) ( Y_i - \overline{Y} ) }{\sum_i ( X_i - \overline{X} )^2} = \dfrac{120.001}{40.805} = 2.941.$

and

$b_0 = \overline{Y} - b_1 \overline{X} = 75.211 - (2.941)(15.719) = 28.981.$

(Example) - MSE

The degrees of freedom (d.f.) = $n -2 = 17. SSE = \sum_i (Y_i - \overline{Y} )^2 - b_1^2 \sum_i ( X_i - \overline{X} )^2 = 203.17.$ So,

$MSE = \dfrac{SSE}{n - 2} = {203.17}{17} = 11.95.$

Also, SSTO = 556.08 and SSR = SSTO - SSE = 352.91, MSR = SSR/1 = 352.91.

$F^* = \dfrac{MSR}{MSE} = 29.529 = (t^* )^2,$ where $t^* = \dfrac{b_1}{s ( b_1 )} = \dfrac{2.941}{0.5412} = 5.434.$ Also, F( 0.95; 1, 17 ) = 4.45, t( 0.975; 17) = 2.11. So, we reject $H_0 : \beta_1 = 0.$ The ANOVA table is given below.

Source	df	SS	MS	F*
Regression	1	352.91	352.91	29.529
Error	17	203.17	11.95
Total	18	556.08

Contributors

Valerie Regalia
Debashis Paul