Analysis of Variance

Last updated
Save as PDF

Page ID: 244

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\dsum}{\displaystyle\sum\limits} \)

\( \newcommand{\dint}{\displaystyle\int\limits} \)

\( \newcommand{\dlim}{\displaystyle\lim\limits} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\(\newcommand{\longvect}{\overrightarrow}\)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

1. A single factor study (continued)

A food company wanted to test four different package designs for a new breakfast cereal. 20 stores with approximately the same sales condition (such as sales volume, price, etc) were selected as experimental units. Five stores were randomly assigned to each of the 4 package designs:

A balanced complete randomized design.
A single, 4-level, qualitative factor: package design.
A quantitative response variable: sales - number of packets of cereal sold during the period of study.
Goal: exploring relationship between package design and sales.

1.1 ANOVA for single factor study

A simple statistical model the data is as follows:

\[Y_{ij} = \mu_i + \epsilon_{ij}, j = 1,...,n_i; i = 1,...,r;\]

where

\(r\) is the number of factor levels (treatments) and \(n_i\) is the number of experimental units corresponding to the \(i\)-th factor level;
\(Y_{ij}\) is the measurement for the \(j\)-th experimental unit corresponding to the \(i\)-th factor level;
\(\mu_i\) is the mean of all the measurements corresponding to the \(i\)-th factor level (unknown);
\(\epsilon_{ij}\)'s are random errors (unobserved).

1.2 Model asumptions

The following assumptions are made about the above model:

\(\epsilon_{ij}\) are independently and identically distributed as \(N(0, \sigma^2)\).
\(\mu_i\)'s are unknown fixed parameters (so called fixed effects), so that \(E(Y_{ij}) = \mu_i\) and \(Var(Y_{ij}) = \sigma^2\). The above assumption is thus equivalent to assuming that \(Y_{ij}\) are independently distributed as \(N(\mu_i, \sigma^2)\).

1.3 Estimation of \(\mu_i\)

Define, the sample mean for the \(i\)-th factor level:

\[\overline{Y}_{i.} = \frac{1}{n_i}\sum_{j = 1}^{n_i}Y_{ij} = \frac{1}{n_i}Y_{i.}\]

where \(Y_{i.} = \sum_{j=1}^{n_i}Y_{ij}\) is the sum of responses for the \(i\)-th treatment group, for \(i = 1,...,r\); and the overall sample mean:

\[\overline{Y}_{..} = \frac{1}{n_T}\sum_{i=1}^{r}\sum_{j=1}^{n_i}Y_{ij} = \frac{1}{n_T}\sum_{i=1}^{r}n_{i}\overline{Y}_{i.} = \frac{Y_{..}}{n_T}\],

where \(n_T = \sum_{i=1}^{r}n_i\). Then \(\overline{Y}_i.\) is an estimate of \(\mu_i\) for each \(i = 1,...,r\). Under the assumptions, \(\overline{Y}_i.\) is an unbiased estimator of \(\mu_i\) since

\[E(\overline{Y}_i.) = \frac{1}{n_i}\sum_{j=1}^{n_i}E(Y_{ij}) = \frac{1}{n_i}\sum_{j=1}^{n_i}\mu_i = \mu_{i.}\]

Table 1: Data summary: packaging of breakfast cereals
		S1	S2	S3	S4	S5	\((Y_{i.})\)	\(\overline{Y}_{i.})\)	\(n_i\)
Packaging Design	D1	11	17	16	14	15	73	14.6	5
Packaging Design	D2	12	10	15	19	11	67	13.4	5
Packaging Design	D3	23	20	18	17	Miss	78	19.5	4
Packaging Design	D4	27	33	22	26	28	136	27.2	5
Total							\(Y_{..}\) = 354	\(\overline{Y}_{..}\) = 18.63	19

1.4 Comparison of factor level means

Want to check for deviations from the null hypothesis \(H_0 : \mu_1 = ... = \mu_r,\) i.e., the alternative hypothesis is \(H_a:\) not all \(\mu_1\)'s are equal.

Idea 1: A baseline value for comparison is the overall mean:

\[\mu_{.} = \frac{\sum_{i=1}^{r}n_i\mu_i}{n_r}\].

Idea 2: Calculate deviations from the overall mean for each factor level:

\[(\mu_1 - \mu_.)^2,...,(\mu_r - \mu_.)^2\].

Under \(H_0 : \mu_1 = ... = \mu_r,\) these deviations are all zero.

Idea 3: Use the weighted sum of the above deviations as an overall measurement of the deviation from \(H_0: \mu_1 = ... = \mu_r:\)

\[\sum_{i=1}^{r}n_i(\mu_i - \mu_.)^2\]

The weight of the i-th treatment group is its sample size \(n_i\), i.e., the more data, the more importance.

Estimators

Estimate the population means by their sample counterparts:

\[\overline{Y}_{1.} \rightarrow \mu_1,...,\overline{Y}_{r.} \rightarrow \mu_r\]

and

\[\overline{Y}_{..} = \frac{1}{n_T}\sum_{i=1}^{n}n_i\overline{Y}_{i.} \rightarrow \mu.\]

Thus,

\[\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\]

is a statistic to measure the deviation from \(H_0 : \mu_1 = ... = \mu_r\). However, \(\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\) is not an unbiased estimator of \(\sum_{i=1}^{r}n_i(\mu_i - \mu_.)^2\). In fact

\[E[\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2] = (r - 1)\sigma^2 + \sum_{i=1}^{r}n_i(\mu_i - \mu_.)^2.\]

Nevertheless, we can compare the magnitude of \(\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\) to that of \(\sigma^2\) to decide whether the deviation is large or not.

Decomposition of Total Sum of Squares

Write

\[Y_{ij} - \overline{Y}_{..} = (Y_{ij} - \overline{Y}_{i.}) + (\overline{Y}_{i.} - \overline{Y}_{..})\]

\(Y_{ij} - \overline{Y}_{..}\) : deviation of the response from the overall mean;
\(\overline{Y}_{i.} - \overline{Y}_{..}\) : deviation of the i-th factor level mean from the overall mean;
\(Y_{ij} - \overline{Y}_{i.}\) : deviation of the response from the corresponding factor level mean (residual).

Then the ANOVA decomposition of the sum of squares:

\[\sum_{i=1}^{r}\sum_{j=1}^{n_i}(Y_{ij} - \overline{Y}_{..})^2 = \sum_{i=1}^{r}\sum_{j=1}^{n_i}(Y_{ij} - \overline{Y}_{i.})^2 + \sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\].

This can be expressed as

\[SSTO = SSE + SSTR\]

where \(SSTO = \sum_{i=1}^{r}\sum_{j=1}^{n_i} (y_{ij} - \overline{y}_{..})^2\) is the Total Sum of Squares; \(SSE = \sum_{i=1}^{r}\sum_{j=1}^{n_i} (y_{ij} - \overline{y}_{i.})^2\) is the Error Sum of Squares and \(SSTR = \sum_{i=1}^{r}n_i(\overline{y}_{i.} - \overline{y}_{..})^2\) is the Treatment Sum of Squares.

Interpretation of decomposition (5)

SSTO: A measure of the overall variability among the responses.
SSTR: A measure of the variability among the factor level means. The more similar the factor level means are, the smaller is the SSTR.
SSE: A measure of the random variation of the responses around their corresponding factor level means. The smaller the error variance is, the smaller the SSE tends to be.
Overall variability is the sum of the variability due to difference in treatments and that due to random fluctuations.

For the study on the effect of package design on sales volume

Refer to table 1.3. Based on the information there:

\(SSTO = (11 - 18.63)^2 + (17 - 18.62)^2 + ... + (28 - 18.63)^2 = 746.42\)

\(SSTR = 5(14.6 - 18.63)^2 + 5(13.4 - 18.63)^2 + 4(19.5 - 18.63)^2 + 5(27.2 - 18.63)^2 = 588.22\)

\(SSE = {(11 - 14.6)^2 + ... + (15 - 14.6)^2} + ... + {(27 - 27.2)^2 + ... + (28 - 27.2)^2} = 158.20.\)

Contributors

Scott Brunstein (UCD)
Debashis Paul (UCD)

Search

Text Color

Text Size

Margin Size

Font Type