# Analysis of Variance

- Page ID
- 244

## 1. A single factor study (continued)

A food company wanted to test four different package designs for a new breakfast cereal. 20 stores with approximately the same sales condition (such as sales volume, price, etc) were selected as experimental units. Five stores were randomly assigned to each of the 4 package designs:

- A balanced complete randomized design.
- A single, 4-level, qualitative factor: package design.
- A quantitative response variable: sales - number of packets of cereal sold during the period of study.
- Goal: exploring relationship between package design and sales.

### 1.1 ANOVA for single factor study

A simple statistical model the data is as follows:

\[Y_{ij} = \mu_i + \epsilon_{ij}, j = 1,...,n_i; i = 1,...,r;\]

where

- \(r\) is the number of factor levels (treatments) and \(n_i\) is the number of experimental units corresponding to the \(i\)-th factor level;
- \(Y_{ij}\) is the measurement for the \(j\)-th experimental unit corresponding to the \(i\)-th factor level;
- \(\mu_i\) is the mean of all the measurements corresponding to the \(i\)-th factor level (unknown);
- \(\epsilon_{ij}\)'s are random errors (unobserved).

### 1.2 Model asumptions

The following assumptions are made about the above model:

- \(\epsilon_{ij}\) are independently and identically distributed as \(N(0, \sigma^2)\).
- \(\mu_i\)'s are unknown fixed parameters (so called
*fixed effects*), so that \(E(Y_{ij}) = \mu_i\) and \(Var(Y_{ij}) = \sigma^2\). The above assumption is thus equivalent to assuming that \(Y_{ij}\) are independently distributed as \(N(\mu_i, \sigma^2)\).

### 1.3 Estimation of \(\mu_i\)

Define, the sample mean for the \(i\)-th factor level:

\[\overline{Y}_{i.} = \frac{1}{n_i}\sum_{j = 1}^{n_i}Y_{ij} = \frac{1}{n_i}Y_{i.}\]

where \(Y_{i.} = \sum_{j=1}^{n_i}Y_{ij}\) is the sum of responses for the \(i\)-th treatment group, for \(i = 1,...,r\); and the overall sample mean:

\[\overline{Y}_{..} = \frac{1}{n_T}\sum_{i=1}^{r}\sum_{j=1}^{n_i}Y_{ij} = \frac{1}{n_T}\sum_{i=1}^{r}n_{i}\overline{Y}_{i.} = \frac{Y_{..}}{n_T}\],

where \(n_T = \sum_{i=1}^{r}n_i\). Then \(\overline{Y}_i.\) is an estimate of \(\mu_i\) for each \(i = 1,...,r\). Under the assumptions, \(\overline{Y}_i.\) is an *unbiased estimator* of \(\mu_i\) since

\[E(\overline{Y}_i.) = \frac{1}{n_i}\sum_{j=1}^{n_i}E(Y_{ij}) = \frac{1}{n_i}\sum_{j=1}^{n_i}\mu_i = \mu_{i.}\]

S1 | S2 | S3 | S4 | S5 | \((Y_{i.})\) | \(\overline{Y}_{i.})\) | \(n_i\) | ||

Packaging Design | D1 | 11 | 17 | 16 | 14 | 15 | 73 | 14.6 | 5 |

Packaging Design | D2 | 12 | 10 | 15 | 19 | 11 | 67 | 13.4 | 5 |

Packaging Design | D3 | 23 | 20 | 18 | 17 | Miss | 78 | 19.5 | 4 |

Packaging Design | D4 | 27 | 33 | 22 | 26 | 28 | 136 | 27.2 | 5 |

Total | \(Y_{..}\) = 354 |
\(\overline{Y}_{..}\) = 18.63 |
19 |

### 1.4 Comparison of factor level means

Want to check for deviations from the null hypothesis \(H_0 : \mu_1 = ... = \mu_r,\) i.e., the alternative hypothesis is \(H_a:\) not all \(\mu_1\)'s are equal.

**Idea 1:**A baseline value for comparison is the overall mean:

\[\mu_{.} = \frac{\sum_{i=1}^{r}n_i\mu_i}{n_r}\].

**Idea 2:**Calculate deviations from the overall mean for each factor level:

\[(\mu_1 - \mu_.)^2,...,(\mu_r - \mu_.)^2\].

Under \(H_0 : \mu_1 = ... = \mu_r,\) these deviations are all zero.

**Idea 3:**Use the weighted sum of the above deviations as an overall measurement of the deviation from \(H_0: \mu_1 = ... = \mu_r:\)

\[\sum_{i=1}^{r}n_i(\mu_i - \mu_.)^2\]

The weight of the *i*-th treatment group is its sample size \(n_i\), i.e., the more data, the more importance.

## Estimators

Estimate the population means by their sample counterparts:

\[\overline{Y}_{1.} \rightarrow \mu_1,...,\overline{Y}_{r.} \rightarrow \mu_r\]

and

\[\overline{Y}_{..} = \frac{1}{n_T}\sum_{i=1}^{n}n_i\overline{Y}_{i.} \rightarrow \mu.\]

Thus,

\[\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\]

is a statistic to measure the deviation from \(H_0 : \mu_1 = ... = \mu_r\). However, \(\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\) is *not* an unbiased estimator of \(\sum_{i=1}^{r}n_i(\mu_i - \mu_.)^2\). In fact

\[E[\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2] = (r - 1)\sigma^2 + \sum_{i=1}^{r}n_i(\mu_i - \mu_.)^2.\]

Nevertheless, we can compare the magnitude of \(\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\) to that of \(\sigma^2\) to decide whether the deviation is large or not.

## Decomposition of Total Sum of Squares

Write

\[Y_{ij} - \overline{Y}_{..} = (Y_{ij} - \overline{Y}_{i.}) + (\overline{Y}_{i.} - \overline{Y}_{..})\]

- \(Y_{ij} - \overline{Y}_{..}\) : deviation of the response from the overall mean;
- \(\overline{Y}_{i.} - \overline{Y}_{..}\) : deviation of the
*i*-th factor level mean from the overall mean; - \(Y_{ij} - \overline{Y}_{i.}\) : deviation of the response from the corresponding factor level mean (
**residual**).

Then the ANOVA decomposition of the sum of squares:

\[\sum_{i=1}^{r}\sum_{j=1}^{n_i}(Y_{ij} - \overline{Y}_{..})^2 = \sum_{i=1}^{r}\sum_{j=1}^{n_i}(Y_{ij} - \overline{Y}_{i.})^2 + \sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\].

This can be expressed as

\[SSTO = SSE + SSTR\]

where \(SSTO = \sum_{i=1}^{r}\sum_{j=1}^{n_i} (y_{ij} - \overline{y}_{..})^2\) is the *Total Sum of Squares;* \(SSE = \sum_{i=1}^{r}\sum_{j=1}^{n_i} (y_{ij} - \overline{y}_{i.})^2\) is the *Error Sum of Squares* and \(SSTR = \sum_{i=1}^{r}n_i(\overline{y}_{i.} - \overline{y}_{..})^2\) is the *Treatment Sum of Squares*.

**Interpretation of decomposition (5)**

- SSTO: A measure of the overall variability among the responses.
- SSTR: A measure of the variability among the factor level means. The more similar the factor level means are, the smaller is the SSTR.
- SSE: A measure of the random variation of the responses around their corresponding factor level means. The smaller the error variance is, the smaller the SSE tends to be.
- Overall variability is the sum of the variability due to difference in treatments and that due to random fluctuations.

**For the study on the effect of package design on sales volume**

Refer to table 1.3. Based on the information there:

\(SSTO = (11 - 18.63)^2 + (17 - 18.62)^2 + ... + (28 - 18.63)^2 = 746.42\)

\(SSTR = 5(14.6 - 18.63)^2 + 5(13.4 - 18.63)^2 + 4(19.5 - 18.63)^2 + 5(27.2 - 18.63)^2 = 588.22\)

\(SSE = {(11 - 14.6)^2 + ... + (15 - 14.6)^2} + ... + {(27 - 27.2)^2 + ... + (28 - 27.2)^2} = 158.20.\)

## Contributors

- Scott Brunstein (UCD)
- Debashis Paul (UCD)