Analysis of Variance
( \newcommand{\kernel}{\mathrm{null}\,}\)
1. A single factor study (continued)
A food company wanted to test four different package designs for a new breakfast cereal. 20 stores with approximately the same sales condition (such as sales volume, price, etc) were selected as experimental units. Five stores were randomly assigned to each of the 4 package designs:
- A balanced complete randomized design.
- A single, 4-level, qualitative factor: package design.
- A quantitative response variable: sales - number of packets of cereal sold during the period of study.
- Goal: exploring relationship between package design and sales.
1.1 ANOVA for single factor study
A simple statistical model the data is as follows:
Yij=μi+ϵij,j=1,...,ni;i=1,...,r;
where
- r is the number of factor levels (treatments) and ni is the number of experimental units corresponding to the i-th factor level;
- Yij is the measurement for the j-th experimental unit corresponding to the i-th factor level;
- μi is the mean of all the measurements corresponding to the i-th factor level (unknown);
- ϵij's are random errors (unobserved).
1.2 Model asumptions
The following assumptions are made about the above model:
- ϵij are independently and identically distributed as N(0,σ2).
- μi's are unknown fixed parameters (so called fixed effects), so that E(Yij)=μi and Var(Yij)=σ2. The above assumption is thus equivalent to assuming that Yij are independently distributed as N(μi,σ2).
1.3 Estimation of μi
Define, the sample mean for the i-th factor level:
¯Yi.=1nini∑j=1Yij=1niYi.
where Yi.=∑nij=1Yij is the sum of responses for the i-th treatment group, for i=1,...,r; and the overall sample mean:
¯Y..=1nTr∑i=1ni∑j=1Yij=1nTr∑i=1ni¯Yi.=Y..nT,
where nT=∑ri=1ni. Then ¯Yi. is an estimate of μi for each i=1,...,r. Under the assumptions, ¯Yi. is an unbiased estimator of μi since
E(¯Yi.)=1nini∑j=1E(Yij)=1nini∑j=1μi=μi.
S1 | S2 | S3 | S4 | S5 | (Yi.) | ¯Yi.) | ni | ||
Packaging Design | D1 | 11 | 17 | 16 | 14 | 15 | 73 | 14.6 | 5 |
Packaging Design | D2 | 12 | 10 | 15 | 19 | 11 | 67 | 13.4 | 5 |
Packaging Design | D3 | 23 | 20 | 18 | 17 | Miss | 78 | 19.5 | 4 |
Packaging Design | D4 | 27 | 33 | 22 | 26 | 28 | 136 | 27.2 | 5 |
Total | Y.. = 354 |
¯Y.. = 18.63 |
19 |
1.4 Comparison of factor level means
Want to check for deviations from the null hypothesis H0:μ1=...=μr, i.e., the alternative hypothesis is Ha: not all μ1's are equal.
- Idea 1: A baseline value for comparison is the overall mean:
μ.=∑ri=1niμinr.
- Idea 2: Calculate deviations from the overall mean for each factor level:
(μ1−μ.)2,...,(μr−μ.)2.
Under H0:μ1=...=μr, these deviations are all zero.
- Idea 3: Use the weighted sum of the above deviations as an overall measurement of the deviation from H0:μ1=...=μr:
r∑i=1ni(μi−μ.)2
The weight of the i-th treatment group is its sample size ni, i.e., the more data, the more importance.
Estimators
Estimate the population means by their sample counterparts:
¯Y1.→μ1,...,¯Yr.→μr
and
¯Y..=1nTn∑i=1ni¯Yi.→μ.
Thus,
r∑i=1ni(¯Yi.−¯Y..)2
is a statistic to measure the deviation from H0:μ1=...=μr. However, ∑ri=1ni(¯Yi.−¯Y..)2 is not an unbiased estimator of ∑ri=1ni(μi−μ.)2. In fact
E[r∑i=1ni(¯Yi.−¯Y..)2]=(r−1)σ2+r∑i=1ni(μi−μ.)2.
Nevertheless, we can compare the magnitude of ∑ri=1ni(¯Yi.−¯Y..)2 to that of σ2 to decide whether the deviation is large or not.
Decomposition of Total Sum of Squares
Write
Yij−¯Y..=(Yij−¯Yi.)+(¯Yi.−¯Y..)
- Yij−¯Y.. : deviation of the response from the overall mean;
- ¯Yi.−¯Y.. : deviation of the i-th factor level mean from the overall mean;
- Yij−¯Yi. : deviation of the response from the corresponding factor level mean (residual).
Then the ANOVA decomposition of the sum of squares:
r∑i=1ni∑j=1(Yij−¯Y..)2=r∑i=1ni∑j=1(Yij−¯Yi.)2+r∑i=1ni(¯Yi.−¯Y..)2.
This can be expressed as
SSTO=SSE+SSTR
where SSTO=∑ri=1∑nij=1(yij−¯y..)2 is the Total Sum of Squares; SSE=∑ri=1∑nij=1(yij−¯yi.)2 is the Error Sum of Squares and SSTR=∑ri=1ni(¯yi.−¯y..)2 is the Treatment Sum of Squares.
Interpretation of decomposition (5)
- SSTO: A measure of the overall variability among the responses.
- SSTR: A measure of the variability among the factor level means. The more similar the factor level means are, the smaller is the SSTR.
- SSE: A measure of the random variation of the responses around their corresponding factor level means. The smaller the error variance is, the smaller the SSE tends to be.
- Overall variability is the sum of the variability due to difference in treatments and that due to random fluctuations.
For the study on the effect of package design on sales volume
Refer to table 1.3. Based on the information there:
SSTO=(11−18.63)2+(17−18.62)2+...+(28−18.63)2=746.42
SSTR=5(14.6−18.63)2+5(13.4−18.63)2+4(19.5−18.63)2+5(27.2−18.63)2=588.22
SSE=(11−14.6)2+...+(15−14.6)2+...+(27−27.2)2+...+(28−27.2)2=158.20.
Contributors
- Scott Brunstein (UCD)
- Debashis Paul (UCD)