Skip to main content
Statistics LibreTexts

Analysis of Variance

  • Page ID
    244
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    1. A single factor study (continued)

    A food company wanted to test four different package designs for a new breakfast cereal. 20 stores with approximately the same sales condition (such as sales volume, price, etc) were selected as experimental units. Five stores were randomly assigned to each of the 4 package designs:

    • A balanced complete randomized design.
    • A single, 4-level, qualitative factor: package design.
    • A quantitative response variable: sales - number of packets of cereal sold during the period of study.
    • Goal: exploring relationship between package design and sales.

    1.1 ANOVA for single factor study

    A simple statistical model the data is as follows:

    \[Y_{ij} = \mu_i + \epsilon_{ij}, j = 1,...,n_i; i = 1,...,r;\]

    where

    • \(r\) is the number of factor levels (treatments) and \(n_i\) is the number of experimental units corresponding to the \(i\)-th factor level;
    • \(Y_{ij}\) is the measurement for the \(j\)-th experimental unit corresponding to the \(i\)-th factor level;
    • \(\mu_i\) is the mean of all the measurements corresponding to the \(i\)-th factor level (unknown);
    • \(\epsilon_{ij}\)'s are random errors (unobserved).

    1.2 Model asumptions

    The following assumptions are made about the above model:

    • \(\epsilon_{ij}\) are independently and identically distributed as \(N(0, \sigma^2)\).
    • \(\mu_i\)'s are unknown fixed parameters (so called fixed effects), so that \(E(Y_{ij}) = \mu_i\) and \(Var(Y_{ij}) = \sigma^2\). The above assumption is thus equivalent to assuming that \(Y_{ij}\) are independently distributed as \(N(\mu_i, \sigma^2)\).

    1.3 Estimation of \(\mu_i\)

    Define, the sample mean for the \(i\)-th factor level:

    \[\overline{Y}_{i.} = \frac{1}{n_i}\sum_{j = 1}^{n_i}Y_{ij} = \frac{1}{n_i}Y_{i.}\]

    where \(Y_{i.} = \sum_{j=1}^{n_i}Y_{ij}\) is the sum of responses for the \(i\)-th treatment group, for \(i = 1,...,r\); and the overall sample mean:

    \[\overline{Y}_{..} = \frac{1}{n_T}\sum_{i=1}^{r}\sum_{j=1}^{n_i}Y_{ij} = \frac{1}{n_T}\sum_{i=1}^{r}n_{i}\overline{Y}_{i.} = \frac{Y_{..}}{n_T}\],

    where \(n_T = \sum_{i=1}^{r}n_i\). Then \(\overline{Y}_i.\) is an estimate of \(\mu_i\) for each \(i = 1,...,r\). Under the assumptions, \(\overline{Y}_i.\) is an unbiased estimator of \(\mu_i\) since

    \[E(\overline{Y}_i.) = \frac{1}{n_i}\sum_{j=1}^{n_i}E(Y_{ij}) = \frac{1}{n_i}\sum_{j=1}^{n_i}\mu_i = \mu_{i.}\]

    Table 1: Data summary: packaging of breakfast cereals
    S1 S2 S3 S4 S5 \((Y_{i.})\) \(\overline{Y}_{i.})\) \(n_i\)
    Packaging Design D1 11 17 16 14 15 73 14.6 5
    Packaging Design D2 12 10 15 19 11 67 13.4 5
    Packaging Design D3 23 20 18 17 Miss 78 19.5 4
    Packaging Design D4 27 33 22 26 28 136 27.2 5
    Total \(Y_{..}\) = 354

    \(\overline{Y}_{..}\) = 18.63

    19

    1.4 Comparison of factor level means

    Want to check for deviations from the null hypothesis \(H_0 : \mu_1 = ... = \mu_r,\) i.e., the alternative hypothesis is \(H_a:\) not all \(\mu_1\)'s are equal.

    • Idea 1: A baseline value for comparison is the overall mean:

    \[\mu_{.} = \frac{\sum_{i=1}^{r}n_i\mu_i}{n_r}\].

    • Idea 2: Calculate deviations from the overall mean for each factor level:

    \[(\mu_1 - \mu_.)^2,...,(\mu_r - \mu_.)^2\].

    Under \(H_0 : \mu_1 = ... = \mu_r,\) these deviations are all zero.

    • Idea 3: Use the weighted sum of the above deviations as an overall measurement of the deviation from \(H_0: \mu_1 = ... = \mu_r:\)

    \[\sum_{i=1}^{r}n_i(\mu_i - \mu_.)^2\]

    The weight of the i-th treatment group is its sample size \(n_i\), i.e., the more data, the more importance.

    Estimators

    Estimate the population means by their sample counterparts:

    \[\overline{Y}_{1.} \rightarrow \mu_1,...,\overline{Y}_{r.} \rightarrow \mu_r\]

    and

    \[\overline{Y}_{..} = \frac{1}{n_T}\sum_{i=1}^{n}n_i\overline{Y}_{i.} \rightarrow \mu.\]

    Thus,

    \[\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\]

    is a statistic to measure the deviation from \(H_0 : \mu_1 = ... = \mu_r\). However, \(\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\) is not an unbiased estimator of \(\sum_{i=1}^{r}n_i(\mu_i - \mu_.)^2\). In fact

    \[E[\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2] = (r - 1)\sigma^2 + \sum_{i=1}^{r}n_i(\mu_i - \mu_.)^2.\]

    Nevertheless, we can compare the magnitude of \(\sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\) to that of \(\sigma^2\) to decide whether the deviation is large or not.

    Decomposition of Total Sum of Squares

    Write

    \[Y_{ij} - \overline{Y}_{..} = (Y_{ij} - \overline{Y}_{i.}) + (\overline{Y}_{i.} - \overline{Y}_{..})\]

    • \(Y_{ij} - \overline{Y}_{..}\) : deviation of the response from the overall mean;
    • \(\overline{Y}_{i.} - \overline{Y}_{..}\) : deviation of the i-th factor level mean from the overall mean;
    • \(Y_{ij} - \overline{Y}_{i.}\) : deviation of the response from the corresponding factor level mean (residual).

    Then the ANOVA decomposition of the sum of squares:

    \[\sum_{i=1}^{r}\sum_{j=1}^{n_i}(Y_{ij} - \overline{Y}_{..})^2 = \sum_{i=1}^{r}\sum_{j=1}^{n_i}(Y_{ij} - \overline{Y}_{i.})^2 + \sum_{i=1}^{r}n_i(\overline{Y}_{i.} - \overline{Y}_{..})^2\].

    This can be expressed as

    \[SSTO = SSE + SSTR\]

    where \(SSTO = \sum_{i=1}^{r}\sum_{j=1}^{n_i} (y_{ij} - \overline{y}_{..})^2\) is the Total Sum of Squares; \(SSE = \sum_{i=1}^{r}\sum_{j=1}^{n_i} (y_{ij} - \overline{y}_{i.})^2\) is the Error Sum of Squares and \(SSTR = \sum_{i=1}^{r}n_i(\overline{y}_{i.} - \overline{y}_{..})^2\) is the Treatment Sum of Squares.

    Interpretation of decomposition (5)

    • SSTO: A measure of the overall variability among the responses.
    • SSTR: A measure of the variability among the factor level means. The more similar the factor level means are, the smaller is the SSTR.
    • SSE: A measure of the random variation of the responses around their corresponding factor level means. The smaller the error variance is, the smaller the SSE tends to be.
    • Overall variability is the sum of the variability due to difference in treatments and that due to random fluctuations.

    For the study on the effect of package design on sales volume

    Refer to table 1.3. Based on the information there:

    \(SSTO = (11 - 18.63)^2 + (17 - 18.62)^2 + ... + (28 - 18.63)^2 = 746.42\)

    \(SSTR = 5(14.6 - 18.63)^2 + 5(13.4 - 18.63)^2 + 4(19.5 - 18.63)^2 + 5(27.2 - 18.63)^2 = 588.22\)

    \(SSE = {(11 - 14.6)^2 + ... + (15 - 14.6)^2} + ... + {(27 - 27.2)^2 + ... + (28 - 27.2)^2} = 158.20.\)

    Contributors

    • Scott Brunstein (UCD)
    • Debashis Paul (UCD)

    This page titled Analysis of Variance is shared under a not declared license and was authored, remixed, and/or curated by Debashis Paul.

    • Was this article helpful?