Skip to main content
Statistics LibreTexts

7.5: Comparing many means with ANOVA

  • Page ID
    56948
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Sometimes we want to compare means across many groups. We might initially think to do pairwise comparisons. For example, if there were three groups, we might be tempted to compare the first mean with the second, then with the third, and then finally compare the second and third means for a total of three comparisons. However, this strategy can be treacherous. If we have many groups and do many comparisons, it is likely that we will eventually find a difference just by chance, even if there is no difference in the populations. Instead, we should apply a holistic test to check whether there is evidence that at least one pair groups are in fact different, and this is where ANOVA saves the day.

    Core ideas of ANOVA

    In this section, we will learn a new method called and a new test statistic called \(F\). ANOVA uses a single hypothesis test to check whether the means across many groups are equal:

    • The mean outcome is the same across all groups. In statistical notation, \(\mu_1 = \mu_2 = \cdots = \mu_k\) where \(\mu_i\) represents the mean of the outcome for observations in category \(i\).
    • At least one mean is different.

    Generally we must check three conditions on the data before performing ANOVA:

    • the observations are independent within and across groups,
    • the data within each group are nearly normal, and
    • the variability across the groups is about equal.

    When these three conditions are met, we may perform an ANOVA to determine whether the data provide strong evidence against the null hypothesis that all the \(\mu_i\) are equal.

    College departments commonly run multiple lectures of the same introductory course each semester because of high demand. Consider a statistics department that runs three lectures of an introductory statistics course. We might like to determine whether there are statistically significant differences in first exam scores in these three classes (\(A\), \(B\), and \(C\)). Describe appropriate hypotheses to determine whether there are any differences between the three classes. [firstExampleForThreeStatisticsClassesAndANOVA] The hypotheses may be written in the following form:

    • The average score is identical in all lectures. Any observed difference is due to chance. Notationally, we write \(\mu_A=\mu_B=\mu_C\).
    • The average score varies by class. We would reject the null hypothesis in favor of the alternative hypothesis if there were larger differences among the class averages than what we might expect from chance alone.

    Strong evidence favoring the alternative hypothesis in ANOVA is described by unusually large differences among the group means. We will soon learn that assessing the variability of the group means relative to the variability among individual observations within each group is key to ANOVA’s success.

    Examine Figure [toyANOVA]. Compare groups I, II, and III. Can you visually determine if the differences in the group centers is due to chance or not? Now compare groups IV, V, and VI. Do these differences appear to be due to chance? Any real difference in the means of groups I, II, and III is difficult to discern, because the data within each group are very volatile relative to any differences in the average outcome. On the other hand, it appears there are differences in the centers of groups IV, V, and VI. For instance, group V appears to have a higher mean than that of the other two groups. Investigating groups IV, V, and VI, we see the differences in the groups’ centers are noticeable because those differences are large relative to the variability in the individual observations within each group.

    We would like to discern whether there are real differences between the batting performance of baseball players according to their position: outfielder (), infielder (), and catcher (). We will use a data set called , which includes batting records of 429 Major League Baseball (MLB) players from the 2018 season who had at least 100 at bats. Six of the 429 cases represented in are shown in Figure [mlbBat18DataMatrix], and descriptions for each variable are provided in Figure [mlbBat18Variables]. The measure we will use for the player batting performance (the outcome variable) is on-base percentage (). The on-base percentage roughly represents the fraction of the time a player successfully gets on base or hits a home run.

    name team position AB H HR RBI AVG OBP
    1 Abreu, J CWS IF 499 132 22 78 0.265 0.325
    2 Acuna Jr., R ATL OF 433 127 26 64 0.293 0.366
    3 Adames, W TB IF 288 80 10 34 0.278 0.348
    \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)    
    427 Zimmerman, R WSH IF 288 76 13 51 0.264 0.337
    428 Zobrist, B CHC IF 455 139 9 58 0.305 0.378
    429 Zunino, M SEA C 373 75 20 44 0.201 0.259
    variable description
      Player name
      The abbreviated name of the player’s team
      The player’s primary field position (, , )
      Number of opportunities at bat
      Number of hits
      Number of home runs
      Number of runs batted in
      Batting average, which is equal to \(\resp{H}/\resp{AB}\)
      On-base percentage, which is roughly equal to the fraction of times a player gets on base or hits a home run

    [nullHypForOBPAgainstPosition] The null hypothesis under consideration is the following: \(\mu_{\resp{OF}} = \mu_{\resp{IF}} = %\mu_{\resp{DH}} = \mu_{\resp{C}}\). Write the null and corresponding alternative hypotheses in plain language.

    The player positions have been divided into three groups: outfield (), infield (), and catcher (). What would be an appropriate point estimate of the on-base percentage by outfielders, \(\mu_{\resp{OF}}\)? A good estimate of the on-base percentage by outfielders would be the sample average of for just those players whose position is outfield: \(\bar{x}_{OF} = 0.320\).

    Figure [mlbHRPerABSummaryTable] provides summary statistics for each group. A side-by-side box plot for the on-base percentage is shown in Figure [mlbANOVABoxPlot]. Notice that the variability appears to be approximately constant across groups; nearly constant variance across groups is an important assumption that must be satisfied before we consider the ANOVA approach.

    Sample size (\(n_i\)) 160 205 64
    Sample mean (\(\bar{x}_i\)) 0.320 0.318 0.302
    Sample SD (\(s_i\)) 0.043 0.038 0.038

    The largest difference between the sample means is between the catcher and the outfielder positions. Consider again the original hypotheses:

    • \(\mu_{\resp{OF}} = \mu_{\resp{IF}} = \mu_{\resp{C}}\)
    • The average on-base percentage (\(\mu_i\)) varies across some (or all) groups.

    Why might it be inappropriate to run the test by simply estimating whether the difference of \(\mu_{\var{C}}\) and \(\mu_{\resp{OF}}\) is statistically significant at a 0.05 significance level?

    [multCompExIncDiscOfClassrooms] The primary issue here is that we are inspecting the data before picking the groups that will be compared. It is inappropriate to examine all data by eye (informal testing) and only afterwards decide which parts to formally test. This is called or . Naturally, we would pick the groups with the large differences for the formal test, and this would leading to an inflation in the Type 1 Error rate. To understand this better, let’s consider a slightly different problem.

    Suppose we are to measure the aptitude for students in 20 classes in a large elementary school at the beginning of the year. In this school, all students are randomly assigned to classrooms, so any differences we observe between the classes at the start of the year are completely due to chance. However, with so many groups, we will probably observe a few groups that look rather different from each other. If we select only these classes that look so different and then perform a formal test, we will probably make the wrong conclusion that the assignment wasn’t random. While we might only formally test differences for a few pairs of classes, we informally evaluated the other classes by eye before choosing the most extreme cases for a comparison.

    For additional information on the ideas expressed in Example [multCompExIncDiscOfClassrooms], we recommend reading about the .4

    In the next section we will learn how to use the \(F\) statistic and ANOVA to test whether observed differences in sample means could have happened just by chance even if there was no difference in the respective population means.

    Analysis of variance (ANOVA) and the \(\pmb{F}\)-test

    The method of analysis of variance in this context focuses on answering one question: is the variability in the sample means so large that it seems unlikely to be from chance alone? This question is different from earlier testing procedures since we will simultaneously consider many groups, and evaluate whether their sample means differ more than we would expect from natural variation. We call this variability the , and it has an associated degrees of freedom, \(df_{G} = k - 1\) when there are \(k\) groups. The \(MSG\) can be thought of as a scaled variance formula for means. If the null hypothesis is true, any variation in the sample means is due to chance and shouldn’t be too large. Details of \(MSG\) calculations are provided in the footnote.5 However, we typically use software for these computations.

    The mean square between the groups is, on its own, quite useless in a hypothesis test. We need a benchmark value for how much variability should be expected among the sample means if the null hypothesis is true. To this end, we compute a pooled variance estimate, often abbreviated as the , which has an associated degrees of freedom value \(df_E = n - k\). It is helpful to think of \(MSE\) as a measure of the variability within the groups. Details of the computations of the \(MSE\) and a link to an extra online section for ANOVA calculations are provided in the footnote6 for interested readers.

    When the null hypothesis is true, any differences among the sample means are only due to chance, and the \(MSG\) and \(MSE\) should be about equal. As a test statistic for ANOVA, we examine the fraction of \(MSG\) and \(MSE\):

    \[\begin{aligned} F = \frac{MSG}{MSE}\end{aligned}\]

    The \(MSG\) represents a measure of the between-group variability, and \(MSE\) measures the variability within each of the groups.

    For the baseball data, \(MSG = 0.00803\) and \(MSE=0.00158\). Identify the degrees of freedom associated with MSG and MSE and verify the \(F\) statistic is approximately 5.077.

    We can use the \(F\) statistic to evaluate the hypotheses in what is called an . A p-value can be computed from the \(F\) statistic using an \(F\) distribution, which has two associated parameters: \(df_{1}\) and \(df_{2}\). For the \(F\) statistic in ANOVA, \(df_{1} = df_{G}\) and \(df_{2} = df_{E}\). An \(F\) distribution with 2 and 426 degrees of freedom, corresponding to the \(F\) statistic for the baseball hypothesis test, is shown in Figure [fDist2And423Shaded].

    The larger the observed variability in the sample means (\(MSG\)) relative to the within-group observations (\(MSE\)), the larger \(F\) will be and the stronger the evidence against the null hypothesis. Because larger values of \(F\) represent stronger evidence against the null hypothesis, we use the upper tail of the distribution to compute a p-value.

    The \(\pmb{F}\) statistic and the \(\pmb{F}\)-test Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups. ANOVA uses a test statistic \(F\), which represents a standardized ratio of variability in the sample means relative to the variability within the groups. If \(H_0\) is true and the model conditions are satisfied, the statistic \(F\) follows an \(F\) distribution with parameters \(df_{1} = k - 1\) and \(df_{2} = n - k\). The upper tail of the \(F\) distribution is used to represent the p-value.

    The p-value corresponding to the shaded area in Figure [fDist2And423Shaded] is equal to about 0.0066. Does this provide strong evidence against the null hypothesis? The p-value is smaller than 0.05, indicating the evidence is strong enough to reject the null hypothesis at a significance level of 0.05. That is, the data provide strong evidence that the average on-base percentage varies by player’s primary field position.

    Reading an ANOVA table from software

    The calculations required to perform an ANOVA by hand are tedious and prone to human error. For these reasons, it is common to use statistical software to calculate the \(F\) statistic and p-value.

    An ANOVA can be summarized in a table very similar to that of a regression summary, which we will see in Chapters [linRegrForTwoVar] and [multipleAndLogisticRegression]. Figure [anovaSummaryTableForOBPAgainstPosition] shows an ANOVA summary to test whether the mean of on-base percentage varies by player positions in the MLB. Many of these values should look familiar; in particular, the \(F\)-test statistic and p-value can be retrieved from the last two columns.

    Df Sum Sq Mean Sq F value Pr(\(>\)F)
    position 2 0.0161 0.0080 5.0766 0.0066
    Residuals 426 0.6740 0.0016    
             

    Graphical diagnostics for an ANOVA analysis

    There are three conditions we must check for an ANOVA analysis: all observations must be independent, the data in each group must be nearly normal, and the variance within each group must be approximately equal.

    Independence.

    If the data are a simple random sample, this condition is satisfied. For processes and experiments, carefully consider whether the data may be independent (e.g. no pairing). For example, in the MLB data, the data were not sampled. However, there are not obvious reasons why independence would not hold for most or all observations.

    Approximately normal.

    As with one- and two-sample testing for means, the normality assumption is especially important when the sample size is quite small when it is ironically difficult to check for non-normality. A histogram of the observations from each group is shown in Figure [mlbANOVADiagNormalityGroups]. Since each of the groups we’re considering have relatively large sample sizes, what we’re looking for are major outliers. None are apparent, so this conditions is reasonably met.

    Constant variance.

    The last assumption is that the variance in the groups is about equal from one group to the next. This assumption can be checked by examining a side-by-side box plot of the outcomes across the groups, as in Figure [mlbANOVABoxPlot]. In this case, the variability is similar in the three groups but not identical. We see in Table [mlbHRPerABSummaryTable] that the standard deviation doesn’t vary much from one group to the next.

    Diagnostics for an ANOVA analysis Independence is always important to an ANOVA analysis. The normality condition is very important when the sample sizes for each group are relatively small. The constant variance condition is especially important when the sample sizes differ between groups.

    Multiple comparisons and controlling Type 1 Error rate

    When we reject the null hypothesis in an ANOVA analysis, we might wonder, which of these groups have different means? To answer this question, we compare the means of each possible pair of groups. For instance, if there are three groups and there is strong evidence that there are some differences in the group means, there are three comparisons to make: group 1 to group 2, group 1 to group 3, and group 2 to group 3. These comparisons can be accomplished using a two-sample \(t\)-test, but we use a modified significance level and a pooled estimate of the standard deviation across groups. Usually this pooled standard deviation can be found in the ANOVA table, e.g. along the bottom of Figure [anovaSummaryTableForOBPAgainstPosition].

    Example [firstExampleForThreeStatisticsClassesAndANOVA] discussed three statistics lectures, all taught during the same semester. Figure [summaryStatisticsForClassTestData] shows summary statistics for these three courses, and a side-by-side box plot of the data is shown in Figure [classDataSBSBoxPlot]. We would like to conduct an ANOVA for these data. Do you see any deviations from the three conditions for ANOVA? In this case (like many others) it is difficult to check independence in a rigorous way. Instead, the best we can do is use common sense to consider reasons the assumption of independence may not hold. For instance, the independence assumption may not be reasonable if there is a star teaching assistant that only half of the students may access; such a scenario would divide a class into two subgroups. No such situations were evident for these particular data, and we believe that independence is acceptable.

    The distributions in the side-by-side box plot appear to be roughly symmetric and show no noticeable outliers.

    The box plots show approximately equal variability, which can be verified in Figure [summaryStatisticsForClassTestData], supporting the constant variance assumption.

    Class \(i\) A B C
    \(n_i\) 58 55 51
    \(\bar{x}_i\) 75.1 72.0 78.9
    \(s_i\) 13.9 13.8 13.1

    [exerExaminingAnovaSummaryTableForMidtermData] ANOVA was conducted for the midterm data, and summary results are shown in Figure [anovaSummaryTableForMidtermData]. What should we conclude?

    Df Sum Sq Mean Sq F value Pr(\(>\)F)
    lecture 2 1290.11 645.06 3.48 0.0330
    Residuals 161 29810.13 185.16    
             

    There is strong evidence that the different means in each of the three classes is not simply due to chance. We might wonder, which of the classes are actually different? As discussed in earlier chapters, a two-sample \(t\)-test could be used to test for differences in each possible pair of groups. However, one pitfall was discussed in Example [multCompExIncDiscOfClassrooms]: when we run so many tests, the Type 1 Error rate increases. This issue is resolved by using a modified significance level.

    Multiple comparisons and the Bonferroni correction for \(\pmb{\alpha}\) The scenario of testing many pairs of groups is called . The suggests that a more stringent significance level is more appropriate for these tests:

    \[\begin{aligned} \alpha^{\star} = \alpha / K \end{aligned}\]

    where \(K\) is the number of comparisons being considered (formally or informally). If there are \(k\) groups, then usually all possible pairs are compared and \(K=\frac{k(k-1)}{2}\).

    In Guided Practice [exerExaminingAnovaSummaryTableForMidtermData], you found strong evidence of differences in the average midterm grades between the three lectures. Complete the three possible pairwise comparisons using the Bonferroni correction and report any differences. [multipleComparisonsOfThreeStatClasses] We use a modified significance level of \(\alpha^{\star} = 0.05 / 3 = 0.0167\). Additionally, we use the pooled estimate of the standard deviation: \(s_{pooled}=13.61\) on \(df=161\), which is provided in the ANOVA summary table.

    Lecture A versus Lecture B: The estimated difference and standard error are, respectively,

    \[\begin{aligned} \bar{x}_A - \bar{x}_{B} &= 75.1 - 72 = 3.1 &&SE = \sqrt{\frac{13.61^2}{58} + \frac{13.61^2}{55}} = 2.56 \end{aligned}\]

    (See Section [pooledStandardDeviations] for additional details.) This results in a T-score of 1.21 on \(df = 161\) (we use the \(df\) associated with \(s_{pooled}\)). Statistical software was used to precisely identify the two-sided p-value since the modified significance level of 0.0167 is not found in the \(t\)-table. The p-value (0.228) is larger than \(\alpha^*=0.0167\), so there is not strong evidence of a difference in the means of lectures A and B.

    Lecture A versus Lecture C: The estimated difference and standard error are 3.8 and 2.61, respectively. This results in a \(T\) score of 1.46 on \(df = 161\) and a two-sided p-value of 0.1462. This p-value is larger than \(\alpha^*\), so there is not strong evidence of a difference in the means of lectures A and C.

    Lecture B versus Lecture C: The estimated difference and standard error are 6.9 and 2.65, respectively. This results in a \(T\) score of 2.60 on \(df = 161\) and a two-sided p-value of 0.0102. This p-value is smaller than \(\alpha^*\). Here we find strong evidence of a difference in the means of lectures B and C.

    We might summarize the findings of the analysis from Example [multipleComparisonsOfThreeStatClasses] using the following notation:

    \[\begin{aligned} \mu_A &\stackrel{?}{=} \mu_B &\mu_A &\stackrel{?}{=} \mu_C &\mu_B &\neq \mu_C\end{aligned}\]

    The midterm mean in lecture A is not statistically distinguishable from those of lectures B or C. However, there is strong evidence that lectures B and C are different. In the first two pairwise comparisons, we did not have sufficient evidence to reject the null hypothesis. Recall that failing to reject \(H_0\) does not imply \(H_0\) is true.

    Reject \(\pmb{H_0}\) with ANOVA but find no differences in group means It is possible to reject the null hypothesis using ANOVA and then to not subsequently identify differences in the pairwise comparisons. However, this does not invalidate the ANOVA conclusion. It only means we have not been able to successfully identify which specific groups differ in their means.

    The ANOVA procedure examines the big picture: it considers all groups simultaneously to decipher whether there is evidence that some difference exists. Even if the test indicates that there is strong evidence of differences in group means, identifying with high confidence a specific difference as statistically significant is more difficult.

    Consider the following analogy: we observe a Wall Street firm that makes large quantities of money based on predicting mergers. Mergers are generally difficult to predict, and if the prediction success rate is extremely high, that may be considered sufficiently strong evidence to warrant investigation by the Securities and Exchange Commission (SEC). While the SEC may be quite certain that there is insider trading taking place at the firm, the evidence against any single trader may not be very strong. It is only when the SEC considers all the data that they identify the pattern. This is effectively the strategy of ANOVA: stand back and consider all the groups simultaneously.


    1. More nuanced guidelines would consider further relaxing the particularly extreme outlier check when the sample size is very large. However, we’ll leave further discussion here to a future course.
    2. You can watch an episode of John Oliver on Last Week Tonight to explore the present day offenses of the tobacco industry. Please be aware that there is some adult language: .
    3. Even though we don’t cover it explicitly, similar sample size planning is also helpful for observational studies.
    4. See, for example, .
    5. Let \(\bar{x}\) represent the mean of outcomes across all groups. Then the mean square between groups is computed as

      \[\begin{aligned} MSG = \frac{1}{df_{G}}SSG = \frac{1}{k-1}\sum_{i=1}^{k} n_{i} \left(\bar{x}_{i} - \bar{x}\right)^2 \end{aligned}\]

      where \(SSG\) is called the and \(n_{i}\) is the sample size of group \(i\).

    6. Let \(\bar{x}\) represent the mean of outcomes across all groups. Then the is computed as

      \[\begin{aligned} SST = \sum_{i=1}^{n} \left(x_{i} - \bar{x}\right)^2 \end{aligned}\]

      where the sum is over all observations in the data set. Then we compute the in one of two equivalent ways:

      \[\begin{aligned} SSE &= SST - SSG \\ &= (n_1-1)s_1^2 + (n_2-1)s_2^2 + \cdots + (n_k-1)s_k^2 \end{aligned}\]

      where \(s_i^2\) is the sample variance (square of the standard deviation) of the residuals in group \(i\). Then the \(MSE\) is the standardized form of \(SSE\): \(MSE = \frac{1}{df_{E}}SSE\).

      For additional details on ANOVA calculations, see


    This page titled 7.5: Comparing many means with ANOVA is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel via source content that was edited to the style and standards of the LibreTexts platform.

    • Was this article helpful?