# 11.2: Introduction to ANOVA's Sum of Squares

• • Michelle Oja
• Taft College
$$\newcommand{\vecs}{\overset { \rightharpoonup} {\mathbf{#1}} }$$ $$\newcommand{\vecd}{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\kernel}{\mathrm{null}\,}$$ $$\newcommand{\range}{\mathrm{range}\,}$$ $$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$ $$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}{\| #1 \|}$$ $$\newcommand{\inner}{\langle #1, #2 \rangle}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\kernel}{\mathrm{null}\,}$$ $$\newcommand{\range}{\mathrm{range}\,}$$ $$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$ $$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}{\| #1 \|}$$ $$\newcommand{\inner}{\langle #1, #2 \rangle}$$ $$\newcommand{\Span}{\mathrm{span}}$$

ANOVA is all about looking at the different sources of variance (i.e. the reasons that scores differ from one another) in a dataset. Fortunately, the way we calculate these sources of variance takes a very familiar form: the Sum of Squares.

## Sums of Squares

### Between Groups Sum of Squares

One source of variability we can identified in 11.1.3 of the previous example was differences or variability between the groups. That is, the groups seemed to have different average levels. The variability arising from these differences is known as the between groups variability, and it is quantified using Between Groups Sum of Squares.

Our calculations for sums of squares in ANOVA will take on the same form as it did for regular calculations of variance. Each observation, in this case the group means, is compared to the overall mean, in this case the grand mean, to calculate a deviation score. These deviation scores are squared so that they do not cancel each other out and sum to zero. The squared deviations are then added up, or summed. There is, however, one small difference. Because each group mean represents a group composed of multiple people, before we sum the deviation scores we must multiple them by the number of people within that group. Incorporating this, we find our equation for Between Groups Sum of Squares to be:

$S S_{B}=\sum_{EachGroup} \left[ \left(\overline{X}_{group}-\overline{X}_{T}\right)^{2} * (n_{group}) \right] \nonumber$

1. Subtract
2. Square
3. Multiply
4. Sum

I know, this looks a little extreme, but it really is what is says that it is, subtracting the mean of the total of all participants (($$\overline{X}_{T}$$) from the mean of one of the groups (($$\overline{X}_{group}$$), then squaring that subtraction. That gives you the difference score for that group. You then multiply that by the sample size for that group ($$n_{group}$$). You do that for each group ($$\sum_{EachGroup}$$). For example, if you had an IV with three levels (k = 3) that were High, Medium, and Low, the you'd do the parts in the brackets for each group, then add all three groups together for one final Between Groups Sum of Squares.

The only difference between this equation and the familiar sum of squares for variance is that we are adding in the sample size. Everything else logically fits together in the same way.

### Within Groups Sum of Squares (Error)

The formula for this within groups sum of squares is again going to take on the same form and logic. What we are looking for is the distance between each individual person and the mean of the group to which they belong. We calculate this deviation score, square it so that they can be added together, then sum all of them into one overall value:

$S S_{W}=\sum_{EachGroup} \left[ \sum \left(\left(X-\overline{X}_{group}\right)^{2}\right) \right] \nonumber$

1. Subtract
2. Square
3. Sum

In this instance, because we are calculating this deviation score for each individual person, there is no need to multiple by how many people we have. It is important to remember that the deviation score for each person is only calculated relative to their group mean; this is what ( $$X - \overline{X}_{group}$$) is telling you to do: subtract the mean of a group from each score from that same group. You then square each of those subtractions. The sum ($$\sum$$) is to sum all of the individual squared scores of all of the groups.

### Total Sum of Squares

The calculation for this score is exactly the same as it would be if we were calculating the overall variance in the dataset (because that’s what we are interested in explaining) without worrying about or even knowing about the groups into which our scores fall:

$S S_{T}=\sum \left[ \left(X - \overline{X}_{T}\right)^{2} \right] \nonumber$

1. Subtract
2. Square
3. Sum

We can see that our Total Sum of Squares is just each individual score minus the grand mean (the mean of the totality of scores, $$\overline{X}_{T}$$). As with our Within Groups Sum of Squares, we are calculating a deviation score for each individual person, so we do not need to multiply anything by the sample size; that is only done for Between Groups Sum of Squares.

### Computation Check!

An important feature of these calculations in ANOVA is that they all fit together. We could work through the algebra to demonstrate that if we added together the formulas for $$SS_B$$ and $$SS_W$$, we would end up with the formula for $$SS_T$$. That is:

$S S_{T}=S S_{B}+S S_{W} \nonumber$

This will prove to be a very convenient way to check your work! If you calculate each $$SS$$ by hand, you can make sure that they all fit together as shown above, and if not, you know that you made a math mistake somewhere.

## By Hand?

We can see from the above formulas that calculating an ANOVA by hand from raw data can take a very, very long time. For this reason, you will rarely be required to calculate the SS values by hand. Many professors will have you work out one problem on your own by hand, then either show you how to use statistical software or provide the Sums of Squares for you. However, you should still take the time to understand how they fit together and what each one represents to help understand the analysis itself; this will make it easier to interpret the results and make predictions (research hypotheses).