4.2: Measures of Variability

Last updated
Save as PDF

Page ID: 29450

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\dsum}{\displaystyle\sum\limits} \)

\( \newcommand{\dint}{\displaystyle\int\limits} \)

\( \newcommand{\dlim}{\displaystyle\lim\limits} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\(\newcommand{\longvect}{\overrightarrow}\)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

The statistics that we’ve discussed so far all relate to central tendency. That is, they all talk about which values are “in the middle” or “popular” in the data. However, central tendency is not the only type of summary statistic that we want to calculate. The second thing that we really want is a measure of the variability of the data. That is, how “spread out” are the data? How “far” away from the mean or median do the observed values tend to be? For now, let’s assume that the data are interval or ratio scale, so we’ll continue to use the WinMargin data. We’ll use this data to discuss several different measures of spread, each with different strengths and weaknesses.

Range

The range of a variable is very simple: it’s the biggest value minus the smallest value. For the MLB_GL2021.sav winning margins data, the maximum value is 21 and the minimum value is 1. While we could use SPSS to calculate this difficult math problem, we could just compute 21-1 in our head, or calculator, and get 20. The range of WinMargin is 20.

Although the range is the simplest way to quantify the notion of “variability”, it’s one of the worst. Recall from our discussion of the mean that we want our summary measure to be robust. If the data set has one or two extremely bad values in it, we’d like our statistics not to be unduly influenced by these cases. If we look once again at our toy example of a data set containing very extreme outliers…

−100,2,3,4,5,6,7,8,9,10

… it is clear that the range is not robust, since this has a range of 110, but if the outlier were removed we would have a range of only 8.

Interquartile range

The interquartile range (IQR) is like the range, but instead of calculating the difference between the biggest and smallest value, it calculates the difference between the 25th quantile and the 75th quantile. Probably you already know what a quantile is (they’re more commonly called percentiles), but if not: the 10th percentile of a data set is the smallest number x such that 10% of the data is less than x. In fact, we’ve already come across the idea: the median of a data set is its 50th quantile / percentile! SPSS can calculate the hell out of quantiles, which I'll show you shortly. In the meantime, let's look at the output for the WinMargins data:

While it’s obvious how to interpret the range, it’s a little less obvious how to interpret the IQR. The simplest way to think about it is like this: the interquartile range is the range spanned by the “middle half” of the data. That is, one-quarter of the data falls below the 25th percentile, one-quarter of the data is above the 75th percentile, leaving the “middle half” of the data lying in between the two. And the IQR is the range covered by that middle half. In this case, the IQR is 5-1 or 4.

SPSS can also calculate other percentiles besides the quartiles. Here is every 10%:

Deviation Scores

While range and IQR tell us something about the dispersion of the data, they are not very descriptive of the "average" distance of the data from the center. For that we want to start with obtaining deviation scores, which is how far each observation in the data set is from the mean (or median, if that's your thing). Here are the first five data points for the WinMargin data along with the deviation scores:

Winning Margin	Mean \(\bar{X}\)	Deviation Score X_i−\(\bar{X}\)
X1 = 2	6	-4
X2 = 6	6	0
X3 = 14	6	8
X4 = 7	6	1
X5 = 1	6	-5

I mentioned earlier that we are interested in the average deviation from the mean. If you recall, obtaining the mean has us adding up all the scores and dividing by the number of scores. So, let's add up -4 + 0 + 8 + 1 + -5 = 0. Hmmm....it appears that the total of the obtained deviation scores is equal to 0. That's not at all helpful. I wonder if there is any way around that?

There are a few things we could do. First, we could take the absolute value of the deviation scores and then compute the average deviation. That would look like 4 + 0 + 8 + 1 + 5 =18. The mean absolute deviation, then, would be 18/5 = 3.6. Now we have the mean absolute deviation of the scores for these five data points. In case you're interested, the formula for computing the mean absolute deviation is:

\[
\begin{aligned}
&\operatorname{Mean Absolute Deviation}(X)=\frac{\sum_{i=1}^{N}|\left(X_{i}-\overline{X}\right)|}
{N}
\end{aligned}
\nonumber\]

Variance

Although the mean absolute deviation measure has its uses, it’s not the best measure of variability to use. From a purely mathematical perspective, there are some solid reasons to prefer squared deviations rather than absolute deviations. If we do that, we obtain a measure called the variance. The variance of a data set X is sometimes written as Var(X), but it’s more commonly denoted s² (the reason for this will become clearer shortly). The formula that we use to calculate the variance of a set of observations is as follows:

\[
\begin{aligned}
&\operatorname{Var}(X)=\frac{\sum_{i=1}^{N}\left(X_{i}-\overline{X}\right)^{2}}{N}
\end{aligned}
\nonumber\]

As you can see, it’s basically the same formula that we used to calculate the mean absolute deviation, except that instead of using “absolute deviations” we use “squared deviations”. It is for this reason that the variance is sometimes referred to as the “mean square deviation”.

Now that we’ve got the basic idea, let’s have a look at a concrete example. Once again, let’s use the first five MLB games as our data. If we follow the same approach that we took last time, we end up with the following table:

Table 5.1: These five operators are used very frequently throughout the text, so it’s important to be familiar with them at the outset.

Notation [English]	i [which game]	X_i [value]	X_i−\(\bar{X}\) [deviation from mean]	(Xi−\(\bar{X}\))² [squared deviation]
Winning Margin for Game 1	1	2	-4	16
Winning Margin for Game 2	2	6	0	0
Winning Margin for Game 3	3	14	8	64
Winning Margin for Game 4	4	7	1	1
Winning Margin for Game 5	5	1	-5	25

That last column contains all of our squared deviations, so all we have to do is average them. So, 16 + 0 + 64 + 1 + 25 = 106 which we then divide by 5 and we get the variance s² = 21.2. if we use SPSS to calculate the variance we get a weird result: 26.5. Now, I'm no math genius (which should scare you), but I'm almost certain that 21.2 is not the same as 26.5. Correct me if I'm wrong. So, what's the deal? Why is this high-powered, expensive software giving us an answer that is different from what we can do with a simple calculator? Is it broken?

As it happens, the answer is no. It’s not a typo, and SPSS is not making a mistake. To get a feel for what’s happening, let’s stop using the tiny data set containing only 5 data points, and switch to the full set of 2,429 games that we’ve got stored in our MLBGL2021.savfile. First, let’s calculate the variance by using the formula described above:

\[
\begin{aligned}
&\operatorname{Var}(X)=\frac{\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}{N}
\end{aligned}
\nonumber\]

If you were to add up all this by hand, you'd get a sum of squared deviations equal to 18640.73. Let's, then, compute the mean of the squared deviations:

\operatorname{Mean Squared Deviation} = \frac{18640.73}{2429}=7.67

According to our formula, this is the variance, right?

Now let's ask SPSS to give us the variance of the WinMargin data:

I hear you. You are about to say something like, "Dude, 7.68 is barely different from 7.67! What is it you're trying to prove?" Well, the fact that these two numbers ARE different shows that SPSS is doing something different than we did by hand. What is that difference you ask?

It’s very simple to explain what SPSS is doing here, but slightly trickier to explain why SPSS is doing it. So let’s start with the “what”. What SPSS is doing is evaluating a slightly different formula to the one shown above. Instead of averaging the squared deviations, which requires you to divide by the number of data points N, SPSS has chosen to divide by N−1. In other words, the formula that SPSS is using is this one

\[
\begin{aligned}
&\operatorname{Var}(X)=\frac{\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}{N-1}
\end{aligned}
\nonumber\]

So that’s the what. The real question is why SPSS (or any statistical program for that matter) is dividing by N−1 and not by N. After all, the variance is supposed to be the mean squared deviation, right? So shouldn’t we be dividing by N, the actual number of observations in the sample? Well, yes, we should. However, as we’ll discuss later in the book, there’s a subtle distinction between “describing a sample” and “making guesses about the population from which the sample came”. Up to this point, it’s been a distinction without a difference. Regardless of whether you’re describing a sample or drawing inferences about the population, the mean is calculated exactly the same way. Not so for the variance, or the standard deviation, or for many other measures besides. What I outlined to you initially (i.e., take the actual average, and thus divide by N) assumes that you literally intend to calculate the variance of the sample. Most of the time, however, you’re not terribly interested in the sample in and of itself. Rather, the sample exists to tell you something about the world. If so, you’re actually starting to move away from calculating a “sample statistic”, and towards the idea of estimating a “population parameter”. However, I’m getting ahead of myself. For now, let’s just take it on faith that SPSS knows what it’s doing, and we’ll revisit the question later on when we talk about estimation.

Okay, one last thing. This section so far has read a bit like a mystery novel. I’ve shown you how to calculate the variance, described the weird “N−1” thing that SPSS does and hinted at the reason why it’s there, but I haven’t mentioned the single most important thing… how do you interpret the variance? Descriptive statistics are supposed to describe things, after all, and right now the variance is really just a gibberish number. Unfortunately, the reason why I haven’t given you the human-friendly interpretation of the variance is that there really isn’t one. This is the most serious problem with the variance. Although it has some elegant mathematical properties that suggest that it really is a fundamental quantity for expressing variation, it’s completely useless if you want to communicate with an actual human… variances are completely uninterpretable in terms of the original variable! All the numbers have been squared, and they don’t mean anything anymore. This is a huge issue. For instance, according to the table I presented earlier, the margin in game 1 was “16 runs-squared higher than the average margin”. This is exactly as stupid as it sounds; and so when we calculate a variance of 6.78, we’re in the same situation. I’ve watched a lot of baseball games, and never has anyone referred to “runs squared”. It’s not a real unit of measurement, and since the variance is expressed in terms of this gibberish unit, it is totally meaningless to a human.

Standard deviation

Okay, suppose that you like the idea of using the variance because of those nice mathematical properties that I haven’t talked about, but – since you’re a human and not a robot – you’d like to have a measure that is expressed in the same units as the data itself (i.e., runs, not runs-squared). What should you do? The solution to the problem is obvious: take the square root of the variance, known as the standard deviation, also called the “root mean squared deviation”, or RMSD. This solves our problem fairly neatly: while nobody has a clue what “a variance of 7.68 runs-squared” really means, it’s much easier to understand “a standard deviation of 2.77 runs”, since it’s expressed in the original units. It is traditional to refer to the standard deviation of a sample of data as s, though “sd” and “std dev.” are also used at times. Because the standard deviation is equal to the square root of the variance, you probably won’t be surprised to see that the formula is:

\[s=\sqrt{
    \frac{
    \sum_{i=1}^{N}
    \left(X_{i}-\bar{X}
    \right)^{2}
    }{N-1}}
    \]

For reasons that will make sense when we return to this topic in the estimation chapter, we will henceforth refer to this new quantity, the estimated standard deviation, as \(\hat{\sigma}\) (read as: “sigma hat”), and the formula for this is

\[
\hat{\sigma}=\sqrt{\frac{\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}{N-1}}
\nonumber\]

Which measure to use?

We’ve discussed quite a few measures of spread (range, IQR, MAD, variance, and standard deviation), and hinted at their strengths and weaknesses. Here’s a quick summary:

Range. Gives you the full spread of the data. It’s very vulnerable to outliers, and as a consequence, it isn’t often used unless you have good reasons to care about the extremes in the data.
Interquartile range. Tells you where the “middle half” of the data sits. It’s pretty robust and complements the median nicely. This is used a lot.
Mean absolute deviation. Tells you how far “on average” the observations are from the mean. It’s very interpretable but has a few minor issues (not discussed here) that make it less attractive to statisticians than the standard deviation. Used sometimes, but not often.
Variance. Tells you the average squared deviation from the mean. It’s mathematically elegant and is probably the “right” way to describe variation around the mean, but it’s completely uninterpretable because it doesn’t use the same units as the data. Almost never used except as a mathematical tool; but it’s buried “under the hood” of a very large number of statistical tools.
Standard deviation. This is the square root of the variance. It’s fairly elegant mathematically, and it’s expressed in the same units as the data so it can be interpreted pretty well. In situations where the mean is the measure of central tendency, this is the default. This is by far the most popular measure of variation.

In short, the IQR and the standard deviation are easily the two most common measures used to report the variability of the data; but there are situations in which the others are used. I’ve described all of them in this book because there’s a fair chance you’ll run into most of these somewhere.

Search

Text Color

Text Size

Margin Size

Font Type