3.3: Measures of the Variation of the Data
- Page ID
- 10927
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Introduction
For Descriptive Statistics, in addition to graphical summaries of our data, there are also calculated summaries that we can use to identify meaningful information about our population of study.
Population is the collection of all of the persons, things or objects under study. The calculations that result from using a data set from the population are called Parameters.
Sample is portion, or subset, of the population that we collect data from for the study. The calculations that result from using a data set from the sample are called Statistics.
Variance and Standard Deviation
An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measure of variation, or spread, is the standard deviation. The standard deviation is a number that measures how far data values are from their mean.
If \(x\) is a data value, then the difference "\(x\) – mean" is called its deviation. In a data set, there are as many deviations as there are items in the data set. The deviations are used to calculate the variance and standard deviation. If the numbers belong to a population, in symbols a deviation is \(x - \mu\). For sample data, in symbols a deviation is \(x - \bar{x}\).
Variance is the average of the squares of the deviations the \(x - \bar{x}\) values for a sample, or the \(x - \mu\) values for a population).
Sample Variance: If \(x\) is a data value then
\(s^{2}=\dfrac{\sum(x-\bar{x})^{2}}{n-1}\)
Population Variance: If \(x\) is a data value then
\(\sigma^{2}=\dfrac{\sum(x-\mu)^{2}}{N} \)
Round your final answer two more decimal places than the original data values. The units are the square of the units of the data values.
Standard Deviation provides a numerical measure of the overall amount of variation in a data set, and can be used to determine whether a particular data value is close to or far from the mean.
Sample Standard Deviation: If \(x\) is a data value then
\(s = \sqrt{\dfrac{\sum(x-\bar{x})^{2}}{n-1}}\) or \(s = \sqrt{\dfrac{\sum f (x-\bar{x})^{2}}{n-1}}\) or \(s = \sqrt{\dfrac{(\sum_{i=1}^{n} x_i^{2}) - n\bar{x}^{2}}{n-1}}\)
For the sample standard deviation, the denominator is \(n - 1\), that is the sample size minus 1.
Population Standard Deviation: If \(x\) is a data value then
\(\sigma = \sqrt{\dfrac{\sum(x-\mu)^{2}}{N}} \) or \(\sigma = \sqrt{\dfrac{\sum f (x-\mu)^{2}}{N}} \) or \(\sigma = \sqrt{\dfrac{(\sum_{i=1}^{N} x_i^{2}) - N\mu^{2}}{N}} \)
For the population standard deviation, the denominator is \(N\), the number of items in the population.
In these formulas, \(f\) represents the frequency with which a value appears. For example, if a value appears once, \(f\) is one. If a value appears three times in the data set or population, \(f\) is three.
Round your final answer to one more decimal place than the original data values. The units are the same as the units of the data values.
Typically, standard deviation is found to be the most useful measure of center between these two calculations. Variance is calculated by squaring the deviations from the mean causing its units to be squared as well. For example, if we have a data set of the number of minutes spent waiting in line, then the variance of the data set would be \(minutes^{2}\). Squared units are not useful in data set comparisons since they don't typically make sense in this context. However, variance is a useful calculation when finding other statistics and parameters, such as standard deviation.
Calculating the Standard Deviation
The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from a sample. The calculations are similar, but not identical. Therefore the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample. The lower case letter s represents the sample standard deviation and the Greek letter \(\sigma\) (sigma, lower case) represents the population standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate of \(\sigma\). However, since standard deviation is not considered to be an unbiased estimator, some adjustment to the formulas has to be made.
If the numbers come from a census of the entire population and not a sample, when we calculate the average of the squared deviations to find the variance, we divide by \(N\), the number of items in the population. If the data are from a sample rather than a population, when we calculate the average of the squared deviations, we divide by n – 1, one less than the number of items in the sample.
The standard deviation provides a measure of the overall variation in a data set
Since the standard deviation is a distance from the mean, it is always positive or zero. The standard deviation is small when the data are all concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.
Suppose that we are studying the amount of time customers wait in line at the checkout at supermarket A and supermarket B. the average wait time at both supermarkets is five minutes. At supermarket A, the standard deviation for the wait time is two minutes; at supermarket B the standard deviation for the wait time is four minutes.
Because supermarket B has a higher standard deviation, we know that there is more variation in the wait times at supermarket B. Overall, wait times at supermarket B are more spread out from the average; wait times at supermarket A are more concentrated near the average.
The standard deviation can be used to determine whether a data value is close to or far from the mean.
Suppose that Rosa and Binh both shop at supermarket A. Rosa waits at the checkout counter for seven minutes and Binh waits for one minute. At supermarket A, the mean waiting time is five minutes and the standard deviation is two minutes. The standard deviation can be used to determine whether a data value is close to or far from the mean.
What can we say about Rosa's wait of seven minutes?
- Seven is two minutes longer than the mean of five; two minutes is equal to one standard deviation.
- Rosa's wait time of seven minutes is two minutes longer than the mean of five minutes.
- Rosa's wait time of seven minutes is one standard deviation above the mean of five minutes.
- Seven is one standard deviation to the right of five because \(5 + (1)(2) = 7\).
What can we say about Binh''s wait of one minute?
- One is four minutes less than the mean of five; four minutes is equal to two standard deviations.
- Binh's wait time of one minute is four minutes less than the mean of five minutes.
- Binh's wait time of one minute is two standard deviations below the mean of five minutes.
- One is two standard deviations to the left of five because \(5 + (-2)(2) = 1\).
A data value that is two standard deviations from the average is just on the borderline for what many statisticians would consider to be far from the mean. Considering data to be far from the mean if it is more than two standard deviations away is more of an approximate "rule of thumb" than a rigid rule. In general, the shape of the distribution of the data affects how much of the data is further away than two standard deviations. (You will learn more about this in later chapters.) Data values that are far from the mean are considered to be unusual, so Binh's wait time of one minute is more unusual than Rosa's wait time of seven minutes.
In general, a data value = mean + (Number of Standard deviations)(standard deviation). The number of standard deviations does not need to be an integer.
The equation data value = mean + (Number of Standard deviations)(standard deviation) can be expressed for a sample and for a population using symbolic notation where k is used to represent the number of standard deviations.
- sample: \[x = \bar{x} + k(s)\]
- Population: \[x = \mu + k(\sigma)\]
The lower case letter s represents the sample standard deviation and the Greek letter \(\sigma\) (sigma, lower case) represents the population standard deviation.
The symbol \(\bar{x}\) is the sample mean and the Greek symbol \(\mu\) is the population mean.
Sampling Variability of a Statistic
How much the statistic varies from one sample to another is known as the sampling variability of a statistic. You typically measure the sampling variability of a statistic by its standard error.
The standard error of the mean is an example of a standard error. It is a special standard deviation and is known as the standard deviation of the sampling distribution of the mean. You will cover the standard error of the mean in a later chapter. The notation for the standard error of the mean is \(\dfrac{\sigma}{\sqrt{n}}\) where \(\sigma\) is the standard deviation of the population and \(n\) is the size of the sample.
Range
One final measure of variation is the Range of the data set. Like the Midrange for Measures of Center which finds the midpoint of the data point based on the distance between the minimum and maximum data values, range is also about the distance between the minimum and maximum data values.
The Range of a data set is the difference between the maximum data value and the minimum data value.
- Range = Maximum value – Minimum Value
- Since it only uses maximum and minimum values in its calculation, it is very sensitive to extreme values.
Technology
Calculating the standard deviation using the technique described in the next example helps with understanding what standard deviation is with respect to the mean, so you are encouraged to try it with smaller data sets. However, unless told otherwise, it is best to use a calculator or computer software to calculate the standard deviation. Not only is the process tedious to do by hand, but if you are not careful, then you run the risk of including round off errors in your calculations.
If you are using a TI-83, 83+, 84+ calculator, you need to select the appropriate standard deviation \(\sigma_{x}\) or \(s_{x}\) from the summary statistics. If you are using a spreadsheet (Microsoft Excel or Google Sheets), you should use the appropriate formula =stdev.p( or =stdev.s( .We will concentrate on using and interpreting the information that the standard deviation gives us. The technology instructions appear at the end of this example.
Example \(\PageIndex{1}\)
The following data are the ages for a SAMPLE of n = 20 fifth grade students. The ages are rounded to the nearest half year:
9; 9.5; 9.5; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 11; 11; 11; 11; 11; 11; 11.5; 11.5; 11.5
- The teacher was interested in the average age and the sample standard deviation of the ages of her students.
- Find the value that is one standard deviation above the mean. Find (\(\bar{x}\) + 1s).
- Find the value that is two standard deviations below the mean. Find (\(\bar{x}\) – 2s).
- Find the values that are 1.5 standard deviations from (below and above) the mean.
Answers
1. To find the mean, sum the data values and divide by 20, the number of data values.
\[\bar{x} = \dfrac{9+9.5(2)+10(4)+10.5(4)+11(6)+11.5(3)}{20} = 10.525 \nonumber\]
The average age is 10.53 years, rounded to two places since our data values are out to one decimal place.
To find the standard deviation, start by calculating the Variance using a table to keep in information organized (explanation of the table values is given below this example box). Notice, that the table is using the frequency of each data value to cut down on the number of deviations that need to be calculated.
Then the standard deviation is calculated by taking the square root of the variance.
| Data | Freq. | Deviations | Deviations2 | (Freq.)(Deviations2) |
|---|---|---|---|---|
| x | f | \(x – \bar{x}\) | \((x – \bar{x})^{2}\) | \((f)(x – \bar{x})^{2}\) |
| 9 | 1 | 9 – 10.525 = –1.525 | (–1.525)2 = 2.325625 | 1 × 2.325625 = 2.325625 |
| 9.5 | 2 | 9.5 – 10.525 = –1.025 | (–1.025)2 = 1.050625 | 2 × 1.050625 = 2.101250 |
| 10 | 4 | 10 – 10.525 = –0.525 | (–0.525)2 = 0.275625 | 4 × 0.275625 = 1.1025 |
| 10.5 | 4 | 10.5 – 10.525 = –0.025 | (–0.025)2 = 0.000625 | 4 × 0.000625 = 0.0025 |
| 11 | 6 | 11 – 10.525 = 0.475 | (0.475)2 = 0.225625 | 6 × 0.225625 = 1.35375 |
| 11.5 | 3 | 11.5 – 10.525 = 0.975 | (0.975)2 = 0.950625 | 3 × 0.950625 = 2.851875 |
| The total is 9.7375 |
The sample variance, \(s^{2}\), is equal to the sum of the last column (9.7375) divided by the total number of data values minus one (20 – 1):
\[s^{2} = \dfrac{9.7375}{20-1} = 0.5125 years^{2}\]
The sample standard deviation s is equal to the square root of the sample variance:
\[s = \sqrt{0.5125} = 0.715891 years\]
and this is rounded to two decimal places, \(s = 0.72 years\).
Typically, you do the calculation for the standard deviation on your calculator or computer. The intermediate results are not rounded. This is done for accuracy.
Solution: Spreadsheet (MS Excel/Google Sheets) (Part a only)
- Using raw data is easier for spreadsheets, because we can just use the standard deviation formulas =stdev.s( or =stdev.p( , depending on our data.
- This example can help us get ready for finding standard deviations of frequency distributions, so we'll emulate what was done above in the spreadsheet. Using the table above instead of the raw data, put the data values (9, 9.5, 10, 10.5, 11, 11.5) into the first column and the frequencies (1, 2, 4, 4, 6, 3) into the second column.
- We can take advantage of cell references to avoid typing repeated numbers and possibly making mistakes. We'll essentially copy the table above in the spreadsheet, but select the cells instead of typing them in. We can make the Spreadsheet do the calculations for us.
- For a number we don't want to change (the mean in this case), we can "lock" the cell reference using dollar signs around the letter. In this example, the mean is located in cell A9.
Formulas for use in Spreadsheets Data (Column A) Frequency (Column B) Deviations (Column C) Deviations^2 (Column D) Freq*(Deviations)^2 (Column E) 9 1 =A2-$A$9 =C2^3 =B2*D2 9.5 2 =A3-$A$9 =C3^3 =B3*D3 10 4 =A4-$A$9 =C4^3 =B4*D4 10.5 4 =A5-$A$9 =C5^3 =B5*D5 11 6 =A6-$A$9 =C6^3 =B6*D6 11.5 3 =A7-$A$9 =C7^3 =B7*D7 =sum(B2:B7) =sum(E2:E7) Then, just as above, divide the sum of Column E, 9.7375, by (20-1): 9.7375/19=0.5125.
Solution: TI Graphing Calculator
- Clear lists L1 and L2. Press STAT 4:ClrList. Enter 2nd 1 for L1, the comma (,), and 2nd 2 for L2.
- Enter data into the list editor. Press STAT 1:EDIT. If necessary, clear the lists by arrowing up into the name. Press CLEAR and arrow down.
- Put the data values (9, 9.5, 10, 10.5, 11, 11.5) into list L1 and the frequencies (1, 2, 4, 4, 6, 3) into list L2. Use the arrow keys to move around.
- Press STAT and arrow to CALC. Press 1:1-VarStats and enter L1 (2nd 1), L2 (2nd 2). Do not forget the comma. Press ENTER.
- \(\bar{x}\) = 10.525
- Use Sx because this is sample data (not a population): Sx=0.715891
2. (\(\bar{x} + 1s) = 10.53 + (1)(0.72) = 11.25\)
3. \((\bar{x} - 2s) = 10.53 – (2)(0.72) = 9.09\)
4. \((\bar{x} - 1.5s) = 10.53 – (1.5)(0.72) = 9.45\) and \((\bar{x} + 1.5s) = 10.53 + (1.5)(0.72) = 11.61\)
Exercise \(\PageIndex{1}\)
On a baseball team, the ages of each of the players are as follows:
21; 21; 22; 23; 24; 24; 25; 25; 28; 29; 29; 31; 32; 33; 33; 34; 35; 36; 36; 36; 36; 38; 38; 38; 40
Use your calculator or computer to find the mean and standard deviation. Then find the value that is two standard deviations above the mean.
- Answer
-
\(\mu\) = 30.68 years old
\(s = 6.09\) years old
(\(\bar{x} + 2s = 30.68 + (2)(6.09) = 42.86\) years old.
Explanation of the standard deviation calculation shown in the table
The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the mean than is the data value 11 which is indicated by the deviations 0.97 and 0.47. A positive deviation occurs when the data value is greater than the mean, whereas a negative deviation occurs when the data value is less than the mean. The deviation is –1.525 for the data value nine. If you add the deviations, the sum is always zero. (For Example \(\PageIndex{1}\), there are \(n = 20\) deviations.) So you cannot simply add the deviations to get the spread of the data. By squaring the deviations, you make them positive numbers, and the sum will also be positive. The variance, then, is the average squared deviation.
The variance is a squared measure and does not have the same units as the data. Taking the square root solves the problem. The standard deviation measures the spread in the same units as the data.
Notice that instead of dividing by \(n = 20\), the calculation divided by \(n - 1 = 20 - 1 = 19\) because the data is a sample. For the sample variance, we divide by the sample size minus one (\(n - 1\)). Why not divide by \(n\)? The answer has to do with the population variance. The sample variance is an estimate of the population variance. Based on the theoretical mathematics that lies behind these calculations, dividing by (\(n - 1\)) gives a better estimate of the population variance.
Your concentration should be on what the standard deviation tells us about the data. The standard deviation is a number which measures how far the data are spread from the mean. Let a calculator or computer do the arithmetic.
The standard deviation, \(s\) or \(\sigma\), is either zero or larger than zero. When the standard deviation is zero, there is no spread; that is, all the data values are equal to each other. The standard deviation is small when the data are all concentrated close to the mean, and is larger when the data values show more variation from the mean. When the standard deviation is a lot larger than zero, the data values are very spread out about the mean; outliers can make \(s\) or \(\sigma\) very large.
The standard deviation, when first presented, can seem unclear. By graphing your data, you can get a better "feel" for the deviations and the standard deviation. You will find that in symmetrical distributions, the standard deviation can be very helpful but in skewed distributions, the standard deviation may not be much help. The reason is that the two sides of a skewed distribution have different spreads.
Exercise \(\PageIndex{2}\)
The following data show the different types of pet food stores in the area carry.
6; 6; 6; 6; 7; 7; 7; 7; 7; 8; 9; 9; 9; 9; 10; 10; 10; 10; 10; 11; 11; 11; 11; 12; 12; 12; 12; 12; 12;
Calculate the sample mean and the sample standard deviation to one decimal place using technology.
- Answer
-
\(\mu = 9.3\) and \(s = 2.2\)
Standard deviation of Grouped Frequency Tables
Recall that for grouped data we do not know individual data values, so we cannot describe the typical value of the data with precision. In other words, we cannot find the exact mean, median, or mode. We can, however, determine the best estimate of the measures of center by finding the mean of the grouped data with the formula:
\[\text{Mean of Frequency Table} = \dfrac{\sum fm}{\sum f}\]
where \(f\) interval frequencies and \(m =\) interval midpoints.
Just as we could not find the exact mean, neither can we find the exact standard deviation. Remember that standard deviation describes numerically the expected deviation a data value has from the mean. In simple English, the standard deviation allows us to compare how “unusual” individual data is compared to the mean.
Example \(\PageIndex{2}\)
Find the standard deviation for the sample data in Table \(\PageIndex{3}\).
| Class | Frequency, f | Midpoint, m | \(fm\) | \(\bar{x}\) | \(m - \bar{x}\) | \((m - \bar{x})^{2}\) | \(f (m - \bar{x})^{2}\) |
|---|---|---|---|---|---|---|---|
| 0–2 | 1 | 1 | 1 | 7.58 | -6.58 | 43.2964 | 43.2964 |
| 3–5 | 6 | 4 | 24 | 7.58 | -3.58 | 12.8164 | 76.8984 |
| 6–8 | 10 | 7 | 70 | 7.58 | -0.58 | 0.3364 | 3.364 |
| 9–11 | 7 | 10 | 70 | 7.58 | 2.42 | 5.8564 | 40.9948 |
| 12–14 | 0 | 13 | 0 | 7.58 | 5.42 | 29.3764 | 0 |
| 15–17 | 2 | 16 | 32 | 7.58 | 8.42 | 70.8964 | 141.7928 |
| SUM | 26 | 197 | 43.2964 |
The values in the second, third, and fourth columns of Table \(\PageIndex{3}\) are used to calculate the mean of the grouped frequency table, the value in the fifth column.
\(\bar{x}=\dfrac{\sum fm}{\sum f}=\dfrac{197}{26} = 7.58\)
After calculating \(\bar{x}\), find the difference, \(m - \bar{x}\), for each midpoint, \(m\). Next, square each difference. In the final column, calculate the product of the frequency and the squared diffrence for each class.
\(s = \sqrt{\dfrac{\sum f (x-\bar{x})^{2}}{n-1}}=\sqrt{\dfrac{43.2964}{26-1}}=3.5\)
Comparing Values from Different Data Sets
The standard deviation is useful when comparing data values that come from different data sets as long as the units are the same and the means are not that different. However, if the data sets have different means and units, then comparing the data values directly can be misleading. There are a couple of techniques we can use to compare the variation in different data sets.
- Coefficient of Variation
- Z - scores.
Z - scores are considered a measure of position within a data set, so we will discuss these more in depth in the next section.
Coefficient of Variation
Coefficient of variation is the standard deviation divided by the mean; it summarizes the amount of variation as a percentage or proportion of the total. It is useful when comparing the amount of variation for one variable among groups with different means, or among different measurement variables.
For example, the United States military measured foot length and foot width in 1774 American men. The standard deviation of foot length was \(13.1mm\) and the standard deviation for foot width was \(5.26mm\), which makes it seem as if foot length is more variable than foot width. However, feet are longer than they are wide. Dividing by the means (\(269.7mm\) for length, \(100.6mm\) for width), the coefficients of variation is actually slightly smaller for length (\(4.9\%\)) than for width (\(5.2\%\)), which for most purposes would be a more useful measure of variation.
The coefficient of variation, denoted by CVar or CV, is used to compare standard deviations from different populations.
For samples:
\[ CV = \frac{s}{\bar{x}}\cdot 100\]
For populations:
\[ CV = \frac{\sigma}{\mu}\cdot 100\]
Example \(\PageIndex{3}\)
According to FuelEconomy.gov, for the year 2014, automatic Sport-Utility Vehicles with 4-wheel drive have an average fuel economy of 21 miles per gallon (mpg), with a standard deviation of 2.3 mpg. Standard trucks with 4-wheel drive and automatic transmission have an average fuel economy of 17 mpg and standard deviation of 2.0 mpg.
Compare the variations of the two.
Solution:
SUVs: 2.3/21*100% = 11.0%
Trucks: 2.0/17*100% = 11.8%
Comparing the coefficients of variation for the SUVs and the Trucks, the truck fuel economy is more variable than the SUVs.
Source: FuelEconomy.gov
Standard Deviation and Distribution
The following lists give a few facts that provide a little more insight into what the standard deviation tells us about the distribution of the data.
For ANY data set, no matter what the distribution of the data is:
- At least 75% of the data is within two standard deviations of the mean.
- At least 89% of the data is within three standard deviations of the mean.
- At least 95% of the data is within 4.5 standard deviations of the mean.
- This is known as Chebyshev's Rule.
For data having a distribution that is BELL-SHAPED and SYMMETRIC:
- Approximately 68% of the data is within one standard deviation of the mean.
- Approximately 95% of the data is within two standard deviations of the mean.
- More than 99% of the data is within three standard deviations of the mean.
- This is known as the Empirical Rule.
- It is important to note that this rule only applies when the shape of the distribution of the data is bell-shaped and symmetric. We will learn more about this when studying the "Normal" or "Gaussian" probability distribution in later chapters.
More information and examples are provided in the next section.
References
- Data from Microsoft Bookshelf.
- King, Bill.“Graphically Speaking.” Institutional Research, Lake Tahoe Community College. Available online at www.ltcc.edu/web/about/institutional-research (accessed April 3, 2013).
Review
The standard deviation can help you calculate the spread of data. There are different equations to use if are calculating the standard deviation of a sample or of a population.
- The Standard Deviation allows us to compare individual data or classes to the data set mean numerically.
- \(s = \sqrt{\dfrac{\sum(x-\bar{x})^{2}}{n-1}}\) or \(s = \sqrt{\dfrac{\sum f (x-\bar{x})^{2}}{n-1}}\) is the formula for calculating the standard deviation of a sample. To calculate the standard deviation of a population, we would use the population mean, \(\mu\), and the formula \(\sigma = \sqrt{\dfrac{\sum(x-\mu)^{2}}{N}}\) or \(\sigma = \sqrt{\dfrac{\sum f (x-\mu)^{2}}{N}}\).∑f(x−μ)2N−−−−−−−−−√.
Formula Review
\[s_{x} = \sqrt{\dfrac{\sum fm^{2}}{n} - \bar{x}^2}\]
where \(s_{x} \text{sample standard deviation}\) and \(\bar{x} = \text{sample mean}\)


