2.7: Distributions- Using Centrality and Variability Together
- Page ID
- 43307
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Learning Objectives
- State and apply Chebyshev's Inequality
- Define unusual observations
- Define normal distributions
- State properties of normal distributions
- Discuss distributions and curves
- Define the standard normal distribution
- State and apply the Empirical Rule
- Define \(z\)-score
- Define outliers
Section \(2.7\) Excel File (contains all of the data sets for this section)
Connecting Measures of Central Tendency & Measures of Dispersion
In the previous two sections, we developed two significant classes of descriptive statistics: measures of central tendency and measures of dispersion. In this section, we begin to consider the power of these measures together. We have shown that the mean of a data set is the balancing point of its frequency distribution and that it minimizes the sum of the square deviations. In the last section, we defined the variance of a data set to be the mean of these square deviations (with appropriate modifications for sample data). We set the standard deviation to be the square root of the variance. The coupling of these measures of centrality and dispersion tells us a lot about the distribution of our data.
Think about what "standard deviation" means; it represents a measure of "typical distance" from the mean. If the mean of some data set was \(100\) and the standard deviation was \(10,\) then we would expect a good chunk of our data points to be between \(90\) and \(110;\) that is, much of the data does not deviate from the mean by more than the standard deviation. We would expect most of the data to fall between \(80\) and \(120;\) that is, most of the data would be within two standard deviations of the mean. Think about it: if all of our data points were less than \(80\) or more than \(120,\) then all of them deviate from \(100\) by at least \(20.\) How could a "typical" deviation from the mean be \(10\) if most of the points are off by at least \(20?\) Taking this further, we should find it incredibly rare that a data point is more than \(7\) standard deviations away from the mean.
Consider the data set \( \{2,\)\(3,\)\(4,\)\(14,\)\(15,\)\(16,\)\(16,\)\(17,\)\(27,\)\(36\}.\)
1. Give the population mean and population standard deviation.
- Answer
-
Using the formulae from previous sections, the mean is \(\mu\) \(= 15 \) and the standard deviation is \(\sigma\) \(\approx 10.13.\)
2. How many data points are within \(1\) standard deviation of the mean?
- Answer
-
For a data point to be within \(1\) standard deviation of \(15\) means that its distance to \(15\) is no more than \(10.13.\) Thus, any data point between \(15-10.13 = 4.87\) and \(15+10.13=25.13\) would be within \(1\) standard deviation of the mean. We can see that \(5\) data points fall in this range, meaning \(50\%\) of the data points are within \(1\) standard deviation of the mean.
3. What proportion of data points are within \(1.5\) standard deviations of the mean?
- Answer
-
\(1.5\) standard deviations would be \(1.5\sigma\) \(\approx 1.5\cdot10.13\) \(\approx 15.2.\) Thus, any number whose distance to the mean is less than \(15.2\) is within \(1.5\) standard deviations of the mean. If we go \(15.2\) below the mean, that would be \(15-15.2\) \(=-0.2.\) If we go \(15.2\) above the mean, that would be \(15+15.2\) \(=30.2.\) Notice that all but \(1\) of our data points fall in this range. Since we have \(10\) data points, that yields \(90\%\) of our data is within \(1.5\) standard deviations of the mean.
4. What proportion of data points are within \(3\) standard deviations of the mean?
- Answer
-
\(3 \sigma\) \(\approx 3\cdot10.13\) \(=30.39.\) Notice that none of the data points differ from the mean by more than \(30.39.\) This means all of our data is within \(3\) standard deviations of the mean. The proportion is \(100\%.\)
We note that the behavior in the preceding example does not characterize all data sets. Generally, we can have data sets where some points are \(5,\) \(10,\) or any number of standard deviations above or below the mean. These examples have a lot of data points; however, the basic intuition still stands: only a tiny proportion of points are far away from the mean. Let us be more precise with this statement.
Chebyshev's Inequality
Around the middle of the nineteenth century, mathematicians (Pafutny Chebyshev, in particular) discovered an explicit connection between a data set's mean and standard deviation and its distribution. The explicit development of such a result is beyond the scope of an elementary statistics course, so we shall present the result and begin to digest the implications.
Given any data set with (population) mean, \(\mu,\) and (population) standard deviation, \(\sigma,\) and any real number \(k>1,\) the proportion of observations that lie in the interval \([\mu-k\sigma,\mu+k\sigma]\) is at least \(1-\frac{1}{k^2}.\)
Using Chebyshev's Inequality, we can guarantee a minimum percentage of observations falling in an interval symmetric about the mean. By starting at the mean and going a specified number of standard deviations above the mean and then below the mean, we are guaranteed to catch at least a certain percentage of the observations in the data set. This result's great power and beauty come from this inequality being valid for all data sets. That is important to remember. Consider the following basic applications of the result:
If \(k=2,\) we have that at least \( 1-\frac{1}{2^2}\) \(=\frac{3}{4}\) \(=75\%\) of the observations fall between \(\mu-2\sigma\) and \(\mu+2\sigma.\) Another way to say this is that, for any data set, at least \(75\%\) of the data falls within two standard deviations of the mean.
If \(k=3,\) we have that at least \(\frac{8}{9}\) \(=88.\bar{8}\%\) of the observations fall between \(\mu-3\sigma\) and \(\mu+3\sigma.\)
The implication of this is that for any data set, less than \(25\%\) of the observations fall more than \(2\sigma\) away from the mean \(\mu\) and less than \(11.\bar{1}\%\) of the observations fall more than \(3\sigma\) away from the mean \(\mu.\) Most observations fall within \(2\) or \(3\) standard deviations from the mean. When we have an observational value that falls away from the bulk of the observations, we consider it unusual. We say that an observation is unusual by the \(2\) standard deviation rule if it is more than \(2\) standard deviations away from the mean; likewise, an observation is unusual by the \(3\) standard deviation rule if it is more than \(3\) standard deviations away from the mean.
Suppose a sales department of some corporation is supposed to acquire a minimum of \(\$200,000\) in revenue each week. Glancing at a long-term report over the last \(25\) years, we see that, on average, the department made \(\$245,000\) each week with a population standard deviation of \(\$20,000.\) What can be said about how often the department did not meet the quota?
- Answer
-
Notice that the quota is \(\$45,000\) below the average revenue generated. We need to know how many standard deviations equal \(\$45,000\) to apply Chebyshev's Inequality. Since the standard deviation is \(\$20,000,\) we can divide to obtain this.\[ \frac{45000}{20000}=\frac{45}{20}=2+\frac{1}{4}=2.25 \nonumber\]Each week the department did not meet the quota was at least \(2.25\) standard deviations below the mean. Chebyshev's Inequality guarantees, on any data set, that the proportion of data points within \(2.25\) standard deviations of the mean is at least \(1-1/(2.25)^2\) \(\approx 0.80.\) We can be confident that at least \(80\%\) of the time, the department met the quota. It is possible that they met the quota far more than \(80\%\) (they could have met it \(100\%\) of the time). We would need more information to obtain a more precise estimate. Regardless, we can be sure that the department did not miss quota more than \(20\%\) of all weeks in the last \(25\) years.
Using Chebyshev's Inequality, determine the number of standard deviations from the mean \(k\) to guarantee at least \(50\%\) of the observations to be in the interval \([\mu-k\sigma,\mu+k\sigma].\)
- Answer
-
We first note that \(50\%\) \(=\frac{1}{2}.\) Our problem reduces to solving the following equation:\[\frac{1}{2}=1-\frac{1}{k^2}\nonumber\]Meaning that\[\frac{1}{2}=\frac{1}{k^2}\nonumber\]Which yields the solution:\[k=\pm\sqrt{2}\nonumber\]Remember, \(k\) can be any real number greater than \(1.\) Hence \(k=\sqrt{2}.\)
Can we use the results of the previous exercise to compute the first and third quartiles for any data set? Explain.
- Answer
-
There are at least two reasons why we cannot do this. While \(50\%\) of the observations do fall between the first and third quartiles, it is also true that \(25\%\) fall below the first quartile and \(25\%\) fall above the third quartile. Chebyshev's Inequality does not guarantee that the percentage of observations in each tail is \(25\%.\)
Chebyshev's Inequality asserts a minimum percentage of observations in the interval. It does not claim that there are exactly \(50\%\) of the observations; it argues that there are at least \(50\%.\)
It is worth noting that Chebyshev's Inequality does tell us that the first quartile cannot be more than \(2\) standard deviations below the mean, as this would imply that more than \(75\%\) of the data is larger than the first quartile. Similarly, the third quartile cannot be more than \(2\) standard deviations above the mean.
Normal Distributions and Curve Fitting
While Chebyshev's Inequality is powerful because it applies to all distributions, more precise connections can be made when we restrict our interest to particular classes of distributions. We shall encounter several classes throughout our course of study, but at this point, we shall limit ourselves to normal distributions. Normal distributions are common in data from everyday life; heights, IQ scores, and the "bell curve" of class grades are familiar examples. Normal distributions are symmetric and unimodal, with the mean, median, and mode all equal.
To uniquely express the shape of a normal distribution, we must discuss modeling distributions with mathematical functions or curves. We can use continuous functions to model both discrete and continuous data. Consider the following figure.
Figure \(\PageIndex{1}\): Continuous function fit to a histogram
The curve highlights the shape of the histogram reasonably well and could continue to fit better if the histogram had more classes. Increasing the number of classes is not always possible with discrete variables and finite data sets. Still, it could happen with continuous variables provided enough data is available with sufficient measurement precision. Recall that frequency and relative frequency distributions have similar graphical representations; the only differences are in the vertical scales. As such, we could develop functions using either distribution. We consider relative frequency distributions; these curves will play an important role throughout this course.
At each point along the horizontal axis, we have two values to compare vertically: the height of the bar versus the function value. The height of the bar represents the percentage of observations that fall in that class. We want to be able to retain this information with our model (function). As we can see, the value of the function changes within classes, making retaining this information difficult. Our solution is to construct curves that closely resemble common classes of histograms so that the area underneath the curve over a given interval corresponds to the relative frequency of the class(es) in that interval.
Consider this process visually using a data set from Statistics Online Computation Resource containing \(25,000\) height values accurate to \(5\) decimal places. We construct relative frequency distributions with class widths of \(1,\) \(0.1,\) and \(0.01\) and portray them in two ways graphically. The histograms on the left represent the relative frequency of a class using the height of its bar. In contrast, the graphical representations on the right represent the relative frequency of a class using the area of its bar. Note that the vertical scales remain the same across all \(6\) graphs.
Figure \(\PageIndex{2}\): Graphical representations of relative frequency by height (left) and area (right)
Since we have a finite data set, the relative frequencies of each class become extremely small, around \(\frac{1}{25000},\) as the class widths become smaller. We see each class's height get smaller until it is difficult to see (bottom left graph). We see a different story on the right. Since our relative frequencies are represented by the area of the bars and the class widths are getting smaller, the shape of the distribution seems to solidify as our class widths decrease. In taking smaller and smaller class widths, our graphical representation becomes "smoother" in the shape of a continuous function, and the area underneath the function over an interval corresponds to the relative frequency of the observations in that interval.
A significant component of statistical research is checking how closely any particular model fits our actual data set. Continuous models allow us to build our statistical framework around these functions using the power of mathematics without needing to construct something new for every data set we study.
Our chosen models preserve relative frequency through area. The relative frequency of observations over a given interval is the area under the curve over that same interval. Recall that the sum of all the relative frequencies of a distribution is always equal to \(1;\) this means that the area underneath the entirety of these curves will also be \(1.\) We will name these curves and continue to deepen our understanding in the coming chapters.
With all of this build-up, we are now ready to define the class of normal distributions; the curve that defines them depends on two factors: the mean and standard deviation. While the knowledge of the particular function bears little utility in this course, we now provide it with the general normal distribution graphed below.\[f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\nonumber\]
Figure \(\PageIndex{3}\): The normal distribution centered at \(\mu\) with standard deviation \(\sigma\)
Use the figure to answer the following:
- The means of these normal distributions are \(-3,\) \(0,\) and \(4.\) Determine to which distribution each value belongs.
- Answer
-
Given that the mean, median, and mode are all equal. The mean occurs at the peak of the distribution. Thus \(\mu_{\text{blue}}\) \(=-3,\) \(\mu_{\text{black}}\) \(=0,\) and \(\mu_{\text{red}}\) \(=4.\)
- The standard deviations of these normal distributions are \(0.5,\) \(1,\) and \(2.\) Determine to which distribution each value belongs.
- Answer
-
The standard deviation is a measure of dispersion. The smaller the spread, the smaller the standard deviation. Since the blue distribution is spread out the most, the blue distribution has the largest standard deviation. We can say \(\sigma_{\text{blue}}\) \(=2,\) \(\sigma_{\text{black}}\) \(=1,\) and \(\sigma_{\text{red}}\) \(=0.5.\)
Figure \(\PageIndex{4}\): Standard Normal Distribution \(\mu=0\) and \(\sigma=1\)
The normal distribution shown above is called the standard normal distribution or the \(z\)-distribution. The standard normal distribution is the normal distribution with \(\mu=0\) and \(\sigma=1.\) We can quickly tell that the mean of the distribution is \(0.\) There is also a way to determine the standard deviation; the reasoning behind it is less apparent, but a first-semester calculus student should be able to arrive at the conclusion. There are two inflection points on normal curves, and they happen at exactly one standard deviation below and one standard deviation above the mean. All we need to do is identify an inflection point and determine the distance to the mean. For those who do not know what inflection points are, look around the points \(-1\) and \(1\) in the figure above to see what is happening; note that these values are one standard deviation away from the mean. We notice that between \(-1\) and \(1\), the curve seems to open downward, and along the tails, the curve appears to open upward. At some point, the function switches from opening upward to downward, and then at another point, the function switches from opening downward to upward; these points are called inflection points. In a normal distribution, the inflection points always occur at one standard deviation above and below the mean.
The Empirical Rule
We began our discussion about normal distributions by saying claims stronger than Chebyshev's Inequality can be made when we restrict our distributions to particular classes. We now formulate such a result for normal distributions, which we call the Empirical Rule.
Given a normal distribution with mean \(\mu\) and standard deviation \(\sigma,\) the percentage of observations within \(1,\) \(2,\) and \(3\) standard deviations of the mean is known\[\% \text{ in } [\mu-\sigma, \mu+\sigma]\approx 68\% \\ \\ \% \text{ in } [\mu-2\sigma, \mu+2\sigma]\approx 95\% \\ \\ \% \text{ in } [\mu-3\sigma, \mu+3\sigma]\approx 99.7\% \nonumber\]
Figure \(\PageIndex{4}\): The Empirical Rule
Figure \(\PageIndex{5}\): Empirical Rule with two normal distributions with different means and standard deviations on the same set of axes
Note: the approximation signs in the statement of the Empirical Rule are used because the areas underneath the curves over the appropriate intervals can be approximated to many decimal places. However, we do not expect that sort of precision at this stage. In future chapters, we will use technology for greater accuracy.
Notice the difference between the claims of Chebyshev's Inequality and the Empirical Rule. Chebyshev's Inequality provides a lower bound for the percentage of observations within \(k\) standard deviations of the mean for any data; meanwhile, the Empirical Rule asserts what those percentages are for \(1,\) \(2,\) and \(3\) standard deviations, but only for data that is normally distributed.
Let us revisit a previous example. Suppose a sales department of some corporation is supposed to acquire a minimum of \(\$200,000\) in revenue each week. Glancing at a long-term report over the last \(25\) years, we see that, on average, the department made \(\$245,000\) each week with a population standard deviation of \(\$20,000.\) Suppose that we also know that the data is normally distributed. What can be said about how often the department did not meet the quota?
- Answer
-
Last time, we noticed that the quota was \(2.25\) standard deviations below the mean and that Chebyshev's Inequality guaranteed at least \(80\%\) of the data points were above \(\$200,000.\) Now we have more information: we are given that the data is normally distributed; therefore, we can be more precise. The Empirical Rule tells us that \(95\%\) of the data is no more than \(2\) standard deviations away from the mean. Two standard deviations is \(\$40,000,\) so we are saying \(95\%\) of the data falls between \(245,000-40,000\) \(=205,000\) and \(245,000+40,000\) \(=285,000.\) Therefore, at least \(95\%\) of our data is above quota. We can use the symmetry of the normal distribution to say even more.
The Empirical Rule tells us only \(5\%\) of the data is more than \(2\) standard deviations away from the mean. Because the curve is symmetric, half lies above the mean and half below the mean. Anything above the mean was above the quota; the \(2.5\%\) of the data points more than \(2\) standard deviations above the mean can be added to the values above the quota. We conclude that at least \(97.5\%\) of the data was above the quota. The department missed the quota no more than \(2.5\%\) of the time. Later, we will develop tools that will allow us to be even more precise.
We can also return to our ideas regarding unusual observations. When we are beyond \(2\) or \(3\) standard deviations away from the mean for normal distributions, the percentage of observations that lie there are \(5\%\) or \(0.3\%\) respectively. Here the title of unusual, rings a little stronger.
Use symmetry and the Empirical Rule to find the percentage of observations in each of the following intervals for a normal distribution with mean \(\mu\) and standard deviation \(\sigma.\)
- \((-\infty, \mu-3\sigma]\)
- Answer
-
\((-\infty, \mu-3\sigma]\approx0.15\%\)
- \([\mu-3\sigma, \mu-2\sigma]\)
- Answer
-
\([\mu-3\sigma, \mu-2\sigma]\approx2.35\%\)
- \([\mu-2\sigma, \mu-\sigma]\)
- Answer
-
\([\mu-2\sigma, \mu-\sigma]\approx13.5\%\)
- \([\mu-\sigma, \mu]\)
- Answer
-
\([\mu-\sigma, \mu]\approx34\%\)
- \([\mu, \mu+\sigma]\)
- Answer
-
\([\mu, \mu+\sigma]\approx34\%\)
- \([\mu+\sigma, \mu+2\sigma]\)
- Answer
-
\([\mu+\sigma, \mu+2\sigma]\approx13.5\%\)
- \([\mu+2\sigma, \mu+3\sigma]\)
- Answer
-
\([\mu+2\sigma, \mu+3\sigma]\approx2.35\%\)
- \([\mu+3\sigma, \infty)\)
- Answer
-
\([\mu+3\sigma, \infty)\approx0.15\%\)
IQ scores are generally thought to be normally distributed with a mean of \(100\) and standard deviation of \(16.\) Determine the percentage of the population with IQ scores in the given ranges.
- Between \(84\) and \(116\)
- Answer
-
Since \(84\) is \(16\) less than \(100,\) \(84\) corresponds to \(1\) standard deviation below the mean. Likewise, \(116\) is \(1\) standard deviation above the mean. A direct application of the Empirical Rule tells us that \(68\%\) of the population is within this range.
- Between \(84\) and \(132\)
- Answer
-
Since \(132\) is \(32\) greater than \(100\) and \(\frac{32}{16}\) \(=2,\) \(132\) lies \(2\) standard deviations from the mean. We are looking at the interval from \(1\) standard deviation below the mean to \(2\) standard deviations above the mean. The previous exercise shows that this range contains \(68\%+13.5\%\) \(=81.5\%\) of the population.
- Greater than \(148\)
- Answer
-
Since \(148\) is \(48\) greater than \(100\) and \(\frac{48}{16}\) \(=3,\) \(148\) lies \(3\) standard deviations from the mean. The percentage of the population that lies beyond that is \(0.15\%.\)
- Between \(52\) and \(68\)
- Answer
-
Since \(52\) is \(48\) less than \(100,\) \(52\) is \(3\) standard deviations below the mean. Likewise, \(68\) is \(2\) standard deviations below the mean. The percentage of the population that lies between \(52\) and \(68\) is \(2.35\%.\)
- Explain why we cannot determine the percentage of the population between \(100\) and \(108\) using the Empirical Rule and symmetry.
- Answer
-
It might be tempting to say that the percentage of the population between \(100\) and \(108\) is \(17\%\) because we have often split the percentages evenly across our known intervals. We cannot do this because we do not have symmetry over the interval \([\mu,\mu+\sigma].\) The area under the curve from \(100\) to \(108\) is larger than the area under the curve from \(108\) to \(116.\) In future chapters, we will use technology to compute the area and thus deduce the percentage of the population within such intervals.
\(z\)-scores
Notice that in both the Empirical Rule and Chebyshev's Inequality, we are interested in how many standard deviations an observation is from the mean. In the previous exercise, we repeatedly determined how far away an observation was from the mean. Then, we divided that difference by the standard deviation to determine the number of standard deviations the observation was from the mean. This computation is commonly called a "standardization of the data" and is known as an observation's \(z\)-score.\[z=\frac{x-\mu}{\sigma}\nonumber\]Our \(z\)-scores do more than facilitate Empirical Rule calculations; as "standardized" measures, they enable us to compare observational values across different populations.
As a married couple prepared to send their daughter to college in \(2017,\) they wanted to compare relative high school academic prowess. The daughter only took the SAT. The mom and dad took the ACT, but there was an age gap of several years. The dad took the ACT in \(1995\) while the mom took the ACT in \(1999.\) After doing a little research, they found out that the average score on the ACT in \(1999\) was \(21\) with a standard deviation of \(4.7\) and in \(1995\) the average score was \(20.8\) with a standard deviation of \(4.7.\) The average score on the SAT in \(2017\) was \(1060\) with a standard deviation of \(195.\) Determine who achieved the highest relative academic prowess on standardized tests if the dad earned a \(29,\) the mom earned a \(28,\) and the daughter earned a \(1395\) on their respective exams.
- Answer
-
In comparing their values, we can see the dad barely outscored the mom. However, we cannot directly compare the daughter's score as the scale for the SAT is entirely different from the scale for the ACT. One way to compare the observed values in these three separate populations is to compute and then compare each observation's \(z\)-score.\[z_\text{dad}=\frac{29-20.8}{4.7}\approx1.745 \\ \\ z_\text{mom}=\frac{28-21}{4.7}\approx1.489 \\ \\ z_\text{daughter}=\frac{1395-1060}{195}\approx1.718 \nonumber\] Based on the \(z\)-scores, the dad performed the best, followed closely by his daughter. We might also consider whether these values are significantly different from each other. That is, the dad's z-score was (\ 0.027 \) larger than the daughter's z-score...is such a difference meaningful? We will answer these questions in future work once more measurements are developed.
Unusual Observations and Outliers
As we have progressed through this section, we have referenced the idea of unusual observations twice and mentioned that there is no standard definition agreed upon by all professionals. Chebyshev's Inequality allows us to estimate the minimum percentage of observations within a certain number of standard deviations of the mean. The Empirical Rule only makes assertions about the percentage of observations in normal distributions. From these two results, we know we have a very small percentage of observations, many standard deviations away from the mean; we classify such observations as unusual. Sometimes, we have a few isolated observations positioned far from the rest of our data, called outliers. Outliers can point to rare/unique occurrences or possibly measurement errors. When an outlier is present, we want to check the validity of the measurement. If protocols were violated or an error occurred in the measurement, we will likely remove the observation from our data analysis.
If an observation is considered unusual by the \(2\) standard deviation rule, what can we say about its \(z\)-score?
- Answer
-
Since the observation is considered unusual by the \(2\) standard deviation rule, we know it lies at least \(2\) standard deviations away from the mean. The \(z\)-score is the number of standard deviations an observation is from the mean. We know the magnitude of the \(z\)-score is at least \(2.\) It could be negative or positive.
One way to classify outliers is using box plots. The box contains \(50\%\) of the observations. How far must an observation be outside this box to be classified as an outlier? Recall the interquartile range \(\text{IQR},\) a measure of dispersion, a range measure of the middle \(50\%\) of our ordered data. The box represents our central data region, and the \(\text{IQR}\) is the length of the box. It is common practice to say any observation beyond the box by more than \(1.5\cdot \text{IQR}\) is an outlier.
Using the \(30\) scores from the \(10\) point assignment in section \(2.1,\) determine if there are any unusual observations or outliers. Use both of the rules for determining unusual observations.\[\{3,4,5,5,5,6,6,6,6,6,7,7,7,7,7,8,8,8,8,8,8,8,9,9,9,9,9,10,10,10\}\nonumber\]
- Answer
-
The rules about unusual observations depend on the mean and the standard deviation. We are studying this data as population data. A quick computation gives us the following values \(\mu\) \(=7\frac{2}{15}\) \(\approx7.267\) and \(\sigma\) \(\approx1.769.\) Since these are intermediary steps to our conclusion, we do not want to use the rounded values in future computations. Reference them exactly when using technology.
The first unusual observation rule is if the observation is beyond \(2\) standard deviations from the mean. The bounds for this are \(\text{lower bound}\) \(= \mu - 2\cdot \sigma\) \(\approx 7.267 - 2 \cdot 1.769\) \(\approx3.729\) and \(\text{upper bound}\) \(=\mu + 2\cdot \sigma\) \(\approx 7.267 + 2 \cdot 1.769\) \(\approx 10.804.\) By this standard, the single data value \(3\) is considered unusual.
The second unusual observation rule is if the observation is beyond \(3\) standard deviations from the mean. The bounds for this standard are \(\text{lower bound}\) \(\approx1.96\) and \(\text{upper bound}\) \(\approx12.573.\) By this standard, there are no unusual observations.
The outlier rule depends on \(Q_1\) and \(Q_3.\) \(Q_1\) \(=6\) and \(Q_3\) \(=9.\) The \(\text{IQR}\) \(=9-6\) \(=3\) and \(1.5\cdot \text{IQR}\) \(=4.5.\) The bounds for this standard are \(\text{lower bound}\) \(=Q_1-1.5\cdot \text{IQR}\) \(=6-4.5\) \(=1.5\) and \(\text{upper bound}\) \(=Q_3+1.5\cdot \text{IQR}\) \(=9+4.5\) \(=13.5.\) Since no observations fall outside of this interval, there are no outliers.