6.5: The Sample Variance
 Page ID
 10182
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{\!\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\ #1 \}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\ #1 \}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{\!\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{\!\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left#1\right}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Descriptive Theory
Recall the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that we make on these objects. We select objects from the population and record the variables for the objects in the sample; these become our data. Once again, our first discussion is from a descriptive point of view. That is, we do not assume that the data are generated by an underlying probability distribution. Remember however, that the data themselves form a probability distribution.
Variance and Standard Deviation
Suppose that \(\bs{x} = (x_1, x_2, \ldots, x_n)\) is a sample of size \(n\) from a realvalued variable \(x\). Recall that the sample mean is \[ m = \frac{1}{n} \sum_{i=1}^n x_i \] and is the most important measure of the center of the data set. The sample variance is defined to be \[ s^2 = \frac{1}{n  1} \sum_{i=1}^n (x_i  m)^2 \] If we need to indicate the dependence on the data vector \(\bs{x}\), we write \(s^2(\bs{x})\). The difference \(x_i  m\) is the deviation of \(x_i\) from the mean \(m\) of the data set. Thus, the variance is the mean square deviation and is a measure of the spread of the data set with respet to the mean. The reason for dividing by \(n  1\) rather than \(n\) is best understood in terms of the inferential point of view that we discuss in the next section; this definition makes the sample variance an unbiased estimator of the distribution variance. However, the reason for the averaging can also be understood in terms of a related concept.
\(\sum_{i=1}^n (x_i  m) = 0\).
Proof
\(\sum_{i=1}^n (x_i  m) = \sum_{i=1}^n x_i  \sum_{i=1}^n m = n m  n m = 0\).
Thus, if we know \(n  1\) of the deviations, we can compute the last one. This means that there are only \(n  1\) freely varying deviations, that is to say, \(n  1\) degrees of freedom in the set of deviations. In the definition of sample variance, we average the squared deviations, not by dividing by the number of terms, but rather by dividing by the number of degrees of freedom in those terms. However, this argument notwithstanding, it would be reasonable, from a purely descriptive point of view, to divide by \(n\) in the definition of the sample variance. Moreover, when \(n\) is sufficiently large, it hardly matters whether we divide by \(n\) or by \(n  1\).
In any event, the square root \(s\) of the sample variance \(s^2\) is the sample standard deviation. It is the root mean square deviation and is also a measure of the spread of the data with respect to the mean. Both measures of spread are important. Variance has nicer mathematical properties, but its physical unit is the square of the unit of \(x\). For example, if the underlying variable \(x\) is the height of a person in inches, the variance is in square inches. On the other hand, the standard deviation has the same physical unit as the original variable, but its mathematical properties are not as nice.
Recall that the data set \(\bs{x}\) naturally gives rise to a probability distribution, namely the empirical distribution that places probability \(\frac{1}{n}\) at \(x_i\) for each \(i\). Thus, if the data are distinct, this is the uniform distribution on \(\{x_1, x_2, \ldots, x_n\}\). The sample mean \(m\) is simply the expected value of the empirical distribution. Similarly, if we were to divide by \(n\) rather than \(n  1\), the sample variance would be the variance of the empirical distribution. Most of the properties and results this section follow from much more general properties and results for the variance of a probability distribution (although for the most part, we give independent proofs).
Measures of Center and Spread
Measures of center and measures of spread are best thought of together, in the context of an error function. The error function measures how well a single number \(a\) represents the entire data set \(\bs{x}\). The values of \(a\) (if they exist) that minimize the error functions are our measures of center; the minimum value of the error function is the corresponding measure of spread. Of course, we hope for a single value of \(a\) that minimizes the error function, so that we have a unique measure of center.
Let's apply this procedure to the mean square error function defined by \[ \mse(a) = \frac{1}{n  1} \sum_{i=1}^n (x_i  a)^2, \quad a \in \R \] Minimizing \(\mse\) is a standard problem in calculus.
The graph of \(\mse\) is a parabola opening upward.
 \(\mse\) is minimized when \(a = m\), the sample mean.
 The minimum value of \(\mse\) is \(s^2\), the sample variance.
Proof
We can tell from the form of \(\mse\) that the graph is a parabola opening upward. Taking the derivative gives \[ \frac{d}{da} \mse(a) = \frac{2}{n  1}\sum_{i=1}^n (x_i  a) = \frac{2}{n  1}(n m  n a) \] Hence \(a = m\) is the unique value that minimizes \(\mse\). Of course, \(\mse(m) = s^2\).
Trivially, if we defined the mean square error function by dividing by \(n\) rather than \(n  1\), then the minimum value would still occur at \(m\), the sample mean, but the minimum value would be the alternate version of the sample variance in which we divide by \(n\). On the other hand, if we were to use the root mean square deviation function \(\text{rmse}(a) = \sqrt{\mse(a)}\), then because the square root function is strictly increasing on \([0, \infty)\), the minimum value would again occur at \(m\), the sample mean, but the minimum value would be \(s\), the sample standard deviation. The important point is that with all of these error functions, the unique measure of center is the sample mean, and the corresponding measures of spread are the various ones that we are studying.
Next, let's apply our procedure to the mean absolute error function defined by \[ \mae(a) = \frac{1}{n  1} \sum_{i=1}^n \leftx_i  a\right, \quad a \in \R \]
The mean absolute error function satisfies the following properties:
 \(\mae\) is a continuous function.
 The graph of \(\mae\) consists of lines.
 The slope of the line at \(a\) depends on where \(a\) is in the data set \(\bs{x}\).
Proof
For parts (a) and (b), note that for each \(i\), \(\leftx_i  a\right\) is a continuous function of \(a\) with the graph consisting of two lines (of slopes \(\pm 1\)) meeting at \(x_i\).
Mathematically, \(\mae\) has some problems as an error function. First, the function will not be smooth (differentiable) at points where two lines of different slopes meet. More importantly, the values that minimize mae may occupy an entire interval, thus leaving us without a unique measure of center. The error function exercises below will show you that these pathologies can really happen. It turns out that \(\mae\) is minimized at any point in the median interval of the data set \(\bs{x}\). The proof of this result follows from a much more general result for probability distributions. Thus, the medians are the natural measures of center associated with \(\mae\) as a measure of error, in the same way that the sample mean is the measure of center associated with the \(\mse\) as a measure of error.
Properties
In this section, we establish some essential properties of the sample variance and standard deviation. First, the following alternate formula for the sample variance is better for computational purposes, and for certain theoretical purposes as well.
The sample variance can be computed as \[ s^2 = \frac{1}{n  1} \sum_{i=1}^n x_i^2  \frac{n}{n  1} m^2 \]
Proof
Note that \begin{align} \sum_{i=1}^n (x_i  m)^2 & = \sum_{i=1}^n \left(x_i^2  2 m x_i + m^2\right) = \sum_{i=1}^n x_i^2  2 m \sum_{i=1}^n x_i  \sum_{i=1}^n m\\ & = \sum_{i=1}^n x_i^2  2 n m^2 + n m^2 = \sum_{i=1}^n x_i^2  n m^2 \end{align} Dividing by \(n  1\) gives the result.
If we let \(\bs{x}^2 = (x_1^2, x_2^2, \ldots, x_n^2)\) denote the sample from the variable \(x^2\), then the computational formula in the last exercise can be written succinctly as \[ s^2(\bs{x}) = \frac{n}{n  1} \left[m(\bs{x}^2)  m^2(\bs{x})\right] \] The following theorem gives another computational formula for the sample variance, directly in terms of the variables and thus without the computation of an intermediate statistic.
The sample variance can be computed as \[ s^2 = \frac{1}{2 n (n  1)} \sum_{i=1}^n \sum_{j=1}^n (x_i  x_j)^2 \]
Proof
Note that \begin{align} \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (x_i  x_j)^2 & = \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (x_i  m + m  x_j)^2 \\ & = \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n \left[(x_i  m)^2 + 2 (x_i  m)(m  x_j) + (m  x_j)^2\right] \\ & = \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (x_i  m)^2 + \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^n (x_i  m)(m  x_j) + \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (m  x_j)^2 \\ & = \frac{1}{2} \sum_{i=1}^n (x_i  m)^2 + 0 + \frac{1}{2} \sum_{j=1}^n (m  x_j)^2 \\ & = \sum_{i=1}^n (x_i  m)^2 \end{align} Dividing by \(n  1\) gives the result.
The sample variance is nonnegative:
 \(s^2 \ge 0\)
 \(s^2 = 0\) if and only if \(x_i = x_j\) for each \(i, \; j \in \{1, 2, \ldots, n\}\).
Proof
Part (a) is obvious. For part (b) note that if \(s^2 = 0\) then \(x_i = m\) for each \(i\). Conversely, if \(\bs{x}\) is a constant vector, then \(m\) is that same constant.
Thus, \(s^2 = 0\) if and only if the data set is constant (and then, of course, the mean is the common value).
If \(c\) is a constant then
 \(s^2(c \, \bs{x}) = c^2 \, s^2(\bs{x})\)
 \(s(c \, \bs{x}) = \leftc\right \, s(\bs{x})\)
Proof
For part (a), recall that \(m(c \bs{x}) = c m(\bs{x})\). Hence \[ s^2(c \bs{x}) = \frac{1}{n  1}\sum_{i=1}^n \left[c x_i  c m(\bs{x})\right]^2 = \frac{1}{n  1} \sum_{i=1}^n c^2 \left[x_i  m(\bs{x})\right]^2 = c^2 s^2(\bs{x}) \]
If \(\bs{c}\) is a sample of size \(n\) from a constant \(c\) then
 \(s^2(\bs{x} + \bs{c}) = s^2(\bs{x})\).
 \(s(\bs{x} + \bs{c}) = s(\bs{x})\)
Proof
Recall that \(m(\bs{x} + \bs{c}) = m(\bs{x}) + c\). Hence \[ s^2(\bs{x} + \bs{c}) = \frac{1}{n  1} \sum_{i=1}^n \left\{(x_i + c)  \left[m(\bs{x}) + c\right]\right\}^2 = \frac{1}{n  1} \sum_{i=1}^n \left[x_i  m(\bs{x})\right]^2 = s^2(\bs{x})\]
As a special case of these results, suppose that \(\bs{x} = (x_1, x_2, \ldots, x_n)\) is a sample of size \(n\) corresponding to a real variable \(x\), and that \(a\) and \(b\) are constants. The sample corresponding to the variable \(y = a + b x\), in our vector notation, is \(\bs{a} + b \bs{x}\). Then \(m(\bs{a} + b \bs{x}) = a + b m(\bs{x})\) and \(s(\bs{a} + b \bs{x}) = \leftb\right s(\bs{x})\). Linear transformations of this type, when \(b \gt 0\), arise frequently when physical units are changed. In this case, the transformation is often called a locationscale transformation; \(a\) is the location parameter and \(b\) is the scale parameter. For example, if \(x\) is the length of an object in inches, then \(y = 2.54 x\) is the length of the object in centimeters. If \(x\) is the temperature of an object in degrees Fahrenheit, then \(y = \frac{5}{9}(x  32)\) is the temperature of the object in degree Celsius.
Now, for \(i \in \{1, 2, \ldots, n\}\), let \( z_i = (x_i  m) / s\). The number \(z_i\) is the standard score associated with \(x_i\). Note that since \(x_i\), \(m\), and \(s\) have the same physical units, the standard score \(z_i\) is dimensionless (that is, has no physical units); it measures the directed distance from the mean \(m\) to the data value \(x_i\) in standard deviations.
The sample of standard scores \(\bs{z} = (z_1, z_2, \ldots, z_n)\) has mean 0 and variance 1. That is,
 \(m(\bs{z}) = 0\)
 \(s^2(\bs{z}) = 1\)
Proof
These results follow from Theroems 7 and 8. In vector notation, note that \(\bs{z} = (\bs{x}  \bs{m})/s\). Hence \(m(\bs{z}) = (m  m) / s = 0\) and \(s(\bs{z}) = s / s = 1\).
Approximating the Variance
Suppose that instead of the actual data \(\bs{x}\), we have a frequency distribution corresponding to a partition with classes (intervals) \((A_1, A_2, \ldots, A_k)\), class marks (midpoints of the intervals) \((t_1, t_2, \ldots, t_k)\), and frequencies \((n_1, n_2, \ldots, n_k)\). Recall that the relative frequency of class \(A_j\) is \(p_j = n_j / n\). In this case, approximate values of the sample mean and variance are, respectively,
\begin{align} m & = \frac{1}{n} \sum_{j=1}^k n_j \, t_j = \sum_{j = 1}^k p_j \, t_j \\ s^2 & = \frac{1}{n  1} \sum_{j=1}^k n_j (t_j  m)^2 = \frac{n}{n  1} \sum_{j=1}^k p_j (t_j  m)^2 \end{align}
These approximations are based on the hope that the data values in each class are well represented by the class mark. In fact, these are the standard definitions of sample mean and variance for the data set in which \(t_j\) occurs \(n_j\) times for each \(j\).
Inferential Statistics
We continue our discussion of the sample variance, but now we assume that the variables are random. Thus, suppose that we have a basic random experiment, and that \(X\) is a realvalued random variable for the experiment with mean \(\mu\) and standard deviation \(\sigma\). We will need some higher order moments as well. Let \(\sigma_3 = \E\left[(X  \mu)^3\right]\) and \(\sigma_4 = \E\left[(X  \mu)^4\right]\) denote the 3rd and 4th moments about the mean. Recall that \(\sigma_3 \big/ \sigma^3 = \skw(X)\), the skewness of \(X\), and \(\sigma_4 \big/ \sigma^4 = \kur(X)\), the kurtosis of \(X\). We assume that \(\sigma_4 \lt \infty\).
We repeat the basic experiment \(n\) times to form a new, compound experiment, with a sequence of independent random variables \(\bs{X} = (X_1, X_2, \ldots, X_n)\), each with the same distribution as \(X\). In statistical terms, \(\bs{X}\) is a random sample of size \(n\) from the distribution of \(X\). All of the statistics above make sense for \(\bs{X}\), of course, but now these statistics are random variables. We will use the same notationt, except for the usual convention of denoting random variables by capital letters. Finally, note that the deterministic properties and relations established above still hold.
In addition to being a measure of the center of the data \(\bs{X}\), the sample mean \[ M = \frac{1}{n} \sum_{i=1}^n X_i \] is a natural estimator of the distribution mean \(\mu\). In this section, we will derive statistics that are natural estimators of the distribution variance \(\sigma^2\). The statistics that we will derive are different, depending on whether \(\mu\) is known or unknown; for this reason, \(\mu\) is referred to as a nuisance parameter for the problem of estimating \(\sigma^2\).
A Special Sample Variance
First we will assume that \(\mu\) is known. Although this is almost always an artificial assumption, it is a nice place to start because the analysis is relatively easy and will give us insight for the standard case. A natural estimator of \(\sigma^2\) is the following statistic, which we will refer to as the special sample variance. \[ W^2 = \frac{1}{n} \sum_{i=1}^n (X_i  \mu)^2 \]
\(W^2\) is the sample mean for a random sample of size \(n\) from the distribution of \((X  \mu)^2\), and satisfies the following properties:
 \(\E\left(W^2\right) = \sigma^2\)
 \(\var\left(W^2\right) = \frac{1}{n}\left(\sigma_4  \sigma^4\right)\)
 \(W^2 \to \sigma^2\) as \(n \to \infty\) with probability 1
 The distribution of \(\sqrt{n}\left(W^2  \sigma^2\right) \big/ \sqrt{\sigma_4  \sigma^4}\) converges to the standard normal distribution as \(n \to \infty\).
Proof
These result follow immediately from standard results in the section on the Law of Large Numbers and the section on the Central Limit Theorem. For part (b), note that \[\var\left[(X  \mu)^2\right] = \E\left[(X  \mu)^4\right] \left(\E\left[(X  \mu)^2\right]\right)^2 = \sigma_4  \sigma^4\]
In particular part (a) means that \(W^2\) is an unbiased estimator of \(\sigma^2\). From part (b), note that \(\var(W^2) \to 0\) as \(n \to \infty\); this means that \(W^2\) is a consistent estimator of \(\sigma^2\). The square root of the special sample variance is a special version of the sample standard deviation, denoted \(W\).
\(\E(W) \le \sigma\). Thus, \(W\) is a negativley biased estimator that tends to underestimate \(\sigma\).
Proof
This follows from the unbiased property and Jensen's inequality. Since \(w \mapsto \sqrt{w}\) is concave downward on \([0, \infty)\), we have \(\E(W) = \E\left(\sqrt{W^2}\right) \le \sqrt{\E\left(W^2\right)} = \sqrt{\sigma^2} = \sigma\).
Next we compute the covariance and correlation between the sample mean and the special sample variance.
The covariance and correlation of \(M\) and \(W^2\) are
 \(\cov\left(M, W^2\right) = \sigma_3 / n\).
 \(\cor\left(M, W^2\right) = \sigma^3 \big/ \sqrt{\sigma^2 (\sigma_4  \sigma^4)}\)
Proof
 From the bilinearity of the covariance operator and by independence, \[ \cov\left(M, W^2\right) = \cov\left[\frac{1}{n}\sum_{i=1}^n X_i, \frac{1}{n} \sum_{j=1}^n (X_j  \mu)^2\right] = \frac{1}{n^2} \sum_{i=1}^n \cov\left[X_i, (X_i  \mu)^2\right] \] But \(\cov\left[X_i, (X_i  \mu)^2\right] = \cov\left[X_i  \mu, (X_i  \mu)^2\right] = \E\left[(X_i  \mu)^3\right]  \E(X_i  \mu) \E\left[(X_i  \mu)^2\right] = \sigma_3\). Substituting gives the result.
 This follows from part (a), the unbiased property, and our previous result that \(\var(M) = \sigma^2 / n\).
Note that the correlation does not depend on the sample size, and that the sample mean and the special sample variance are uncorrelated if \(\sigma_3 = 0\) (equivalently \(\skw(X) = 0\)).
The Standard Sample Variance
Consider now the more realistic case in which \(\mu\) is unknown. In this case, a natural approach is to average, in some sense, the squared deviations \((X_i  M)^2\) over \(i \in \{1, 2, \ldots, n\}\). It might seem that we should average by dividing by \(n\). However, another approach is to divide by whatever constant would give us an unbiased estimator of \(\sigma^2\). This constant turns out to be \(n  1\), leading to the standard sample variance: \[ S^2 = \frac{1}{n  1} \sum_{i=1}^n (X_i  M)^2 \]
\(\E\left(S^2\right) = \sigma^2\).
Proof
By expanding (as was shown in the last section), \[ \sum_{i=1}^n (X_i  M)^2 = \sum_{i=1}^n X_i^2  n M^2 \] Recall that \(\E(M) = \mu\) and \(\var(M) = \sigma^2 / n\). Taking expected values in the displayed equation gives \[ \E\left(\sum_{i=1}^n (X_i  M)^2\right) = \sum_{i=1}^n (\sigma^2 + \mu^2)  n \left(\frac{\sigma^2}{n} + \mu^2\right) = n (\sigma^2 + \mu^2) n \left(\frac{\sigma^2}{n} + \mu^2\right) = (n  1) \sigma^2 \]
Of course, the square root of the sample variance is the sample standard deviation, denoted \(S\).
\(\E(S) \le \sigma\). Thus, \(S\) is a negativley biased estimator than tends to underestimate \(\sigma\).
Proof
The proof is exactly the same as for the special standard variance.
\(S^2 \to \sigma^2\) as \(n \to \infty\) with probability 1.
Proof
This follows from the strong law of large numbers. Recall again that \[ S^2 = \frac{1}{n  1} \sum_{i=1}^n X_i^2  \frac{n}{n  1} M^2 = \frac{n}{n  1}[M(\bs{X}^2)  M^2(\bs{X})] \] But with probability 1, \(M(\bs{X}^2) \to \sigma^2 + \mu^2\) as \(n \to \infty\) and \(M^2(\bs{X}) \to \mu^2\) as \(n \to \infty\).
Since \(S^2\) is an unbiased estimator of \(\sigma^2\), the variance of \(S^2\) is the mean square error, a measure of the quality of the estimator.
\(\var\left(S^2\right) = \frac{1}{n} \left( \sigma_4  \frac{n  3}{n  1} \sigma^4 \right)\).
Proof
Recall from the result above that \[ S^2 = \frac{1}{2 n (n  1)} \sum_{i=1}^n \sum_{j=1}^n (X_i  X_j)^2 \] Hence, using the bilinear property of covariance we have \[ \var(S^2) = \cov(S^2, S^2) = \frac{1}{4 n^2 (n  1)^2} \sum_{i=1}^n \sum_{j=1}^n \sum_{k=1}^n \sum_{k=1}^n \cov[(X_i  X_j)^2, (X_k  X_l)^2] \] We compute the covariances in this sum by considering disjoint cases:
 \(\cov\left[(X_i  X_j)^2, (X_k  X_l)^2\right] = 0\) if \(i = j\) or \(k = l\), and there are \(2 n^3  n^2\) such terms.
 \(\cov\left[(X_i  X_j)^2, (X_k  X_l)^2\right] = 0\) if \(i, j, k, l\) are distinct, and there are \(n (n  1)(n  2) (n  3)\) such terms.
 \(\cov\left[(X_i  X_j)^2, (X_k  X_l)^2\right] = 2 \sigma_4 + 2 \sigma^4\) if \(i \ne j\) and \(\{k, l\} = \{i, j\}\), and there are \(2 n (n  1)\) such terms.
 \(\cov\left[(X_i  X_j)^2, (X_k  X_l)^2\right] = \sigma_4  \sigma^4\) if \(i \ne j\), \(k \ne l\) and \(\#(\{i, j\} \cap \{k, l\}) = 1\), and there are \(4 n (n  1)(n  2)\) such terms.
Substituting gives the result.
Note that \(\var(S^2) \to 0\) as \(n \to \infty\), and hence \(S^2\) is a consistent estimator of \(\sigma^2\). On the other hand, it's not surprising that the variance of the standard sample variance (where we assume that \(\mu\) is unknown) is greater than the variance of the special standard variance (in which we assume \(\mu\) is known).
\(\var\left(S^2\right) \gt \var\left(W^2\right)\).
Proof
From the formula above for the variance of \( W^2 \), the previous result for the variance of \( S^2 \), and simple algebra, \[ \var\left(S^2\right)  \var\left(W^2\right) = \frac{2}{n (n  1)} \sigma^4 \] Note however that the difference goes to 0 as \(n \to \infty\).
Next we compute the covariance between the sample mean and the sample variance.
The covariance and correlation between the sample mean and sample variance are
 \(\cov\left(M, S^2\right) = \sigma_3 / n\)
 \(\cor\left(M, S^2\right) = \frac{\sigma_3}{\sigma \sqrt{\sigma_4  \sigma^4 (n  3) / (n  1)}}\)
Proof
 Recall again that \[ M = \frac{1}{n} \sum_{i=1}^n X_i, \quad S^2 = \frac{1}{2 n (n  1)} \sum_{j=1}^n \sum_{k=1}^n (X_j  X_k)^2 \] Hence, using the bilinear property of covariance we have \[ \cov(M, S^2) = \frac{1}{2 n^2 (n  1)} \sum_{i=1}^n \sum_{j=1}^n \sum_{k=1}^n \cov[X_i, (X_j  X_k)^2] \] We compute the covariances in this sum by considering disjoint cases:
 \(\cov\left[X_i, (X_j  X_k)^2\right] = 0\) if \(j = k\), and there are \(n^2\) such terms.
 \(\cov\left[X_i, (X_j  X_k)^2\right] = 0\) if \(i, j, k\) are distinct, and there are \(n (n  1)(n  2)\) such terms.
 \(\cov\left[X_i, (X_j  X_k)^2\right] = \sigma_3\) if \(j \ne k\) and \(i \in \{j, k\}\), and there are \(2 n (n  1)\) such terms.
 This follows follows from part(a), the result above on the variance of \( S^2 \), and \(\var(M) = \sigma^2 / n\).
In particular, note that \(\cov(M, S^2) = \cov(M, W^2)\). Again, the sample mean and variance are uncorrelated if \(\sigma_3 = 0\) so that \(\skw(X) = 0\). Our last result gives the covariance and correlation between the special sample variance and the standard one. Curiously, the covariance the same as the variance of the special sample variance.
The covariance and correlation between \(W^2\) and \(S^2\) are
 \(\cov\left(W^2, S^2\right) = (\sigma_4  \sigma^4) / n\)
 \(\cor\left(W^2, S^2\right) = \sqrt{\frac{\sigma_4  \sigma^4}{\sigma_4  \sigma^4 (n  3) / (n  1)}}\)
Proof
 Recall again that \[ W^2 = \frac{1}{n} \sum_{i=1}^n (X_i  \mu)^2, \quad S^2 = \frac{1}{2 n (n  1)} \sum_{j=1}^n \sum_{k=1}^n (X_j  X_k)^2\] so by the bilinear property of covariance we have \[ \cov(W^2, S^2) = \frac{1}{2 n^2 (n  1)} \sum_{i=1}^n \sum_{j=1}^n \sum_{k=1}^n \cov[(X_i  \mu)^2, (X_j  X_k)^2] \] Once again, we compute the covariances in this sum by considering disjoint cases:
 \(\cov[(X_i  \mu)^2, (X_j  X_k)^2] = 0\) if \(j = k\), and there are \(n^2\) such terms.
 \(\cov[(X_i  \mu)^2, (X_j  X_k)^2] = 0\) if \(i, j, k\) are distinct, and there are \(n (n  1)(n  2)\) such terms.
 \(\cov[(X_i  \mu)^2, (X_j  X_k)^2] = \sigma_4  \sigma^4\) if \(j \ne k\) and \(i \in \{j, k\}\), and there are \(2 n (n  1)\) such terms.
 This follows from part (a) and the formulas above for the variance of \( W^2 \) and the variance of \( V^2 \)
Note that \(\cor\left(W^2, S^2\right) \to 1\) as \(n \to \infty\), not surprising since with probability 1, \(S^2 \to \sigma^2\) and \(W^2 \to \sigma^2\) as \(n \to \infty\).
A particularly important special case occurs when the sampling distribution is normal. This case is explored in the section on Special Properties of Normal Samples.
Exercises
Basic Properties
Suppose that \(x\) is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of operation. A sample of 30 components has mean 113° and standard deviation \(18°\).
 Classify \(x\) by type and level of measurement.
 Find the sample mean and standard deviation if the temperature is converted to degrees Celsius. The transformation is \(y = \frac{5}{9}(x  32)\).
Answer
 continuous, interval
 \(m = 45°\), \(s = 10°\)
Suppose that \(x\) is the length (in inches) of a machined part in a manufacturing process. A sample of 50 parts has mean 10.0 and standard deviation 2.0.
 Classify \(x\) by type and level of measurement.
 Find the sample mean if length is measured in centimeters. The transformation is \(y = 2.54 x\).
Answer
 continuous, ratio
 \(m = 25.4\), \(s = 5.08\)
Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). The mean grade on the first midterm exam was 64 (out of a possible 100 points) and the standard deviation was 16. Professor Moriarity thinks the grades are a bit low and is considering various transformations for increasing the grades. In each case below give the mean and standard deviation of the transformed grades, or state that there is not enough information.
 Add 10 points to each grade, so the transformation is \(y = x + 10\).
 Multiply each grade by 1.2, so the transformation is \(z = 1.2 x\)
 Use the transformation \(w = 10 \sqrt{x}\). Note that this is a nonlinear transformation that curves the grades greatly at the low end and very little at the high end. For example, a grade of 100 is still 100, but a grade of 36 is transformed to 60.
One of the students did not study at all, and received a 10 on the midterm. Professor Moriarity considers this score to be an outlier.
 Find the mean and standard deviation if this score is omitted.
Answer
 \(m = 74\), \(s = 16\)
 \(m = 76.8\), \(s = 19.2\)
 Not enough information
 \(m = 66.25\), \(s = 11.62\)
Computational Exercises
All statistical software packages will compute means, variances and standard deviations, draw dotplots and histograms, and in general perform the numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those with large data sets, the use of statistical software is essential. On the other hand, there is some value in performing the computations by hand, with small, artificial data sets, in order to master the concepts and definitions. In this subsection, do the computations and draw the graphs with minimal technological aids.
Suppose that \(x\) is the number of math courses completed by an ESU student. A sample of 10 ESU students gives the data \(\bs{x} = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2)\).
 Classify \(x\) by type and level of measurement.
 Sketch the dotplot.
 Construct a table with rows corresponding to cases and columns corresponding to \(i\), \(x_i\), \(x_i  m\), and \((x_i  m)^2\). Add rows at the bottom in the \(i\) column for totals and means.
Answer
 discrete, ratio

\(i\) \(x_i\) \(x_i  m\) \((x_i  m)^2\) \(1\) \(3\) \(1\) \(1\) \(2\) \(1\) \(1\) \(1\) \(3\) \(2\) \(0\) \(0\) \(4\) \(0\) \(2\) \(4\) \(5\) \(2\) \(0\) \(0\) \(6\) \(4\) \(2\) \(4\) \(7\) \(3\) \(1\) \(1\) \(8\) \(2\) \(0\) \(0\) \(9\) \(1\) \(1\) \(1\) \(10\) \(2\) \(0\) \(0\) Total 20 0 14 Mean 2 0 \(14/9\)
Suppose that a sample of size 12 from a discrete variable \(x\) has empirical density function given by \(f(2) = 1/12\), \(f(1) = 1/4\), \(f(0) = 1/3\), \(f(1) = 1/6\), \(f(2) = 1/6\).
 Sketch the graph of \(f\).
 Compute the sample mean and variance.
 Give the sample values, ordered from smallest to largest.
Answer
 \( m = 1/12\), \(s^2 = 203/121\)
 \((2, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 2)\)
The following table gives a frequency distribution for the commuting distance to the math/stat building (in miles) for a sample of ESU students.
Class  Freq  Rel Freq  Density  Cum Freq  Cum Rel Freq  Midpoint 

\((0, 2]\)  6  
\((2, 6]\)  16  
\((6, 10]\)  18  
\((10, 20])\)  10  
Total 
 Complete the table
 Sketch the density histogram
 Sketch the cumulative relative frquency ogive.
 Compute an approximation to the mean and standard deviation.
Answer

Class Freq Rel Freq Density Cum Freq Cum Rel Freq Midpoint \((0, 2]\) 6 0.12 0.06 6 0.12 1 \((2, 6]\) 16 0.32 0.08 22 0.44 4 \((6, 10]\) 18 0.36 0.09 40 0.80 8 \((10, 20]\) 10 0.20 0.02 50 1 15 Total 50 1  \(m = 7.28\), \(s = 4.549\)
Error Function Exercises
In the error function app, select root mean square error. As you add points, note the shape of the graph of the error function, the value that minimizes the function, and the minimum value of the function.
In the error function app, select mean absolute error. As you add points, note the shape of the graph of the error function, the values that minimizes the function, and the minimum value of the function.
Suppose that our data vector is \((2, 1, 5, 7)\). Explicitly give \(\mae\) as a piecewise function and sketch its graph. Note that
 All values of \(a \in [2, 5]\) minimize \(\mae\).
 \(\mae\) is not differentiable at \(a \in \{1, 2, 5, 7\}\).
Suppose that our data vector is \((3, 5, 1)\). Explicitly give \(\mae\) as a piecewise function and sketch its graph. Note that
 \(\mae\) is minimized at \(a = 3\).
 \(\mae\) is not differentiable at \(a \in \{1, 3, 5\}\).
Simulation Exercises
Many of the apps in this project are simulations of experiments with a basic random variable of interest. When you run the simulation, you are performing independent replications of the experiment. In most cases, the app displays the standard deviation of the distribution, both numerically in a table and graphically as the radius of the blue, horizontal bar in the graph box. When you run the simulation, the sample standard deviation is also displayed numerically in the table and graphically as the radius of the red horizontal bar in the graph box.
In the binomial coin experiment, the random variable is the number of heads. For various values of the parameters \(n\) (the number of coins) and \(p\) (the probability of heads), run the simulation 1000 times and compare the sample standard deviation to the distribution standard deviation.
In the simulation of the matching experiment, the random variable is the number of matches. For selected values of \(n\) (the number of balls), run the simulation 1000 times and compare the sample standard deviation to the distribution standard deviation.
Run the simulation of the gamma experiment 1000 times for various values of the rate parameter \(r\) and the shape parameter \(k\). Compare the sample standard deviation to the distribution standard deviation.
Probability Exercises
Suppose that \(X\) has probability density function \(f(x) = 12 \, x^2 \, (1  x)\) for \(0 \le x \le 1\). The distribution of \(X\) is a member of the beta family. Compute each of the following
 \(\mu = \E(X)\)
 \(\sigma^2 = \var(X)\)
 \(d_3 = \E\left[(X  \mu)^3\right]\)
 \(d_4 = \E\left[(X  \mu)^4\right]\)
Answer
 \(3/5\)
 \(1/25\)
 \(2/875\)\)
 \(33/8750\)
Suppose now that \((X_1, X_2, \ldots, X_{10})\) is a random sample of size 10 from the beta distribution in the previous problem. Find each of the following:
 \(\E(M)\)
 \(\var(M)\)
 \(\E\left(W^2\right)\)
 \(\var\left(W^2\right)\)
 \(\E\left(S^2\right)\)
 \(\var\left(S^2\right)\)
 \(\cov\left(M, W^2\right)\)
 \(\cov\left(M, S^2\right)\)
 \(\cov\left(W^2, S^2\right)\)
Answer
 \(3/5\)
 \(1/250\)
 \(1/25\)
 \(19/87\,500\)
 \(1/25\)
 \(199/787\,500\)
 \(2/8750\)
 \(2/8750\)
 \(19/87\,500\)
Suppose that \(X\) has probability density function \(f(x) = \lambda e^{\lambda x}\) for \(0 \le x \lt \infty\), where \(\lambda \gt 0\) is a parameter. Thus \(X\) has the exponential distribution with rate parameter \(\lambda\). Compute each of the following
 \(\mu = \E(X)\)
 \(\sigma^2 = \var(X)\)
 \(d_3 = \E\left[(X  \mu)^3\right]\)
 \(d_4 = \E\left[(X  \mu)^4\right]\)
Answer
 \(1/\lambda\)
 \(1/\lambda^2\)
 \(2/\lambda^3\)
 \(9/\lambda^4\)
Suppose now that \((X_1, X_2, \ldots, X_5)\) is a random sample of size 5 from the exponential distribution in the previous problem. Find each of the following:
 \(\E(M)\)
 \(\var(M)\)
 \(\E\left(W^2\right)\)
 \(\var\left(W^2\right)\)
 \(\E\left(S^2\right)\)
 \(\var\left(S^2\right)\)
 \(\cov\left(M, W^2\right)\)
 \(\cov\left(M, S^2\right)\)
 \(\cov\left(W^2, S^2\right)\)
Answer
 \(1/\lambda\)
 \(1/5 \lambda^2\)
 \(1/\lambda^2\)
 \(8/5 \lambda^4\)
 \(1/\lambda^2\)
 \(17/10 \lambda^4\)
 \(2/5 \lambda^3\)
 \(2/5 \lambda^3\)
 \(8/5 \lambda^4\)
Recall that for an acesix flat die, faces 1 and 6 have probability \(\frac{1}{4}\) each, while faces 2, 3, 4, and 5 have probability \(\frac{1}{8}\) each. Let \(X\) denote the score when an acesix flat die is thrown. Compute each of the following:
 \(\mu = \E(X)\)
 \(\sigma^2 = \var(X)\)
 \(d_3 = \E\left[(X  \mu)^3\right]\)
 \(d_4 = \E\left[(X  \mu)^4\right]\)
Answer
 \(7/2\)
 \(15/4\)
 \(0\)
 \(333/16\)
Suppose now that an acesix flat die is tossed 8 times. Find each of the following:
 \(\E(M)\)
 \(\var(M)\)
 \(\E\left(W^2\right)\)
 \(\var\left(W^2\right)\)
 \(\E\left(S^2\right)\)
 \(\var\left(S^2\right)\)
 \(\cov\left(M, W^2\right)\)
 \(\cov\left(M, S^2\right)\)
 \(\cov\left(W^2, S^2\right)\)
Answer
 \(7/2\)
 \(15/32\)
 \(15/4\)
 \(27/32\)
 \(15/4\)
 \(207/512\)
 \(0\)
 \(0\)
 \(27/32\)
Data Analysis Exercises
Statistical software should be used for the problems in this subsection.
Consider the petal length and species variables in Fisher's iris data.
 Classify the variables by type and level of measurement.
 Compute the sample mean and standard deviation, and plot a density histogram for petal length.
 Compute the sample mean and standard deviation, and plot a density histogram for petal length by species.
Answers
 petal length: continuous, ratio. species: discrete, nominal
 \(m = 37.8\), \(s = 17.8\)
 \(m(0) = 14.6\), \(s(0) = 1.7\); \(m(1) = 55.5\), \(s(1) = 30.5\); \(m(2) = 43.2\), \(s(2) = 28.7\)
Consider the erosion variable in the Challenger data set.
 Classify the variable by type and level of measurement.
 Compute the mean and standard deviation
 Plot a density histogram with the classes \([0, 5)\), \([5, 40)\), \([40, 50)\), \([50, 60)\).
Answer
 continuous, ratio
 \(m = 7.7\), \(s = 17.2\)
Consider Michelson's velocity of light data.
 Classify the variable by type and level of measurement.
 Plot a density histogram.
 Compute the sample mean and standard deviation.
 Find the sample mean and standard deviation if the variable is converted to \(\text{km}/\text{hr}\). The transformation is \(y = x + 299\,000\)
Answer
 continuous, interval
 \(m = 852.4\), \(s = 79.0\)
 \(m = 299\,852.4\), \(s = 79.0\)
Consider Short's paralax of the sun data.
 Classify the variable by type and level of measurement.
 Plot a density histogram.
 Compute the sample mean and standard deviation.
 Find the sample mean and standard deviation if the variable is converted to degrees. There are 3600 seconds in a degree.
 Find the sample mean and standard deviation if the variable is converted to radians. There are \(\pi/180\) radians in a degree.
Answer
 continuous, ratio
 \(m = 8.616\), \(s = 0.749\)
 \(m = 0.00239\), \(s = 0.000208\)
 \(m = 0.0000418\), \(s = 0.00000363\)
Consider Cavendish's density of the earth data.
 Classify the variable by type and level of measurement.
 Compute the sample mean and standard deviation.
 Plot a density histogram.
Answer
 continuous, ratio
 \(m = 5.448\), \(s = 0.221\)
Consider the M&M data.
 Classify the variables by type and level of measurement.
 Compute the sample mean and standard deviation for each color count variable.
 Compute the sample mean and standard deviation for the total number of candies.
 Plot a relative frequency histogram for the total number of candies.
 Compute the sample mean and standard deviation, and plot a density histogram for the net weight.
Answer
 color counts: discrete ratio. net weight: continuous ratio.
 \(m(r) = 9.60\), \(s(r) = 4.12\); \(m(g) = 7.40\), \(s(g) = 0.57\); \(m(bl) = 7.23\), \(s(bl) = 4.35\); \(m(o) = 6.63\), \(s(0) = 3.69\); \(m(y) = 13.77\), \(s(y) = 6.06\); \(m(br) = 12.47\), \(s(br) = 5.13\)
 \(m(n) = 57.10\), \(s(n) = 2.4\)
 \(m(w) = 49.215\), \(s(w) = 1.522\)
Consider the body weight, species, and gender variables in the Cicada data.
 Classify the variables by type and level of measurement.
 Compute the relative frequency function for species and plot the graph.
 Compute the relative frequeny function for gender and plot the graph.
 Compute the sample mean and standard deviation, and plot a density histogram for body weight.
 Compute the sample mean and standard deviation, and plot a density histogrm for body weight by species.
 Compute the sample mean and standard deviation, and plot a density histogram for body weight by gender.
Answer
 body weight: continuous, ratio. species: discrete, nominal. gender: discrete, nominal.
 \(f(0) = 0.423\), \(f(1) = 0.519\), \(f(2) = 0.058\)
 \(f(0) = 0.567\), \(f(1) = 0.433\)
 \(m = 0.180\), \(s = 0.059\)
 \(m(0) = 0.168\), \(s(0) = 0.054\); \(m(1) = 0.185\), \(s(1) = 0.185\); \(m(2) = 0.225\), \(s(2) = 0.107\)
 \(m(0) = 0.206\), \(s(0) = 0.052\); \(m(1) = 0.145\), \(s(1) = 0.051\)
Consider Pearson's height data.
 Classify the variables by type and level of measurement.
 Compute the sample mean and standard deviation, and plot a density histogram for the height of the father.
 Compute the sample mean and standard deviation, and plot a density histogram for the height of the son.
Answer
 continuous ratio
 \(m(x) = 67.69\), \(s(x) = 2.75\)
 \(m(y) = 68.68\), \(s(y) = 2.82\)