# 7.1: Estimators

- Page ID
- 10189

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)## The Basic Statistical Model

As usual, our starting point is a random experiment with an underlying sample space and a probability measure \(\P\). In the basic statistical model, we have an observable random variable \(\bs{X}\) taking values in a set \(S\). Recall that in general, this variable can have quite a complicated structure. For example, if the experiment is to sample \(n\) objects from a population and record various measurements of interest, then the data vector has the form \[ \bs{X} = (X_1, X_2, \ldots, X_n) \] where \(X_i\) is the vector of measurements for the \(i\)th object. The most important special case is when \((X_1, X_2, \ldots, X_n)\) are independent and identically distributed (IID). In this case \(\bs{X}\) is a random sample of size \(n\) from the distribution of an underlying measurement variable \(X\).

### Statistics

Recall also that a statistic is an observable function of the outcome variable of the random experiment: \(\bs{U} = \bs{u}(\bs{X})\) where \( \bs{u} \) is a known function from \( S \) into another set \( T \). Thus, a statistic is simply a random variable derived from the observation variable \(\bs{X}\), with the assumption that \(\bs{U}\) is also observable. As the notation indicates, \(\bs{U}\) is typically also vector-valued. Note that the original data vector \(\bs{X}\) is itself a statistic, but usually we are interested in statistics derived from \(\bs{X}\). A statistic \(\bs{U}\) may be computed to answer an inferential question. In this context, if the dimension of \(\bs{U}\) (as a vector) is smaller than the dimension of \(\bs{X}\) (as is usually the case), then we have achieved data reduction. Ideally, we would like to achieve significant data reduction with no loss of information about the inferential question at hand.

### Parameters

In the technical sense, a parameter \(\bs{\theta}\) is a function of the *distribution* of \(\bs{X}\), taking values in a parameter space \(T\). Typically, the distribution of \(\bs{X}\) will have \(k \in \N_+\) real parameters of interest, so that \(\bs{\theta}\) has the form \(\bs{\theta} = (\theta_1, \theta_2, \ldots, \theta_k)\) and thus \(T \subseteq \R^k\). In many cases, one or more of the parameters are unknown, and must be estimated from the data variable \(\bs{X}\). This is one of the of the most important and basic of all statistical problems, and is the subject of this chapter. If \( \bs{U} \) is a statistic, then the distribution of \( \bs{U} \) will depend on the parameters of \( \bs{X} \), and thus so will distributional constructs such as means, variances, covariances, probability density functions and so forth. We usually suppress this dependence notationally to keep our mathematical expressions from becoming too unwieldy, but it's very important to realize that the underlying dependence is present. Remember that the critical idea is that by observing a value \( \bs{u} \) of a statistic \( \bs{U} \) we (hopefully) gain information about the unknown parameters.

### Estimators

Suppose now that we have an unknown real parameter \(\theta\) taking values in a parameter space \(T \subseteq \R\). A real-valued statistic \(U = u(\bs{X})\) that is used to estimate \(\theta\) is called, appropriately enough, an estimator of \(\theta\). Thus, the estimator is a random variable and hence has a distribution, a mean, a variance, and so on (all of which, as noted above, will generally depend on \( \theta \)). When we actually run the experiment and observe the data \(\bs{x}\), the observed value \(u = u(\bs{x})\) (a single number) is the estimate of the parameter \(\theta\). The following definitions are basic.

Suppose that \( U \) is a statistic used as an estimator of a parameter \( \theta \) with values in \( T \subseteq \R \). For \( \theta \in T \),

- \( U - \theta \) is the error.
- \(\bias(U) = E(U - \theta) = \E(U) - \theta \) is the bias of \( U \)
- \(\mse(U) = \E\left[(U - \theta)^2\right] \) is the mean square error of \( U \)

Thus the error is the difference between the estimator and the parameter being estimated, so of course the error is a random variable. The bias of \( U \) is simply the expected error, and the mean square error (the name says it all) is the expected square of the error. Note that bias and mean square error are functions of \( \theta \in T \). The following definitions are a natural complement to the definition of bias.

Suppose again that \( U \) is a statistic used as an estimator of a parameter \( \theta \) with values in \( T \subseteq \R \).

- \(U\) is unbiased if \(\bias(U) = 0\), or equivalently \(\E(U) = \theta\), for all \(\theta \in T\).
- \(U\) is negatively biased if \(\bias(U) \le 0\), or equivalently \(\E(U) \le \theta\), for all \(\theta \in T\).
- \(U\) is positively biased if \(\bias(U) \ge 0\), or equivalently \(\E(U) \ge \theta\), for all \(\theta \in T\).

Thus, for an unbiased estimator, the expected value of the estimator is the parameter being estimated, clearly a desirable property. On the other hand, a positively biased estimator overestimates the parameter, on average, while a negatively biased estimator underestimates the parameter on average. Our definitions of negative and positive bias are *weak* in the sense that the weak inequalities \(\le\) and \(\ge\) are used. There are corresponding strong definitions, of course, using the strong inequalities \(\lt\) and \(\gt\). Note, however, that none of these definitions may apply. For example, it might be the case that \(\bias(U) \lt 0\) for some \(\theta \in T\), \(\bias(U) = 0\) for other \(\theta \in T\), and \(\bias(U) \gt 0\) for yet other \(\theta \in T\).

\(\mse(U) = \var(U) + \bias^2(U)\)

## Proof

This follows from basic properties of expected value and variance: \[ \E[(U - \theta)^2] = \var(U - \theta) + [\E(U - \theta)]^2 = \var(U) + \bias^2(U) \]

In particular, if the estimator is unbiased, then the mean square error of \(U\) is simply the variance of \(U\).

Ideally, we would like to have unbiased estimators with small mean square error. However, this is not always possible, and the result in (3) shows the delicate relationship between bias and mean square error. In the next section we will see an example with two estimators of a parameter that are multiples of each other; one is unbiased, but the other has smaller mean square error. However, if we have two unbiased estimators of \(\theta\), we naturally prefer the one with the smaller variance (mean square error).

Suppose that \( U \) and \( V \) are unbiased estimators of a parameter \( \theta \) with values in \( T \subseteq \R \).

- \( U \) is more efficient than \( V \) if \( \var(U) \le \var(V) \).
- The relative efficiency of \(U\) with respect to \(V\) is \[ \eff(U, V) = \frac{\var(V)}{\var(U)} \]

### Asymptotic Properties

Suppose again that we have a real parameter \( \theta \) with possible values in a parameter space \( T \). Often in a statistical experiment, we observe an infinite sequence of random variables over time, \(\bs{X} = (X_1, X_2, \ldots,)\), so that at time \( n \) we have observed \( \bs{X}_n = (X_1, X_2, \ldots, X_n) \). In this setting we often have a general formula that defines an estimator of \(\theta\) for each sample size \(n\). Technically, this gives a *sequence* of real-valued estimators of \(\theta\): \( \bs{U} = (U_1, U_2, \ldots) \) where \( U_n \) is a real-valued function of \( \bs{X}_n \) for each \( n \in \N_+ \). In this case, we can discuss the asymptotic properties of the estimators as \(n \to \infty\). Most of the definitions are natural generalizations of the ones above.

The sequence of estimators \(\bs{U} = (U_1, U_2, \ldots)\) is asymptotically unbiased if \( \bias(U_n) \to 0\) as \(n \to \infty\) for every \(\theta \in T \), or equivalently, \(\E(U_n) \to \theta\) as \(n \to \infty\) for every \(\theta \in T\).

Suppose that \(\bs{U} = (U_1, U_2, \ldots)\) and \(\bs{V} = (V_1, V_2, \ldots)\) are two sequences of estimators that are asymptotically unbiased. The asymptotic relative efficiency of \(\bs{U}\) to \(\bs{V}\) is \[ \lim_{n \to \infty} \eff(U_n, V_n) = \lim_{n \to \infty} \frac{\var(V_n)}{\var(U_n)} \] assuming that the limit exists.

Naturally, we expect our estimators to improve, as the sample size \(n\) increases, and in some sense to converge to the parameter as \( n \to \infty \). This general idea is known as *consistency*. Once again, for the remainder of this discussion, we assume that \(\bs{U} = (U_1, U_2, \ldots)\) is a sequence of estimators for a real-valued parameter \( \theta \), with values in the parameter space \( T \).

Consistency

- \( \bs{U} \) is consistent if \(U_n \to \theta\) as \(n \to \infty\) in probability for each \(\theta \in T\). That is, \( \P\left(\left|U_n - \theta\right| \gt \epsilon\right) \to 0\) as \(n \to \infty\) for every \(\epsilon \gt 0\) and \(\theta \in T\).
- \( \bs{U} \) is mean-square consistent if \( \mse(U_n) = \E[(U_n - \theta)^2] \to 0 \) as \( n \to \infty \) for \( \theta \in T \).

Here is the connection between the two definitions:

If \( \bs{U} \) is mean-square consistent then \(\bs{U}\) is consistent.

## Proof

From Markov's inequality, \[ \P\left(\left|U_n - \theta\right| \gt \epsilon\right) = \P\left[(U_n - \theta)^2 \gt \epsilon^2\right] \le \frac{\E\left[(U_n - \theta)^2\right]}{\epsilon^2} \to 0 \text{ as } n \to \infty \]

That mean-square consistency implies simple consistency is simply a statistical version of the theorem that states that mean-square convergence implies convergence in probability. Here is another nice consequence of mean-square consistency.

If \( \bs{U} \) is mean-square consistent then \( \bs{U} \) is asymptotically unbiased.

## Proof

This result follows from the fact that mean absolute error is smaller than root mean square error, which in turn is special case of a general result for norms. See the advanced section on vector spaces for more details. So, using this result and the ordinary triangle inequality for expected value we have \[ |\E(U_n - \theta)| \le \E(|U_n - \theta|) \le \sqrt{\E[(U_n - \theta)]^2} \to 0 \text{ as } n \to \infty \] Hence \( \E(U_n) \to \theta \) as \( n \to \infty \) for \( \theta \in T \).

In the next several subsections, we will review several basic estimation problems that were studied in the chapter on Random Samples.

## Estimation in the Single Variable Model

Suppose that \( X \) is a basic real-valued random variable for an experiment, with mean \( \mu \in \R\) and variance \( \sigma^2 \in (0, \infty) \). We sample from the distribution of \( X \) to produce a sequence \(\bs{X} = (X_1, X_2, \ldots)\) of independent variables, each with the distribution of \( X \). For each \( n \in \N_+ \), \( \bs{X}_n = (X_1, X_2, \ldots, X_n) \) is a random sample of size \(n\) from the distribution of \(X\).

### Estimating the Mean

This subsection is a review of some results obtained in the section on the Law of Large Numbers in the chapter on Random Samples. Recall that a natural estimator of the distribution mean \(\mu\) is the sample mean, defined by \[ M_n = \frac{1}{n} \sum_{i=1}^n X_i, \quad n \in \N_+ \]

Properties of \( \bs M = (M_1, M_2, \ldots) \) as a sequence of estimators of \( \mu \).

- \(\E(M_n) = \mu\) so \(M_n\) is unbiased for \( n \in \N_+ \)
- \(\var(M_n) = \sigma^2 / n\) for \( n \in \N_+ \) so \( \bs M \) is consistent.

The consistency of \(\bs M\) is simply the weak law of large numbers. Moreover, there are a number of important special cases of the results in (10). See the section on Sample Mean for the details.

Special cases of the sample mean

- Suppose that \(X = \bs{1}_A\), the indicator variable for an event \(A\) that has probability \(\P(A)\). Then the sample mean for a random sample of size \( n \in \N_+ \) from the distribution of \( X \) is the relative frequency or empirical probability of \(A\), denoted \(P_n(A)\). Hence \(P_n(A)\) is an unbiased estimator of \( \P(A) \) for \( n \in \N_+ \) and \( (\P_n(A): n \in \N_+) \) is consistent..
- Suppose that \(F\) denotes the distribution function of a real-valued random variable \(Y\). Then for fixed \(y \in \R\), the empirical distribution function \(F_n(y)\) is simply the sample mean for a random sample of size \(n \in \N_+\) from the distribution of the indicator variable \(X = \bs{1}(Y \le y)\). Hence \(F_n(y)\) is an unbiased estimator of \( F(y) \) for \( n \in \N_+ \) and \( (F_n(y): n \in \N_+) \) is consistent.
- Suppose that \(U\) is a random variable with a discrete distribution on a countable set \(S\) and \(f\) denotes the probability density function of \(U\). Then for fixed \(u \in S\), the empirical probability density function \(f_n(u)\) is simply the sample mean for a random sample of size \(n \in \N_+\) from the distribution of the indicator variable \(X = \bs{1}(U = u)\). Hence \(f_n(u)\) is an unbiased estimator of \( f(u) \) for \( n \in \N_+ \) and \( (f_n(u): n \in \N_+) \) is consistent.

### Estimating the Variance

This subsection is a review of some results obtained in the section on the Sample Variance in the chapter on Random Samples. We also assume that the fourth central moment \(\sigma_4 = \E\left[(X - \mu)^4\right]\) is finite. Recall that \(\sigma_4 / \sigma^4\) is the kurtosis of \(X\). Recall first that if \(\mu\) is known (almost always an artificial assumption), then a natural estimator of \(\sigma^2\) is a special version of the sample variance, defined by \[ W_n^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2, \quad n \in \N_+ \]

Properties of \( \bs W^2 = (W_1^2, W_2^2, \ldots) \) as a sequence of estimators of \( \sigma^2 \).

- \(\E\left(W_n^2\right) = \sigma^2\) so \(W_n^2\) is unbiased for \( n \in \N_+ \)
- \(\var\left(W_n^2\right) = \frac{1}{n}(\sigma_4 - \sigma^4)\) for \( n \in \N_+ \) so \(\bs W^2\) is consistent.

## Proof

\( \bs W^2 \) corresponds to sampling from the distribution of \( (X - \mu)^2 \). This distribution as mean \( \sigma^2 \) and variance \( \sigma_4 - \sigma^4 \), so the results follow immediately from theorem (10).

If \(\mu\) is unknown (the more reasonable assumption), then a natural estimator of the distribution variance is the standard version of the sample variance, defined by \[ S_n^2 = \frac{1}{n - 1} \sum_{i=1}^n (X_i - M_n)^2, \quad n \in \{2, 3, \ldots\} \]

Properties of \( \bs S^2 = (S_2^2, S_3^2, \ldots) \) as a sequence of estimators of \( \sigma^2 \)

- \(\E\left(S_n^2\right) = \sigma^2\) so \(S_n^2\) is unbiased for \( n \in \{2, 3, \ldots\} \)
- \(\var\left(S_n^2\right) = \frac{1}{n} \left(\sigma_4 - \frac{n - 3}{n - 1} \sigma^4 \right)\) for \( n \in \{2, 3, \ldots\} \) so \(\bs S^2\) is consistent sequence.

Naturally, we would like to compare the sequences \( \bs W^2 \) and \( \bs S^2 \) as estimators of \( \sigma^2 \). But again remember that \( \bs W^2 \) only makes sense if \( \mu \) is known.

Comparison of \( \bs W^2 \) and \( \bs S^2 \)

- \(\var\left(W_n^2\right) \lt \var(S_n^2)\) for \( n \in \{2, 3, \ldots\} \).
- The asymptotic relative efficiency of \(\bs W^2\) to \(\bs S^2\) is 1.

So by (a) \(W_n^2\) is better than \(S_n^2\) for \( n \in \{2, 3, \ldots\} \), assuming that \(\mu\) is known so that we can actually *use* \(W_n^2\). This is perhaps not surprising, but by (b) \(S_n^2\) works just about as well as \(W_n^2\) for a large sample size \( n \). Of course, the sample standard deviation \(S_n\) is a natural estimator of the distribution standard deviation \(\sigma\). Unfortunately, this estimator is biased. Here is a more general result:

Suppose that \( \theta \) is a parameter with possible values in \(T \subseteq (0, \infty) \) (with at least two points) and that \( U \) is a statistic with values in \( T \). If \( U^2 \) is an unbiased estimator of \( \theta^2 \) then \( U \) is a negatively biased estimator of \( \theta \).

## Proof

Note that \[ \var(U) = \E(U^2) - [\E(U)]^2 = \theta^2 - [\E(U)]^2, \quad \theta \in T \] Since \( T \) has at least two points, \( U \) cannot be deterministic so \( \var(U) \gt 0 \). It follows that \( [\E(U)]^2 \lt \theta^2 \) so \( \E(U) \lt \theta \) for \( \theta \in T \).

Thus, we should not be too obsessed with the unbiased property. For most sampling distributions, there will be no statistic \(U\) with the property that \(U\) is an unbiased estimator of \(\sigma\) and \(U^2\) is an unbiased estimator of \(\sigma^2\).

## Estimation in the Bivariate Model

In this subsection we review some of the results obtained in the section on the Correlation and Regression in the chapter on Random Samples

Suppose that \( X \) and \( Y \) are real-valued random variables for an experiment, so that \( (X, Y) \) has a bivariate distribution in \( \R^2 \). Let \( \mu = \E(X)\) and \( \sigma^2 = \var(X) \) denote the mean and variance of \( X \), and let \( \nu = \E(Y) \) and \( \tau^2 = \var(Y) \) denote the mean and variance of \( Y \). For the bivariate parameters, let \( \delta = \cov(X, Y) \) denote the distribution covariance and \( \rho = \cor(X, Y) \) the distribution correlation. We need one higher-order moment as well: let \( \delta_2 = \E\left[(X - \mu)^2 (Y - \nu)^2\right] \), and as usual, we assume that all of the parameters exist. So the general parameter spaces are \( \mu, \, \nu \in \R \), \( \sigma^2, \, \tau^2 \in (0, \infty) \), \( \delta \in \R \), and \( \rho \in [0, 1] \). Suppose now that we sample from the distribution of \( (X, Y) \) to generate a sequence of independent variables \(\left((X_1, Y_1), (X_2, Y_2), \ldots\right)\), each with the distribution of \( (X, Y) \). As usual, we will let \(\bs{X}_n = (X_1, X_2, \ldots, X_n)\) and \(\bs{Y}_n = (Y_1, Y_2, \ldots, Y_n)\); these are random samples of size \(n\) from the distributions of \(X\) and \(Y\), respectively.

Since we now have two underlying variables, we need to enhance our notation somewhat. It will help to define the deterministic versions of our statistics. So if \( \bs x = (x_1, x_2, \ldots) \) and \( \bs y = (y_1, y_2, \ldots) \) are sequences of real numbers and \( n \in \N_+ \), we define the mean and special covariance functions by \begin{align*} m_n(\bs x) & = \frac{1}{n} \sum_{i=1}^n x_i \\ w_n(\bs x, \bs y) & = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)(y_i - \nu) \end{align*} If \( n \in \{2, 3, \ldots\} \) we define the variance and standard covariance functions by \begin{align*} s_n^2(\bs x) & = \frac{1}{n - 1} \sum_{i=1}^n [x_i - m_n(\bs x)]^2 \\ s_n(\bs x, \bs y) & = \frac{1}{n - 1} \sum_{i=1}^n [x_i - m_n(\bs x)][y_i - m_n(\bs y)] \end{align*} It should be clear from context whether we are using the one argument or two argument version of \( s_n \). On this point, note that \( s_n(\bs x, \bs x) = s_n^2(\bs x)\).

### Estimating the Covariance

If \(\mu\) and \(\nu\) are known (almost always an artificial assumption), then a natural estimator of the distribution covariance \(\delta\) is a special version of the sample covariance, defined by \[ W_n = w_n\left(\bs{X}, \bs{Y}\right) = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)(Y_i - \nu), \quad n \in \N_+\]

Properties of \( \bs W = (W_1, W_2, \ldots) \) as a sequence of estimators of \( \delta \).

- \(\E\left(W_n\right) = \delta\) so \(W_n\) is unbiased for \( n \in \N_+ \).
- \( \var\left(W_n\right) = \frac{1}{n}(\delta_2 - \delta^2) \) for \( n \in \N_+ \) so \(\bs W\) is consistent.

## Proof

We've done this proof before, but it's so basic that it's worth repeating. Note that \( \bs W \) corresponds to sampling from the distribution of \( (X - \mu) (Y - \nu) \). This distribution as mean \( \delta \) and variance \( \delta_2 - \delta^2 \), so the results follow immediately from Theorem (10).

If \(\mu\) and \(\nu\) are unknown (usually the more reasonable assumption), then a natural estimator of the distribution covariance \(\delta\) is the standard version of the sample covariance, defined by \[ S_n = s_n(\bs X , \bs Y) = \frac{1}{n - 1} \sum_{i=1}^n [X_i - m_n(\bs X)][Y_i - m_n(\bs Y)], \quad n \in \{2, 3, \ldots\}\]

Properties of \( \bs S = (S_2, S_3, \ldots) \) as a sequence of estimators of \( \delta \).

- \(\E\left(S_n\right) = \delta\) so \( S_n \)is unbiased for \( n \in \{2, 3, \ldots\} \).
- \( \var\left(S_n\right) = \frac{1}{n}\left(\delta_2 + \frac{1}{n - 1} \sigma^2 \tau^2 - \frac{n - 2}{n - 1} \delta^2\right) \) for \( n \in \{2, 3, \ldots\} \) so \(\bs S\) is consistent.

Once again, since we have two competing sequences of estimators of \( \delta \), we would like to compare them.

Comparison of \(\bs W\) and \(\bs S\) as estimators of \(\delta\):

- \(\var\left(W_n\right) \lt \var\left(S_n\right)\) for \( n \in \{2, 3, \ldots\} \).
- The asymptotic relative efficiency of \(\bs W\) to \(\bs S\) is 1.

Thus, \(U_n\) is better than \(V_n\) for \( n \in \{2, 3, \ldots\} \), assuming that \(\mu\) and \( \nu \) are known so that we can actually *use* \(W_n\). But for large \( n \), \(V_n\) works just about as well as \(U_n\).

### Estimating the Correlation

A natural estimator of the distribution correlation \(\rho\) is the sample correlation \[ R_n = \frac{s_n (\bs X, \bs Y)}{s_n(\bs X) s_n(\bs Y)}, \quad n \in \{2, 3, \ldots\} \] Note that this statistics is a nonlinear function of the sample covariance and the two sample standard deviations. For most distributions of \((X, Y)\), we have no hope of computing the bias or mean square error of this estimator. If we *could* compute the expected value, we would probably find that the estimator is biased. On the other hand, even though we cannot compute the mean square error, a simple application of the law of large numbers shows that \(R_n \to \rho\) as \(n \to \infty\) with probability 1. Thus, \( \bs R = (R_2, R_3, \ldots) \) is at least consistent.

### Estimating the regression coefficients

Recall that the distribution regression line, with \(X\) as the predictor variable and \(Y\) as the response variable, is \(y = a + b \, x\) where \[ a = \E(Y) - \frac{\cov(X, Y)}{\var(X)} \E(X), \quad b = \frac{\cov(X, Y)}{\var(X)} \] On the other hand, the sample regression line, based on the sample of size \( n \in \{2, 3, \ldots\} \), is \(y = A_n + B_n x\) where \[ A_n = m_n(\bs Y) - \frac{s_n(\bs X, \bs Y)}{s_n^2(\bs X )} m_n(\bs X), \quad B_n = \frac{s_n(\bs X, \bs Y)}{s_n^2(\bs X)} \] Of course, the statistics \(A_n\) and \(B_n\) are natural estimators of the parameters \(a\) and \(b\), respectively, and in a sense are derived from our previous estimators of the distribution mean, variance, and covariance. Once again, for most distributions of \((X, Y)\), it would be difficult to compute the bias and mean square errors of these estimators. But applications of the law of large numbers show that with probability 1, \( A_n \to a \) and \( B_n \to b \) as \( n \to \infty \), so at least \( \bs A = (A_2, A_3, \ldots) \) and \( \bs B = (B_2, B_3, \ldots) \) are consistent.

## Exercises and Special Cases

### The Poisson Distribution

Let's consider a simple example that illustrates some of the ideas above. Recall that the Poisson distribution with parameter \(\lambda \in (0, \infty)\) has probability density function \(g\) given by \[ g(x) = e^{-\lambda} \frac{\lambda^x}{x!}, \quad x \in \N \] The Poisson distribution is often used to model the number of random points

in a region of time or space, and is studied in more detail in the chapter on the Poisson process. The parameter \(\lambda\) is proportional to the size of the region of time or space; the proportionality constant is the average rate of the random points. The distribution is named for Simeon Poisson.

Suppose that \(X\) has the Poisson distribution with parameter \(\lambda\). . Hence

- \(\mu = \E(X) = \lambda\)
- \(\sigma^2 = \var(X) = \lambda\)
- \(\sigma_4 = \E\left[(X - \lambda)^4\right] = 3 \lambda^2 + \lambda\)

## Proof

Recall the permutation notation \( x^{(n)} = x (x - 1) \cdots (x - n + 1) \) for \( x \in \R \) and \( n \in \N \). The expected value \( \E[X^{(n)}] \) is the factorial moment of \( X \) of order \( n \). It's easy to see that he factorial moments are \( \E\left[X^{(n)}\right] = \lambda^n \) for \( n \in \N \). The results follow from this.

Suppose now that we sample from the distribution of \( X \) to produce a sequence of independent random variables \( \bs{X} = (X_1, X_2, \ldots) \), each having the Poisson distribution with unknown parameter \( \lambda \in (0, \infty) \). Again, \(\bs{X}_n = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n \in \N_+\) from the from the distribution for each \( n \in \N \). From the previous exercise, \(\lambda\) is both the mean and the variance of the distribution, so that we could use either the sample mean \(M_n\) or the sample variance \(S_n^2\) as an estimator of \(\lambda\). Both are unbiased, so which is better? Naturally, we use mean square error as our criterion.

Comparison of \(\bs M\) to \(\bs S^2\) as estimators of \(\lambda\).

- \(\var\left(M_n\right) = \frac{\lambda}{n}\) for \( n \in \N_+ \).
- \(\var\left(S_n^2\right) = \frac{\lambda}{n} \left(1 + 2 \lambda \frac{n}{n - 1} \right)\) for \( n \in \{2, 3, \ldots\} \).
- \(\var\left(M_n\right) \lt \var\left(S_n^2\right)\) so \( M_n \) for \( n \in \{2, 3, \ldots\} \).
- The asymptotic relative efficiency of \(\bs M\) to \(\bs S^2\) is \(1 + 2 \lambda\).

So our conclusion is that the sample mean \(M_n\) is a better estimator of the parameter \(\lambda\) than the sample variance \(S_n^2\) for \( n \in \{2, 3, \ldots\} \), and the difference in quality increases with \( \lambda \).

Run the Poisson experiment 100 times for several values of the parameter. In each case, compute the estimators \(M\) and \(S^2\). Which estimator seems to work better?

The emission of elementary particles from a sample of radioactive material in a time interval is often assumed to follow the Poisson distribution. Thus, suppose that the alpha emissions data set is a sample from a Poisson distribution. Estimate the rate parameter \(\lambda\).

- using the sample mean
- using the sample variance

## Answer

- 8.367
- 8.649

### Simulation Exercises

In the sample mean experiment, set the sampling distribution to gamma. Increase the sample size with the scroll bar and note graphically and numerically the unbiased and consistent properties. Run the experiment 1000 times and compare the sample mean to the distribution mean.

Run the normal estimation experiment 1000 times for several values of the parameters.

- Compare the empirical bias and mean square error of \(M\) with the theoretical values.
- Compare the empirical bias and mean square error of \(S^2\) and of \(W^2\) to their theoretical values. Which estimator seems to work better?

In matching experiment, the random variable is the number of matches. Run the simulation 1000 times and compare

- the sample mean to the distribution mean.
- the empirical density function to the probability density function.

Run the exponential experiment 1000 times and compare the sample standard deviation to the distribution standard deviation.

### Data Analysis Exercises

For Michelson's velocity of light data, compute the sample mean and sample variance.

## Answer

852.4, 6242.67

For Cavendish's density of the earth data, compute the sample mean and sample variance.

## Answer

5.448, 0.048817

For Short's parallax of the sun data, compute the sample mean and sample variance.

## Answer

8.616, 0.561032

Consider the Cicada data.

- Compute the sample mean and sample variance of the body length variable.
- Compute the sample mean and sample variance of the body weight variable.
- Compute the sample covariance and sample correlation between the body length and body weight variables.

## Answer

- 24.0, 3.92
- 0.180, 0.003512
- 0.0471, 0.4012

Consider the M&M data.

- Compute the sample mean and sample variance of the net weight variable.
- Compute the sample mean and sample variance of the total number of candies.
- Compute the sample covariance and sample correlation between the number of candies and the net weight.

## Answer

- 57.1, 5.68
- 49.215, 2.3163
- 2.878, 0.794

Consider the Pearson data.

- Compute the sample mean and sample variance of the height of the father.
- Compute the sample mean and sample variance of the height of the son.
- Compute the sample covariance and sample correlation between the height of the father and height of the son.

## Answer

- 67.69, 7.5396
- 68.68, 7.9309
- 3.875, 0.501

The estimators of the mean, variance, and covariance that we have considered in this section have been natural in a sense. However, for other parameters, it is not clear how to even find a reasonable estimator in the first place. In the next several sections, we will consider the problem of constructing estimators. Then we return to the study of the mathematical properties of estimators, and consider the question of when we can know that an estimator is the best possible, given the data.