Skip to main content
Statistics LibreTexts

22.2: Sample Statistics

  • Page ID
    57823
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    A statistic is simply a function of your data. Since data are typically a random sample, any statistic you calculate — like a mean or a variance or a skew — is itself a random variable. This means your sample mean has its own distribution and will vary from one sample to the next. This section reviews key sample statistics essential for linear models, framing them not just as static summaries but as the fundamental, random building blocks of statistical inference. Understanding this duality — between a calculated value and a random quantity — is crucial for moving from simple description to meaningful modeling.

    ✦•················• ✦ •··················•✦

    Sample Mean (Arithmetic Mean)

    The sample mean is the usual "average" that we have calculated many times in the past. If we want to use just one number to summarize the data, the mean is the usual value.

    Definition: Sample Mean

    Let \(Y_i\) be a random sample from a distribution. The sample mean is defined as

    \begin{equation}
    \overline{Y} = \frac{1}{n} \sum_{i=1}^n\ Y_i
    \end{equation}

    Note that this definition is equivalent to

    \begin{equation}
    \sum_{i=1}^n\ Y_i = n\overline{Y}
    \end{equation}

    This form is useful in many proofs, and you will see one later. Before we get to that, however, here is an elementary property of the sample mean that needs to be provided.

    Lemma \(\PageIndex{1}\): Linear Functional

    The sample mean is a linear functional.

    Proof.
    This means we have to show that the expected value of \((aY+b)\) is \(a\overline{Y} + b\) for non-stochastic scalars \(a\) and \(b\). Thankfully, this proof is very straightforward. We just find the mean of \(aY+b\) using the definition.

    \begin{align}
    \overline{aY+b} &= \frac{1}{n} \sum_{i=1}^n\ (aY_i + b) \\[1ex]
    &= \frac{1}{n} \sum_{i=1}^n\ aY_i + \frac{1}{n} \sum_{i=1}^n\ b \\[1ex]
    &= a \frac{1}{n} \sum_{i=1}^n\ Y_i + \frac{1}{n} n b \\[1em]
    &= a\overline{Y} + b
    \end{align}

    Thus, through the transitive property, the expected value of a linear combination of the variable is just a linear combination of the expected values.

    \(\blacksquare\)

    You should go through each of the steps in the previous proof and give a reason the step is mathematically correct. This will give you some practice in the mathematics underlying linear models

    Now, we show that the average deviance from the mean is zero.

    Lemma \(\PageIndex{2}\): Mean Deviation

    Let \(Y_i\) be a sample of size \(n\) from a random variable.

    \begin{equation}
    \sum_{i=1}^n\ \left( Y_i - \overline{Y} \right) = 0
    \end{equation}

    Proof.
    I leave the proof as an exercise.

    While it seems as though this lemma is not important, you will see its application over and over again. It will serve you well to learn it by sight as quickly as possible.

    Sample Variance

    If we use the mean to summarize the entire data set using a single number, the measure of spread is used to see how well that number represents each element of the data. There are many measures of spread that you probably have come across. These include the variance, standard deviation, and interquartile range. Here we will focus on the variance.

    Definition: Sample Variance

    Let \(Y_i\) be a random sample of size \(n\) from a distribution. The sample variance is defined as

    \begin{equation}
    S^2_y = \frac{1}{n-1}\ \sum_{i=1}^n\ \left(Y_i - \overline{Y}\right)^2
    \end{equation}

    Here are some elementary properties of the sample variance that need to be provided.

    Lemma \(\PageIndex{3}\): Quadratic Functional

    The sample variance is a quadratic functional.

    Proof.
    This means we need to show that the variance of \((aY + b)\) is \(a^2 S^2_y\) for non-stochastic scalars \(a\) and \(b\). This is also a relatively straight forward application of the definition.

    Thus, let \(Y_i\) be a random sample of size \(n\) from a distribution. With this, we have the following

    \begin{align}
    S^2_{aY+b} &= \frac{1}{n-1} \sum_{i=1}^n\ \left((aY_i+b) - (\overline{aY+b}) \right)^2 \\[1ex]
    &= \frac{1}{n-1} \sum_{i=1}^n\ \left(aY_i+b - a\overline{Y}-b \right)^2 \\[1ex]
    &= \frac{1}{n-1} \sum_{i=1}^n\ \left(aY_i - a\overline{Y} \right)^2 \\[1ex]
    &= \frac{1}{n-1} \sum_{i=1}^n\ \left(a(Y_i - \overline{Y}) \right)^2 \\[1ex]
    &= \frac{1}{n-1} \sum_{i=1}^n\ a^2\left(Y_i - \overline{Y} \right)^2 \\[1ex]
    &= a^2\ \frac{1}{n-1} \sum_{i=1}^n\ \left(Y_i - \overline{Y} \right)^2 \\[1em]
    &= a^2\ S^2_y
    \end{align}

    Thus, since \( S^2_{aY+b}= a^2\ S^2_y\), we know that the variance is a quadratic functional.

    \(\blacksquare\)

    The proofs of Lemmas \(\PageIndex{1}\) and \(\PageIndex{3}\) both exemplify one important method for proving statements: Substitute and simplify. It also provides to us an important result. Go over both proofs a few times. Realize the importance of the scalar multiple on the variance, as well as the scalar addend on the variance.

    Also, think about why it makes sense for a translation to have no effect on the variance, but a scale change to have a significant effect.

    What is so important about the previous lemma? Think about the variance of heights of students in this room. First, measure in feet, then in inches, then in centimeters. Why does the variance change in the three measurements?

    Now, have everyone stand on the same chair while measuring them in inches. Why does the variability when measured to the floor does not depend on whether they are standing on the chair or not?

    Sample Covariance

    The variance measures the variability of a single value. The covariance measures the variability of the relationship between two numeric variables.

    Definition: Sample Covariance

    Let \(X_i\) be a random sample from a distribution. Let \(Y_i\) be a random sample from a distribution (same as \(X\) or different). The sample covariance is defined as

    \begin{equation}
    S_{X,Y} = \frac{1}{n-1} \sum_{i=1}^n\ \left(X_i - \overline{X}\right)\left(Y_i-\overline{Y}\right)
    \end{equation}

    The covariance is also denoted by \(Cov[X,Y]\). However, using this symbol may lead to confusion about whether you are talking about a sample or a population. It is safer to stay with \(S_{X,Y}\) as the symbol for the sample covariance and with \(\sigma_{X,Y}\) as the symbol for the population covariance.

    Here are some elementary properties of the sample covariance.

    Lemma \(\PageIndex{4}\)

    \(S_{X+Y,Z} = S_{X,Z} + S_{Y,Z}\)

    Proof.
    This proof is also a direct application of the definition and of the proof style of Lemma \(\PageIndex{3}\).

    \begin{align}
    S_{X+Y,Z} &= \frac{1}{n-1} \sum_{i=1}^n\ \left( (X_i+Y_i) - (\overline{X}-\overline{Y})\right) \left(Z_i-\overline{Z}\right) \\[1em]
    &= \frac{1}{n-1} \sum_{i=1}^n\ \left(X_i-\overline{X} +Y_i-\overline{Y}\right)\left(Z_i-\overline{Z}\right) \\[2em]
    &= \frac{1}{n-1} \sum_{i=1}^n\ \left(X_i-\overline{X}\right)\left(Z_i-\overline{Z}\right)\ + \nonumber \\
    & \qquad\qquad \frac{1}{n-1} \sum_{i=1}^n\ \left(Y_i-\overline{Y}\right)\left(Z_i-\overline{Z}\right) \\[3ex]
    &= S_{X,Z} + S_{Y,Z}
    \end{align}

    Thus, by the transitive property, \(S_{X+Y,Z} = S_{X,Z} + S_{Y,Z}\).

    \(\blacksquare\)

    Note

    Remember that proving things is important. As you work through this material, you need to be able to prove things mathematically. In statistics, however, not all things can be proven mathematically. Some things need simulation to give provisional results. However, proving things mathematically is preferred.

    I leave the following three lemmas as exercises for you.

    Lemma \(\PageIndex{5}\)

    \(S_{X,Y} = S_{Y,X}\)

    Lemma \(\PageIndex{6}\)

    \(S_{aX,bY} = ab\ S_{X,Y}\)

    Lemma \(\PageIndex{7}\)

    \( S_{X,X} = S^2_{X} \)

    Sample Correlation

    Like the covariance, the correlation measures the strength of the relationship between two numeric variables. However, where the covariance cannot be meaningfully compared across variables, the correlation can. A correlation value of 0.25 indicates the same exact thing whether you are looking at the relationship between height and weight or between population and GDP. A covariance of 100 means something different if one is looking at the relationship between height and weight or if one is looking at the relationship between population and GDP. This is because the correlation scales the covariance to ensure that it ranges between -1 and +1.

    Definition: Sample Correlation

    Let \(X_i\) be a random sample of size \(n\) from a distribution with finite variance. Let \(Y_i\) be a random sample from a distribution, also with finite variance. The sample correlation is defined as

    \begin{equation}
    R_{X,Y} = \frac{\sum_{i=1}^n\ \left(X_i-\overline{X}\right)\left(Y_i-\overline{Y}\right)}{\quad \sqrt{\sum_{i=1}^n\ \left(X_i-\overline{X}\right)^2\ \sum_{i=1}^n\ \left(Y_i-\overline{Y}\right)^2}\quad }
    \end{equation}

    The correlation is also symbolized using \(Cor[X,Y]\). However, this may be confusing as this symbol is also used for the population correlation. It is better to use \(R_{X,Y}\) and \(r_{X,Y}\) for the sample correlation and $\rho_{X,Y}\) for the population correlation.

    Alternate formulas for the correlation include:

    \begin{align}
    R_{X,Y} &= \frac{S_{X,Y}}{\sqrt{\ S^2_{X}\ S^2_{Y}\ }} \\[1em]
    R_{X,Y} &= \frac{S_{X,Y}}{S_X\ S_Y}
    \end{align}

    I leave the following lemmas as exercises for you.

    Lemma \(\PageIndex{8}\)

    \(R_{X,Y} = R_{Y,X}\)

    Lemma \(\PageIndex{9}\)

    \(R_{aX,Y} = R_{X,Y}\)

    Lemma \(\PageIndex{10}\)

    \(R_{aX+b,Y} = R_{X,Y}\)

    Lemma \(\PageIndex{11}\)

    \(-1 \leq R_{X,Y} \leq 1\)

    Lemma \(\PageIndex{10}\) is very important. Do not just think about them in terms of the mathematics. Think about what these mean in terms of the reality: If we apply a linear transformation to the two variables, how does that affect the correlation? What does that really mean?

    Caution

    In each of the above definitions, the numeric divisor is chosen to ensure that the sample statistic is an unbiased estimator of the population parameter. In other words, we divide by \(n-1\) so that this equation is true:

    \begin{equation*}
    E[\text{statistic}] = \text{parameter}
    \end{equation*}

    It is interesting that those denominators are the "degrees of freedom" for that statistic. This gives one helpful "definition" for the term degrees of freedom.


    This page titled 22.2: Sample Statistics is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?