21.4.8: Distribution of Sample Sums
- Page ID
- 64708
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)The Central Limit Theorem is a cornerstone of statistical inference because it guarantees that the sampling distribution of the sum will be approximately Normal, regardless of the original population's distribution (more or less), provided the sample size is large enough. This is crucial because it allows us to make probabilistic claims about sample estimates using the Normal distribution — even when we know little about the underlying population.
In practice, this means we can construct confidence intervals, conduct hypothesis tests, and quantify uncertainty without needing to derive exact distributions every time. This set of simulation experiments visually confirms this powerful result: By summing random variables from any finite-variance distribution, you can see the Normal distribution emerge from chaos.
Problem:
The data are not generated from a Normal process, but OLS estimation requires it. How bad is it if I ignore this violation?
Solution:
Let us generate observations (data) under several different distributions, calculate the sums of those samples, and repeat many, many, many times. and both look at the distribution of the sums and test if that distribution is distinguishable from Normal.
Normal (Gaussian) Data:
First, let us see the process with the data being generated from a Normal distribution. The first step is to generate the data.
set.seed(370) y = rnorm(100*1000)
Now, to make things faster, let's put this in a matrix and calculate the row sums
s = matrix(y, ncol=100) t = rowSums(s)
The variable t now contains 1000 values, with each being the sum of n = 100 random values from a Normal distribution.
Unsurprisingly, the Shapiro-Wilk Normality test does not conclude that the sums are not Normally distributed (p-value = 0.710). The histogram shows why:
Student's t Data:
The previous was boring. We completely expected the sums of Normally-distributed values to also be Normally-distributed. Let us do the same analysis, but with data generated from a Student's t distribution. The only thing that changes in the code is the distribution:
set.seed(370) y = rt(100*1000, df=10) s = matrix(y, ncol=100) t = rowSums(s)
The Shapiro-Wilk test is, again, unable to that the sums are not Normally distributed (p-value = 0.889).
Chi-Square Data:
The t distribution is symmetric. What happens if we use an asymmetric distribution? Here, we use a Chi-Squared distribution.
set.seed(370) y = rchisq(100*1000, df=10) s = matrix(y, ncol=100) t = rowSums(s)
Again, the Shapiro-Wilk test fails to detect non-Normality (p-value = 0.104).
Cauchy Data:
Well, all of the above data-generating distributions have one thing in common: Their variances are finite. In other words, the Central Limit Theorem applies to them. Larger sample sizes mean that the distribution of sample sums (and means) converge to a Normal distribution. What about a distribution that has a non-finite variance, like the Cauchy?
set.seed(370) y = rcauchy(100*1000) s = matrix(y, ncol=100) t = rowSums(s)
Note that the Shapiro-Wilk test does detect the non-Normality in the sample sums (p-value << 0.0001). Here is the histogram of the sum of Cauchy random variables.
Concluding Thoughts:
In this series of examples, we have seen that the distribution of the sample sums is less dependent on the distribution of the original data than we may think. As long as the sample size is large enough (and the variance of the distribution is finite), the sample sums are sufficiently close to Normal.
Extensions:
1. Repeat the above experiments, but change the sample size from 100 to 10. Which sample sums remain sufficiently Normal?
2. Repeat the Cauchy experiment above, but change the sample size from 100 to 1000. Are the sample sums sufficiently Normal now? Are the sufficiently Normal if the sample size is 10,000? ... what about 100,000? Explain what your conclusions mean.

