22.6: Other Topics
- Page ID
- 57827
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)This section holds some topics that I do not know where to put otherwise. You will have discussed the effects of multiple comparisons on the legitimacy of statistical conclusions. That is why you had to use analysis of variance instead of multiple t-tests for comparing more than two population means.
Next, I am introducing the runs test here. The runs test is important in testing the claim that the model systematically fits the observed data. The logic behind it is very straight-forward, being based on a Binomial random variable.
Finally, I provide a proof of the Central Limit Theorem. You will most certainly have heard of the Central Limit Theorem in your earlier statistics course but not seen its proof. However, the work below with moment generating functions is provided solely for those who are interested in how to prove the Central Limit Theorem.
✦•················• ✦ •··················•✦
A Check for Skewness
There are several tests for skewness, and for estimating the level of skew in a distribution. However, most are beyond the scope of this book. In lieu of these, let us use the Hildebrand Rule as a good Rule of Thumb for determining if the data are "sufficiently" skewed (Hildebrand 1986, p43).
For the sample, the ratio is defined as
\begin{equation}
H = \frac{\displaystyle \bar{x} - \widetilde{x}}{\displaystyle s}
\end{equation}
According to Hildebrand, if this ratio is greater than 0.20, then the data (or population) are positively skewed. If this ratio is less than -0.20, then the data (or population) are negatively skewed. Otherwise, there is no significant skew.
Of course, this is just a Rule of Thumb that completely ignores the sample size and precision of the estimate. However, it is not overly terrible. If, however, the skew of the data is important, one should use a genuine statistical test.
Multiple Comparisons
Remember from your previous statistics class the problem with multiple comparisons. When performing multiple tests, one needs to adjust the value of the p-value to ensure that you do not reject true null hypotheses at too high a rate (\(\alpha\)).
The first part illustrates why. The second part introduces the Bonferroni adjustment. Note that the Bonferroni always "works," but it tends to be rather conservative. That is, it controls the Type I error rate, but at the expense of reducing the power of the test (higher Type II error rate, \(\beta\)).
Multiple Testing
To start this section, recall the meaning of \(\alpha\). It is the proportion of the time we wrongly reject a true null hypothesis (i.e., commit a Type I error). Since we are basing our statistical inference on our selected value of \(\alpha\), we want to ensure that we are rejecting at the right frequency (or with the right probability). At the very least, we want to ensure that we do not reject too frequently. If we reject too frequently, we are biasing our conclusions towards rejecting null hypotheses when we should not.
If we reject too infrequently, then we will fail to reject the null hypothesis too often. This could also be a bad thing. It really depends on the penalty for being each type of wrong. In the framework of the traditional hypothesis testing, we wish to avoid rejecting the null hypothesis too frequently. This leads to the focus on ensuring we reject the null hypothesis at the \(\alpha\) rate or less. Chapter 15: Binary Dependent Variables explores consequences of only controlling \(\alpha\).
To see the effect of multiple tests on the probability of rejecting a true null hypothesis (committing a Type I error), let us look at the following two examples.
Let the null hypothesis be true. Let us also collect a sample and test the null hypothesis using that sample. What is the probability we reject the true null hypothesis if we state our accepted Type I error rate is \(\alpha\)?
Solution.
By definition, as long as the test is appropriate for the null hypothesis, the probability of wrongly rejecting the null hypothesis is \(\alpha\).
Let the null hypothesis be true. Let us collect a sample and test the null hypothesis using that sample. Let us now collect a second, independent, sample and test the null hypothesis. What is the probability we reject the true null hypothesis in either sample if we state our Type I error rate is \(\alpha\)?
Solution.
By definition, for an appropriate test of the null hypothesis, the probability of wrongly rejecting the null hypothesis in each test is \(\alpha\).
Let us define \(Y\) as the number of samples for which the test rejects the null hypothesis. Clearly, if the tests are independent (if the samples are independent), then \(Y \sim Bin(2, \alpha)\), and we need to calculate \(P[Y \ge 1]\).
As such, the probability is
\begin{align}
P[Y \ge 1] &= 1 - P[Y = 0] \\
&= 1 - \binom{2}{0}\ \alpha^0(1-\alpha)^2
\end{align}
If \(\alpha=0.05\), then \(P[Y \ge 1] = 0.0975\). In other words, we claim the Type I error rate is 0.05, but it is really 0.0975 almost twice as much!!
\(\blacksquare\)
Allow me to repeat that result: If we test the null hypothesis twice, then the actual Type I error rate is almost twice what we claim. This is very problematic, especially for hypotheses that require several sub-tests to fully test, or for times we test multiple hypotheses using the same data.
The Bonferroni Adjustment
One of the first ways for ensuring that the real Type I error rate is not larger than the claimed is the Bonferroni adjustment (Bonferroni 1936; Dunn 1958; Dunn 1961). While the Bonferroni adjustment always controls the Type I error rate, it is only the best option when nothing else is available; it reduces the power of the test (i.e., it increases the probability of a Type II error).
Let the target Type I error rate be \(\alpha\), with \(k\) tests being performed. To ensure that the experimentwise Type I error rate is no greater than \(\alpha\), each test should be rejected at the \(\alpha/k\) level.
Proof.
If we reject each test at the \(\alpha/k\) level, the experiment-wise error rate for the \(k\) tests is
\begin{align}
\text{EWER} &= P\left[\bigcup_{i=1}^k\ \mathbb{1}_{ P_i \le \frac{\alpha}{k} } \right] \\[2em]
&\le \sum_{i=1}^k\ P\left[P_i \le \frac{\alpha}{k} \right] \\[2em]
&= \sum_{i=1}^k\ \frac{\alpha}{k} \\[1em]
&= k\ \frac{\alpha}{k} = \alpha
\end{align}
Note that \(\mathbb{1}_{(P_i \le \frac{\alpha}{k})}\) equals 1 when \(P_i \le \frac{\alpha}{k}\) and 0 otherwise. The function \(\mathbb{1}_{\cdot}\) is called the "indicator function" for this very reason. It "indicates" if the condition in the braces is true. (What is the distribution of the indicator function?)
Also note that \(P_i\) is the p-value for each test. If the test is appropriate, then the p-value follows a standard uniform distribution, \(Unif(0,1)\)
With those reminders, the rest is just algebra.
\(\blacksquare\)
Note the inequality. Under what circumstances is it an equality?
The Bonferroni adjustment always "works." In other words, applying the Bonferroni adjustment always ensures that the Type I error rate for the entire battery of tests is at most \(\alpha\). However, it is usually a poor adjustment in that it over-corrects for multiple comparisons.
To see this, let us make the assumption that the tests are independent. This additional information allows us to create a tighter adjustment.
Let the target Type I error rate be \(\alpha\), with \(k\) independent tests being performed. To ensure that the experiment-wise Type I error rate is no greater than \(\alpha\), each test should be rejected at the \(1-(1-\alpha)^{1/k}\) level.
Proof.
If we reject each test at the \(1-(1-\alpha)^{1/k}\) level, the experiment-wise error rate for the \(k\) tests is
\begin{align}
\text{EWER} &= P\left[\bigcup_{i=1}^k\ \mathbb{1}_{ P_i \le 1-(1-\alpha)^{1/k}} \right]
\end{align}
By De Morgan's Laws, this is
\begin{align}
&= 1 - P\left[\bigcap_{i=1}^k\ \mathbb{1}_{P_i > 1-(1-\alpha)^{1/k}} \right]
\end{align}
By independence, this is
\begin{align}
&= 1- \prod_{i=1}^k\ P\left[P_i > 1-(1-\alpha)^{1/k}\right] \\
&= 1- \prod_{i=1}^k\ \Big( 1- P\left[P_i \le 1-(1-\alpha)^{1/k}\right] \Big) \\
&= 1- \prod_{i=1}^k\ \Big( 1 - \left( 1-(1-\alpha)^{1/k} \right) \Big)
\end{align}
Finishing with algebra
\begin{align}
\text{EWER} &= 1- \prod_{i=1}^k\ (1-\alpha)^{1/k} \\
&= 1- \left((1-\alpha)^{1/k}\right)^k \\
&= 1 - (1-\alpha) \\
&= \alpha
\end{align}
Thus, if we know that the tests are independent, we can create a smaller multiplier. This will still protect the Type I error rate, while not affecting the Type II error rate as much.
\(\blacksquare\)
The original Bonferroni adjustment is a larger divisor (smaller multiplier; lower power) than the adjustment that assumes independence among the tests. In other words,
\begin{equation}
\frac{\alpha}{k} \le 1-(1-\alpha)^{1/k}
\end{equation}
I leave this proof as an exercise.
Theorem \(\PageIndex{1}\) (far above) shows that we can adjust any case of multiple testing. However, Theorem \(\PageIndex{2}\) and Theorem \(\PageIndex{3}\) (just above) show that additional information (or assumptions) can help make adjustments that affect the test's power less.
In other words, additional information can help. We should not throw anything out.
This section explored the need for adjusting the \(\alpha\) to ensure that the claimed Type I error rate is no larger than the true Type I error rate. This is extremely important in inferential statistics. It is not, however, the only thing that needs to be examined.
Power (the probability of rejecting a false null hypothesis) is also quite important. Because of this, while the Bonferroni adjustment always works, it may reduce the power of the test to unacceptable levels. As such, understanding the relationships between and among the various tests is important in creating a superior adjustment.
The Runs Test
This section introduces the runs test (Bradley 1968; Mood 1940; Wald and Wolfowitz 1940). While you will not have seen it in your introductory statistics course, it is based on elementary concepts.
To begin, let us define a couple of terms.
A run is a sequential set of values all either above or below zero.
The length of a run is the number of elements in that run.
Let the following be residuals from a model.
\[
1.2, 3.5, -0.5, 8.9, -1.2, -1.6, -4.5, 0.6, 1.5, -7.9
\]
Let us calculate the number of runs.
Solution.
The first run is of length 2: \(\{1.2, 3.5\}\). The second run is of length 1: \(\{-0.5\}\). The third run is also of length 1: \(\{8.9\}\). The fourth through sixth runs are \(\{-1.2, -1.6, -4.5\}\), \(\{0.6, 1.5\}\), and \(\{-7.9\}\).
Thus, there are six runs in those 10 residuals.
\(\blacksquare\)
Before continuing, remind yourself of Figure 6.2.1 and Figure 6.2.2. In both graphics, the residual is colored blue if it is positive and pink otherwise. In Figure 6.2.1, there are long unbroken streaks (runs) of blue and pink. There are just three runs.
In Figure 6.2.2, the length of those runs is much reduced and the number is increased. There are now 12 runs in the 20 residuals.
Since the distribution of the residuals is assumed/required to be Normal, the probability of each residual being above the line \(e=0\) is \(\pi=0.500\)... a Binomial random variable! Because we know the distribution of the number of residuals above the \(e=0\) line, we know everything about the distribution of residuals, including the distribution of the number of runs. Too few runs indicates that one positive residual will tend to be followed by another. This usually suggests the model is misspecified. In Figure 6.2.1, the number of runs was just three. In the properly specified model of Figure 6.2.2, the number of runs is 12.
A statistician will pay attention to the number of runs as well as the distribution of the number of runs under the null hypothesis. In other words, what is the distribution of the number of runs?
The following estimates that distribution. Working through the code may help you better understand the runs test, the test statistic, and its distribution.
resids = 20
calculateRuns = function(r) {
n = length(r)
x = sign(r)
y = x[-n]-x[-1]
runs = 1+sum(y!=0)
}
numRuns = numeric()
for(i in 1:1e5) {
x = rnorm(resids)
numRuns[i] = calculateRuns(x)
}
par(mar=c(2,1,1,1))
par(yaxt="n")
par(family="serif")
barplot(table(numRuns), col=rgb(0.85, 0.37, 0.01))
abline(h=0)
In the code above, the first line specifies the number of residuals. Since I will be using this distribution to analyze the residuals from Section~\ref{sec:lm4-constantexpectedvalue}, I set the number to 20. The calculateRuns function calculates the number of runs in a sequence of values. The for-loop repeatedly generates values from the null hypothesis and calculates the number of runs.
The last chunk of code plots the histogram, which is shown in Figure \(\PageIndex{1}\). Note that it is bell-shaped. This should give you an insight into how we can create an approximate test statistic based on the Normal distribution, which is what the functions do.
From this, the probability of having 7 or more runs is 95%. So, we would claim too few runs if we observed 6 or fewer runs in our 20 residuals. The p-value for observing 3 runs is 0.0003. Since this p-value is less than \(\alpha\) = 0.05, we reject the null hypothesis that the residuals are randomly distributed about the zero line. Thus, we would conclude that the relationship is not linear.
The following lines calculated the critical value and the p-value for the runs distribution.
quantile(numRuns, 0.05) ## critical value mean(numRuns<=3) ## p-value
Another Normal Approximation*
In addition to the simulation experiment above, Wald and Wolfowitz (1940) provided a Normal approximation that did not require simulation. They concluded that the number of runs, \(R\), approximately follows the following distribution:
\begin{equation}
R\ \stackrel{\cdot}{\sim}\ N\left( \frac{2N^+N^-}{N}+1;\ \frac{(\mu-1)(\mu-2)}{N-1} \right)
\end{equation}
In these, \(N^+\) is the number of positive values, \(N^-\) is the number of negative values, and \(N\) is the number of values. As with any "Normal approximation," the approximation improves as the sample size increases.
The Exact Distribution*
While Wald and Wolfowitz (1940) did create an approximate distribution for the number of runs, we can also provide the exact distribution that preceded them using the hypergeometric distribution. All that we need is to remember combinatorics and the "choose" (or nCr or binomial coefficient) function.
Under the null hypothesis, the probability of having exactly \(r\) runs, given a sample size of \(N\), of which \(N^+\) are positive values and \(N^-\) are negative, is
\begin{equation}
P[R = r] = 2\ \frac{\displaystyle\binom{N^+ -1}{m-1}\ \binom{N^- -1}{m-1}}{\displaystyle\binom{N^{\phantom{+}}}{N^+}}
\end{equation}
if \(r\) is even and \(r = 2m\). However, if \(r\) is odd and \(r = 2m+1\), then
\begin{equation}
P[R = r] = \frac{ \displaystyle\binom{N^+-1}{m}\ \binom{N^--1}{m-1} + \displaystyle\binom{N^--1}{m}\ \binom{N^+-1}{m-1} }{\displaystyle\binom{N^{\phantom{+}}}{N^+}}
\end{equation}
Those look rather "simple" to calculate. Remember, however, that p-values are cumulative probabilities. Thus, to calculate p-values, one would need to calculate \(P[R \le r]\), which means a lot more calculation.
Computers are awesome!
The R Functions
Instead of going through the above steps every time to obtain p-values and critical values, the function is implemented in a couple packages in R. First, it is implemented in the lawstat package as the function runs.test. It takes just one piece of information: the residuals in the order of the independent variable. It is also implemented in the randtests and snpar packages, as well as KnoxStats.
Note that order matters. Thus, you will need to order the residuals based on each independent variable separately. That is, if you have three independent variables, you will need to run the runs test three times, once for each independent variable.
The Central Limit Theorem
The Central Limit Theorem (CLT) is one of the most important theorems in statistics. It explains why we can rely so heavily on the Normal distribution for inferential statistics, even if the data do not arise from a Normal process.
Let \(X\) be a random variable following a distribution with finite variance \(\sigma^2\). Let \(\{X_1, X_2, \ldots, X_n\}\) be a random sample of size \(n\) from this distribution. Finally, define \(T\) as
\begin{equation}
T_n \stackrel{\text{def}}{=} \sum_{i=1}^n\ X_i
\end{equation}
As \(n \to \infty\), the distribution of \(T_n\) converges to the Normal distribution.
Note that this theorem requires that the random variable has a finite variance and that the sample is a random sample (independent and identically distributed). If either of these two conditions is not met, then this version of the Central Limit Theorem does not hold.
Other versions exist for some different types of dependent sampling. However, even those require the variance be finite.
In your previous statistics course, you may remember the Central Limit Theorem as saying something about the distribution of the sample mean. Well, that is actually a consequence of the CLT.
Given the conditions of the Central Limit Theorem, define \(\overline{X}\) as
\begin{equation}
\overline{X}_n \stackrel{\text{def}}{=} \frac{1}{n}\ \sum_{i=1}^n\ X_i
\end{equation}
The distribution of the sample mean, \(\overline{X}_n\), converges to the Normal distribution.
To use the Central Limit Theorem, we need to ensure the data are randomly generated from a distribution with a finite variance. We also need to ensure that we are trying to describe the sample total or sample mean. If those conditions are met, then we can use the CLT.
How would we prove the Central Limit Theorem? To do that, we need to introduce something called "moment generating functions." This we do in the next section. Continue, if you wish. Proving the Central Limit Theorem is usually beyond the scope of this type of course.
Moment Generating Functions*
This section introduces moment generating functions. You most certainly did not experience these in your introductory statistics course. For this course, they are used solely to prove the Central Limit Theorem and to prove that a linear combination of Normally distributed random variables is also a Normally distributed random variable (Theorem: The Sum of Normals).
A moment generating function is a function that can be used to generate all of the moments of a distribution. We have already experienced moments. The expected value is the first moment, for instance. Here are the definitions of three types of moments: raw, central, and standardized:
The \(k^\text{th}\) (raw) moment is \(E[X^k]\).
The \(k^\text{th}\) central moment is \(E[\left(X-\mu\right)^k]\), where \(\mu = E[X]\).
The \(k^\text{th}\) standardized moment is \(E\left[\left(\frac{X-\mu}{\sigma}\right)^k\right]\), where \(\mu = E[X]\) and \(\sigma = V[X]\).
Thus, by definition, the first raw moment is \(E[X]\). This is just the mean. The second central moment is \(E[(X-\mu)^2]\). This is just the variance. Similarly, the third standardized moment is the skew, and the fourth standardized moment is the kurtosis.
Prove to yourself that the first central moment is always \(0\) as long as \(E[X]\) exists.
· · ─ ·✶· ─ · ·
Moments are interesting in their own right. However, the following theorem makes them important in the realm of probability theory.
If two distributions have all moments the same, then the two distributions are equal, with probability 1.
The "with probability 1" clause is added to take into consideration the inherent issues with continuous distributions. [It also clearly specifies the difference between probability (and statistics) and mathematics that some mathematicians may not appreciate.]
Here is an example of that inherent issue: Let us define two distributions \(X\) and \(Y\):
\begin{equation}
f(x) = \left\{ \begin{array}{ll} 1 \qquad & 0 < x < 1 \\ 0 & \text{otherwise} \\ \end{array} \right.
\end{equation}
\begin{equation}
g(y) = \left\{ \begin{array}{ll} 1 \qquad & 0 \le y \le 1 \\ 0 & \text{otherwise} \\ \end{array} \right.
\end{equation}
The random variables \(X\) and \(Y\) are not identical. The support for \(Y\) includes the entire support of \(X\) as well as the values \(0\) and \(1\). Thus, they are mathematically not the same distribution.
However, they are the same with probability 1. In other words, it does not matter which of the distributions you use, the calculated probabilities will be the same. This is because \(P[X=0] = P[Y=0] = 0\) and \(P[X=1] = P[Y=1] = 0\). Thus, for all intents and purposes, \(X\) and \(Y\) are the same. In probability, we would say \(X = Y\) with probability 1, symbolized as \(X \stackrel{wp0}{=\!\!=\!\!=} Y\).
With Theorem \(\PageIndex{6}\), it becomes clear that if two distributions have the same moment generating function, then those distributions are the same (with probability 1).
And so, at this point, we know what a moment is and what we can use them for. We know that if two distributions have all the same moments, then the two distributions are identical (with probability 1). This led to us knowing that if two distributions have the same moment generating functions, then the two distributions are the same. All that remains is to define a moment generating function (MGF).
There is a little more to the theory, but the details are not important at this point in your statistical life.
Let \(X\) be a random variable. Define the function \(M_x(t)\) as
\begin{equation}
M_x(t) \stackrel{\text{def}}{=} E\left[e^{Xt}\right]
\end{equation}
Then \(M_x(t)\) is called the moment generating function of \(X\).
Why is it called a "moment generating function"? It generates the raw moments. The \(k^{\text{th}}\) raw moment of \(X\) is calculated from
\begin{equation}
E[X^k] = \left.\frac{d^k}{dt^k}M_x(t) \right|_{t=0}
\end{equation}
Note that the moment generating function must be defined in a small neighborhood of 0 for us to apply this formula.
Let \(X\) be a Bernoulli random variable with success probability \(\pi\). What is its moment generating function?
Before starting the solution, please review the information on the Bernoulli distribution.
Solution.
The moment generating function is defined as \(M_x(t) =E\left[e^{Xt}\right]\). Since \(X\) is a discrete random variable, the MGF is a sum weighted by its probability mass function (pmf).
\begin{align}
E\left[e^{Xt}\right] &= \sum_i\ e^{xt}\ P[X = x_i] \\[1em]
&= \sum_{x=0}^1\ e^{xt}\ \pi^x (1-\pi)^{1-x} \\[1em]
&= e^{0t}\ \pi^0 (1-\pi)^{1-0} + e^{1t}\ \pi^1 (1-\pi)^{1-1} \\[1em]
&= 1\cdot 1\cdot (1-\pi) + e^{t}\cdot \pi^1\cdot 1
\end{align}
And, simplification gives us the MGF for a Bernoulli random variable:
\begin{equation}
E[e^{Xt}] = (1-\pi) + \pi\ e^{t}
\end{equation}
Now, with that, we can calculate the mean of \(X\):
\begin{align}
E[X^1] &= \left.\frac{d^1}{dt^1}M_x(t) \right|_{t=0} \\
&= \left.\frac{d}{dt}\left( (1-\pi) + \pi\ e^{t} \right)\ \right|_{t=0} \\
&= \left. \pi\ e^{t}\ \right|_{t=0} \\
&= \pi\ e^{0}\\
&= \pi
\end{align}
It should come as no surprise that the expected value of a Bernoulli random variable is the success probability. The second moment is
\begin{align}
E[X^2] &= \left.\frac{d^2}{dt^2}M_x(t) \right|_{t=0} \\
&= \left.\frac{d^2}{dt^2} \left( (1-\pi) + \pi\ e^{t} \right)\ \right|_{t=0} \\
&= \left.\frac{d}{dt}\left( \pi\ e^{t} \right)\ \right|_{t=0} \\
&= \left. \pi\ e^{t}\ \right|_{t=0} \\
&= \pi
\end{align}
This is the second raw moment. Recall that \(E[X^2] = \sigma^2 + \mu^2\). Thus, we can calculate the variance of \(X\) as
\begin{align}
V[X] &= E[X^2] - \mu^2 \\
&= \pi - \pi^2 \\
&= \pi(1-\pi)
\end{align}
Thus, we have obtained our usual variance formula for a Bernoulli random variable... but in a different way.
\(\blacksquare\)
Determine the moment generating function for a Binomial random variable with parameters \(n\) and \(\pi\).
Solution.
By our definition of the Binomial random variable, we know that it is just the sum of \(n\) independent and identically distributed Bernoulli random variables. Let us use that to make the calculations easier.
Thus, let \(X_i\) be a series of Bernoulli random variables and \(Y = \sum_{i=1}^n X_i\) be the resulting Binomial random variable. The moment generating function of \(Y\) is
\begin{align}
M_y(t) &= E\left[\exp \left(Yt\right) \right] \\
&= E\left[\exp \left[ \sum_{i=1}^n\ X_i\ t \right]\right] \\
&= E\left[\prod_{i=1}^n\ \exp \left[ X_i\ t \right]\right] \\
&= \prod_{i=1}^n\ E\left[\exp \left[ X_i\ t \right]\right] \\
&= \prod_{i=1}^n\ \left( (1-\pi) + \pi e^t\right)
\end{align}
Thus, the MGF for a Binomial random variable is
\begin{equation}
M_y(t) = \left( (1-\pi) + \pi e^t\right)^n
\end{equation}
With that, we can calculate the various moments, should we desire.
\(\blacksquare\)
Let \(X\) and \(Y\) be two independent random variables. Define \(Z = X + Y\). The moment generating function of \(Z\) is the product of the moment generating functions of \(X\) and \(Y\):
\begin{equation}
M_z(t) = M_x(t)\ M_y(t)
\end{equation}
Proof.
The proof is simply an exercise in definitions and algebra:
\begin{align}
M_z(t) &\stackrel{\text{def}}{=} E[e^{tZ}] \\
&= E[e^{(X+Y)t}] \\
&= E[e^X\ e^Y] \\
&= E[e^X]\ E[e^Y] \label{eq:appS-weoriuwoeir} \\
&= M_x(t)\ M_y(t)
\end{align}
Note that we needed the assumption of independence in the fourth line (\(\ref{eq:appS-weoriuwoeir}\)), where we split the expected value.
\(\blacksquare\)
The moment generating function for the Normal distribution is
\begin{equation}
M_x(t) = \exp\left[ \mu t + \frac{1}{2} \sigma^2 t^2 \right]
\end{equation}
I leave this proof as an exercise. It is a simple application of the definition of moment generating function and the probability density function of the Normal distribution.
With that last theorem, we are able to prove that a linear combination of independent Normal random variables also is a Normal random variable.
Let \(X \sim N\left( \mu_x,\ \sigma_x^2\right)\) and \(Y \sim N\left( \mu_y,\ \sigma_y^2\right)\). Let \(X\) and \(Y\) be independent; that is, \(X \perp Y\). If \(a\), \(b\), and \(c\) be scalars, then
\begin{equation}
aX + bY + c \sim N\left( a\mu_x + b\mu_y + c,\ a^2\sigma_x^2 + b^2\sigma_y^2\right)
\end{equation}
Proof.
As expected, the proof relies on moment generating functions. First, the moment generating function of \(aX + bY + c\) is the product of the MGFs of \(aX\), \(bY\), and \(c\) because they are independent. So, let us examine the MGFs of each.
\begin{align}
M_{ax}(t) &= M_x(at) \\
&= \exp\left[ \mu_x (at) + \frac{1}{2} \sigma_x^2 (at)^2 \right] \\
M_{by}(t) &= M_y(bt) \\
&= \exp\left[ \mu_y (bt) + \frac{1}{2} \sigma_y^2 (bt)^2 \right] \\
M_c(t) &= M(ct) \\
&= \exp \left[ ct \right]
\end{align}
Putting these together gives
\begin{align}
M_{ax+by+c}(t) &= \exp\left[ \mu_x (at) + \frac{1}{2} \sigma_x^2 (at)^2 \right] \cdot \exp\left[ \mu_y (bt) + \frac{1}{2} \sigma_y^2 (bt)^2 \right] \cdot \exp \left[ ct \right] \\
&= \exp\left[ \mu_x (at) + \frac{1}{2} \sigma_x^2 (at)^2 + \mu_y (bt) + \frac{1}{2} \sigma_y^2 (bt)^2 + ct \right] \\
&= \exp\left[ \left( a\mu_x + b\mu_y + c \right) t + \frac{1}{2} \left( a^2\sigma_x^2 + b^2 \sigma_y^2 \right) t^2 \right]
\end{align}
Note that this is the moment generating function of a Normal distribution with mean \(a\mu_x + b\mu_y + c\) and variance \(a^2\sigma_x^2 + b^2 \sigma_y^2\). Thus, we have proven this theorem.
\(\blacksquare\)
We are not quite done; there is one missing piece.
How can we have a moment generating function for the scalar value \(c\)? Aren't MGFs only for random variables? Well, there is a distribution called the "degenerate distribution" that takes on a single value with probability 1. Think of a two-headed coin. The probability mass function for a degenerate distribution is \(\delta(X-c)\), where \(\delta(\cdot)\) is the Dirac delta function (Calculus I), \(X\) is the random variable, and \(c\) is the only possible value of \(X\).
If you have taken a course in differential equations, you will certainly have come across the Dirac delta function. If not, then you may not have. In my experience, I saw \(\delta(\cdot)\) in Calculus I... and II and III and a few physics courses.
The take-away is that there is a lot of statistics that you have not even seen. You have just caught little glimpses. Sometimes, you have seen probability without actually noticing it. For instance, the probability density functions you have seen are just solutions to certain differential equations.
Astounding!
Alright, I'm not done with that. If you do not like thinking about the degenerate distribution as an actual distribution, think of it as a limiting distribution. For instance, define
\begin{equation}
f_s(x;\ \mu) \stackrel{\text{def}}{=} \frac{1}{\sqrt{2\pi s^2}}\ \exp\left[ -\frac{1}{2}\ \frac{(x-\mu)^2}{s^2} \right]
\end{equation}
The limit of this (Normal) distribution as \(s \to 0\) is the degenerate distribution at the point \(x=\mu\).
Neat-o skeet-o!
Note that the previous distribution showed that the sum of two Normally-distributed random variables also has a Normal distribution. The converse, that a Normally-distributed random variable can only be decomposed into the sum of two other Normally-distributed random variables is much more difficult to prove. It is the purpose of Cramér's Theorem.
In other words, Cramér's Theorem states that if \(Z = X + Y\) and if \(Z\) follows a Normal distribution (and a few other requirements), then both \(X\) and \(Y\) must also follow Normal distributions. The proof can be found in Cramér (1936).
A Proof of the Central Limit Theorem*
And so, with all of this preparation, we are now ready to prove the Central Limit Theorem.
Let \(X\) be a random variable with expected value \(\mu\) and finite variance \(\sigma^2\). Let \(\{X_1, X_2, \ldots, X_n\}\) be a random sample of size \(n\) from this distribution. Finally, define \(T_n\) as
\begin{equation}
T_n \stackrel{\text{def}}{=} \sum_{i=1}^n\ X_i
\end{equation}
Then,
\begin{equation}
T_n \stackrel{d}{\to} N\left(n\mu,\ n\sigma^2\right)
\end{equation}
as \(n \to \infty\).
Here is an overview of this proof: This proof of the Central Limit Theorem relies on moment generating functions and Taylor's theorem. It starts with the distribution of \(T\), which we need to determine. It then creates two subsequent distributions, \(Y_n\) and \(Z_i\), to help with the calculations. The moment generating function for \(Z_i\) is approximated using Taylor's theorem. From that, the moment generating function for \(Y_n\) is determined. That provides us the approximate distribution of \(Y_n\), which is \(N(0,\ 1)\). With that, we determine the approximate distribution of \(T\), as required.
With that overview, let us start the actual proof.
Proof.
Let \(X\) be a random variable with mean \(\mu\) and finite variance \(\sigma^2\). Let \(\{X_1, X_2, \ldots, X_n\}\) be a random sample from \(X\). The sum \(T_n \stackrel{\text{def}}{=} \sum X_i\) has mean \(n\mu\) and variance \(n\sigma^2\).
Define the random variable \(Y_n\) as
\begin{equation}
Y_n \stackrel{\text{def}}{=} \frac{T_n - n\mu}{\sqrt{n\sigma^2}}
\end{equation}
In other words, the \(Y_n\) random variable is just the \(T_n\) random variable standardized.
Note that this definition is equivalent to
\begin{equation}
Y_n = \sum_{i=1}^n\ \frac{X_i - \mu}{\sqrt{n\sigma^2}} \label{eq:CLTy}
\end{equation}
Eventually, we will show that the distribution of \(Y_n\) converges to standard Normal as \(n \to \infty\). For now, we use it as an intermediate between \(T\) and \(Z_i\). Define \(Z_i\) as
\begin{equation}
Z_i \stackrel{\text{def}}{=} \frac{X_i - \mu}{\sigma}
\end{equation}
which is the z-score for the \(X_i\) random variable. It is useful, because we can now see
\begin{equation}
Y_n = \sum_{i=1}^n \frac{1}{\sqrt{n}} Z_i
\end{equation}
This link between \(Z_i\) and \(Y_n\) will be exploited later, when we finally approximate the distribution of \(Z_i\) and use it to approximate the distribution of \(Y_n\).
With that literary foreshadowing, note that the moment generating function of \(Y_n\), in terms of the \(Z_i\), is
\begin{equation}
M_y(t) = \left(M_z\left(\frac{t}{\sqrt{n}}\right)^{\phantom{!}} \right)^n
\end{equation}
This result relies on the \(Z_i\) being independent and identically distributed (i.e., a random sample). The moment generating function for \(Z_1\) (or any of the \(Z_i\)) can be approximated by using Taylor's theorem (expanded around \(t=0\)):
\begin{align}
M_z\left(\frac{t}{\sqrt{n}}\right) &= \sum_{i=0}^{\infty}\ \left(\frac{t}{\sqrt{n}}\right)^i \frac{M_Z^{(i)}(0)}{i!} \\
&= \left(\frac{t}{\sqrt{n}}\right)^0 \frac{M_Z^{(0)}(0)}{0!} + \left(\frac{t}{\sqrt{n}}\right)^1 \frac{M_Z^{(1)}(0)}{1!} + \left(\frac{t}{\sqrt{n}}\right)^2 \frac{M_Z^{(2)}(0)}{2!} + \cdots \\
&= 1 \frac{1}{0!} \hspace{4.9em} + \frac{t}{\sqrt{n}} \frac{E[Z_i]}{1!} \hspace{2em} + \left(\frac{t}{\sqrt{n}}\right)^2 \frac{E[Z^2]}{2!} \hspace{0.5em} + \cdots \\
&= 1 \hspace{5.9em} + \frac{t}{\sqrt{n}} \frac{0}{1!} \hspace{3.9em} + \left(\frac{t}{\sqrt{n}}\right)^2 \frac{1}{2!} \hspace{2.3em} + \cdots \\
&= 1 \hspace{5.9em} + 0 \hspace{5.9em} + \frac{t^2}{2n} \hspace{5.3em} + \cdots
\end{align}
If we include the remainder term from Taylor (the \(\cdots\) above), then we have this conclusion
\begin{equation}
M_z\left(\frac{t}{\sqrt{n}}\right) \approx 1 + \frac{t^2}{2n} + \xi\frac{t^3}{6n^{3/2}} + o\left(\frac{t^3}{n^{3/2}}\right)
\end{equation}
The last two terms are the remainder. Here, \(\xi\) is a constant and \(o\left(\frac{t^3}{n^{3/2}}\right)\) is "little-o" notation indicating that the remainder (whatever it is) goes to 0 faster than does \(\frac{t^3}{n^{3/2}}\) (as \(t \to 0\)).
Now, we return to \(Y_n\). It was defined as \(\sum_{i=1}^n \frac{1}{\sqrt{n}} Z_i\), where the \(Z_i\) were independent and identically distributed. This means the moment generating function of \(Y_n\) is
\begin{equation}
M_y\left(\frac{t}{\sqrt{n}}\right) \approx \left( 1 - \frac{t^2}{2n} + \xi\frac{t^3}{6n^{3/2}} + o\left(\frac{t^3}{n^{3/2}}\right)^{\phantom{!}} \right)^n
\end{equation}
The third term goes to 0 as \(n \to \infty\), as does the fourth term. The remaining two terms are
\begin{equation}
M_y\left(\frac{t}{\sqrt{n}}\right) \approx \left( 1 - \frac{t^2/2}{n} \right)^n; \qquad n \to \infty
\end{equation}
Note that this is just the definition of the exponential function \(e^{\frac{1}{2} t^2}\). This means we have the MGF for \(Y_n\):
\begin{equation}
M_y\left(\frac{t}{\sqrt{n}}\right) \approx e^{\frac{1}{2} t^2}
\end{equation}
\ldots\ and this is just the moment generating function for a standard Normal distribution, \(N(0,\ 1)\). Thus, \(Y_n\) will approach \(N(0,\ 1)\) as \(n \to \infty\).
Finally, using our definition of \(Y_n\) (Eqn \(\ref{eq:CLTy}\)), we can conclude that the distribution of \(T_n\) also converges in distribution to \(N(\mu,\ \sigma^2)\) as \(n \to \infty\).
\(\blacksquare\)
And so, in this final section of the appendix, we were finally able to prove the Central Limit Theorem. To do so, we needed to learn about moment generating functions... and remember our Taylor Series expansion.
Again, the only purpose of moment generating functions for this course of study is to be able to prove the Central Limit Theorem. I included the proof of the Central Limit Theorem simply because many have asked for it in the past.
Et voilà !


