7.3: Difference of two means
- Page ID
- 56946
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)In this section we consider a difference in two population means, \(\mu_1 - \mu_2\), under the condition that the data are not paired. Just as with a single sample, we identify conditions to ensure we can use the \(t\)-distribution with a point estimate of the difference, \(\bar{x}_1 - \bar{x}_2\), and a new standard error formula. Other than these two differences, the details are almost identical to the one-mean procedures.
We apply these methods in three contexts: determining whether stem cells can improve heart function, exploring the relationship between pregnant womens’ smoking habits and birth weights of newborns, and exploring whether there is statistically significant evidence that one variation of an exam is harder than another variation. This section is motivated by questions like “Is there convincing evidence that newborns from mothers who smoke have a different average birth weight than newborns from mothers who don’t smoke?”
Confidence interval for a difference of means
Does treatment using embryonic stem cells (ESCs) help improve heart function following a heart attack? Figure [statsSheepEscStudy] contains summary statistics for an experiment to test ESCs in sheep that had a heart attack. Each of these sheep was randomly assigned to the ESC or control group, and the change in their hearts’ pumping capacity was measured in the study. Figure [stemCellTherapyForHearts] provides histograms of the two data sets. A positive value corresponds to increased pumping capacity, which generally suggests a stronger recovery. Our goal will be to identify a 95% confidence interval for the effect of ESCs on the change in heart pumping capacity relative to the control group.
| \(n\) | \(\bar{x}\) | \(s\) | |||
|---|---|---|---|---|---|
| ESCs | 9 | 3.50 | 5.17 | ||
| control | 9 | -4.33 | 2.76 |
The point estimate of the difference in the heart pumping variable is straightforward to find: it is the difference in the sample means.
\[\begin{aligned} \bar{x}_{esc} - \bar{x}_{control}\ =\ 3.50 - (-4.33)\ =\ 7.83\end{aligned}\]
For the question of whether we can model this difference using a \(t\)-distribution, we’ll need to check new conditions. Like the 2-proportion cases, we will require a more robust version of independence so we are confident the two groups are also independent. Secondly, we also check for normality in each group separately, which in practice is a check for outliers.
Using the \(\pmb{\MakeLowercase{t}}\)-distribution for a difference in means [ConditionsForTwoSampleTDist] The \(t\)-distribution can be used for inference when working with the standardized difference of two means if
- Independence, extended. The data are independent within and between the two groups, e.g. the data come from independent random samples or from a randomized experiment.
- Normality. We check the outliers rules of thumb for each group separately.
The standard error may be computed as
\[\begin{aligned} SE%_{\bar{x}_{1} - \bar{x}_{2}} = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} %\approx \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \index{standard error (SE)!difference in means} \end{aligned}\]
The official formula for the degrees of freedom is quite complex and is generally computed using software, so instead you may use the smaller of \(n_1 - 1\) and \(n_2 - 1\) for the degrees of freedom if software isn’t readily available.
Can the \(t\)-distribution be used to make inference using the point estimate, \(\bar{x}_{esc} - \bar{x}_{control} = 7.83\)? First, we check for independence. Because the sheep were randomized into the groups, independence within and between groups is satisfied.
Figure [stemCellTherapyForHearts] does not reveal any clear outliers in either group. (The ESC group does look a bit more variability, but this is not the same as having clear outliers.)
With both conditions met, we can use the \(t\)-distribution to model the difference of sample means.
As with the one-sample case, we always compute the standard error using sample standard deviations rather than population standard deviations:
\[\begin{aligned} SE%_{\bar{x}_{esc} - \bar{x}_{control}} %= \sqrt{\frac{\sigma_{esc}^2}{n_{esc}} + \frac{\sigma_{control}^2}{n_{control}}} %\\ = \sqrt{\frac{s_{esc}^2}{n_{esc}} + \frac{s_{control}^2}{n_{control}}} = \sqrt{\frac{5.17^2}{9} + \frac{2.76^2}{9}} = 1.95\end{aligned}\]
Generally, we use statistical software to find the appropriate degrees of freedom, or if software isn’t available, we can use the smaller of \(n_1 - 1\) and \(n_2 - 1\) for the degrees of freedom, e.g. if using a \(t\)-table to find tail areas. For transparency in the Examples and Guided Practice, we’ll use the latter approach for finding \(df\); in the case of the ESC example, this means we’ll use \(df = 8\).
Calculate a 95% confidence interval for the effect of ESCs on the change in heart pumping capacity of sheep after they’ve suffered a heart attack. We will use the sample difference and the standard error that we computed earlier calculations:
\[\begin{aligned} \bar{x}_{esc} - \bar{x}_{control} = 7.83 && SE = \sqrt{\frac{5.17^2}{9} + \frac{2.76^2}{9}} = 1.95 \end{aligned}\]
Using \(df = 8\), we can identify the critical value of \(t^{\star}_{8} = 2.31\) for a 95% confidence interval. Finally, we can enter the values into the confidence interval formula:
\[\begin{aligned} \text{point estimate} \ \pm\ t^{\star} \times SE \quad\rightarrow\quad 7.83 \ \pm\ 2.31\times 1.95 \quad\rightarrow\quad (3.32, 12.34) \end{aligned}\]
We are 95% confident that embryonic stem cells improve the heart’s pumping function in sheep that have suffered a heart attack by 3.32% to 12.34%.
As with past statistical inference applications, there is a well-trodden procedure.
- Prepare.
-
Retrieve critical contextual information, and if appropriate, set up hypotheses.
- Check.
-
Ensure the required conditions are reasonably satisfied.
- Calculate.
-
Find the standard error, and then construct a confidence interval, or if conducting a hypothesis test, find a test statistic and p-value.
- Conclude.
-
Interpret the results in the context of the application.
The details change a little from one setting to the next, but this general approach remain the same.
Hypothesis tests for the difference of two means
A data set called represents a random sample of 150 cases of mothers and their newborns in North Carolina over a year. Four cases from this data set are represented in Figure [babySmokeDF]. We are particularly interested in two variables: and . The variable represents the weights of the newborns and the variable describes which mothers smoked during pregnancy. We would like to know, is there convincing evidence that newborns from mothers who smoke have a different average birth weight than newborns from mothers who don’t smoke? We will use the North Carolina sample to try to answer this question. The smoking group includes 50 cases and the nonsmoking group contains 100 cases.
| fage | mage | weeks | weight | sex | smoke | |
|---|---|---|---|---|---|---|
| 1 | NA | 13 | 37 | 5.00 | female | nonsmoker |
| 2 | NA | 14 | 36 | 5.88 | female | nonsmoker |
| 3 | 19 | 15 | 41 | 8.13 | male | smoker |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | |
| 150 | 45 | 50 | 36 | 9.25 | female | nonsmoker |
Set up appropriate hypotheses to evaluate whether there is a relationship between a mother smoking and average birth weight. [babySmokeHTForWeight] The null hypothesis represents the case of no difference between the groups.
- There is no difference in average birth weight for newborns from mothers who did and did not smoke. In statistical notation: \(\mu_{n} - \mu_{s} = 0\), where \(\mu_{n}\) represents non-smoking mothers and \(\mu_s\) represents mothers who smoked.
- There is some difference in average newborn weights from mothers who did and did not smoke (\(\mu_{n} - \mu_{s} \neq 0\)).
We check the two conditions necessary to model the difference in sample means using the \(t\)-distribution.
- Because the data come from a simple random sample, the observations are independent, both within and between samples.
- With both data sets over 30 observations, we inspect the data in Figure [babySmokePlotOfTwoGroupsToExamineSkew] for any particularly extreme outliers and find none.
Since both conditions are satisfied, the difference in sample means may be modeled using a \(t\)-distribution.
[babySmokeCalcForWeight] The summary statistics in Figure [SumStatsBirthWeightNewbornsSmoke] may be useful for this Guided Practice.
- What is the point estimate of the population difference, \(\mu_{n} - \mu_{s}\)?
- Compute the standard error of the point estimate from part (a).
| mean | 6.78 | 7.18 |
| st. dev. | 1.43 | 1.60 |
| samp. size | 50 | 100 |
Complete the hypothesis test started in Example [babySmokeHTForWeight] and Guided Practice [babySmokeCalcForWeight]. Use a significance level of \(\alpha=0.05\). For reference, \(\bar{x}_{n} - \bar{x}_{s} = 0.40\), \(SE = 0.26\), and the sample sizes were \(n_n = 100\) and \(n_s = 50\). [babySmokeHTForWeightComputePValueAndEvalHT] We can find the test statistic for this test using the values from Guided Practice [babySmokeCalcForWeight]:
\[\begin{aligned} T = \frac{\ 0.40 - 0\ }{0.26} = 1.54 \end{aligned}\]
The p-value is represented by the two shaded tails in the following plot:
We find the single tail area using software (or the \(t\)-table in Appendix [tDistributionTable]). We’ll use the smaller of \(n_n - 1 = 99\) and \(n_s - 1 = 49\) as the degrees of freedom: \(df = 49\). The one tail area is 0.065; doubling this value gives the two-tail area and p-value, 0.135.
The p-value is larger than the significance value, 0.05, so we do not reject the null hypothesis. There is insufficient evidence to say there is a difference in average birth weight of newborns from North Carolina mothers who did smoke during pregnancy and newborns from North Carolina mothers who did not smoke during pregnancy.
We’ve seen much research suggesting smoking is harmful during pregnancy, so how could we fail to reject the null hypothesis in Example [babySmokeHTForWeightComputePValueAndEvalHT]?
[babySmokeHTIDingHowToDetectDifferences] If we made a Type 2 Error and there is a difference, what could we have done differently in data collection to be more likely to detect the difference?
Public service announcement: while we have used this relatively small data set as an example, larger data sets show that women who smoke tend to have smaller newborns. In fact, some in the tobacco industry actually had the audacity to tout that as a benefit of smoking:
It’s true. The babies born from women who smoke are smaller, but they’re just as healthy as the babies born from women who do not smoke. And some women would prefer having smaller babies.
- Joseph Cullman, Philip Morris’ Chairman of the Board
...on CBS’ Face the Nation, Jan 3, 1971
Fact check: the babies from women who smoke are not actually as healthy as the babies from women who do not smoke.2
Case study: two versions of a course exam
An instructor decided to run two slight variations of the same exam. Prior to passing out the exams, she shuffled the exams together to ensure each student received a random version. Summary statistics for how students performed on these two exams are shown in Figure [summaryStatsForTwoVersionsOfExams]. Anticipating complaints from students who took Version B, she would like to evaluate whether the difference observed in the groups is so large that it provides convincing evidence that Version B was more difficult (on average) than Version A.
| Version | \(n\) | \(\bar{x}\) | \(s\) | min | max |
|---|---|---|---|---|---|
| A | 30 | 79.4 | 14 | 45 | 100 |
| B | 27 | 74.1 | 20 | 32 | 100 |
[htSetupForEvaluatingTwoExamVersions] Construct hypotheses to evaluate whether the observed difference in sample means, \(\bar{x}_A - \bar{x}_B=5.3\), is due to chance. We will later evaluate these hypotheses using \(\alpha = 0.01\).
[conditionsForTDistForEvaluatingTwoExamVersions] To evaluate the hypotheses in Guided Practice [htSetupForEvaluatingTwoExamVersions] using the \(t\)-distribution, we must first verify conditions.
- Does it seem reasonable that the scores are independent?
- Any concerns about outliers?
After verifying the conditions for each sample and confirming the samples are independent of each other, we are ready to conduct the test using the \(t\)-distribution. In this case, we are estimating the true difference in average test scores using the sample data, so the point estimate is \(\bar{x}_A - \bar{x}_B = 5.3\). The standard error of the estimate can be calculated as
\[\begin{aligned} SE = \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}} = \sqrt{\frac{14^2}{30} + \frac{20^2}{27}} = 4.62\end{aligned}\]
Finally, we construct the test statistic:
\[\begin{aligned} T = \frac{\text{point estimate} - \text{null value}}{SE} = \frac{(79.4-74.1) - 0}{4.62} = 1.15\end{aligned}\]
If we have a computer handy, we can identify the degrees of freedom as 45.97. Otherwise we use the smaller of \(n_1-1\) and \(n_2-1\): \(df=26\).
Identify the p-value depicted in Figure [pValueOfTwoTailAreaOfExamVersionsWhereDFIs26] using \(df = 26\), and provide a conclusion in the context of the case study. Using software, we can find the one-tail area (0.13) and then double this value to get the two-tail area, which is the p-value: 0.26. (Alternatively, we could use the \(t\)-table in Appendix [tDistributionTable].)
In Guided Practice [htSetupForEvaluatingTwoExamVersions], we specified that we would use \(\alpha = 0.01\). Since the p-value is larger than \(\alpha\), we do not reject the null hypothesis. That is, the data do not convincingly show that one exam version is more difficult than the other, and the teacher should not be convinced that she should add points to the Version B exam scores.
Pooled standard deviation estimate (special topic)
Occasionally, two populations will have standard deviations that are so similar that they can be treated as identical. For example, historical data or a well-understood biological mechanism may justify this strong assumption. In such cases, we can make the \(t\)-distribution approach slightly more precise by using a pooled standard deviation.
The of two groups is a way to use data from both samples to better estimate the standard deviation and standard error. If \(s_1^{}\) and \(s_2^{}\) are the standard deviations of groups 1 and 2 and there are very good reasons to believe that the population standard deviations are equal, then we can obtain an improved estimate of the group variances by pooling their data:
\[\begin{aligned} s_{pooled}^2 = \frac{s_1^2\times (n_1-1) + s_2^2\times (n_2-1)}{n_1 + n_2 - 2}\end{aligned}\]
where \(n_1\) and \(n_2\) are the sample sizes, as before. To use this new statistic, we substitute \(s_{pooled}^2\) in place of \(s_1^2\) and \(s_2^2\) in the standard error formula, and we use an updated formula for the degrees of freedom:
\[\begin{aligned} df = n_1 + n_2 - 2\end{aligned}\]
The benefits of pooling the standard deviation are realized through obtaining a better estimate of the standard deviation for each group and using a larger degrees of freedom parameter for the \(t\)-distribution. Both of these changes may permit a more accurate model of the sampling distribution of \(\bar{x}_1 - \bar{x}_2\), if the standard deviations of the two groups are indeed equal.
Pool standard deviations only after careful consideration A pooled standard deviation is only appropriate when background research indicates the population standard deviations are nearly equal. When the sample size is large and the condition may be adequately checked with data, the benefits of pooling the standard deviations greatly diminishes.


