Skip to main content
Statistics LibreTexts

15: Comparing means

  • Page ID
    6313
  • 15 Comparing means

    One of the most common questions that we want to ask in statistics is whether there is a difference between the means of two different groups. Let's say that we would like to know whether regular marijuana smokers watch more television. We can ask this question using the NHANES dataset; let's take a sample of 200 individuals from the dataset and test whether the number of hours of television watching per day is related to regular marijuana use. Figure 15.1 shows these data using a violin plot.

    # create sample with tv watching and marijuana use
    NHANES_sample <-
      NHANES_adult %>%
      drop_na(TVHrsDay, RegularMarij) %>%
      mutate(
        TVHrsNum = recode( #recode character values into numerical values
          TVHrsDay,
          "More_4_hr" = 5,
          "4_hr" = 4, 
          "2_hr" = 2,
          "1_hr" = 1, 
          "3_hr" = 3, 
          "0_to_1_hr" = 0.5,
          "0_hrs" = 0
        )
      ) %>%
      sample_n(200)
    Violin plot showing distributions of TV watching separated by regular marijuana use.

    Figure 15.1 Violin plot showing distributions of TV watching separated by regular marijuana use.

    15.1 Student's T test

    We have already encountered Student's t statistic in the previous chapter on hypothesis testing. This statistic provides us with a way to test for differences between two groups of independent observations; we will turn later in the chapter to cases where the observations are not independent. As a reminder, the t-statistic for comparison of two independent groups is computed as:

    t=X1X2S12n1+S22n2 t = \frac{\bar{X_1} - \bar{X_2}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}

    where X1\bar{X}_1 and X2\bar{X}_2 are the means of the two groups, S12S^2_1 and S22S^2_2 are the variances for each of the groups, and n1n_1 and n2n_2 are the sizes of the two groups. Under the null hypothesis of no difference between means, this statistic is distributed according to a t distribution with n-2 degrees of freedom (since we have computed two parameter estimates, namely the means of the two groups). We can compute the t-test in R using the t.test() function. In this case, we started with the specific hypothesis that smoking marijuana is associated with greater TV watching, so we will use a one-tailed test. Since the t.test function orders the conditions alphabetically, the "No" group comes first, and thus we need to test the alternative hypothesis of whether the first group is less than the second ("Yes") group; for this reason, we specify 'less' as our alternative.

    # compute t test for tv watching as function of marijuana use
    t.test(
      TVHrsNum ~ RegularMarij,
      data = NHANES_sample,
      var.equal = TRUE,
      alternative = 'less'
    )
    ## 
    ##  Two Sample t-test
    ## 
    ## data:  TVHrsNum by RegularMarij
    ## t = -3, df = 200, p-value = 0.001
    ## alternative hypothesis: true difference in means is less than 0
    ## 95 percent confidence interval:
    ##   -Inf -0.29
    ## sample estimates:
    ##  mean in group No mean in group Yes 
    ##               2.2               2.8

    In this case we see that there is a statistically significant difference between groups, in the expected direction - regular pot smokers watch more TV.

    15.2 The t-test as a linear model

    The t-test is often presented as a specialized tool for comparing means, but it can also be viewed as an application of the general linear model. In this case, the model would look like this:

    BP^=β1^*Smoking+β0^ \hat{BP} = \hat{\beta_1}*Smoking + \hat{\beta_0} However, smoking is a binary variable, so we treat it as a dummy variable like we discussed in the previous chapter, setting it to a value of 1 for smokers and zero for nonsmokers. In that case, β1^\hat{\beta_1} is simply the difference in means between the two groups, and β0^\hat{\beta_0} is the mean for the group that was coded as zero. We can fit this model using the lm() function, and see that it gives the same t statistic as the t-test above:

    # print summary of linear regression to perform t-test
    s <- summary(lm(TVHrsNum ~ RegularMarij, data = NHANES_sample))
    s
    ## 
    ## Call:
    ## lm(formula = TVHrsNum ~ RegularMarij, data = NHANES_sample)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -2.810 -1.165 -0.166  0.835  2.834 
    ## 
    ## Coefficients:
    ##                 Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)        2.165      0.115   18.86   <2e-16 ***
    ## RegularMarijYes    0.645      0.213    3.02   0.0028 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 1.4 on 198 degrees of freedom
    ## Multiple R-squared:  0.0441, Adjusted R-squared:  0.0393 
    ## F-statistic: 9.14 on 1 and 198 DF,  p-value: 0.00282

    We can also view the lm() results graphically (see Figure 15.2):

    Violin plots showing data for each group, with a blue line connecting the predicted values for each group, computed on the basis of the results of the linear model.

    Figure 15.2 Violin plots showing data for each group, with a blue line connecting the predicted values for each group, computed on the basis of the results of the linear model.

    In this case, the predicted value for nonsmokers is β0^\hat{\beta_0} (2.17) and the predicted value for smokers is β0^+β1^\hat{\beta_0} +\hat{\beta_1} (2.81).

    To compute the standard errors for this analysis, we can use exactly the same equations that we used for linear regression -- since this really is just another example of linear regression. In fact, if you compare the p-value from the t-test above with the p-value in the linear regression analysis for the marijuana use variable, you will see that the one from the linear regression analysis is exactly twice the one from the t-test, because the linear regression analysis is performing a two-tailed test.

    15.2.1 Effect sizes for comparing two means

    The most commonly used effect size for a comparison between two means is Cohen's d, which (as you may remember from Chapter 10) is an expression of the effect in terms of standard error units. For the t-test estimated using the general linear model outlined above (i.e. with a single dummy-coded variable), this is expressed as:

    d=beta1^SEresidual d = \frac{\hat{beta_1}}{SE_{residual}} We can obtain these values from the analysis output above, giving us a d = 0.47, which we would generally interpret as a medium sized effect.

    We can also compute R2R^2 for this analysis, which tells us how much variance in TV watching is accounted for. This value (which is reported in the summary of the lm() analysis) is 0.04, which tells us that while the effect may be statistically significant, it accounts for relatively little of the variance in TV watching.

    15.3 Bayes factor for mean differences

    As we discussed in the chapter on Bayesian analysis, Bayes factors provide a way to better quantify evidence in favor or against the null hypothesis of no difference. In this case, we want to specifically test against the null hypothesis that the difference is less than zero - because the difference is computed by the function between the first group ('No') and the second group ('Yes'). Thus, we specify a "null interval" going from zero to infinity, which means that the alternative is less than zero.

    # compute bayes factor for group comparison
    bf <- ttestBF(
      formula = TVHrsNum ~ RegularMarij, 
      data = NHANES_sample, 
      nullInterval = c(0, Inf)
    )
    bf
    ## Bayes factor analysis
    ## --------------
    ## [1] Alt., r=0.707 0<d<Inf    : 0.043 ±0%
    ## [2] Alt., r=0.707 !(0<d<Inf) : 22    ±0%
    ## 
    ## Against denominator:
    ##   Null, mu1-mu2 = 0 
    ## ---
    ## Bayes factor type: BFindepSample, JZS

    This shows us that the evidence against the null hypothesis is reasonably strong.

    15.4 Paired t-tests

    In experimental research, we often use within-subjects designs, in which we compare the same person on multiple measurements. For example, in the NHANES dataset blood pressure was measured three times. Let's say that we are interested in testing whether there is a difference in mean blood pressure between the first and second measurement (Figure 15.3).

    Violin plot of systolic blood pressure on first and second recording, from NHANES.

    Figure 15.3 Violin plot of systolic blood pressure on first and second recording, from NHANES.

    We see that there does not seem to be much of a difference in mean blood pressure between time points (about one point). First let's test for a difference using an independent samples t-test, which ignores the fact that pairs of data points come from the the same individuals.

    t.test(
      BPsys ~ timepoint,
      data = NHANES_sample_tidy,
      paired = FALSE, 
      var.equal = TRUE
    )
    ## 
    ##  Two Sample t-test
    ## 
    ## data:  BPsys by timepoint
    ## t = 0.5, df = 400, p-value = 0.6
    ## alternative hypothesis: true difference in means is not equal to 0
    ## 95 percent confidence interval:
    ##  -2.8  4.6
    ## sample estimates:
    ## mean in group BPSys1 mean in group BPSys2 
    ##                  122                  121

    This analysis shows no significant difference. However, this analysis is inappropriate since it assumes that the two samples are independent, when in fact they are not, since the data come from the same individuals. We can plot the data with a line for each individual to show this (see Figure 15.4).

    Violin plot of systolic BP on each recording, with lines connecting the two data points for each individual.

    Figure 15.4 Violin plot of systolic BP on each recording, with lines connecting the two data points for each individual.

    In this analysis, what we really care about is whether the blood pressure for each person changed in a systematic way between the two measurements, so another way to represent the data is to compute the difference between the two timepoints for each individual, and then analyze these difference scores rather than analyzing the individual measurements. In Figure 15.5, we show a histogram of these difference scores, with a blue line denoting the mean difference.

    Histogram of difference scores between first and second BP measurement.

    Figure 15.5 Histogram of difference scores between first and second BP measurement.

    15.4.1 Sign test

    One simple way to test for differences is using a test called the sign test, which asks whether the proportion of positive differences (ignoring their magnitude) is different than what we would expect by chance. To do this, we take the differences and compute their sign, and then we use a binomial test to ask whether the proportion of positive signs differs from 0.5.

    # compute sign test for differences between first and second measurement
    npos <- sum(NHANES_sample$diffPos)
    bt <- binom.test(npos, nrow(NHANES_sample))
    bt
    ## 
    ##  Exact binomial test
    ## 
    ## data:  npos and nrow(NHANES_sample)
    ## number of successes = 100, number of trials = 200, p-value = 0.4
    ## alternative hypothesis: true probability of success is not equal to 0.5
    ## 95 percent confidence interval:
    ##  0.46 0.60
    ## sample estimates:
    ## probability of success 
    ##                   0.53

    Here we see that the proportion of individuals with positive signs (0.53) is not large enough to be surprising under the null hypothesis of p=0.5p=0.5. However, one problem with the sign test is that it is throwing away information about the magnitude of the differences, and thus might be missing something.

    15.4.2 Paired t-test

    A more common strategy is to use a paired t-test, which is equivalent to a one-sample t-test for whether the mean difference between the measurements is zero. We can compute this using the t.test() function in R and setting paired=TRUE.

    # compute paired t-test
    t.test(BPsys ~ timepoint, data = NHANES_sample_tidy, paired = TRUE)
    ## 
    ##  Paired t-test
    ## 
    ## data:  BPsys by timepoint
    ## t = 2, df = 200, p-value = 0.02
    ## alternative hypothesis: true difference in means is not equal to 0
    ## 95 percent confidence interval:
    ##  0.17 1.69
    ## sample estimates:
    ## mean of the differences 
    ##                    0.93

    With this analyses we see that there is in fact a significant difference between the two measurements. Let's compute the Bayes factor to see how strong this effect is:

    # compute Bayes factor for paired t-test
    ttestBF(x = NHANES_sample$BPSys1, y = NHANES_sample$BPSys2, paired = TRUE)
    ## Bayes factor analysis
    ## --------------
    ## [1] Alt., r=0.707 : 1.3 ±0%
    ## 
    ## Against denominator:
    ##   Null, mu = 0 
    ## ---
    ## Bayes factor type: BFoneSample, JZS

    This shows us that although the effect was significant in a paired t-test, it actually provides very little evidence in favor of the alternative hypothesis.

    15.4.3 The paired t-test as a linear model

    We can also define the paired t-test in terms of a general linear model. To do this, we include all of the measurements for each subject as data points (within a tidy data frame). We then include in the model a variable that codes for the identity of each individual (in this case, the ID variable that contains a subject ID for each person). This is known as a mixed model, since it includes effects of independent variables as well as effects of individuals. The standard model fitting procedure lm() can't do this, but we can do it using the lmer() function from a popular R package called lme4, which is specialized for estimating mixed models. The (1|ID) in the formula tells lmer() to estimate a separate intercept (which is what the 1 refers to) for each value of the ID variable (i.e. for each individual in the dataset), and then estimate a common slope relating timepoint to BP.

    # compute mixed model for paired test
    
    lmrResult <- lmer(BPsys ~ timepoint + (1 | ID), data = NHANES_sample_tidy)
    summary(lmrResult)
    ## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
    ## lmerModLmerTest]
    ## Formula: BPsys ~ timepoint + (1 | ID)
    ##    Data: NHANES_sample_tidy
    ## 
    ## REML criterion at convergence: 2982
    ## 
    ## Scaled residuals: 
    ##     Min      1Q  Median      3Q     Max 
    ## -2.6985 -0.4478  0.0058  0.3996  2.7395 
    ## 
    ## Random effects:
    ##  Groups   Name        Variance Std.Dev.
    ##  ID       (Intercept) 342      18.49   
    ##  Residual              15       3.87   
    ## Number of obs: 400, groups:  ID, 200
    ## 
    ## Fixed effects:
    ##                 Estimate Std. Error      df t value Pr(>|t|)    
    ## (Intercept)      121.770      1.336 207.545    91.2   <2e-16 ***
    ## timepointBPSys2   -0.930      0.387 199.000    -2.4    0.017 *  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Correlation of Fixed Effects:
    ##             (Intr)
    ## tmpntBPSys2 -0.145

    You can see that this shows us a p-value that is very close to the result from the paired t-test computed using the t.test() function.

    15.5 Comparing more than two means

    Often we want to compare more than two means to determine whether any of them differ from one another. Let's say that we are analyzing data from a clinical trial for the treatment of high blood pressure. In the study, volunteers are randomized to one of three conditions: Drug 1, Drug 2 or placebo. Let's generate some data and plot them (see Figure 15.6)

    Box plots showing blood pressure for three different groups in our clinical trial.

    Figure 15.6 Box plots showing blood pressure for three different groups in our clinical trial.

    15.5.1 Analysis of variance

    We would first like to test the null hypothesis that the means of all of the groups are equal -- that is, neither of the treatments had any effect. We can do this using a method called analysis of variance (ANOVA). This is one of the most commonly used methods in psychological statistics, and we will only scratch the surface here. The basic idea behind ANOVA is one that we already discussed in the chapter on the general linear model, and in fact ANOVA is just a name for a specific implementation of such a model.

    Remember from the last chapter that we can partition the total variance in the data (SStotalSS_{total}) into the variance that is explained by the model (SSmodelSS_{model}) and the variance that is not (SSerrorSS_{error}). We can then compute a mean square for each of these by dividing them by their degrees of freedom; for the error this is NpN - p (where pp is the number of means that we have computed), and for the model this is p1p - 1:

    MSmodel=SSmodeldfmodel=SSmodelp1 MS_{model} =\frac{SS_{model}}{df_{model}}= \frac{SS_{model}}{p-1}

    MSerror=SSerrordferror=SSerrorNp MS_{error} = \frac{SS_{error}}{df_{error}} = \frac{SS_{error}}{N - p}

    With ANOVA, we want to test whether the variance accounted for by the model is greater than what we would expect by chance, under the null hypothesis of no differences between means. Whereas for the t distribution the expected value is zero under the null hypothesis, that's not the case here, since sums of squares are always positive numbers. Fortunately, there is another standard distribution that describes how ratios of sums of squares are distributed under the null hypothesis: The F distribution (see figure 15.7). This distribution has two degrees of freedom, which correspond to the degrees of freedom for the numerator (which in this case is the model), and the denominator (which in this case is the error).

    F distributions under the null hypothesis, for different values of degrees of freedom.

    Figure 15.7 F distributions under the null hypothesis, for different values of degrees of freedom.

    To create an ANOVA model, we extend the idea of dummy coding that you encountered in the last chapter. Remember that for the t-test comparing two means, we created a single dummy variable that took the value of 1 for one of the conditions and zero for the others. Here we extend that idea by creating two dummy variables, one that codes for the Drug 1 condition and the other that codes for the Drug 2 condition. Just as in the t-test, we will have one condition (in this case, placebo) that doesn't have a dummy variable, and thus represents the baseline against which the others are compared; its mean defines the intercept of the model. Let's create the dummy coding for drugs 1 and 2.

    # create dummy variables for drug1 and drug2
    df <-
      df %>%
      mutate(
        d1 = as.integer(group == "drug1"), # 1s for drug1, 0s for all other drugs
        d2 = as.integer(group == "drug2")  # 1s for drug2, 0s for all other drugs
      )

    Now we can fit a model using the same approach that we used in the previous chapter:

    # fit ANOVA model
    lmResultANOVA <- lm(sysBP ~ d1 + d2, data = df)
    summary(lmResultANOVA)
    ## 
    ## Call:
    ## lm(formula = sysBP ~ d1 + d2, data = df)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -29.084  -7.745  -0.098   7.687  23.431 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   141.60       1.66   85.50  < 2e-16 ***
    ## d1            -10.24       2.34   -4.37  2.9e-05 ***
    ## d2             -2.03       2.34   -0.87     0.39    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 9.9 on 105 degrees of freedom
    ## Multiple R-squared:  0.169,  Adjusted R-squared:  0.154 
    ## F-statistic: 10.7 on 2 and 105 DF,  p-value: 5.83e-05

    The output from this command provides us with two things. First, it shows us the result of a t-test for each of the dummy variables, which basically tell us whether each of the conditions separately differs from placebo; it appears that Drug 1 does whereas Drug 2 does not. However, keep in mind that if we wanted to interpret these tests, we would need to correct the p-values to account for the fact that we have done multiple hypothesis tests; we will see an example of how to do this in the next chapter.

    Remember that the hypothesis that we started out wanting to test was whether there was any difference between any of the conditions; we refer to this as an omnibus hypothesis test, and it is the test that is provided by the F statistic. The F statistic basically tells us whether our model is better than a simple model that just includes an intercept. In this case we see that the F test is highly significant, consistent with our impression that there did seem to be differences between the groups (which in fact we know there were, because we created the data).