13.8: Effect Size
The most commonly used measure of effect size for a t-test is Cohen’s d (Cohen 1988). It’s a very simple measure in principle, with quite a few wrinkles when you start digging into the details. Cohen himself defined it primarily in the context of an independent samples t-test, specifically the Student test. In that context, a natural way of defining the effect size is to divide the difference between the means by an estimate of the standard deviation. In other words, we’re looking to calculate something along the lines of this:
\(d=\dfrac{(\text { mean } 1)-(\text { mean } 2)}{\text { std dev }}\)
and he suggested a rough guide for interpreting d in Table
??
. You’d think that this would be pretty unambiguous, but it’s not; largely because Cohen wasn’t too specific on what he thought should be used as the measure of the standard deviation (in his defence, he was trying to make a broader point in his book, not nitpick about tiny details). As discussed by McGrath and Meyer (2006), there are several different version in common usage, and each author tends to adopt slightly different notation. For the sake of simplicity (as opposed to accuracy) I’ll use d to refer to any statistic that you calculate from the sample, and use δ to refer to a theoretical population effect. Obviously, that does mean that there are several different things all called d. The
cohensD()
function in the
lsr
package uses the
method
argument to distinguish between them, so that’s what I’ll do in the text.
My suspicion is that the only time that you would want Cohen’s d is when you’re running a t-test, and if you’re using the
oneSampleTTest
,
independentSamplesTTest
and
pairedSamplesTTest()
functions to run your t-tests, then you don’t need to learn any new commands, because they automatically produce an estimate of Cohen’s d as part of the output. However, if you’re using
t.test()
then you’ll need to use the
cohensD()
function (also in the
lsr
package) to do the calculations.
| d-value | rough interpretation |
|---|---|
| about 0.2 | small effect |
| about 0.5 | moderate effect |
| about 0.8 | large effect |
Cohen’s d from one sample
The simplest situation to consider is the one corresponding to a one-sample t-test. In this case, the one sample mean \(\ \bar{X}\) and one (hypothesised) population mean μ o to compare it to. Not only that, there’s really only one sensible way to estimate the population standard deviation: we just use our usual estimate \(\ \hat{\sigma}\). Therefore, we end up with the following as the only way to calculate d,
\(d=\dfrac{\bar{X}-\mu_{0}}{\hat{\sigma}}\)
When writing the
cohensD()
function, I’ve made some attempt to make it work in a similar way to
t.test()
. As a consequence,
cohensD()
can calculate your effect size regardless of which type of t-test you performed. If what you want is a measure of Cohen’s d to accompany a one-sample t-test, there’s only two arguments that you need to care about. These are:
-
x
. A numeric vector containing the sample data. -
mu
. The mean against which the mean ofx
is compared (default value ismu = 0
).
We don’t need to specify what
method
to use, because there’s only one version of d that makes sense in this context. So, in order to compute an effect size for the data from Dr Zeppo’s class (Section 13.2), we’d type something like this:
cohensD( x = grades, # data are stored in the grades vector
mu = 67.5 # compare students to a mean of 67.5
)
## [1] 0.5041691
and, just so that you can see that there’s nothing fancy going on, the command below shows you how to calculate it if there weren’t no fancypants
cohensD()
function available:
( mean(grades) - 67.5 ) / sd(grades)
## [1] 0.5041691
Yep, same number. Overall, then, the psychology students in Dr Zeppo’s class are achieving grades (mean = 72.3%) that are about .5 standard deviations higher than the level that you’d expect (67.5%) if they were performing at the same level as other students. Judged against Cohen’s rough guide, this is a moderate effect size.
Cohen’s d from a Student t test
The majority of discussions of Cohen’s d focus on a situation that is analogous to Student’s independent samples t test, and it’s in this context that the story becomes messier, since there are several different versions of d that you might want to use in this situation, and you can use the
method
argument to the
cohensD()
function to pick the one you want. To understand why there are multiple versions of d, it helps to take the time to write down a formula that corresponds to the true population effect size δ. It’s pretty straightforward,
\(\delta=\dfrac{\mu_{1}-\mu_{2}}{\sigma}\)
where, as usual, μ1 and μ2 are the population means corresponding to group 1 and group 2 respectively, and σ is the standard deviation (the same for both populations). The obvious way to estimate δ is to do exactly the same thing that we did in the t-test itself: use the sample means as the top line, and a pooled standard deviation estimate for the bottom line:
\(d=\dfrac{\bar{X}_{1}-\bar{X}_{2}}{\hat{\sigma}_{p}}\)
where \(\ \hat{\sigma_p}\) is the exact same pooled standard deviation measure that appears in the t-test. This is the most commonly used version of Cohen’s d when applied to the outcome of a Student t-test ,and is sometimes referred to as Hedges’ g statistic (Hedges 1981). It corresponds to
method = "pooled"
in the
cohensD()
function, and it’s the default.
However, there are other possibilities, which I’ll briefly describe. Firstly, you may have reason to want to use only one of the two groups as the basis for calculating the standard deviation. This approach (often called Glass’ Δ) only makes most sense when you have good reason to treat one of the two groups as a purer reflection of “natural variation” than the other. This can happen if, for instance, one of the two groups is a control group. If that’s what you want, then use
method = "x.sd"
or
method = "y.sd"
when using
cohensD()
. Secondly, recall that in the usual calculation of the pooled standard deviation we divide by N−2 to correct for the bias in the sample variance; in one version of Cohen’s d this correction is omitted. Instead, we divide by N. This version (
method = "raw"
) makes sense primarily when you’re trying to calculate the effect size in the sample; rather than estimating an effect size in the population. Finally, there is a version based on Hedges and Olkin (1985), who point out there is a small bias in the usual (pooled) estimation for Cohen’s d. Thus they introduce a small correction (
method = "corrected"
), by multiplying the usual value of d by (N−3)/(N−2.25).
In any case, ignoring all those variations that you could make use of if you wanted, let’s have a look at how to calculate the default version. In particular, suppose we look at the data from Dr Harpo’s class (the
harpo
data frame). The command that we want to use is very similar to the relevant
t.test()
command, but also specifies a
method
cohensD( formula = grade ~ tutor, # outcome ~ group
data = harpo, # data frame
method = "pooled" # which version to calculate?
)
## [1] 0.7395614
This is the version of Cohen’s d that gets reported by the
independentSamplesTTest()
function whenever it runs a Student t-test.
Cohen’s d from a Welch test
Suppose the situation you’re in is more like the Welch test: you still have two independent samples, but you no longer believe that the corresponding populations have equal variances. When this happens, we have to redefine what we mean by the population effect size. I’ll refer to this new measure as δ′, so as to keep it distinct from the measure δ which we defined previously. What Cohen (1988) suggests is that we could define our new population effect size by averaging the two population variances. What this means is that we get:
\(\delta^{\prime}=\dfrac{\mu_{1}-\mu_{2}}{\sigma^{\prime}}\)
where
\(\sigma^{\prime}=\sqrt{\dfrac{\sigma_{1}^{2}+\sigma_{2}^{2}}{2}}\)
This seems quite reasonable, but notice that none of the measures that we’ve discussed so far are attempting to estimate this new quantity. It might just be my own ignorance of the topic, but I’m only aware of one version of Cohen’s d that actually estimates the unequal-variance effect size δ′ rather than the equal-variance effect size δ. All we do to calculate d for this version (
method = "unequal"
) is substitute the sample means \(\ \bar{X_1}\) and \(\ \bar{X_2}\) and the corrected sample standard deviations \(\ \hat{\sigma_1}\) and \(\ \hat{\sigma_2}\) into the equation for δ′. This gives us the following equation for d,
\(d=\dfrac{\bar{X}_{1}-\bar{X}_{2}}{\sqrt{\dfrac{\hat{\sigma}_{1}\ ^{2}+\hat{\sigma}_{2}\ ^{2}}{2}}}\)
as our estimate of the effect size. There’s nothing particularly difficult about calculating this version in R, since all we have to do is change the
method
argument:
cohensD( formula = grade ~ tutor,
data = harpo,
method = "unequal"
)
## [1] 0.7244995
This is the version of Cohen’s d that gets reported by the
independentSamplesTTest()
function whenever it runs a Welch t-test.
Cohen’s d from a paired-samples test
Finally, what should we do for a paired samples t-test? In this case, the answer depends on what it is you’re trying to do.
If
you want to measure your effect sizes relative to the distribution of difference scores, the measure of d that you calculate is just (
method = "paired"
)
\(d=\dfrac{\bar{D}}{\hat{\sigma}_{D}}\)
where \(\ \hat{\sigma_D}\) is the estimate of the standard deviation of the differences. The calculation here is pretty straightforward
cohensD( x = chico$grade_test2,
y = chico$grade_test1,
method = "paired"
)
## [1] 1.447952
This is the version of Cohen’s d that gets reported by the
pairedSamplesTTest()
function. The only wrinkle is figuring out whether this is the measure you want or not. To the extent that you care about the practical consequences of your research, you often want to measure the effect size relative to the
original
variables, not the
difference
scores (e.g., the 1% improvement in Dr Chico’s class is pretty small when measured against the amount of between-student variation in grades), in which case you use the same versions of Cohen’s d that you would use for a Student or Welch test. For instance, when we do that for Dr Chico’s class,
cohensD( x = chico$grade_test2,
y = chico$grade_test1,
method = "pooled"
)
## [1] 0.2157646
what we see is that the overall effect size is quite small, when assessed on the scale of the original variables.