Skip to main content
Statistics LibreTexts

7.4: Power calculations for a difference of means

  • Page ID
    56947
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Often times in experiment planning, there are two competing considerations:

    • We want to collect enough data that we can detect important effects.
    • Collecting data can be expensive, and in experiments involving people, there may be some risk to patients.

    In this section, we focus on the context of a clinical trial, which is a health-related experiment where the subject are people, and we will determine an appropriate sample size where we can be 80% sure that we would detect any practically important effects.3

    Going through the motions of a test

    We’re going to go through the motions of a hypothesis test. This will help us frame our calculations for determining an appropriate sample size for the study.

    Suppose a pharmaceutical company has developed a new drug for lowering blood pressure, and they are preparing a clinical trial (experiment) to test the drug’s effectiveness. They recruit people who are taking a particular standard blood pressure medication. People in the control group will continue to take their current medication through generic-looking pills to ensure blinding. Write down the hypotheses for a two-sided hypothesis test in this context. Generally, clinical trials use a two-sided alternative hypothesis, so below are suitable hypotheses for this context:

    \(H_0\):

    The new drug performs exactly as well as the standard medication.
    \(\mu_{trmt} - \mu_{ctrl} = 0\).

    \(H_A\):

    The new drug’s performance differs from the standard medication.
    \(\mu_{trmt} - \mu_{ctrl} \neq 0\).

    The researchers would like to run the clinical trial on patients with systolic blood pressures between 140 and 180 mmHg. Suppose previously published studies suggest that the standard deviation of the patients’ blood pressures will be about 12 mmHg and the distribution of patient blood pressures will be approximately symmetric. If we had 100 patients per group, what would be the approximate standard error for \(\bar{x}_{trmt} - \bar{x}_{ctrl}\)? The standard error is calculated as follows:

    \[\begin{aligned} SE_{\bar{x}_{trmt} - \bar{x}_{ctrl}} = \sqrt{\frac{s_{trmt}^2}{n_{trmt}} + \frac{s_{ctrl}^2}{n_{ctrl}}} = \sqrt{\frac{12^2}{100} + \frac{12^2}{100}} = 1.70 \end{aligned}\]

    This may be an imperfect estimate of \(SE_{\bar{x}_{trmt} - \bar{x}_{ctrl}}\), since the standard deviation estimate we used may not be perfectly correct for this group of patients. However, it is sufficient for our purposes.

    What does the null distribution of \(\bar{x}_{trmt} - \bar{x}_{ctrl}\) look like? The degrees of freedom are greater than 30, so the distribution of \(\bar{x}_{trmt} - \bar{x}_{ctrl}\) will be approximately normal. The standard deviation of this distribution (the standard error) would be about 1.70, and under the null hypothesis, its mean would be 0.

    For what values of \(\bar{x}_{trmt} - \bar{x}_{ctrl}\) would we reject the null hypothesis? For \(\alpha = 0.05\), we would reject \(H_0\) if the difference is in the lower 2.5% or upper 2.5% tail:

    Lower 2.5%:

    For the normal model, this is 1.96 standard errors below 0, so any difference smaller than \(-1.96 \times 1.70 = -3.332\) mmHg.

    Upper 2.5%:

    For the normal model, this is 1.96 standard errors above 0, so any difference larger than \(1.96 \times 1.70 = 3.332\) mmHg.

    The boundaries of these are shown below:

    Next, we’ll perform some hypothetical calculations to determine the probability we reject the null hypothesis, if the alternative hypothesis were actually true.

    Computing the power for a 2-sample test Computing the power for a 2-sample test

    When planning a study, we want to know how likely we are to detect an effect we care about. In other words, if there is a real effect, and that effect is large enough that it has practical value, then what’s the probability that we detect that effect? This probability is called the , and we can compute it for different sample sizes or for different effect sizes.

    We first determine what is a practically significant result. Suppose that the company researchers care about finding any effect on blood pressure that is 3 mmHg or larger vs the standard medication. Here, 3 mmHg is the minimum of interest, and we want to know how likely we are to detect this size of an effect in the study.

    Suppose we decided to move forward with 100 patients per treatment group and the new drug reduces blood pressure by an additional 3 mmHg relative to the standard medication. What is the probability that we detect a drop? [PowerFor100AtNeg3] Before we even do any calculations, notice that if \(\bar{x}_{trmt} - \bar{x}_{ctrl} = -3\) mmHg, there wouldn’t even be sufficient evidence to reject \(H_0\). That’s not a good sign.

    To calculate the probability that we will reject \(H_0\), we need to determine a few things:

    • The sampling distribution for \(\bar{x}_{trmt} - \bar{x}_{ctrl}\) when the true difference is -3 mmHg. This is the same as the null distribution, except it is shifted to the left by 3:
    • The rejection regions, which are outside of the dotted lines above.
    • The fraction of the distribution that falls in the rejection region.

    In short, we need to calculate the probability that \(x < -3.332\) for a normal distribution with mean -3 and standard deviation 1.7. To do so, we first shade the area we want to calculate:

    We’ll use a normal approximation, which is good approximation when the degrees of freedom is about 30 or more. We’ll start by calculating the Z-score and find the tail area using either statistical software or the probability table:

    \[\begin{aligned} Z = \frac{-3.332 - (-3)}{1.7} = -0.20 \qquad \to \qquad 0.42 \end{aligned}\]

    The power for the test is about 42% when \(\mu_{trmt} - \mu_{ctrl} = -3\) and each group has a sample size of 100.

    In Example [PowerFor100AtNeg3], we ignored the upper rejection region in the calculation, which was in the opposite direction of the hypothetical truth, i.e. -3. The reasoning? There wouldn’t be any value in rejecting the null hypothesis and concluding there was an increase when in fact there was a decrease.

    We’ve also used a normal distribution instead of the \(t\)-distribution. This is a convenience, and if the sample size is too small, we’d need to revert back to using the \(t\)-distribution. We’ll discuss this a bit further at the end of this section.

    Determining a proper sample size

    In the last example, we found that if we have a sample size of 100 in each group, we can only detect an effect size of 3 mmHg with a probability of about 0.42. Suppose the researchers moved forward and only used 100 patients per group, and the data did not support the alternative hypothesis, i.e. the researchers did not reject \(H_0\). This is a very bad situation to be in for a few reasons:

    • In the back of the researchers’ minds, they’d all be wondering, maybe there is a real and meaningful difference, but we weren’t able to detect it with such a small sample.
    • The company probably invested hundreds of millions of dollars in developing the new drug, so now they are left with great uncertainty about its potential since the experiment didn’t have a great shot at detecting effects that could still be important.
    • Patients were subjected to the drug, and we can’t even say with much certainty that the drug doesn’t help (or harm) patients.
    • Another clinical trial may need to be run to get a more conclusive answer as to whether the drug does hold any practical value, and conducting a second clinical trial may take years and many millions of dollars.

    We want to avoid this situation, so we need to determine an appropriate sample size to ensure we can be pretty confident that we’ll detect any effects that are practically important. As mentioned earlier, a change of 3 mmHg was deemed to be the minimum difference that was practically important. As a first step, we could calculate power for several different sample sizes. For instance, let’s try 500 patients per group.

    Calculate the power to detect a change of -3 mmHg when using a sample size of 500 per group.

    1. Determine the standard error (recall that the standard deviation for patients was expected to be about 12 mmHg).
    2. Identify the null distribution and rejection regions.
    3. Identify the alternative distribution when \(\mu_{trmt} - \mu_{ctrl} = -3\).
    4. Compute the probability we reject the null hypothesis.

    The researchers decided 3 mmHg was the minimum difference that was practically important, and with a sample size of 500, we can be very certain (97.7% or better) that we will detect any such difference. We now have moved to another extreme where we are exposing an unnecessary number of patients to the new drug in the clinical trial. Not only is this ethically questionable, but it would also cost a lot more money than is necessary to be quite sure we’d detect any important effects.

    The most common practice is to identify the sample size where the power is around 80%, and sometimes 90%. Other values may be reasonable for a specific context, but 80% and 90% are most commonly targeted as a good balance between high power and not exposing too many patients to a new treatment (or wasting too much money).

    We could compute the power of the test at several other possible sample sizes until we find one that’s close to 80%, but there’s a better way. We should solve the problem backwards.

    What sample size will lead to a power of 80%? Use \(\alpha = 0.05\). [sample_size_for_80_percent_power] We’ll assume we have a large enough sample that the normal distribution is a good approximation for the test statistic, since the normal distribution and the \(t\)-distribution look almost identical when the degrees of freedom are moderately large (e.g. \(df \geq 30\)). If that doesn’t turn out to be true, then we’d need to make a correction.

    We start by identifying the Z-score that would give us a lower tail of 80%. For a moderately large sample size per group, the Z-score for a lower tail of 80% would be about \(Z = 0.84\).

    Additionally, the rejection region extends \(1.96\times SE\) from the center of the null distribution for \(\alpha = 0.05\). This allows us to calculate the target distance between the center of the null and alternative distributions in terms of the standard error:

    \[\begin{aligned} 0.84 \times SE + 1.96 \times SE = 2.8 \times SE \end{aligned}\]

    In our example, we want the distance between the null and alternative distributions’ centers to equal the minimum effect size of interest, 3 mmHg, which allows us to set up an equation between this difference and the standard error:

    \[\begin{aligned} 3 &= 2.8 \times SE \\ 3 &= 2.8 \times \sqrt{\frac{12^2}{n} + \frac{12^2}{n}} \\ n &= \frac{2.8^2}{3^2} \times \left( 12^2 + 12^2 \right) = 250.88 \\ \end{aligned}\]

    We should target 251 patients per group in order to achieve 80% power at the 0.05 significance level for this context.

    The standard error difference of \(2.8 \times SE\) is specific to a context where the targeted power is 80% and the significance level is \(\alpha = 0.05\). If the targeted power is 90% or if we use a different significance level, then we’ll use something a little different than \(2.8 \times SE\).

    Had the suggested sample size been relatively small – roughly 30 or smaller – it would have been a good idea to rework the calculations using the degrees of fredom for the smaller sample size under that initial sample size. That is, we would have revised the 0.84 and 1.96 values based on degrees of freedom implied by the initial sample size. The revised sample size target would generally have then been a little larger.

    Suppose the targeted power was 90% and we were using \(\alpha = 0.01\). How many standard errors should separate the centers of the null and alternative distribution, where the alternative distribution is centered at the minimum effect size of interest?

    What are some considerations that are important in determining what the power should be for an experiment?

    Figure [power_curve_neg-3] shows the power for sample sizes from 20 patients to 5,000 patients when \(\alpha = 0.05\) and the true difference is -3. This curve was constructed by writing a program to compute the power for many different sample sizes.

    [power_curve_neg-3]

    Power calculations for expensive or risky experiments are critical. However, what about experiments that are inexpensive and where the ethical considerations are minimal? For example, if we are doing final testing on a new feature on a popular website, how would our sample size considerations change? As before, we’d want to make sure the sample is big enough. However, suppose the feature has undergone some testing and is known to perform well (e.g. the website’s users seem to enjoy the feature). Then it may be reasonable to run a larger experiment if there’s value from having a more precise estimate of the feature’s effect, such as helping guide the development of the next useful feature.


    This page titled 7.4: Power calculations for a difference of means is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel via source content that was edited to the style and standards of the LibreTexts platform.

    • Was this article helpful?