Skip to main content
Statistics LibreTexts

9.3: Statistical Power

  • Page ID
    50682
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    How do we know that an effect is there to detect? That question involves the concept of statistical power. The power of the statistical test is the likelihood that we will be able to detect an effect if indeed that effect does exist.

    Power is yet another reason why we are so fixated on getting a large sample size. When we say that we need to collect more data or get a large sample size, we mean get enough power to detect an effect if indeed the effect does exist. In this context, power is akin to sample size. Think of power as fuel. Do we have enough power, sample size, and fuel to keep sampling so that we will find the effect that we say is there to detect?

    In some ways, we could always say, “Well, why not keep sampling, observing, and collecting data until you get what you want? What do we need a power analysis for if we are to keep collecting data anyway?” We keep saying that we need to collect more data, but how much data should you collect before saying, “Stop, I’ve had enough?” The purpose of the power analysis is to address this issue: What is the minimum sample size needed to detect an effect if the effect does, in fact, exist.

    9.3.1: What are the Values of Power?

    Power ranges from 0 to 100%. Zero is obvious; your sample size is so low that there is no opportunity to find the effect. 100% means that your sample size is so high, you have a 100% chance that you will find something, assuming there is something to find. 100% is unlikely to be achieved. There is conceptually no situation where you have all the power you need to detect everything. That’s like saying your car has unlimited gas for fuel; you can keep going and going until you get to where you want to go.

    A power value of .50 means you have a 50% chance of finding an effect. Well, that is akin to flipping a coin. Some days you might find the effect, some days you might not. Therefore, 50% is not a great number. If 100% is unrealistic, what is a good percentage? That would be 80%, which gives you just enough confidence to find something, but it is not over the top, where you must have a large sample size to find something. More about why getting a large sample size is problematic will be discussed in the later sections.

    9.3.2: Power and Type II Error

    Power addresses the problem of Type II error. Recall the Type II error – you get a non-significant result, but actually, your result is significant. Usually, this Type II error is due to having a low sample size. You did not look hard enough to find the effect. If you do not find something, how do you know that you did everything you could, or sampled as much as you could, to conclude that you found nothing?

    Consider this example. You lost your car keys. You looked in your car and you did not find them. Conclusion: You lost your car keys. You might have made a Type II error; you think you found nothing, but in reality, your car keys are still out there. What does your roommate, bestie, or family member say to you? “Did you look hard enough? Did you look everywhere?" You keep looking, and eventually you find your car keys in your pants pocket in your bedroom. You concluded you found nothing, and by default, you made a Type II error. But you did not find anything because you did not keep looking for your keys, or you did not sample more places to find your keys. Then you found your keys. Your original effort was underpowered or insufficient to find your keys.

    Consider this example. You come home from one date, which was bad, and you conclude that you will be alone for the rest of your life. You might have made a Type II error. What does your roommate, bestie, or family member say to you? “Keep going on dates because eventually you will find someone." You keep going on dates, and eventually, after the third try (because the third time is the charm), you find your soulmate. You concluded you found no one, and by default, you made a Type II error. But the reason you did not find anything is that you did not keep going on more dates. The more dates you go on, the more likely you are to find your soul mate.

    In psychology, we investigate treatment effects. Sometimes our treatment does not show an effect. But maybe if we had more participants, the effect would emerge. That said, the power analysis will tell us how many participants we need to recruit into the study to find the effect we hope to find.

    The power analysis will “rule out” a Type II error. “Rule out” is in quotes because the purpose of the power analysis is not to rule out a Type II error. The power analysis will tell you the sample size you need to find the effect. If you did not find a significant effect, and you did sample enough participants, you can reasonably conclude that you did not find a significant effect. In this case, the likelihood of having made a Type II error goes down. Consider the car keys example. If you looked everywhere, your bedroom, your desk, every clothes pocket, then yes, your keys are definitely missing. Consider the dating example, if you go on, say, 300 dates, and still cannot find your soul mate, then yes, you will not find your soul mate. If that is the case, then, well, work on yourself, lower your expectations for a soul mate, or re-examine your life. In psychology, if you sample 300 participants and still cannot find your treatment effect, your treatment does not work, and you can be certain that it does not after testing it with 300 participants.

    9.3.3: Why not Get Large Sample Sizes Anyway?

    The power analysis tells you the minimum sample size you need to find the effect if, in fact, the effect does exist. Well, we are going to sample a lot anyway, so why not just keep sampling as much as you can? Why not always get large sample sizes, like N = 1000, so you have all the chances to find an effect?

    A couple of reasons against large sampling. Recruiting many participants costs a lot of resources. Recruiting and collecting participant data requires time, personnel, money, and sleep (we all need rest). And if you are a graduate student, you do not have four things in your life: time, friends, money, and sleep.

    Most clinical populations are not that large. Clinical populations have a low base rate and are rare in number. For example, people with rare medical diseases. In clinical psychology, it is not so much that there is a low base rate, but due to stigma or awareness, some clinical populations are hard to find. Middle Eastern women who experience domestic violence in their marriages are hard to recruit because of the cultural stigma of seeking help. Adult males who engage in self-harm behaviors might not readily participate in research because of the stigma of being an adult male who engages in self-harm behaviors. Cubs fans who are intelligent are hard to find because what intelligent person becomes a Cubs fan? Intelligent people become White Sox fans. The lesson is that sometimes it is difficult to get the sample size we want, given the characteristics of the issue and population at hand.

    Large sample sizes usually result in Type I errors. You can keep sampling, and eventually you will find something, but you must decide if you did find something because it is a real effect or if you simply got lucky because you kept looking.

    Getting a large sample size is prohibitive for these reasons. Power analysis is meant to guide you in being efficient in your sampling strategy.

    Even if you did get a number for your sample size from the power analysis, it behooves you to still sample extra participants. Every research study has missing data and missing responses from surveys. People who drop out midway through a battery of surveys or drop out during the experimental treatment. Having more participants and more observations will allow you to make up for the missing data. How that works is discussed in later chapters.

    9.3.4: What Helps Get More Power?

    Power seems synonymous with sample size. But there are other ways to get more power or find the effect you want to see. Put differently, how else can we give ourselves the best chance to find the effect?

    The first way is to increase the sample size. I have already discussed this at length.

    The second way is to increase the alpha level, say from .05 to .10. By increasing the alpha level, you are giving yourself more chances to find something. Increasing the alpha level is akin to widening the region of rejection under the normal distribution. It is akin to lowering your standards. You want to find an effect? Lower your standards, and you will find someone who will meet your lower standards. But that does not mean you found something; you just lowered the bar and said that anything will serve as an effect. I am not a fan of this option because increasing the alpha level increases power but also increases the chances of making a Type I error. Essentially, you are saying, “anything can be significant.”

    The third way is to increase the desired effect size. If the true population means are very different, the difference is very apparent. It will not take much to find this difference because it is easy to find. The statistical test will have high power because it will not take many observations or data to find the effect. Let us say you want to find out if height is related to making basketball free throws. Are taller players more likely to make basketball free throws? . You sample 100 6-foot players and compare their free throw percentage with 100 6-foot-5-inch players, and you find that height is not related to free throw percentage. But that difference of 5 inches, especially at that height, probably won’t make much of a difference. What can you do? You can compare the 6-foot players to, let’s say, 4-foot preschoolers. Lo and behold, the 6-foot players make more free throws than the 4-foot preschoolers. And you only sampled 10 6-foot players compared to 10 4-foot preschoolers. You found your effect: height is related to free-throw shooting. But the only reason you found the effect is because you stacked the deck in your favor. You artificially increased the actual effect – 4-foot preschools vs. 6-foot players. So, I am not a fan of this method because it seems manipulative.

    I am also not a fan of increasing the desired effect size because it is hard to know the effect size in advance. What kind of effect, or impact, will there be for mindfulness on alcohol abuse? What kind of effect, or impact, will there be for social support and the likelihood of coming out as LGBTQ? These effects are hard to determine in advance. The best way to estimate the effect size is to do a literature review and catalog what kind of effects past studies found. That will give you a guide, but it is difficult to say the effect sizes you find in the literature will be based on the variables you are interested in.

    So, of the three options to increase power, a) increase sample size, b) increase alpha level, c) increase the effect size, the best option is to increase the sample size. BTW, notice that all the options for increasing power are to increase something. If you find that question on the EPPP, if one of the answers involves an increase, that might be a good clue for answering the question.

    9.3.5: Small Effect Sizes are Hard to Detect

    If the difference is very small, it will take a lot (large sample size, more observations) to find this difference. This is the umpteenth reason why we need a large sample size. Effects are difficult to detect in small sample sizes, so you need more power, or larger sample sizes, to detect effects.

    Consider these examples. Usain Bolt blew away the field in the 100-meter race in the 2008 Olympics. The gap between him and the second-place finisher was so large that it was convincing that Usain Bolt was the fastest. There was no need to make Usain Bolt redo that race and prove himself once again as the fastest runner. He did it anyway at the next Olympics and had the same result; case closed. No need for a larger sample

    But Michael Phelps won his relay race by .01 of a second. That difference was so small that it is difficult to tell if he just got lucky in that race or if something happened. The difference was so small, it could have been random luck. Or there is something about Michael Phelps’ swimming technique that put him in a position where he could outstretch his opponents and win the race by a fingertip. The only way to determine if it was luck or a real effect is to run the race again. You need a larger sample size to determine if an outcome was luck, hence unstable, because no one is that lucky twice, or if something is going on, such as great swimming technique. His 11 gold medals prove that he has the skill and speed to settle this debate.

    9.3.6: Power Does not Always Mean Obtaining a Large Sample Size

    We keep talking about power in terms of sample size. In fact, we seem to be obsessed with sample size and getting more of it. However, statistical tests do not rise and fall with sample sizes. There are other types of observations.

    One observation is over time. You do not need to get more participants; sometimes, you can just observe the same person over time. We will discuss the benefits of this observation approach when we discuss longitudinal analysis. However, power can increase not just with an increase in sample size but also by adding more observations over time.

    Another way to increase power is to use reliable measures. If the measures have little measurement error, then you have better precision, hence confidence that the number you generate is the actual number. Error kills off statistical significance. Any error enlarges the overall error term, and as we learned, the more error, the more the denominator increases, and any statistical test with a large error in the denominator is bound to be not significant. Better precision means higher reliability of measurement and a better chance of detecting an effect.


    This page titled 9.3: Statistical Power is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by Peter Ji.