Skip to main content

Registration is now open for this year's LibreFest! Join us virtually the week of July 13.

Register here
Statistics LibreTexts

8.7: Interpreting p-values and Significance

  • Page ID
    58929
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Once we've completed a hypothesis test and found a p-value, what does that number really tell us?

    A p-value is one of the most widely reported — and most widely misunderstood — numbers in all of statistics. In this section we'll build a careful, honest understanding of what a p-value means, what it doesn't mean, and how to use it responsibly.


    What Is a p-value, Really?

    Let's review our working definition:

    Definition: p-value

    The p-value is the probability of observing a sample statistic as extreme or more extreme than what we saw — assuming the null hypothesis is true.

    It reflects how surprising our result would be if the null hypothesis were actually correct. It is not the probability that the null hypothesis is true.

    A small p-value means our observed result would be very unlikely under the null — surprising enough that we may want to reject it. A large p-value means the result is quite consistent with the null hypothesis.


    Visualizing the p-value

    The diagram below illustrates where the p-value lives on the sampling distribution. When we calculate a test statistic, we are asking: if \( H_0 \) were true, how far out in the tail is our result?


    The shaded area is the p-value — the probability of getting a result at least as extreme as ours, assuming \( H_0 \) is true.

    Drawing Conclusions from Your p-value

    We compare the p-value to our significance level \( \alpha \) to make a formal decision:

    • If \( p \leq \alpha \): reject \( H_0 \) → there is statistically significant evidence for \( H_A \)
    • If \( p > \alpha \): fail to reject \( H_0 \) → there is not enough evidence against it

    For example, if \( \alpha = 0.05 \) and our p-value is 0.03, we reject the null hypothesis.

    Statistical Significance

    If the p-value is less than or equal to our \( \alpha \) level, we say the result is statistically significant.

    This means the result is unlikely to be due to chance alone — according to the assumptions of our model. It does not mean the result is important, large, or proven true.

    Important — "Fail to reject" is not the same as "accept": When \( p > \alpha \), we say we fail to reject \( H_0 \) — we do not say we accept it or that the null is true. The data simply didn't give us enough evidence to conclude otherwise. Absence of evidence is not evidence of absence.


    A Worked Interpretation: Commute Times Revisited

    Recall the commute time example from Section 8.3. We tested:

    • \( H_0: \mu = 30 \) minutes   (no change)
    • \( H_A: \mu > 30 \) minutes   (commutes are longer)

    We found \( z = 2.40 \) and \( p = 0.0082 \).

    Here are four ways to say what that p-value means — and two ways that are wrong:

    Correct vs. incorrect ways to interpret \( p = 0.0082 \).
    Status Statement
    ✓ Correct If the true average commute time were still 30 minutes, there would be only a 0.82% chance of observing a sample mean as large as 32 minutes (with \( n = 36 \)).
    ✓ Correct The data are very inconsistent with the null hypothesis — our result is in the most extreme 0.82% of outcomes we would expect if \( H_0 \) were true.
    ✓ Correct Since \( p = 0.0082 < \alpha = 0.05 \), we reject \( H_0 \). There is statistically significant evidence that commutes are longer.
    ✓ Correct The probability of getting our observed result or more extreme, assuming the null is true, is 0.0082.
    ✗ Wrong "There is a 0.82% probability that the null hypothesis is true." — The p-value is not the probability that \( H_0 \) is true.
    ✗ Wrong "There is a 99.18% probability that the alternative hypothesis is true." — A p-value cannot be converted into the probability that \( H_A \) is true either.

    Common Misinterpretations of p-values

    The table above showed two specific errors. Here is a broader summary of the most common mistakes — knowing these will sharpen your statistical thinking:

    Common p-value misinterpretations and what to say instead.
    What people say Why it's wrong What to say instead
    "p = 0.03 means there's a 3% chance the null is true." The p-value assumes the null is true. It can't also be the probability the null is true — that's circular. "If the null were true, there'd be a 3% chance of seeing data this extreme."
    "p < 0.05 proves my hypothesis." Statistical significance is not proof. The test could still have produced a Type I error, or the model's assumptions could be violated. "There is statistically significant evidence consistent with \( H_A \)."
    "p = 0.06 means there's no effect." Failing to reject \( H_0 \) is not the same as proving \( H_0 \). The sample may simply have been too small to detect a real effect. "We did not find sufficient evidence against \( H_0 \) at the \( \alpha = 0.05 \) level."
    "p = 0.001 means the effect is large." A very small p-value only means the result is unlikely under \( H_0 \). With a large enough sample, a tiny, practically meaningless difference can produce \( p < 0.001 \). "The result is highly statistically significant; we should also examine the effect size to assess practical importance."

    Statistical Significance vs. Practical Significance

    One of the most important distinctions in applied statistics is between a result being statistically significant and being practically significant.

    • Statistical significance answers: Is this result unlikely to be due to chance?
    • Practical significance answers: Is this result meaningful or important in the real world?

    These can come apart in both directions:

    • Statistically significant but not practically significant: A study of 50,000 people finds that a new diet reduces body weight by an average of 0.3 kg (\( p < 0.001 \)). The effect is real but almost certainly too small to matter clinically.
    • Practically significant but not statistically significant: A small pilot study (\( n = 12 \)) of a new cancer treatment shows a 20% improvement in survival rate, but \( p = 0.11 \). The effect may be real and important — the study simply lacked the power to detect it.

    Effect Size

    To quantify practical significance we often report an effect size — a standardized measure of how large the difference actually is, independent of sample size. Common effect sizes include:

    • Cohen's d (for means): \( d = \dfrac{\bar{x}_1 - \bar{x}_2}{s_{pooled}} \) — roughly, 0.2 is small, 0.5 is medium, 0.8 is large.
    • \( r^2 \) or \( \eta^2 \) — the proportion of variance in the outcome explained by the predictor (you already know \( r^2 \) from Chapter 9).

    A complete results report includes both the p-value (for significance) and an effect size (for magnitude).


    p-hacking and the Misuse of Significance

    P-values are useful — but not magical. Let's talk about the risks of misusing them.

    What Is p-hacking?

    P-hacking (also called data dredging) is when someone tries many comparisons, subgroups, or statistical tweaks and reports only the results that achieve \( p < 0.05 \) — even when those results occurred by chance.

    • Running dozens of statistical tests until "something sticks"
    • Selectively reporting only significant results and filing the rest away
    • Stopping data collection early because a p-value just dipped below 0.05
    • Trying different variable transformations, subgroups, or outlier exclusions until \( p < 0.05 \)

    To see why this matters: if you run 20 independent tests at \( \alpha = 0.05 \), you would expect one false positive by chance alone, even if nothing is truly going on. P-hacking exploits this — and it's why many published findings have failed to replicate.

    What Can We Do?

    • Pre-register your hypotheses and analysis plan before collecting data
    • Report all results — not just the significant ones
    • Use replication — a single significant result is a starting point, not a conclusion
    • Adjust for multiple comparisons when running many tests simultaneously (e.g., Bonferroni correction)

    Remember: A p-value does not tell you the probability that your hypothesis is true. It tells you the probability of seeing your data — or data more extreme — if the null hypothesis were true. These are fundamentally different questions.


    Thinking Beyond the p-value

    While significance testing is a powerful tool, it should never be your only consideration. Always ask:

    • Is this result practically significant? Does the effect matter in the real world?
    • What is the effect size? Is the difference meaningful beyond just being nonzero?
    • What are the risks of a Type I or Type II error in this context?
    • Would this result replicate with a larger or different sample?
    • Are the assumptions of the test actually satisfied?

    P-values are just one tool. Use them with care, context, and critical thinking.


    Think About It:
    • You find a statistically significant result (\( p = 0.04 \)), but the difference is only 0.2 points on a 100-point scale. What should you conclude? What other information would you want before acting on the result?
    • A researcher runs 40 separate hypothesis tests on one large dataset, each at \( \alpha = 0.05 \), and finds 3 significant results. Should they be excited? What is the expected number of false positives in 40 tests at this level?
    • Two studies test the same drug. Study A has \( n = 30 \) and finds \( p = 0.06 \). Study B has \( n = 3000 \) and finds \( p = 0.04 \). Both find approximately the same effect size. What do these two results together suggest? Which p-value is more informative on its own?

    This page titled 8.7: Interpreting p-values and Significance is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by Mathematics Department.

    • Was this article helpful?