8.10: The Definition of Type I and Type II Errors
- Page ID
- 56393
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)In a Type I error i you reject null when it is true– “I think my finding was true and significant, but really it was due to chance.” If the alpha level α is at 5%, 5% of the time, I reject the null. There was a significant effect, but actually, I should have kept it because there was nothing there. We state, "My statistical result supports my hypothesis, but it is possible that my results were due to random chance, and I made a Type I error.” The type I error is the same as a false positive. You think you found something, but in reality, you did not.
In a Type II error – you accept null when it is false. --“I think my findings were not true and not significant, but really, something did happen.” If the Beta level β is at 95%, I fail to reject the null. There was no significant effect, but actually, I should have rejected it because there was something there. We state the following: “My statistical result did not support my hypothesis, but it is possible that my results indicated something did occur, and I made a type II error.” The type II error is the same as a false negative. You think you did not find something, but in reality, you did.
8.10.1: Examples of Type I and Type II Errors
A classic medical example is when a diagnostic test says you have a medical condition, but you really do not. The classic example is a pregnancy test. A Type I error means that you are pregnant when you really are not pregnant. A type II error means that you are not pregnant, but you really are pregnant. Back in 2020, we had the COVID pandemic. COVID-19 tests were developed quickly, yet there were concerns about Type I or Type II errors. A type I error means that you have COVID-19 when you do not have COVID-19. A type II error means that you don’t have COVID-19 when you do have COVID-19.
A psychology/forensic example would predict the risk of whether an offender will re-offend. The problem is that if you release an offender and state that the offender will not re-offend, you are making a Type II error because you say the offender will not re-offend, but there’s a chance the offender will re-offend. A psychological diagnostic example of a Type I error is assessing a child as having ADHD when, in fact, the child does not have ADHD.
Switching to humor, if you break up with your dating partner, you can use this break-up line – “I’m breaking up with you because I made a type I error. I thought you were significant, but in reality, you are not significant; you’re just an error. I made the mistake of thinking you were significant.” If someone breaks up with you, you can use this comeback line – “You are breaking up with me? You just made a Type II error. You think I am not significant, but in reality, I am significant. You just lost out on dating me because I am a significant finding.” Those two comeback lines should address all of your problems about what to say when you experience a dating breakup.
8.10.2: The Likelihood of Making a Type I or Type II Error
How likely are you to make a type I or type II error? This situation usually occurs when your statistical test value has an accompanying p value that is right on the alpha level line of .05. What do you do when you are in this situation? You have to examine statistical information, such as sample size and research design.
Let us walk through a Type I error. Situation A: p = .04, n = 300, 150 in the treatment group, 150 in the control group. Walking through the steps, first, we have to say that our test is significant because the p value for the statistical test is less than .05. Second, we have to automatically say we made a Type I error. Because technically, we found something, but there is always a chance we are wrong, and in reality, we did not find something.
In this situation, we have an outcome that is right near the .05 alpha level. So technically, we say we found something, but we made a mistake and made a Type I error. But did we really? One hint is the sample size. With an overall sample size of 300 evenly divided between both groups, indicating ample sample sizes in both groups and because sample sizes near 300 do give a stable estimate of the population, we can be confident, although never entirely certain, that our result is a real result.
Continuing with Type I. Situation B: p = .04, n = 30, 15 in the treatment group, 15 in the control group. Walking through the steps, first, we have to say that our test is significant because the p value for the statistical test is less than .05. Second, we have to automatically say we made a Type I error. Because technically, we found something, but there is always a chance we are wrong, and in reality, we did not find something.
In this situation, we have an outcome that is right near the .05 alpha level. So technically, we say we found something, but we made a mistake and made a Type I error. But did we really?
One hint is the sample size. With an overall sample size of 30 evenly divided between both groups with a sample size of 15, and because sample sizes below 30 do not give a stable estimate of the population, we cannot be confident, although never entirely certain, that our result is a real result, meaning that we did find something, but it is more likely it is a fluke or a random error result. In reality, we did not find anything because with a larger sample size, we might get more observations that actually contradict the significant finding.
Let us walk through a Type II error. Situation C: p = .06, n = 300, 150 in the treatment group, 150 in the control group. First, we have to say that our test is not significant because the p value for the statistical test is more than .05. Second, we have to automatically say we made a Type II error. Because technically, we did not find something, but there is always a chance we are wrong, and in reality, we did find something.
In this situation, we have an outcome that is right above the .05 alpha level. So technically, we say we did not find something, but we made a mistake and made a Type II error. But did we really? One hint is the sample size. With an overall sample size of 300 evenly divided between both groups, indicating ample sample sizes in both groups, and because sample sizes near 300 do give a stable estimate of the population, we can be confident, although never entirely certain, that our result is a real result.
Continuing with Type II. Situation D: p = .06, n = 30, 15 in the treatment group, 15 in the control group. Walking through the steps, first, we have to say that our test is not significant because the p value for the statistical test is more than .06. Second, we have to automatically say we made a Type II error. Because technically, we did not find something, but there is always a chance we are wrong, and in reality, we did find something.
In this situation, we have an outcome that is right above the .05 alpha level. So technically, we say we did not find something, but we made a mistake and made a type II error. But did we really? One hint is the sample size. With an overall sample size of 30 (15 per group) —and because sample sizes below 30 do not give a stable estimate of the population— we cannot be confident, though never entirely certain, that our result is a real result. This suggests that we did not find something. In reality, we might find something because with a larger sample size, we might get more observations confirming a significant finding that we will eventually find.
8.10.3: Hallmarks of Type I and II Errors
By definition, you will always make an error when you estimate. Type I and Type II errors occur every time. Whenever you say you found a significant result, you make a Type I error. Whenever you say you did not find a significant result, you make a Type II error.
You cannot make a Type I and Type II error simultaneously. You cannot make a Type II error when you have a significant result. You cannot make a Type I error when you have a non-significant result.
Type I errors often occur. We want to find and publish significant results, and when we do, we always make a Type I error, by definition. When I say we often see Type I errors, it is only because we publish significant results.
You do not see Type II errors that often because we do not publish non-significant results. The profession keeps advocating that we should also publish non-significant results, and hence see more Type II errors, because null findings are just as important as significant findings. Until our profession stops favoring significant findings for publication, you will see more Type I errors than Type II errors.
If a statistical test falls right on .05, what do you decide? Think of the .05 as the border on a sports field. In baseball, soccer, and tennis, if the ball falls right on the line, then the ball is considered “fair” or “in play,” and we keep playing. So, if a result falls right on the line, it is considered significant. What would it mean to have a p value result of .049? Is it significant? The answer is yes because rounding up .049 to .05 still means the result falls on the significance of the line. Side note: It is curious that in other sports, like basketball and football, if the player or the ball falls on the line, it is considered “out of bounds,” and the play stops. Why the line means two different outcomes across sports is a mystery to me.
You could take this problem of “on the line” even further. What happens if the p value result is .051? Do you round down and say the result is at .05? Or do you be honest and say, well, the p value is technically right above the .05 line, so technically, it is not significant. Or if you have a result that is at .054. Like most of us who learned this math concept in high school, any value of 4 and below gets rounded down to the lower value, and any value of 5 and up gets rounded up to the higher value. Should you round down a .054 to a .05 and call the result significant? Or you can be honest and say, well, this value is above the .05 line, so technically, it is not significant. But what if we extend the decimal places to .0501? Now that 1 in the millionth place of the decimal, does that mean the result is not significant?
If you think, “This is like hair-splitting, and it is giving me a headache,” you would be correct. Notice how we place WAY TOO MUCH weight on the p < .05 as the “make or break it” rule for statistical significance. It is not good to make major decisions on a .05 line. Later in this chapter, we will discuss the futility of making decisions based on p < .05.
Is there any way to fix this error? The error is making a mistake. I guessed the significance of the study, but it was wrong. Conceptually, this error will occur every time you make a guess about the significance of the statistical test. The only way to fix Type I and II errors is to prevent a wrong guess by replicating the study.
You will never know the truth about an outcome. All you can do is be confident that you think you did everything you could to get to the truth. Because the definition of confidence is minimizing the chance that you are incorrect in your conclusion, the only way to maximize your confidence is to continue collecting observations and replicating the study until you find the same result repeatedly. If something happens over and over again, if you get the same result, no matter what you do, then you have confidence. Recall the definition of insanity is doing the same thing over and over again and expecting a different result. The definition of confidence is doing something in different ways over and over again and still getting the same result no matter what you do. The definition of confidence is the reason why we can confidently claim that the Chicago Cubs stink. No matter the circumstances, no matter who is on the team, no matter who manages the team, no matter where they play, no matter the changes in baseball rules, no matter the time period, they still stink. On the other hand, the Chicago White Sox are superior. No matter who is on the team, no matter the circumstances, no matter who manages the team, no matter where they play, no matter the time period, they still are winners. That is the definition of confidence (I am taking artistic writer’s liberty here. My book. My examples).
The concept of the Type I and Type II errors is another reason you need a large sample size AND replication. You can't base your conclusion on one statistical test or one study. We have to replicate to see if we get the same result no matter what we do. Soapbox moment – our field and profession seem to frown upon replicating results. We seem to value original research and new contributions. There is much value in replicating results, especially results that are suspect or could use updating. Never let anyone tell you that replicating a study is not valuable. It is valuable.
It is possible to manipulate the study design to obtain a significant p value. All these scenarios lead to a Type I error. You say you found an effect, but you really did not. And you did not because you manipulated the study design to stack the chances in your favor of finding a significant effect. One method is to increase the sample size. Among statisticians this method is called “fishing.” The more you sample, the more likely you are to find an outcome. The more you cast your fishing line into the lake, the more likely you are to catch a fish eventually. Does catching a fish make you a fisherman after, say, 300 tries? Nope. It means if you keep looking, eventually you will find something. Interestingly, if you sample 100 times, according to the normal distribution, five of those observations will be significant simply due to random luck. But it does not mean you found something.
A second method is to conduct more statistical tests with lots of outcome variables. The statistical term for this is “increasing the Type I error rate” or “increasing the experimenter-wise error rate.” This method is like fishing. If you have 100 dependent variables and run 100 statistical tests, eventually, at least five of those tests will yield a significant result, simply due to chance alone. This scenario is conceptual. It is hard to believe that a) any study will have 100 separate dependent variables, and b) there is nothing about conducting a statistical test 100 times with the same sample that will alter the significance of results.
A third method is to increase the sampling, which usually leads to increasing the sampling range or getting more varied participants in the sample. The more variations, the more likely you are to find a pattern in those variations. If you have a range of variables from 1 to 10, and the sample you obtain is on the high end, such as from 6 to 9, your range is truncated, and you likely will not get a significant result because there is not much of a range to establish a pattern. If you keep sampling, you will likely find some observations that fall in the lower and upper regions of the variable. You will get observations below 6 and more observations that are 10. The more range in the sampling, the more chances there are to find an effect.
The lesson in all of this? The research design, in this case, the sampling frame and the number of dependent variables, can be engineered to find a significant effect. The lesson is to always conceptualize. By conceptualizing the relationships between your variables, you can determine if the effect you found tracks or aligns with your expectations of how those variables should be associated with each other.
Which error is worse? Type I or II? It depends if you want to find an effect or not. When you are testing treatments, you want to be confident that you have found an effect, so you don’t want to make a Type I error, that is, you only take results that are very unlikely to occur by chance, so that you can be confident that you have a real treatment effect. When making diagnoses, or you want to err on the side of caution and not make a Type II error. If you assess a child and say they do not have autism when, in fact, they do, that is a Type II error. You do not want to miss the diagnosis, so you might err on the side of caution to avoid making a Type II error. If you tested a treatment for migraine headaches and some patients did get better, but overall, your treatment showed no effect; you made a Type II error. But you might still offer the treatment because even if the treatment did not work for everyone, it might work for you, and if you suffer from migraines, you will take any chance at treatment, particularly if there are minimal risks involved with the treatment. So, there is no inherent value in whether Type I errors or Type II errors are worse. The answer is based on your conceptualization of the research issue. If we had to default to something, though, Type I errors would be more problematic. Everyone wants to find results because results lead to publication and funding for more research. We tout our results, but we are probably making Type I errors every time we think we found something significant. If anything, we are probably more prone to making Type I errors.
If you are wondering what to do to minimize Type I and Type II errors, the answer is always to replicate the results to confirm the answer. If you are on the fence about whether your result is a Type I or Type II error, then a consultation is needed. The problem with the p value is that EVERYTHING is based on the p value. We will return to this issue at the end of this section, but it is not good to base our decisions on one piece of information that is fickle. Certainly, you do not date a person simply because they make over $80,000 per year. What if someone makes $78,000 per year? Do we dismiss that person??? The answer is no, because there are other qualities we desire in a person, such as personality, kindness, and sense of humor. If a person has other good qualities, the $78,000 per year is still good, even if their salary does not make the cutoff. Yes, give a person a chance to date you, even if they do not meet all of your qualification thresholds.
Turning back to psychology statistics, we have to look at information other than the p value: The sample size, the means, the standard deviations, the goal of the study, and the sampling method. All of those are considerations. How they are all put together to make a decision is discussed later. For now, keep in mind that when you are fussing over if a result is significant at p < .05, disengage from the p value and look at other factors to help with your decision. And always consult. It is not good practice to declare significance, and hence the value of the study results, based on one piece of information – the p value. More later.
8.10.4: Type I or II Errors and Why We Set the P value at .05
Who died and said for all time, “The p value is at .05?” What happens if you move this sacred special p value of .05 by increasing or decreasing it? Or why not just set the p value at .05 for all time and be done with it? Why tinker with it? And do you even give a rat’s ass about this?
The answers to these questions are a) increasing or decreasing the p value increases or decreases the chances of making a Type I or Type II error, b) we set the p value at .05 because it is the best compromise of making a Type I or Type II error, c) you care because to guard against consequences, or to increase the range of possible positive outcomes, there are times when you prefer to make a Type I or Type II error.
BTW, there is no good way anyone can explain this concept. It involves a lot of thought, and you have to study the material to understand this concept. There is no way to only read about this issue and have that be sufficient for you to understand this concept. You must come up with your own way of thinking about this concept.
The p value set at .05 means that you are only interested in results that occur by chance 5% of the time.
You would be right to ask this question:
Q: “Why not set the p value at .01? That way, you can only get the significant results.”
A: “Because with such a strict and narrow range, you might miss out on a significant result.”
Conversely, you would be right to ask this question:
Q: “Why not set the p value at .10? That way, you can get a lot of significant results.”
A: “Because with such a loose and wide range, you will get results that are actually not significant.”
These situations illustrate what happens when you increase or decrease the p value; doing so changes how you decide if a result is significant or not significant.
Keep in mind that Type I and Type II errors are inversely related. As one error increases, the other error decreases.
To illustrate this point if alpha is set at 5%.:
- If you decrease the alpha level (i.e., decrease alpha to 1%), you lower the possibility of committing a Type I error but increase the likelihood of committing a Type II error.
- If you increase the alpha level (i.e., increase alpha to 10%), you lower the possibility of committing a Type II error but increase the likelihood of committing a Type I error.
To demonstrate this, let’s use a score result as our example:
- Coping skills: The range is 1 to 10; a score of 10 indicates good coping skills.
- The TX group’s mean score is 8. The control group’s mean score is 5.
- Set p = .05.
- The difference between the TX group and the control group is 3 pts.
- Suppose we test if that difference is significant. Our t-test value = 3.50, p = .03.
- That test result is lower than p = .05; therefore, the result is significant.
- Automatically, I made a Type I error. I found a significant result, but I could be wrong; the result really is not significant.
What happens when I decrease the p value?
- Decrease p value from .05 to .01.
- Let’s use the same t-test value = 3.50, p = .03.
- Now, with p = .01, that same result becomes not significant.
- I have now made a Type II error. I did not find a significant result, but I could be wrong; the result is significant.
- So, decreasing the p value increases the chance of making a Type II error.
Let’s take another score result.
- Coping skills: The range is 1 to 10; a score of 10 indicates good coping skills.
- The TX group’s mean score is 7. The control group’s mean score is 5.
- Set p = .05.
- The difference between the TX group and the control group is 2 pts.
- Suppose we test if that difference is significant. Our t-test value = 1.70, p = .07.
- That test result is higher than p = .05; therefore, the result is not significant.
- Automatically – I made a Type II error. I did not find a significant result, but I could be wrong; the result is really significant.
What happens when I increase the p value?
- Increase the p value from .05 to .10.
- Let’s use the same t-test value = 1.70, p = .07.
- Now, with p = .10, that same result becomes significant.
- I have now made a Type I error. I found a significant result, but I could be wrong; the result really is not significant.
- So, increasing the p value increases the chance of making a Type I error.
So, we decrease the p value (.05 to .01), and we increase the Type II error. We increase the p value (.05 to .10), and we increase the Type I error.
There is no way around committing one of these errors. There is no situation where we are certain that we are not committing a Type I or a Type II error. No matter what we do to the p value, there is always uncertainty.
And that is why we set the p value at .05 because that’s the best of both worlds, or put differently, that is the best value we can set to guard against making a Type I or Type II error.
8.10.5: Why Increase or Decrease the P value from .05?
If p < .05 is the best of both worlds, why bother adjusting it?
Q: “Why fuss with this? Just keep the p value at .05 and be done!!”
A: “Because there are situations when you want to increase or decrease the p value.”
BTW, how do you increase or decrease the p value? It’s very easy. You just declare it. You say, “instead of p < .05, it is now p < .01” or “instead of p < .05, it is now p < .10.” There is no mathematical formula, nothing for statistical programs to do; you enter the desired p value in an “options” section of the statistical program. And why .01 or .10? Conventional guidelines. It is much easier to understand .01 or .10. You could set it to anything you want, but nothing important is added by setting it to .03 or .12.
In what scenario would you want to decrease the p value from .05 to .01? When you want a strict range and only result with a very low probability of occurring by chance. Remember, the .01 is 1%, which means you have very high standards. Conceptually, you only want the best of the best when you set your alpha level at .01.
Let’s take a humorous example. Your friends say that their criteria for selecting a potential date is someone who makes $80,000 per year. But you set your salary criteria at $100,000 per year. You lowered your p value from .05 to .01. You are raising your standards and restricting the range of salary recruitments for your potential date. Nothing wrong with having high standards because you only want the best of the best. However, the trade-off is this: you might miss out on a great partner simply because they make $90,000 per year. If you are okay with that trade-off, then so be it.
Let’s take a medical example. You are testing medications for cancer that really need to work. You set p = .01 because you only want those medications that have a high probability of having a treatment effect, and you do not want any medication that does not have a treatment effect. You do not want a Type I error. You don't want false positives. You do not want medication on the market that you say works but really does not. No false hopes…. However, the tradeoff is this: you might be missing out on effective medications simply because they did not quite get the effect you wanted to see across everyone who received the medication. If you are okay with that trade-off, then so be it.
In what scenario would you want to increase the p value from .05 to .10? When you want a loose range, and you want more results. Remember, the .10 is 10%, which means you are lowering your standards. Conceptually, you only want a wider range of results when you set your alpha level at .10.
Let’s take a humorous example. Your friends say that their criterion for selecting a potential date is someone who makes $80,000 per year. But you set your salary criteria at $50,000/year. You raised your p value from .05 to .10. You are lowering your standards and widening the range of acceptable salaries for your potential date. There is nothing wrong with lowering standards because you want a wider variety of potential partners to choose from. However, the tradeoff is this – you might be dating a partner who is not that great for you because they make $50,000 per year. In other words, you are settling. If you are okay with that trade-off, then so be it.
Let’s take a medical example. You are testing medications for migraines. Set p = .10 because if the medication has any chance of working, you will take it, assuming there is minimal risk to taking the medication. If you have a migraine, you will try anything that works. It may not have worked for others, but if it worked for you, then great. If it did not work for you, if it was a false positive, a Type I error, then fine. As a migraine sufferer, you move on and try the next meditation because you will try anything that might work for you.
At the end of the day, unless you have a conceptual reason for shifting the p value because there is something of value regarding the need to be comfortable with making a Type I or Type II error, you are better off leaving it alone or leaving it at p < .05. Never shift the p value to make your results and yourself look better.
8.10.6: Here is the Kicker: Abandon P values
A valid response to all this discussion about p values is “what a lot of fuss!”
A takeaway message from the discussion about p values is that significance values can be easily manipulated. You simply increase or decrease them. You can make anything significant by increasing the p value; you can make anything not significant by decreasing the p value. It is a fickle piece of statistical information.
When you read these phrases, the messiness of making conclusions based on the p value is compounded.
- p < .001
- “Highly significant”
- “Trending towards significance” or “approaching significance.”
Never use any of the above phrases. The problem with all these phrases is that people think there is some value in the number itself when interpreting a significance value. The p value is simply a cutoff value. The only question you can ask when reading the p value is “Is the statistical test significant?” and the only two answers to this question are yes or no. There is no such thing as highly significant. Just because a p value is .0001 does not mean the test result is better than a statistical test result whose p value is .01. When someone says that a result is highly significant, what does that mean? Why is one statistical test with a .0001 significance level better than one with a .01 level? A more significant result means that it is not likely to occur by chance. But anything can have a low probability of occurring by chance. There is no need to directly compare the importance of the results by comparing their significance value. P values are not intended for that purpose.
Listing p values such as * p < .05, ** p < .01, *** p < .001 is a practice with no meaning. Results are significant or not significant. There is no such thing as more significance based on the value of the significance test.
The phrases “trending towards significance” or “approaching significance” are misleading. Researchers think that a p = .06 is “close enough,” so they might as well call them significant. Essentially, this practice is akin to a Type II error. You are saying that the results are not significant, but in fact they are, so you might as well call them significant. There simply is no basis, evidence, criteria, or consensus that a .06 statistical significance should be considered significant when p < .01. For that matter, what makes a .07 not significant, but a .06 significant? Is the difference of 1% enough to make a study’s results not significant? The reasoning is a stretch.
Everything discussed leads you to say that from this point forward, you hate the significance test and the p value. We simply cannot put so much emphasis on one piece of statistical information, the p value, that will make or break a study. As of this writing, there is a growing movement among statisticians to abandon the p value. If statisticians say to abandon the p value, then you ought to listen. The movement to abandon the p value does not mean ignoring the p value altogether. Statisticians are saying to never put so much weight on deciding the value of a study and its statistical results, based exclusively on the p value. Keep in mind that statisticians want us not to rely solely on the p value; they want us to use the p value as just one piece of statistical information, along with other pieces of statistical information and the research design, and to inform us as we evaluate the validity of a research study.


