11.10: Summary
 Page ID
 8247
Null hypothesis testing is one of the most ubiquitous elements to statistical theory. The vast majority of scientific papers report the results of some hypothesis test or another. As a consequence it is almost impossible to get by in science without having at least a cursory understanding of what a pvalue means, making this one of the most important chapters in the book. As usual, I’ll end the chapter with a quick recap of the key ideas that we’ve talked about:
 Research hypotheses and statistical hypotheses. Null and alternative hypotheses. (Section 11.1).
 Type 1 and Type 2 errors (Section 11.2)
 Test statistics and sampling distributions (Section 11.3)
 Hypothesis testing as a decision making process (Section 11.4)
 pvalues as “soft” decisions (Section 11.5)
 Writing up the results of a hypothesis test (Section 11.6)
 Effect size and power (Section 11.8)
 A few issues to consider regarding hypothesis testing (Section 11.9)
Later in the book, in Chapter 17, I’ll revisit the theory of null hypothesis tests from a Bayesian perspective, and introduce a number of new tools that you can use if you aren’t particularly fond of the orthodox approach. But for now, though, we’re done with the abstract statistical theory, and we can start discussing specific data analysis tools.
References
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Lawrence Erlbaum.
Ellis, P. D. 2010. The Essential Guide to Effect Sizes: Statistical Power, MetaAnalysis, and the Interpretation of Research Results. Cambridge, UK: Cambridge University Press.
Lehmann, Erich L. 2011. Fisher, Neyman, and the Creation of Classical Statistics. Springer.
Gelman, A., and H. Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” The American Statistician 60: 328–31.

The quote comes from Wittgenstein’s (1922) text, Tractatus LogicoPhilosphicus.

A technical note. The description below differs subtly from the standard description given in a lot of introductory texts. The orthodox theory of null hypothesis testing emerged from the work of Sir Ronald Fisher and Jerzy Neyman in the early 20th century; but Fisher and Neyman actually had very different views about how it should work. The standard treatment of hypothesis testing that most texts use is a hybrid of the two approaches. The treatment here is a little more Neymanstyle than the orthodox view, especially as regards the meaning of the p value.

My apologies to anyone who actually believes in this stuff, but on my reading of the literature on ESP, it’s just not reasonable to think this is real. To be fair, though, some of the studies are rigorously designed; so it’s actually an interesting area for thinking about psychological research design. And of course it’s a free country, so you can spend your own time and effort proving me wrong if you like, but I wouldn’t think that’s a terribly practical use of your intellect.

This analogy only works if you’re from an adversarial legal system like UK/US/Australia. As I understand these things, the French inquisitorial system is quite different.

An aside regarding the language you use to talk about hypothesis testing. Firstly, one thing you really want to avoid is the word “prove”: a statistical test really doesn’t prove that a hypothesis is true or false. Proof implies certainty, and as the saying goes, statistics means never having to say you’re certain. On that point almost everyone would agree. However, beyond that there’s a fair amount of confusion. Some people argue that you’re only allowed to make statements like “rejected the null”, “failed to reject the null”, or possibly “retained the null”. According to this line of thinking, you can’t say things like “accept the alternative” or “accept the null”. Personally I think this is too strong: in my opinion, this conflates null hypothesis testing with Karl Popper’s falsificationist view of the scientific process. While there are similarities between falsificationism and null hypothesis testing, they aren’t equivalent. However, while I personally think it’s fine to talk about accepting a hypothesis (on the proviso that “acceptance” doesn’t actually mean that it’s necessarily true, especially in the case of the null hypothesis), many people will disagree. And more to the point, you should be aware that this particular weirdness exists, so that you’re not caught unawares by it when writing up your own results.

Strictly speaking, the test I just constructed has α=.057, which is a bit too generous. However, if I’d chosen 39 and 61 to be the boundaries for the critical region, then the critical region only covers 3.5% of the distribution. I figured that it makes more sense to use 40 and 60 as my critical values, and be willing to tolerate a 5.7% type I error rate, since that’s as close as I can get to a value of α=.05.

The internet seems fairly convinced that Ashley said this, though I can’t for the life of me find anyone willing to give a source for the claim.

That’s p=.000000000000000000000000136 for folks that don’t like scientific notation!

Note that the
p
here has nothing to do with a p value. Thep
argument in thebinom.test()
function corresponds to the probability of making a correct response, according to the null hypothesis. In other words, it’s the θ value. 
There’s an R package called
compute.es
that can be used for calculating a very broad range of effect size measures; but for the purposes of the current book we won’t need it: all of the effect size measures that I’ll talk about here have functions in thelsr
package 
Although in practice a very small effect size is worrying, because even very minor methodological flaws might be responsible for the effect; and in practice no experiment is perfect, so there are always methodological issues to worry about.

Notice that the true population parameter θ doesn’t necessarily correspond to an immutable fact of nature. In this context θ is just the true probability that people would correctly guess the colour of the card in the other room. As such the population parameter can be influenced by all sorts of things. Of course, this is all on the assumption that ESP actually exists!

Although this book describes both Neyman’s and Fisher’s definition of the p value, most don’t. Most introductory textbooks will only give you the Fisher version.

In this case, the Pearson chisquare test of independence (Chapter 12;
chisq.test()
in R) is what we use; see also theprop.test()
function.