12.6: The Most Typical Way to Do Chi-square Tests in R
When discussing how to do a chi-square goodness of fit test (Section 12.1.7) and the chi-square test of independence (Section 12.2.2), I introduced you to two separate functions in the
lsr
package. We ran our goodness of fit tests using the
goodnessOfFitTest()
function, and our tests of independence (or association) using the
associationTest()
function. And both of those functions produced quite detailed output, showing you the relevant descriptive statistics, printing out explicit reminders of what the hypotheses are, and so on. When you’re first starting out, it can be very handy to be given this sort of guidance. However, once you start becoming a bit more proficient in statistics and in R it can start to get very tiresome. A real statistician hardly needs to be told what the null and alternative hypotheses for a chi-square test are, and if an advanced R user wants the descriptive statistics to be printed out, they know how to produce them!
For this reason, the basic
chisq.test()
function in R is a lot more terse in its output, and because the mathematics that underpin the goodness of fit test and the test of independence is basically the same in each case, it can run either test depending on what kind of input it is given. First, here’s the goodness of fit test. Suppose you have the frequency table
observed
that we used earlier,
observed
##
## clubs diamonds hearts spades
## 35 51 64 50
If you want to run the goodness of fit test against the hypothesis that all four suits are equally likely to appear, then all you need to do is input this frequenct table to the
chisq.test()
function:
chisq.test( x = observed )
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 8.44, df = 3, p-value = 0.03774
Notice that the output is very compressed in comparison to the
goodnessOfFitTest()
function. It doesn’t bother to give you any descriptive statistics, it doesn’t tell you what null hypothesis is being tested, and so on. And as long as you already understand the test, that’s not a problem. Once you start getting familiar with R and with statistics, you’ll probably find that you prefer this simple output rather than the rather lengthy output that
goodnessOfFitTest()
produces. Anyway, if you want to change the null hypothesis, it’s exactly the same as before, just specify the probabilities using the
p
argument. For instance:
chisq.test( x = observed, p = c(.2, .3, .3, .2) )
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 4.7417, df = 3, p-value = 0.1917
Again, these are the same numbers that the
goodnessOfFitTest()
function reports at the end of the output. It just hasn’t included any of the other details.
What about a test of independence? As it turns out, the
chisq.test()
function is pretty clever.
180
If you input a
cross-tabulation
rather than a simple frequency table, it realises that you’re asking for a test of independence and not a goodness of fit test. Recall that we already have this cross-tabulation stored as the
chapekFrequencies
variable:
chapekFrequencies
## species
## choice robot human
## puppy 13 15
## flower 30 13
## data 44 65
To get the test of independence, all we have to do is feed this frequency table into the
chisq.test()
function like so:
chisq.test( chapekFrequencies )
##
## Pearson's Chi-squared test
##
## data: chapekFrequencies
## X-squared = 10.722, df = 2, p-value = 0.004697
Again, the numbers are the same as last time, it’s just that the output is very terse and doesn’t really explain what’s going on in the rather tedious way that
associationTest()
does. As before, my intuition is that when you’re just getting started it’s easier to use something like
associationTest()
because it shows you more detail about what’s going on, but later on you’ll probably find that
chisq.test()
is more convenient.