12.6: The Most Typical Way to Do Chi-square Tests in R

Last updated
Save as PDF

Page ID: 4019

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

When discussing how to do a chi-square goodness of fit test (Section 12.1.7) and the chi-square test of independence (Section 12.2.2), I introduced you to two separate functions in the lsr package. We ran our goodness of fit tests using the goodnessOfFitTest() function, and our tests of independence (or association) using the associationTest() function. And both of those functions produced quite detailed output, showing you the relevant descriptive statistics, printing out explicit reminders of what the hypotheses are, and so on. When you’re first starting out, it can be very handy to be given this sort of guidance. However, once you start becoming a bit more proficient in statistics and in R it can start to get very tiresome. A real statistician hardly needs to be told what the null and alternative hypotheses for a chi-square test are, and if an advanced R user wants the descriptive statistics to be printed out, they know how to produce them!

For this reason, the basic chisq.test() function in R is a lot more terse in its output, and because the mathematics that underpin the goodness of fit test and the test of independence is basically the same in each case, it can run either test depending on what kind of input it is given. First, here’s the goodness of fit test. Suppose you have the frequency table observed that we used earlier,

observed

## 
##    clubs diamonds   hearts   spades 
##       35       51       64       50

If you want to run the goodness of fit test against the hypothesis that all four suits are equally likely to appear, then all you need to do is input this frequenct table to the chisq.test() function:

chisq.test( x = observed )

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 8.44, df = 3, p-value = 0.03774

Notice that the output is very compressed in comparison to the goodnessOfFitTest() function. It doesn’t bother to give you any descriptive statistics, it doesn’t tell you what null hypothesis is being tested, and so on. And as long as you already understand the test, that’s not a problem. Once you start getting familiar with R and with statistics, you’ll probably find that you prefer this simple output rather than the rather lengthy output that goodnessOfFitTest() produces. Anyway, if you want to change the null hypothesis, it’s exactly the same as before, just specify the probabilities using the p argument. For instance:

chisq.test( x = observed, p = c(.2, .3, .3, .2) )

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 4.7417, df = 3, p-value = 0.1917

Again, these are the same numbers that the goodnessOfFitTest() function reports at the end of the output. It just hasn’t included any of the other details.

What about a test of independence? As it turns out, the chisq.test() function is pretty clever.¹⁸⁰ If you input a cross-tabulation rather than a simple frequency table, it realises that you’re asking for a test of independence and not a goodness of fit test. Recall that we already have this cross-tabulation stored as the chapekFrequencies variable:

chapekFrequencies

##         species
## choice   robot human
##   puppy     13    15
##   flower    30    13
##   data      44    65

To get the test of independence, all we have to do is feed this frequency table into the chisq.test() function like so:

chisq.test( chapekFrequencies )

## 
##  Pearson's Chi-squared test
## 
## data:  chapekFrequencies
## X-squared = 10.722, df = 2, p-value = 0.004697

Again, the numbers are the same as last time, it’s just that the output is very terse and doesn’t really explain what’s going on in the rather tedious way that associationTest() does. As before, my intuition is that when you’re just getting started it’s easier to use something like associationTest() because it shows you more detail about what’s going on, but later on you’ll probably find that chisq.test() is more convenient.