Lab 9: Categorical Data
- Page ID
- 9051
Objectives:
1. Understand how to analyze categorical data
2. Understand how to perform chi-square tests in R.
Definitions:
- categorical (qualitative) data
- chi-square distribution
- observed vs. expected counts
- goodness-of-fit test
- contingency table
- test of homogeneity
- test of independence
Introduction:
Recall that categorical data is data based on some attribute or characteristic. The observations fall into categories. Up to this point, we have performed hypothesis tests primarily about population means. But if we are interested in testing claims about categorical data, then we need a new approach, since we cannot compute means for categorical variables. Instead we focus on proportions, and we have only developed tests for comparing two proportions at a time. In this lab, we will look at methods to analyze relationships between categorical variables and to check how well a probability model fits a single categorical variable.
Activities:
Getting Organized: If you are already organized, and remember the basic protocol from previous labs, you can skip this section.
Navigate to your class folder structure. Within your "Labs" folder make a subfolder called "Lab9". Next, download the lab notebook .Rmd file for this lab from Blackboard and save it in your "Lab9" folder. There are no datasets used in this lab.
Within RStudio, navigate to your "Lab9" folder via the file browser in the lower right pane and then click "More > Set as working directory". Get set to write your observations and R commands in an R Markdown file by opening the "lab9_notebook.Rmd" file in RStudio. Remember to add your name as the author in line 3 of the document. For this lab, enter all of your commands into code chunks in the lab notebook. You can still experiment with code in an R script, if you want. To set up an R Script in RStudio, in the upper left corner click “File > New File > R script”. A new tab should open up in the upper left pane of RStudio.
Goodness-of-Fit Tests: In class on Tuesday, we considered whether any one day of the week is more or less likely to be a person’s birthday than any other day of the week. Let \(p_{\text{M}}\) denote the proportion of all people that were born on a Monday, or equivalently, the probability that a randomly selected person was born on a Monday. Similarly, define \(p_{\text{Tu}}\), \(p_{\text{W}}\), \(p_{\text{Th}}\), \(p_{\text{F}}\), \(p_{\text{Sa}}\), and \(p_{\text{Su}}\) We are testing the following hypotheses:
\(H_0: p_{\text{M}} = p_{\text{Tu}} = p_{\text{W}} = p_{\text{Th}} = p_{\text{F}} = p_{\text{Sa}} = p_{\text{Su}} = 1/7\)
\(H_A: p_i \neq 1/7\) for at least one day of the week
In other words, we are testing whether the probability model stated in the null hypothesis fits the data well.
To test these hypotheses, you created a version of the following table:
Days of the Week |
|
|
Mon Tues Wed Thu Fri Sat Sun | Total: \(n\) | |
Observed counts: \(O_i\) |
17 26 22 23 19 15 25 |
147 |
Expected counts: \(E_i = np_i\) |
21 21 21 21 21 21 21 |
147 |
\((O_i - E_i)^2 / E_i\) |
0.76 1.19 0.05 0.19 0.19 1.71 0.76 |
4.86 |
The test statistic in this case, 4.86, follows a chi-square distribution, with degrees of freedom equal to the number of categories (i.e., days of the week) minus one, and so the \(P\)-value is calculated in R as follows:
pchisq(4.86, df = 6, lower.tail = FALSE) ## [1] 0.5618907
Pause for Reflection #1:
Suppose we suspect that weekend days are less likely to be a birthday, perhaps because doctors want the weekend off and so do not schedule Caesarean deliveries for weekends. Let’s test whether the data provide evidence against the hypothesis that weekend days are half as likely as other days to be someone’s birthday and that all weekdays are equally likely.
- State the hypotheses being tested in this case. The null hypothesis should give the proposed probability model for the data. Note that not all the days of the week will have the same probabilities, but we will still need the probabilities to add up to 1.
- Redo the table above to calculate the test statistic in this case. Note that we are using the same data, so the observed counts stay the same, but the expected counts will change.
- Alter the R code above to calculate the corresponding \(P\)-value and state the conclusion of the test.
- Which category (day) has the largest contribution to the test statistic? Explain what this reveals.
_____________________________________________________
Chi-Square Test in R: As you may have already guessed, there is a function in R, chisq.test(), that performs the calculations you just did. To use this function, store the observed counts in a list:
birthdays = c(17, 26, 22, 23, 19, 15, 25)
For the test that each day of the week is equally likely, all we have to do is call the chisq.test() function on the object containing the observed counts as follows, since by default R tests the data against the null hypothesis that all probabilities are equal:
chisq.test(birthdays) ## ## Chi-squared test for given probabilities ## ## data: birthdays ## X-squared = 4.8571, df = 6, p-value = 0.5623
The output above gives the value of the observed test statistic X-squared and the degrees of freedom df for the chi-square distribution used to calculate the corresponding p-value.
For the test that weekend days are half as likely as other days, we need to specify the probabilities stated in the null hypothesis in the chisq.test() function as follows:
probs = c(rep(1/6, 5), 1/12, 1/12) chisq.test(birthdays, p = probs)
Pause for Reflection #2:
Explain the code above, specifically the line defining the object probs. Does the output of the chisq.test match the results you found in Reflection #1?
_____________________________________________________
Newspaper Reading: Are Americans today less likely to read a newspaper every day than in previous years? The General Social Survey (GSS) interviews a random sample of adult Americans every two years, and one of the questions asks respondents,"How often do you read the newspaper?" Sample results for the years 1978, 1988, 1998, 2008, and 2018 are given in the contingency table below.
1978 1988 1998 2008 2018 |
total |
|
Every day Not every day |
874 500 805 431 321 654 488 1065 898 1247 |
2922 4352 |
total |
1528 988 1870 1329 1559 |
7274 |
In asking whether or not these sample data provide evidence that the proportion of Americans who read the newspaper every day differed among the five populations for these years, we have to ask how likely it is to have observed such sample data if, in fact, the "every day" proportions were the same for all five populations (years). However, it’s a little harder to quantify this now that we are comparing more than two groups.
We adopt a strategy similar to the goodness-of-fit test: Compare the observed counts in the table with the counts expected under the null hypothesis of equal population proportions/distributions. The farther the observed counts are from the expected counts, the more extreme we will consider the data to be.
Pause for Reflection #3:
Use appropriate symbols to state the null hypothesis that the population proportion of adult Americans who read the newspaper every day was the same for these five years: 1978, 1988, 1998, 2008, and 2018.
_____________________________________________________
Pause for Reflection #4:
For the five years combined, what proportion of respondents read the newspaper every day? If this same proportion of the 1528 respondents in the year 1978 had read the newspaper every day, how many people would this represent? Record your answer with two decimal places, and repeat for the other four years.
_____________________________________________________
We have now calculated the expected counts under the null hypothesis that the population proportion of adult Americans who read the paper every day was the same for these four years (and consequently also the population proportions who did not read the paper every day). A more general technique for calculating the expected count of cell \(i\) is to take the marginal total for that row times the marginal total for that column, divided by the grand total (sample size of the study, \(n\)):
\[E_i = \displaystyle \frac{\text{row total} \times \text{column total}}{\text{grand total}}\]
Pause for Reflection #5:
Use the general formula in Equation (9.1) to calculate the expected count of "not every day" people in the year 1988 and complete the following table:
1978 1988 1998 2008 2018 |
total |
|
Every day |
874 500 805 431 312 (613.80) (396.88) (751.19) (533.87) (626.26) |
2922 |
Not every day |
654 488 1065 898 1247 (914.20) ( ) (1118.81) (795.13) (932.74) |
4352 |
total |
1528 988 1870 1329 1559 |
7274 |
_____________________________________________________
Now that we have the observed counts and the expected counted calculated, we need to find a test statistic to measure how far the observed counts deviate from the expected counts. To do this, we do the same calculation as with the goodness-of-fit test:
\(X^2 = \displaystyle \sum_{\text{all cells}\ i} \frac{(O_i - E_i)^2}{E_i}\)
Pause for Reflection #6:
Calculate the value of \((O_i-E_i)^2/E_i\) for the "not every day" people in 1988 (i.e., for the second cell in the second row of the table). Add this value to other contributions to the test statistic calculation provided below and compute the test statistic:
\(X^2\) = 110.30 + 26.79 + 3.85 + 19.82 + 157.70
+ 74.06 + ?? + 2.59 + 13.31 + 105.88 = ??
What kind of values (e.g., large or small) of the test statistic provide evidence against the null hypothesis that the five populations (years) have the same proportion of Americans reading the newspaper every day? Explain.
_____________________________________________________
Again in this case, the test statistic follows a chi-square distribution. However, in this case, the degrees of freedom are equal to \((r-1)(c-1)\), where \(r\) is the number of rows and \(c\) is the number of columns in the contingency table.
Pause for Reflection #7:
Calculate the degrees of freedom for the test statistic found in Reflection #6 and then use the pchisq() function to find the corresponding \(P\)-value. Based on the \(P\)-value, state your conclusion.
_____________________________________________________
Tests of Homogeneity: The test we just performed is called a chi-square test of equal proportions (homogeneity). It is used to test whether the proportions for independent samples from three or more populations are the same. And the calculations can also be done in R with the chisq.test() function. First, we need to format the observed counts in R, which can be done using the rbind() command:
years = rbind(c(874, 500, 805, 431, 312), c(654, 488, 1065, 898, 1247)) years ## [,1] [,2] [,3] [,4] [,5] ## [1,] 874 500 805 431 312 ## [2,] 654 488 1065 898 1247
Then, we simply call the chisq.test() function on the table of observed counts years:
chisq.test(years) ## ## Pearson's Chi-squared test ## ## data: years ## X-squared = 532.28, df = 4, p-value < 2.2e-16
We can see the expected counts in R with the following code:
chisq.test(years)$expected ## [,1] [,2] [,3] [,4] [,5] ## [1,] 613.8048 396.8842 751.187 533.8655 626.2876 ## [2,] 914.1952 591.1158 1118.8122 795.1345 932.7424
Tests of Independence: We continue to consider the GSS survey. But this time, we use only the year 2018 with another variable: the respondent’s political inclination, classified as liberal, moderate, or conservative. The sample results are summarized in the table:
Liberal Moderate Conservative |
|
Every day Few times a week Once a week Less than once a week Never |
109 153 160 85 109 95 52 82 63 56 68 64 52 65 63 |
Notice how this data is different from the data used in the previous example regarding newspaper reading. In this case, we have one random sample of individuals (2018 respondents) that are classified according to two variables (political inclination and how often they read the newspaper). Previously, we had five separate random samples (for the five years) that were classified on just one variable.
It turns out that the same chi-square test applies to two-way tables where the data are one random sample from a population classified on two variables. The difference in the null hypothesis being tested is that, in the population, the two variables are independent, and the alternative hypothesis is that there is a relationship between the variables.
For the above data, we perform a chi-square test of independence for the following hypotheses:
\(H_0\) : political inclination and how often someone reads the paper are independent
\(H_A\) : political inclination is related to how often someone reads the paper
Pause for Reflection #8:
Format the data in R using the rbind() function. Then call the chisq.test() function on the data to perform the calculations for the test of independence. Record your conclusion in your lab notebook. If the test indicates strong evidence of a relationship between the variables, examine the table cells that contribute most to the value of the test statistic in order to describe the relationship.
_____________________________________________________