11.4: Test of Independence
- Page ID
- 20095
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Tests of independence involve using a contingency table of observed (data) values.
The test statistic for a test of independence is similar to that of a goodness-of-fit test:
\[\sum_{(i \cdot j)} \frac{(O-E)^{2}}{E}\]
where:
- \(O =\) observed values
- \(E =\) expected values
- \(i =\) the number of rows in the table
- \(j =\) the number of columns in the table
There are \(i \cdot j\) terms of the form \(\frac{(O-E)^{2}}{E}\).
The expected value for each cell needs to be at least five in order for you to use this test.
A test of independence determines whether two factors are independent or not. You first encountered the term independence in Probability Topics. As a review, consider the following example.
Suppose \(A =\) a speeding violation in the last year and \(B =\) a cell phone user while driving. If \(A\) and \(B\) are independent then \(P(A \text{ AND } B) = P(A)P(B)\). \(A \text{ AND } B\) is the event that a driver received a speeding violation last year and also used a cell phone while driving. Suppose, in a study of drivers who received speeding violations in the last year, and who used cell phone while driving, that 755 people were surveyed. Out of the 755, 70 had a speeding violation and 685 did not; 305 used cell phones while driving and 450 did not.
Let \(y =\) expected number of drivers who used a cell phone while driving and received speeding violations.
If \(A\) and \(B\) are independent, then \(P(A \text{ AND } B) = P(A)P(B)\). By substitution,
\[\frac{y}{755} = \left(\frac{70}{755}\right)\left(\frac{305}{755}\right) \nonumber\]
Solve for \(y\):
\[y = \frac{(70)(305)}{755} = 28.3 \nonumber\]
About 28 people from the sample are expected to use cell phones while driving and to receive speeding violations.
In a test of independence, we state the null and alternative hypotheses in words. Since the contingency table consists of two factors, the null hypothesis states that the factors are independent and the alternative hypothesis states that they are not independent (dependent). If we do a test of independence using the example, then the null hypothesis is:
\(H_{0}\): Being a cell phone user while driving and receiving a speeding violation are independent events.
If the null hypothesis were true, we would expect about 28 people to use cell phones while driving and to receive a speeding violation.
The test of independence is always right-tailed because of the calculation of the test statistic. If the expected and observed values are not close together, then the test statistic is very large and way out in the right tail of the chi-square curve, as it is in a goodness-of-fit.
The number of degrees of freedom for the test of independence is:
\[df = (\text{number of columns} - 1)(\text{number of rows} - 1) \nonumber\]
The following formula calculates the expected number (\(E\)):
\[E = \frac{\text{(row total)(column total)}}{\text{total number surveyed}} \nonumber\]
A sample of 300 students is taken. Of the students surveyed, 50 were music students, while 250 were not. Ninety-seven were on the honor roll, while 203 were not. If we assume being a music student and being on the honor roll are independent events, what is the expected number of music students who are also on the honor roll?
- Answer
-
About 16 students are expected to be music students and on the honor roll.
In a volunteer group, adults 21 and older volunteer from one to nine hours each week to spend time with a disabled senior citizen. The program recruits among community college students, four-year college students, and nonstudents. In Table \(\PageIndex{1}\) is a sample of the adult volunteers and the number of hours they volunteer per week.
Type of Volunteer | 1–3 Hours | 4–6 Hours | 7–9 Hours | Row Total |
---|---|---|---|---|
Community College Students | 111 | 96 | 48 | 255 |
Four-Year College Students | 96 | 133 | 61 | 290 |
Nonstudents | 91 | 150 | 53 | 294 |
Column Total | 298 | 379 | 162 | 839 |
Is the number of hours volunteered independent of the type of volunteer? Use the \(p\)-value method with \(\alpha=0.02\).
Answer
Determine the hypothesis:
The observed table and the question at the end of the problem, "Is the number of hours volunteered independent of the type of volunteer?" tell you this is a test of independence. The two factors are number of hours volunteered and type of volunteer. This test is always right-tailed.
- \(H_{0}\): The number of hours volunteered is independent of the type of volunteer.
- \(H_{a}\): The number of hours volunteered is dependent on the type of volunteer.
Calculate the evidence:
The expected results are in Table \(\PageIndex{2}\).
Type of Volunteer | 1-3 Hours | 4-6 Hours | 7-9 Hours |
---|---|---|---|
Community College Students | 90.57 | 115.19 | 49.24 |
Four-Year College Students | 103.00 | 131.00 | 56.00 |
Nonstudents | 104.42 | 132.81 | 56.77 |
For example, the calculation for the expected frequency for the top left cell is
\[E = \frac{(\text{row total})(\text{column total})}{\text{total number surveyed}} = \frac{(255)(298)}{839} = 90.57 \nonumber\]
Next, calculate the test statistic. For each cell, calculate \(\frac{(O-E)^{2}}{E}\).
For example, the calculation for the top left cell is
\[\frac{(O-E)^{2}}{E}=\frac{(111-90.57)^{2}}{90.57}=\frac{20.43^{2}}{90.57}=\frac{417.3849}{90.57}=4.61 \nonumber\]
Type of Volunteer | 1-3 Hours | 4-6 Hours | 7-9 Hours |
---|---|---|---|
Community College Students | 4.61 | 3.20 | 0.03 |
Four-Year College Students | 0.48 | 0.03 | 0.45 |
Nonstudents | 1.72 | 2.22 | 0.25 |
Add up all the cells in the previous table to find the test statistic, \(\chi^{2} = 12.99\).
Next work to find the \(p\)-value. All Chi-Square tests are right-tailed, so use Excel formula \(=\text{CHISQ.DIST.RT}(x,df)\).
The test statistic, \(\chi^{2} = 12.99\), is the \(x\) and the degrees of freedom are
\[df = (3 \text{ columns} – 1)(3 \text{ rows} – 1) = (2)(2) = 4 \nonumber\]
Enter \(=\text{CHISQ.DIST.RT}(12.99,4)=0.0113\)

Make a Decision:
Compare \(\alpha\) and the \(p\text{-value}\): \(\alpha=0.02\) is given. \(p\text{-value} = 0.0113\). \(\alpha > p\text{-value}\).
Since \(\alpha > p\text{-value}\), reject \(H_{0}\). This means that the factors are not independent.
Conclusion: At a 2% level of significance, from the data, there is sufficient evidence to conclude that the number of hours volunteered and the type of volunteer are dependent on one another.
The Bureau of Labor Statistics gathers data about employment in the United States. A sample is taken to calculate the number of U.S. citizens working in one of several industry sectors over time. Table \(\PageIndex{3}\) shows the results:
Industry Sector | 2000 | 2010 | 2020 | Total |
---|---|---|---|---|
Nonagriculture wage and salary | 13,243 | 13,044 | 15,018 | 41,305 |
Goods-producing, excluding agriculture | 2,457 | 1,771 | 1,950 | 6,178 |
Services-providing | 10,786 | 11,273 | 13,068 | 35,127 |
Agriculture, forestry, fishing, and hunting | 240 | 214 | 201 | 655 |
Nonagriculture self-employed and unpaid family worker | 931 | 894 | 972 | 2,797 |
Secondary wage and salary jobs in agriculture and private household industries | 14 | 11 | 11 | 36 |
Secondary jobs as a self-employed or unpaid family worker | 196 | 144 | 152 | 492 |
Total | 27,867 | 27,351 | 31,372 | 86,590 |
We want to know if the change in the number of jobs is independent of the change in years. State the null and alternative hypotheses and the degrees of freedom.
Answer
- \(H_{0}\): The number of jobs is independent of the year.
- \(H_{a}\): The number of jobs is dependent on the year.

De Anza College is interested in the relationship between anxiety level and the need to succeed in school. A random sample of 400 students took a test that measured anxiety level and need to succeed in school. Table shows the results. De Anza College wants to know if anxiety level and need to succeed in school are independent events. Use the critical value method with \(\alpha=0.10\).
Need to Succeed in School | High Anxiety |
Med-high Anxiety |
Medium Anxiety |
Med-low Anxiety |
Low Anxiety |
Row Total |
---|---|---|---|---|---|---|
High Need | 35 | 42 | 53 | 15 | 10 | 155 |
Medium Need | 18 | 48 | 63 | 33 | 31 | 193 |
Low Need | 4 | 5 | 11 | 15 | 17 | 52 |
Column Total | 57 | 95 | 127 | 63 | 58 | 400 |
Solution
Determine the hypothesis:
\(H_{0}\): The anxiety level is independent of the need to succeed in school.
\(H_{a}\): The anxiety level is dependent on the need to succeed in school.
This test is always right-tailed.
Calculate the evidence:
Use the Excel formula \(=\text{CHISQ.INV.RT}(\alpha,df)\) to find the critical value.
\(\alpha=0.10\) as given in the problem statement. The degrees of freedom are
\[df = (5 \text{ columns} – 1)(3 \text{ rows} – 1) = (4)(2) = 8 \nonumber\]
So enter into Excel the formula \(=\text{CHISQ.INV.RT}(0.10,8)=13.3616\)
Next work on calculating the test statistic. First, find the expected values for each cell. The formula for each cell will be \(E = \frac{(\text{row total})(\text{column total})}{\text{total number surveyed}}\).
Need to Succeed in School | High Anxiety |
Med-high Anxiety |
Medium Anxiety |
Med-low Anxiety |
Low Anxiety |
---|---|---|---|---|---|
High Need | 22.09 | 36.81 | 49.21 | 24.41 | 22.48 |
Medium Need | 27.50 | 45.84 | 61.28 | 30.40 | 27.99 |
Low Need | 7.41 | 12.35 | 16.51 | 8.19 | 7.54 |
Next, calculate \(\frac{(O-E)^{2}}{E}\) for each cell.
Need to Succeed in School | High Anxiety |
Med-high Anxiety |
Medium Anxiety |
Med-low Anxiety |
Low Anxiety |
---|---|---|---|---|---|
High Need | 7.54 | 0.73 | 0.29 | 3.63 | 6.93 |
Medium Need | 3.28 | 0.10 | 0.05 | 0.22 | 0.32 |
Low Need | 1.57 | 4.37 | 1.84 | 5.66 | 11.87 |
Next add up all the cells from the last table to find \(\sum \frac{(O-E)^{2}}{E}\). This gives the test statistic as \(\chi^{2} = 48.4\)
Make a Decision:
Since this is a right-tailed test, everything larger than the critical value will be in the rejection region. The critical value is \(13.3616\), so everything larger than \(13.3616\) is the rejection region. The test statistic is \(\chi^{2} = 48.4\), which is larger than the critical value, \(13.3616\), so it is in the rejection region. So we will Reject the Null Hypothesis.
Determine the conclusion:
At a 10% level of significance, from the data, there is sufficient evidence to conclude that the anxiety level and the need to succeed in school are dependent on one another.
Refer back to the information in Table 11.4.3. How many service providing jobs are there expected to be in 2020? How many nonagriculture wage and salary jobs are there expected to be in 2020?
Answer
12,727, 14,965
References
- DiCamilo, Mark, Mervin Field, “Most Californians See a Direct Linkage between Obesity and Sugary Sodas. Two in Three Voters Support Taxing Sugar-Sweetened Beverages If Proceeds are Tied to Improving School Nutrition and Physical Activity Programs.” The Field Poll, released Feb. 14, 2013. Available online at field.com/fieldpollonline/sub...rs/Rls2436.pdf (accessed May 24, 2013).
- Harris Interactive, “Favorite Flavor of Ice Cream.” Available online at http://www.statisticbrain.com/favori...r-of-ice-cream (accessed May 24, 2013)
- “Youngest Online Entrepreneurs List.” Available online at http://www.statisticbrain.com/younge...repreneur-list (accessed May 24, 2013).
Review
To assess whether two factors are independent or not, you can apply the test of independence that uses the chi-square distribution. The null hypothesis for this test states that the two factors are independent. The test compares observed values to expected values. The test is right-tailed. Each observation or cell category must have an expected value of at least 5.
Formula Review
Test of Independence
- The number of degrees of freedom is equal to \((\text{number of columns - 1})(\text{number of rows - 1})\).
- The test statistic is \(\sum_{(i \cdot j)} \frac{(O-E)^{2}}{E}\) where \(O =\) observed values, \(E =\) expected values, \(i =\) the number of rows in the table, and \(j =\) the number of columns in the table.
- If the null hypothesis is true, the expected number \(E = \frac{(\text{row total})(\text{column total})}{\text{total surveyed}}\).
Determine the appropriate test to be used in the next three exercises.
A pharmaceutical company is interested in the relationship between age and presentation of symptoms for a common viral infection. A random sample is taken of 500 people with the infection across different age groups.
Answer
a test of independence
The owner of a baseball team is interested in the relationship between player salaries and team winning percentage. He takes a random sample of 100 players from different organizations.
A marathon runner is interested in the relationship between the brand of shoes runners wear and their run times. She takes a random sample of 50 runners and records their run times as well as the brand of shoes they were wearing.
Answer
a test of independence
Use the following information to answer the next seven exercises: Transit Railroads is interested in the relationship between travel distance and the ticket class purchased. A random sample of 200 passengers is taken. Table \(\PageIndex{4}\) shows the results. The railroad wants to know if a passenger’s choice in ticket class is independent of the distance they must travel.
Traveling Distance | Third class | Second class | First class | Total |
---|---|---|---|---|
1–100 miles | 21 | 14 | 6 | 41 |
101–200 miles | 18 | 16 | 8 | 42 |
201–300 miles | 16 | 17 | 15 | 48 |
301–400 miles | 12 | 14 | 21 | 47 |
401–500 miles | 6 | 6 | 10 | 22 |
Total | 73 | 67 | 60 | 200 |
State the hypotheses.
- \(H_{0}\): _______
- \(H_{a}\): _______
\(df =\) _______
Answer
8
How many passengers are expected to travel between 201 and 300 miles and purchase second-class tickets?
How many passengers are expected to travel between 401 and 500 miles and purchase first-class tickets?
Answer
6.6
What is the test statistic?
What is the \(p\text{-value}\)?
Answer
0.0435
What can you conclude at the 5% level of significance?
Use the following information to answer the next eight exercises: An article in the New England Journal of Medicine, discussed a study on smokers in California and Hawaii. In one part of the report, the self-reported ethnicity and smoking levels per day were given. Of the people smoking at most ten cigarettes per day, there were 9,886 African Americans, 2,745 Native Hawaiians, 12,831 Latinos, 8,378 Japanese Americans and 7,650 whites. Of the people smoking 11 to 20 cigarettes per day, there were 6,514 African Americans, 3,062 Native Hawaiians, 4,932 Latinos, 10,680 Japanese Americans, and 9,877 whites. Of the people smoking 21 to 30 cigarettes per day, there were 1,671 African Americans, 1,419 Native Hawaiians, 1,406 Latinos, 4,715 Japanese Americans, and 6,062 whites. Of the people smoking at least 31 cigarettes per day, there were 759 African Americans, 788 Native Hawaiians, 800 Latinos, 2,305 Japanese Americans, and 3,970 whites.
Complete the table.
Smoking Level Per Day | African American | Native Hawaiian | Latino | Japanese Americans | White | TOTALS |
---|---|---|---|---|---|---|
1-10 | ||||||
11-20 | ||||||
21-30 | ||||||
31+ | ||||||
TOTALS |
Answer
Smoking Level Per Day | African American | Native Hawaiian | Latino | Japanese Americans | White | Totals |
---|---|---|---|---|---|---|
1-10 | 9,886 | 2,745 | 12,831 | 8,378 | 7,650 | 41,490 |
11-20 | 6,514 | 3,062 | 4,932 | 10,680 | 9,877 | 35,065 |
21-30 | 1,671 | 1,419 | 1,406 | 4,715 | 6,062 | 15,273 |
31+ | 759 | 788 | 800 | 2,305 | 3,970 | 8,622 |
Totals | 18,830 | 8,014 | 19,969 | 26,078 | 27,559 | 10,0450 |
State the hypotheses.
- \(H_{0}\): _______
- \(H_{a}\): _______
Enter expected values in Table. Round to two decimal places.
Calculate the following values:
Answer
Smoking Level Per Day | African American | Native Hawaiian | Latino | Japanese Americans | White |
---|---|---|---|---|---|
1-10 | 7777.57 | 3310.11 | 8248.02 | 10771.29 | 11383.01 |
11-20 | 6573.16 | 2797.52 | 6970.76 | 9103.29 | 9620.27 |
21-30 | 2863.02 | 1218.49 | 3036.20 | 3965.05 | 4190.23 |
31+ | 1616.25 | 687.87 | 1714.01 | 2238.37 | 2365.49 |
\(df =\) _______
\(\chi^{2} \text{test statistic} =\) ______
Answer
10,301.8
\(p\text{-value} =\) ______
Is this a right-tailed, left-tailed, or two-tailed test? Explain why.
Answer
right
Graph the situation. Label and scale the horizontal axis. Mark the mean and test statistic. Shade in the region corresponding to the \(p\text{-value}\).

State the decision and conclusion (in a complete sentence) for the following preconceived levels of \(\alpha\).
\(\alpha = 0.05\)
- Decision: ___________________
- Reason for the decision: ___________________
- Conclusion (write out in a complete sentence): ___________________
Answer
- Reject the null hypothesis.
- \(p\text{-value} < \alpha\)
- There is sufficient evidence to conclude that smoking level is dependent on ethnic group.
\(\alpha = 0.05\)
- Decision: ___________________
- Reason for the decision: ___________________
- Conclusion (write out in a complete sentence): ___________________
Glossary
- Contingency Table
- a table that displays sample values for two different factors that may be dependent or contingent on one another; it facilitates determining conditional probabilities.