10.2: Qualitative Data
- Page ID
- 64701
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Suppose that we have observed data from a large survey-based study of alumni from the past decade of a large university. Among those, the variables observed in the study are the debt at graduation and a self-reported racial and ethnicity classification with the categories African American, Asian, Hispanic, and white. Sampling one hundred of these observations at random, we obtain the data given in Table 10.1. These data are hypothetical but were simulated to have characteristics like summary data given in (Espinosa et al. 2019).
Table 10.1 Data from a hypothetical alumni study. Debt indicates the debt load (in thousands of dollars) of the student at graduation with an undergraduate degree. Race is coded as AF (African American), AS (Asian), HI (Hispanic), and WH (White).
|
Observation |
Race |
Debt |
Observation |
Race |
Debt |
Observation |
Race |
Debt |
||
|
1 |
WH |
15 |
34 |
WH |
26 |
67 |
HI |
25 |
||
|
2 |
WH |
13 |
35 |
WH |
0 |
68 |
AF |
54 |
||
|
3 |
HI |
0 |
36 |
AF |
44 |
69 |
WH |
14 |
||
|
4 |
WH |
48 |
37 |
AF |
4 |
70 |
HI |
22 |
||
|
5 |
HI |
23 |
38 |
WH |
35 |
71 |
WH |
0 |
||
|
6 |
WH |
36 |
39 |
HI |
5 |
72 |
WH |
27 |
||
|
7 |
WH |
32 |
40 |
AF |
57 |
73 |
AF |
6 |
||
|
8 |
HI |
34 |
41 |
WH |
32 |
74 |
WH |
22 |
||
|
9 |
WH |
40 |
42 |
WH |
5 |
75 |
AF |
0 |
||
|
10 |
WH |
23 |
43 |
WH |
0 |
76 |
WH |
34 |
||
|
11 |
WH |
16 |
44 |
WH |
24 |
77 |
WH |
0 |
||
|
12 |
HI |
24 |
45 |
WH |
0 |
78 |
WH |
48 |
||
|
13 |
HI |
28 |
46 |
WH |
0 |
79 |
WH |
31 |
||
|
14 |
AF |
36 |
47 |
WH |
0 |
80 |
WH |
0 |
||
|
15 |
WH |
37 |
48 |
WH |
0 |
81 |
WH |
42 |
||
|
16 |
AF |
0 |
49 |
WH |
3 |
82 |
HI |
0 |
||
|
17 |
WH |
0 |
50 |
HI |
13 |
83 |
HI |
0 |
||
|
18 |
WH |
0 |
51 |
WH |
0 |
84 |
WH |
0 |
||
|
19 |
AS |
38 |
52 |
HI |
18 |
85 |
WH |
47 |
||
|
20 |
WH |
0 |
53 |
WH |
21 |
86 |
WH |
0 |
||
|
21 |
HI |
0 |
54 |
AS |
0 |
87 |
AS |
0 |
||
|
22 |
WH |
0 |
55 |
HI |
25 |
88 |
AF |
14 |
||
|
23 |
WH |
50 |
56 |
AS |
0 |
89 |
WH |
47 |
||
|
24 |
WH |
24 |
57 |
HI |
0 |
90 |
WH |
21 |
||
|
25 |
WH |
59 |
58 |
WH |
57 |
91 |
WH |
0 |
||
|
26 |
WH |
38 |
59 |
WH |
0 |
92 |
HI |
12 |
||
|
27 |
HI |
22 |
60 |
WH |
23 |
93 |
WH |
34 |
||
|
28 |
HI |
43 |
61 |
WH |
42 |
94 |
WH |
23 |
||
|
29 |
WH |
35 |
62 |
WH |
7 |
95 |
WH |
45 |
||
|
30 |
HI |
4 |
63 |
WH |
11 |
96 |
WH |
36 |
||
|
31 |
WH |
0 |
64 |
HI |
22 |
97 |
WH |
26 |
||
|
32 |
AS |
0 |
65 |
AF |
54 |
98 |
WH |
15 |
||
|
33 |
AF |
27 |
66 |
HI |
0 |
99 |
WH |
37 |
||
|
100 |
WH |
26 |
In looking at Table 10.2 we can easily summarize the debt data in many ways. The table shows two measures of location, the mean and the median, as well as a measure of variation, the standard deviation. Comparing the means and the medians, we can conclude that African American students have the largest amount of debt at graduation, followed by white, Hispanic, and then Asian American students. The measures of variation tend to follow the same pattern, though the measures are much closer. We will not attempt to explain why these differences occur; at this point we are only interested in summarizing the observed data.
Table 10.2 Summary Statistics of Debt by Race. Debt indicates the debt load (in thousands of dollars) of the student at graduation with an undergraduate degree. Race is coded as AF (African American), AS (Asian), HI (Hispanic), and WH (White).
|
Race |
Mean |
Median |
Standard Deviation |
|
AF |
26.9 |
27.0 |
23.1 |
|
AS |
7.6 |
0.0 |
17.0 |
|
HI |
14.5 |
15.5 |
13.1 |
|
WH |
21.9 |
23.0 |
18.8 |
The next question is how to summarize the data on race. As this data is qualitative and on a nominal measurement scale, no mathematical calculations can be performed. Hence, we cannot compute an average because we cannot add the data. We cannot compute a median because we cannot sort the data. The only mathematical comparison that can be performed is to determine whether two observed individuals have the same race, and hence the only summary that is valid for us to do is to compute how many of each race are represented in the data. When this information is presented for each of the categories in a tabular format, the resulting table is called a frequency distribution and the individual counts for each category are called the frequencies.
The frequency distribution of a set of qualitative data is a table that contains the number of times each category of the data is observed. The individual counts for each category are called the frequencies of the categories.
The frequency distribution for the race data observed in Table 10.1 is given in Table 10.3. The first row of this table corresponds to the data for African American alumni. The frequency reported in the table indicates that of 100 observations, 11 correspond to African American alumni: 14, 16, 33, 36, 37, 40, 65, 68, 73, 75, and 88. Similarly, the frequency reported in the table indicates that there are 5 Asian American observations for those alumni: 19, 32, 54, 56, and 87. The reported frequencies of 22 for Hispanic and 62 for white alumni are interpreted in similar ways. Note that adding all these frequencies equals 100, which is the total number observations in the set of data.
Table 10.3 Frequencies, relative frequencies, proportions, and percentages for each of the categories of the race data contained in Table 10.1.
|
Race |
Frequency |
Relative Frequency |
Percent |
|
AF |
11 |
0.11 |
11% |
|
AS |
5 |
0.05 |
5% |
|
HI |
22 |
0.22 |
22% |
|
WH |
62 |
0.62 |
62% |
|
Total |
100 |
1.00 |
100% |
A frequency distribution is useful in determining if there are categories for a variable that corresponds to a large part of the data, and in turn if there are categories that are quite rare. For some sets of data, a frequency distribution may indicate that all the categories may roughly occur an equal number of times. The frequency distribution for the race data observed in Table 10.3 indicates that most of the Alumni observed in the Alumni data were white, followed by Hispanic, African American, and Asian American. It is noteworthy in this case that Asian American Alumni were quite rare in this data. Whether this is reflective of a similar trend in all the alumni from the university or if it is an artifact of this set of observed data would have to be researched further. Note that if the frequency distribution shows that the data do not generally reflect the characteristics of the population of alumni, this may be an indication of a nonrepresentative sample.
Comparing frequencies directly may not be as convenient as looking at what proportion or percentage of the data is in each category. This can be particularly true when there are more than a few categories, and when there is a large range of variation of the frequencies. If a frequency is divided by the total number of observations in the set of data, the result is the relative frequency, which measures the proportion of the data is in the category. Each proportion can be converted to a percentage by multiplying it by 100%. A table of either relative frequencies or percentages is usually still called a frequency distribution.
The frequency distribution of a set of qualitative data is a table that contains the number of times each category of the data is observed, the proportion of data that was observed in each category, or the percentage of data that was observed in each category. The individual counts for each category are called the frequencies of the categories. The relative frequencies are computed by dividing each frequency by the total number of observations in the set of data. The percentages are computed by multiplying the relative frequencies by 100%.
The relative frequencies and percentages for the race variable in the Alumni study are given in the third and fourth columns of Table 10.3. It should be noted that because there were exactly one hundred observations in these data, the frequencies, relative frequencies, and percentages look pretty much like the frequencies, and the conclusions for this table are basically the same as before.
As another example, consider a research study where twenty-five students who identified as female were randomly sampled from the graduating class at a small midwestern college. Of concern to the researcher is the level of gender discrimination the students experienced during their time at the college. To measure discrimination, the study used eight questions based on the perceptions regarding gender discrimination using a standard scale developed and validated by other researchers. Each indicator in the survey was scored using a five-point Likert scale to ask the respondents how strongly they agree or disagree with a statement. The observations were then added to get a score on how much gender discrimination was experienced. The observed data are given in Table 10.4.
Table 10.4 Observed responses to questions on gender discrimination of 25 random sampled students who identify as female at a small Midwestern college.
|
Individual |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
Sum |
|
1 |
4 |
3 |
4 |
5 |
4 |
3 |
2 |
3 |
28 |
|
2 |
5 |
5 |
5 |
4 |
3 |
2 |
3 |
2 |
29 |
|
3 |
3 |
2 |
1 |
2 |
1 |
2 |
3 |
5 |
19 |
|
4 |
3 |
4 |
4 |
5 |
5 |
4 |
5 |
3 |
33 |
|
5 |
5 |
5 |
4 |
3 |
1 |
1 |
1 |
1 |
21 |
|
6 |
1 |
1 |
1 |
3 |
2 |
4 |
3 |
4 |
19 |
|
7 |
1 |
1 |
2 |
3 |
4 |
5 |
4 |
5 |
25 |
|
8 |
2 |
3 |
5 |
4 |
3 |
1 |
1 |
1 |
20 |
|
9 |
3 |
1 |
2 |
3 |
2 |
2 |
4 |
5 |
22 |
|
10 |
5 |
4 |
5 |
5 |
3 |
3 |
3 |
4 |
32 |
|
11 |
4 |
5 |
5 |
5 |
4 |
5 |
4 |
4 |
36 |
|
12 |
1 |
3 |
1 |
3 |
2 |
4 |
4 |
3 |
21 |
|
13 |
3 |
4 |
3 |
2 |
3 |
4 |
3 |
3 |
25 |
|
14 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
4 |
39 |
|
15 |
4 |
4 |
4 |
5 |
4 |
5 |
5 |
5 |
36 |
|
16 |
1 |
1 |
1 |
1 |
2 |
3 |
2 |
4 |
15 |
|
17 |
3 |
1 |
2 |
2 |
2 |
3 |
3 |
4 |
20 |
|
18 |
5 |
5 |
4 |
5 |
5 |
5 |
5 |
4 |
38 |
|
19 |
2 |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
10 |
|
20 |
4 |
4 |
3 |
2 |
3 |
3 |
4 |
2 |
25 |
|
21 |
2 |
2 |
2 |
2 |
4 |
3 |
4 |
4 |
23 |
|
22 |
1 |
1 |
1 |
1 |
2 |
3 |
5 |
4 |
18 |
|
23 |
5 |
4 |
2 |
3 |
2 |
4 |
4 |
3 |
27 |
|
24 |
5 |
5 |
5 |
4 |
5 |
5 |
3 |
3 |
35 |
|
25 |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
2 |
10 |
Each of the questions in the survey are qualitative on an ordinal measurement scale. When the measurement scale is ordinal, it is important to list the categories on the table in the order implied by the ordinal structure of the measurement scale. Otherwise, the calculations are the same as in the previous example. For example, in constructing a frequency table for the responses from the first question, it can be noted that 5 individuals scored the statement as 1, so the frequency for that category is 5. To get the relative frequency take the frequency and divide by the number of observations. Hence the relative frequency is \(5\div 25=0.20\). The corresponding percentage is computed by multiplying the relative frequency by 100%, and hence the percentage for that category is \(0.20×\times 100\%=20\%\). The remaining calculations follow in a similar manner and are collected in Table 10.5.
Table 10.5 Frequency distributions for the eight questions in the survey on gender discrimination from Table 10.4.
|
Response |
||||||
|
Question |
1 |
2 |
3 |
4 |
5 |
Total |
|
1 |
5 |
4 |
5 |
4 |
7 |
25 |
|
0.20 |
0.16 |
0.20 |
0.16 |
0.28 |
1.00 |
|
|
20% |
16% |
20% |
16% |
28% |
100% |
|
|
2 |
7 |
3 |
3 |
6 |
6 |
25 |
|
0.28 |
0.12 |
0.12 |
0.24 |
0.24 |
1.00 |
|
|
28% |
12% |
12% |
24% |
24% |
100% |
|
|
3 |
7 |
5 |
2 |
5 |
6 |
25 |
|
0.28 |
0.20 |
0.08 |
0.20 |
0.24 |
1.00 |
|
|
28% |
20% |
8% |
20% |
24% |
100% |
|
|
4 |
4 |
5 |
6 |
3 |
7 |
25 |
|
0.16 |
0.20 |
0.24 |
0.12 |
0.28 |
1.00 |
|
|
16% |
20% |
24% |
12% |
28% |
100% |
|
|
5 |
4 |
7 |
5 |
5 |
4 |
25 |
|
0.16 |
0.28 |
0.20 |
0.20 |
0.16 |
1.00 |
|
|
16% |
28% |
20% |
20% |
16% |
100% |
|
|
6 |
4 |
3 |
7 |
5 |
6 |
25 |
|
0.16 |
0.12 |
0.28 |
0.20 |
0.24 |
1.00 |
|
|
16% |
12% |
28% |
20% |
24% |
100% |
|
|
7 |
4 |
2 |
7 |
7 |
5 |
25 |
|
0.16 |
0.08 |
0.28 |
0.28 |
0.20 |
1.00 |
|
|
16% |
8% |
28% |
28% |
20% |
100% |
|
|
8 |
3 |
3 |
6 |
9 |
4 |
25 |
|
0.12 |
0.12 |
0.28 |
0.36 |
0.16 |
1.00 |
|
|
12% |
12% |
28% |
36% |
16% |
100% |
|
Note that each row of Table 10.5 is a frequency distribution for the responses for each of the 8 questions. For the first question the responses look somewhat even, indicating that for that question there did not seem to be a preference for higher or lower responses. For the second and third questions, there is more division in that the higher frequencies are for the lower and higher responses. This indicates that there was little middle ground for these questions and that individuals taking part in the study either felt no discrimination for those instances or felt a great deal of discrimination. For questions 7 and 8, the opposite trend is observed; it appears that many individuals had some experience with the situation described in the questions.

