10.3: Quantitative Data
- Page ID
- 64736
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Frequency distributions can also be computed on quantitative data, though there is sometimes an extra step required before this can be done. To motivate our discussion, we will consider research on the number of water quality violations per year in a fictional state with 100 counties. Of interest in this research is comparing the number of violations with the median income of the counties to see if lower income counties are associated with more violations. This example is motivated by similar research from a nationwide study, but the data has been made more manageable for the purpose of this example (Bae et al. 2021). The data are given in Table 10.6.
Table 10.6 Simulated data on the number of water quality violations of public drinking sources and the median income of 100 counties of a hypothetical state over a year.
|
County |
Violations |
Income |
County |
Violations |
Income |
County |
Violations |
Income |
|
1 |
2 |
31 |
34 |
0 |
54 |
67 |
2 |
63 |
|
2 |
2 |
39 |
35 |
1 |
64 |
68 |
1 |
22 |
|
3 |
1 |
30 |
36 |
4 |
41 |
69 |
1 |
17 |
|
4 |
0 |
57 |
37 |
2 |
34 |
70 |
1 |
49 |
|
5 |
3 |
24 |
38 |
1 |
25 |
71 |
1 |
23 |
|
6 |
8 |
20 |
39 |
2 |
28 |
72 |
0 |
55 |
|
7 |
2 |
56 |
40 |
2 |
19 |
73 |
2 |
18 |
|
8 |
1 |
41 |
41 |
1 |
55 |
74 |
1 |
18 |
|
9 |
1 |
81 |
42 |
0 |
80 |
75 |
1 |
54 |
|
10 |
2 |
45 |
43 |
1 |
10 |
76 |
1 |
13 |
|
11 |
2 |
23 |
44 |
1 |
18 |
77 |
2 |
47 |
|
12 |
1 |
32 |
45 |
2 |
20 |
78 |
6 |
37 |
|
13 |
2 |
25 |
46 |
0 |
64 |
79 |
2 |
44 |
|
14 |
9 |
22 |
47 |
1 |
12 |
80 |
2 |
45 |
|
15 |
2 |
26 |
48 |
2 |
14 |
81 |
1 |
27 |
|
16 |
2 |
27 |
49 |
5 |
24 |
82 |
0 |
56 |
|
17 |
2 |
28 |
50 |
1 |
17 |
83 |
2 |
48 |
|
18 |
1 |
47 |
51 |
1 |
35 |
84 |
2 |
49 |
|
19 |
1 |
19 |
52 |
1 |
39 |
85 |
1 |
32 |
|
20 |
1 |
20 |
53 |
1 |
18 |
86 |
2 |
24 |
|
21 |
1 |
42 |
54 |
1 |
41 |
87 |
3 |
25 |
|
22 |
2 |
38 |
55 |
2 |
40 |
88 |
1 |
41 |
|
23 |
7 |
19 |
56 |
2 |
17 |
89 |
2 |
34 |
|
24 |
1 |
16 |
57 |
1 |
22 |
90 |
2 |
42 |
|
25 |
1 |
33 |
58 |
2 |
30 |
91 |
1 |
49 |
|
26 |
1 |
41 |
59 |
2 |
26 |
92 |
1 |
27 |
|
27 |
2 |
36 |
60 |
1 |
18 |
93 |
1 |
18 |
|
28 |
2 |
50 |
61 |
0 |
76 |
94 |
2 |
17 |
|
29 |
2 |
29 |
62 |
2 |
55 |
95 |
1 |
34 |
|
30 |
0 |
63 |
63 |
0 |
56 |
96 |
2 |
33 |
|
31 |
1 |
24 |
64 |
1 |
26 |
97 |
2 |
18 |
|
32 |
2 |
17 |
65 |
1 |
58 |
98 |
2 |
19 |
|
33 |
1 |
37 |
66 |
1 |
56 |
99 |
1 |
28 |
|
100 |
9 |
20 |
We first consider constructing a frequency distribution for the number of violations over the 100 counties in the state. Note that this is quantitative data, but it would also suffice to just treat each possible value of the variable as its own category, and then follow the same calculations shown in earlier. The resulting frequency distribution complete with relative frequencies and percentages are given in Table 10.7. This frequency distribution is interpreted just as any other frequency distribution. For example, one can observe that most of the counties have one or two violations. Of the remaining numbers of violations, having no violations happens the most. While there are several counties with more than two violations, they are relatively rare.
Table 10.7 Frequency distribution for the number of water quality violations for the 100 counties.
|
Number |
Frequency |
Relative Frequency |
Percent |
|
0 |
9 |
0.09 |
9% |
|
1 |
44 |
0.44 |
44% |
|
2 |
38 |
0.38 |
38% |
|
3 |
2 |
0.02 |
2% |
|
4 |
1 |
0.01 |
1% |
|
5 |
1 |
0.01 |
1% |
|
6 |
1 |
0.01 |
1% |
|
7 |
1 |
0.01 |
1% |
|
8 |
1 |
0.01 |
1% |
|
9 |
2 |
0.02 |
2% |
|
Total |
100 |
1.00 |
100% |
With the success of computing the frequency distribution of the data on the number of water violations, we can move to constructing a frequency distribution on the median incomes of the counties. If we consider taking the same approach, which uses each possible value of the variable as a category for the frequency table, we begin to see a potential problem. Looking through the data, we find that this variable takes on many possible values. In fact, there are 46 different values of the income variable, and each of these values would need a line in the frequency table for this data. The table also has many frequencies that are below 3 because most of the values only occur a few times. For these two reasons this table would not be very useful, as it would be difficult to read because it is so large and would not show where data is concentrated because most of the frequencies would be small.
For quantitative data with many possible values, a frequency distribution can be constructed by creating classes, which are ranges of possible values for the variable. The frequencies of the classes are then computed as the number of observed values of the variable that fall within each specified range. Relative frequencies and percentages are then computed as detailed earlier. When constructing such a table, researchers need to be careful in defining these classes. First, each data value of the variable should fall in exactly one class. That means that the classes should not overlap, and that they should cover the entire range of the variable.
A set of classes for a quantitative variable is a set of non-overlapping ranges that cover all the values of the variable so that each value of the variable falls in exactly one class.
The classes should not overlap because a data value should not be counted twice when computing the frequencies. Similarly, the range of the classes should cover the range of values of the variable so that each data value gets included in the frequency.
Let us consider constructing classes for the median income variable for the data in Table 10.6. Recalling the median incomes are reported in thousands of dollars, the smallest median income is 10 and the largest median income is 81. Therefore, the set of classes should cover this range. Next, the researcher must decide how to divide up this range into classes. The simplest approach is to create classes of equal length, dividing up the range of the data into equal parts. There should generally be at least four or five classes, but no more than ten. Classes that are not of equal size should generally be avoided as it makes the frequencies difficult to interpret and leaves the table open to manipulation that can be used to hide some aspects of the data. For this data it is convenient to divide it into eight classes starting with the class 10 to 19 and ending with the class 80 to 89, with equally sized classes in between. The frequency table corresponding to these classes is given in Table 10.8.
Table 10.8 Frequency distribution of the median incomes for the 100 counties.
|
Class |
Frequency |
Relative Frequency |
Percent |
|
10 to 19 |
21 |
0.21 |
21% |
|
20 to 29 |
26 |
0.26 |
26% |
|
20 to 39 |
17 |
0.17 |
17% |
|
40 to 49 |
17 |
0.17 |
17% |
|
50 to 59 |
12 |
0.12 |
12% |
|
60 to 69 |
4 |
0.04 |
4% |
|
70 to 79 |
1 |
0.01 |
1% |
|
80 to 89 |
2 |
0.02 |
2% |
|
Total |
100 |
1.00 |
100% |
The frequency distribution given in Table 10.8 can be interpreted just as any other frequency distribution. From the table we can observe that the median income for most of the counties is between 10 thousand and 60 thousand dollars with a slightly higher concentration on the lower end of the scale. It is most common for the counties to have a median income between 10 thousand and 20 thousand dollars. The counties in the lower portion of the table, corresponding to counties that have higher incomes, do seem unusual and extreme in that their incomes seem much larger than most of the counties.
A similar type of calculation can be done with the debt variable from the study of race and student debt. For the data shown in Table 10.1 the variable corresponding to student debt is quantitative with a range starting at 0 (no debt) to 68 thousand dollars. As with the median county income data we have constructed classes of equal length, starting with the class 0 to 9 and ending with the class 60 to 69. The corresponding frequency distribution for the data using these classes is given in Table 10.9. These data are interesting in that an unusual pattern is observed in the frequency distribution. There are two classes with relatively large frequencies, the first class being 0 to 9 with the largest frequency of 38, and the class 20 to 29 with the next largest frequency of 21. What is interesting is that in many frequency distributions the classes with large frequencies tend to be grouped together with no classes with a small frequency between them. In this case, these two classes are separated by the class 10 to 19, which has a frequency of 10, less than half that of the second largest frequency. For this data it means that there seem to be two sub-populations in the data, one with very low debt and one with debt between 10 thousand and 19 thousand dollars. Such an observation would generally warrant further investigation in a research project to see if this behavior could be explained by any of the other observed variables.
Table 10.9 Frequency distribution for the debt at graduation for the sample of 100 alumni.
|
Class |
Frequency |
Relative Frequency |
Percent |
|
0 to 9 |
38 |
0.38 |
38% |
|
10 to 19 |
10 |
0.10 |
10% |
|
20 to 29 |
21 |
0.21 |
21% |
|
30 to 39 |
14 |
0.14 |
14% |
|
40 to 49 |
10 |
0.10 |
10% |
|
50 to 59 |
6 |
0.06 |
6% |
|
60 to 69 |
1 |
0.01 |
1% |
|
Total |
100 |
1.00 |
100% |
The choice of the classes can have a great effect on how the frequency table is interpreted. To demonstrate how this can occur, we will consider several hypothetical frequency distributions and then show how they can be manipulated in different ways that may make readers of the research respond in different ways. For the first case, consider a study that looks at the number of police calls per day in an urban area for a 100-day period. The original frequency distribution is given in Table 10.10. For brevity we have only included the frequencies in this table as they are sufficient for our purpose here. The number of police calls per day ranges from 0 to 8. Most days get a few calls, but there are a few days that get many more. Now consider what would happen if we collapsed the last four classes in the frequency distribution (see Table 10.11). What happens in this case is that the last class becomes one of the classes with the largest frequencies, resulting in an unnecessary emphasis on days with a relatively large number of police calls. What makes the problem worse is that the last class does not have an upper bound, so the reader cannot distinguish what the true upper bound may be. Ten calls? Twenty calls? There is no way to know from the second table that the upper bound on the range was eight calls. Someone might use a tactic like this to make the neighborhood seem more “dangerous” than it really is.
Table 10.10 Frequency distribution of the number of police calls received per day over a 100-day period in an urban area.
|
Calls |
Frequency |
|
0 |
10 |
|
1 |
20 |
|
2 |
27 |
|
3 |
13 |
|
4 |
10 |
|
5 |
9 |
|
6 |
7 |
|
7 |
3 |
|
8 |
1 |
Table 10.11 Frequency distribution of the number of police calls received per day over a 100-day period in an urban area.
|
Calls |
Frequency |
|
0 |
10 |
|
1 |
20 |
|
2 |
27 |
|
3 |
13 |
|
4 |
10 |
|
5 or more |
20 |
As a second example of how frequency distributions can be manipulated, consider a study that randomly sampled individuals from a large population. The authors of the study are attempting to argue that the individuals who took part in the study came from all age groups between 20 and 60 years old. A frequency distribution for the observed sample is given in Table 10.12.
Table 10.12 Frequency distribution of the ages of participants in a social justice study.
|
Age |
Frequency |
|
20 to 24 |
71 |
|
25 to 29 |
8 |
|
30 to 34 |
5 |
|
35 to 39 |
86 |
|
40 to 44 |
82 |
|
45 to 49 |
65 |
|
50 to 54 |
76 |
|
55 to 59 |
6 |
What is immediately obvious in this frequency distribution is that although there are many age ranges with a substantial number of observations, the study found relatively few individuals willing to participate who were aged between 25 and 34, or who were older than 54. This could indicate that the conclusions from the study may not be relevant for those individuals who are in the age ranges with fewer observed individuals. Now consider combining the adjacent classes to get the frequency distribution given in Table 10.13. In this second table the gap in the age coverage has been completely obscured by combining the adjacent classes.
Table 10.13 Frequency distribution of the ages of participants in a social justice study with half as many classes.
|
Age |
Frequency |
|
20 to 29 |
79 |
|
30 to 39 |
91 |
|
40 to 49 |
147 |
|
50 to 54 |
82 |
It can be very difficult to determine if a researcher is purposely being misleading with the way they constructed a frequency distribution. When assessing frequency distributions, it is usually enough to ask yourself if the classes look strange. Usually, classes should follow some logical pattern. It seems natural to break up data into classes based on multiples of ten, for example. If you observe classes broken up in awkward ways or in a way such that some classes are much larger than others, you should look in the research article to see if the researchers justify the way they constructed the classes. If no such justification is given, the conclusions based on these frequency tables might be suspect.

