5.3: The Structure of Data
- Page ID
- 63321
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Numerical data, whether it originates from quantitative observations, or is being used to represent qualitative observations, is further classified by the amount of inherent structure there is in the data. Some data, such as physical measurements like length and mass, have a great deal of structure. For example, salaries have a great deal of structure. If we compare the salaries of two individuals, we can not only determine who has a higher salary, but we can also compute by how much one salary exceeds the other. Furthermore, we can also make comparisons based on fractions; for example, that one salary is twice the other.
Other measurements may have a little less structure. Temperature is a good example. We can certainly compare two temperatures and determine which temperature is hotter or colder, and we can determine how much one temperature is greater or less than or another. But we cannot compare temperatures using fractions. For example, it is not correct to say that 100 degrees Fahrenheit is twice the temperature of 50 degrees Fahrenheit. To see why this is true, convert both temperatures to degrees Celsius: 100 degrees Fahrenheit is about 38 degrees Celsius while 50 degrees Fahrenheit is about 10 degrees Celsius. But \( 100\div 50 = 2 \) and \( 38\div 10 = 3.8 \). It follows that when temperature is measured in Celsius, the same higher temperature is 3.8 times that of the same lower temperature. Hence, it does not make sense to say that “today is twice as hot as yesterday” because the comparison would be different if someone measured the temperature using different units.
Still, other numerical data has even less structure. We might decide to code data about age to be based on age ranges such as is given in Table 5.3. With this coding we can compare ages in that we can determine if one person is older than another. However, it does not make sense to say that an individual whose age is coded as 3 is one older than an individual whose age is coded as 2. Further, if two individuals have the same coded age then they could be the same age, one could be older than another, or vice-versa.
Table 5.3. An example of coding data for age ranges in years.
|
Range |
Code |
|
0-18 |
1 |
|
19-21 |
2 |
|
22-65 |
3 |
|
65+ |
4 |
We get numerical data with even less structure if we consider the codes for relationship status given in Table 5.1. In this case we can only determine if two individuals have the same relationship status. It does not make sense to say that cohabitation is greater than single because the corresponding code for cohabitation is greater than the code for being single. We could have just as easily reversed these codes.
The amount of structure in numerical data is classified by what type of numerical calculations make sense for the data. The data with the least amount of structure are called nominal data.
A variable has a nominal measurement scale if the only possible valid mathematical comparison of two data points is whether they are equal.
Nominal data usually arises from numerically coding qualitative data that is categorized, classified, named, or labeled. For nominal data the underlying categories do not have any mathematical structure, as with the example given by the coding in Table 5.1. The key idea is that the categories cannot even be rank ordered by which is greater. Hence, variables with numeric coding for marital status, gender identity, hair color, political affiliation, or religious affiliation are all examples of variables with a nominal scale of measurement. Nominal variables may also refer to non-numeric data represented by categories or names that cannot be ranked or sorted.
If data can be at most ranked or sorted, then the data has an ordinal measurement scale.
A variable has an ordinal measurement scale if the only possible valid mathematical comparisons of two data points is whether they are equal and whether one is greater than another.
While observations from a variable with an ordinal measurement scale can be ranked or sorted, it is not valid to look at the differences between the numerical coding for the categories. The age range coding given in Table 5.3 are an example of a variable with an ordinal measurement scale. From the coding given in this table we can decide if two individuals are in the same age range or if one individual is in an older age range than the other. We cannot, however, determine how many years older one person may be than another by looking at the coded data. Social class is another example of a variable that can have an ordinal measurement scale. For example, social class may be coded as shown in Table 5.4. The ranking of these classes may reflect the perceived prestige of the corresponding classes by individuals in a population. Looking at the data we can conclude if two individuals are in the same social class, or if one is in a social class that is perceived as higher than another. But it is not valid to take the numerical difference between the class coding. As with nominal data, ordinal variables may also refer to non-numeric data represented by categories or names that can be ranked and sorted.
Table 5.4 An example of coding social classes.
|
Class |
Code |
|
Working Class |
1 |
|
Middle Class |
2 |
|
Upper Class |
3 |
Ordinal data is very common in surveys. For example, a study that sought to determine if racial micro aggressions were influencing the health of Asian Americans asked everyone in the study: “How would you rate your overall health in the past-year?” The responses were reported using the categories excellent, very good, good, fair, and poor. This is qualitative categorical data on an ordinal scale (Nicholson and Mei, 2020). Another very common type of survey question in surveys gives a statement and asks the respondent to gauge how much they agree with the statement, usually on a scale with categories “Strongly Agree”, “Agree”, “Neither Agree nor Disagree”, “Disagree”, and “Strongly Disagree”, which is qualitative categorical data on an ordinal scale. This latter example is often called a Likert Scale (Joshi et al., 2015).
Ordinal data is often coded using a numerical system. For example, assigning the value 5 to “Strongly Agree”, 4 to “Agree”, 3 to “Neither Agree nor Disagree”, 2 to “Disagree”, and 1 to “Strongly Disagree”. These values are then averaged over the questions on a survey to get an average level of agreement with the statements in the survey. An example of this type of method will be studied later in this chapter. It should be noted that one should be cautious with studies that do these types of calculations with ordinal data in that, while this may allow researchers to compare the level of agreement between two individuals or the average level of agreement over two groups of individuals, the actual values themselves have little meaning. Further, using these types of calculations must be justified by the researchers before the results can be trusted. These types of justifications are addressed later in the book.
If data can be sorted or ranked, and the differences between the data points provide a valid mathematical comparison, then the data is said to have an interval measurement scale.
A variable has an interval measurement scale if the possible valid mathematical comparisons of two data points are whether they are equal, or whether one is greater than another. Additionally, a difference between two datum is a valid comparison of how much one datum exceeds another.
While observations from a variable with an interval measurement scale can be ranked or sorted, and the differences between the values provide a valid measurement, ratios are still not valid. Measurements that can be both positive and negative are commonly on the interval measurement scale, with the temperature measurement discussed in Section 5.2 providing an example. Another example of an interval measurement is the FICO credit score used in the United States which has a range between 300 and 850. Credit worthiness, as measured by this score, can of course be ranked. That is, one individual is considered to have better credit if their score is higher than another individual's score. It is valid to look at differences between credit scores. For example, it is very common for so called credit repair companies to advertise that they can increase your credit score by 100 points. However, it is not valid to state that a credit score can be doubled. For example, doubling a credit score of 500 would give a score of 1000 which is not even possible.
If data can be sorted or ranked, the differences between the data points provide a valid mathematical comparison, and ratios are also valid mathematical comparisons, then the data is said to have a ratio measurement scale.
A variable has a ratio measurement scale if the possible valid mathematical comparisons of two data points are whether they are equal or whether one is greater than the other. Additionally, differences and ratios between two data are a valid comparison.
While observations from a variable with an interval measurement scale can be ranked or sorted, and the differences between the values provide a valid measurement, ratios are still not valid. Hence, any of the usual arithmetic calculations that are valid for two numbers are also valid for ratio data. Many measurements are on the ratio scale, particularly if they are non-negative. For example, age is a ratio variable. Assuming that it is measured in years, then a person who is 30 years old is twice the age of someone who is 15 years old. Other examples of variables that have a ratio measurement scale include height, weight, time to graduation, and income.
As can be observed from the definitions given above, each successive definition adds more structure to the data over the last. This is visualized in Figure \(\PageIndex{1}\). As the definitions indicate, nominal data can only be compared to determine if two values are equal. Ordinal data has the structure of nominal data, but with ordinal data one can also determine if one value is larger than another. Interval data has the structure of ordinal data, but additionally allows one to determine how much one value exceeds another. Finally, ratio data has the structure of interval data, but additionally allows one to compare values through ratios.

