3.3: Categorical Variables- Variables That Vary by Type
- Page ID
- 49366
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)The variable type is “this” or “that.” The “it” can vary according to what type it is. Gender varies by type. (Assuming a binary perspective) people are male or female. Racial identity varies by type. (Assuming an erroneous polarization perspective) people are (“labeled” according to the government) as White/Caucasian, Black, Asian, Middle Eastern, Native American, Alaskan, Hawaiian, with Latino/Hispanic as an ethnic identity. Cancer varies by type (at least under current conventional wisdom), such as breast, prostate, skin, and lung cancer.
Type means sorting the data as being one or the other. There is no overlap between the types; you can only belong to one type, not the other. So, people are sorted into males or females, not both, White/Caucasian, Black, Asian, Middle Eastern, Native American, Alaskan, Hawaiian, Latino/Hispanic, but not in several types, or having breast, prostate or skin, or lung cancer. (And yes, these examples assume an erroneous binary, which is not good according to several perspectives).
There are several general and statistical terms for variation by type. We think of type in terms of groups and categories. People are sorted into group A or group B. The terms type, group, and category are used interchangeably. We say, what are the “types” within the gender variable? Or how many groups are there within the race variable? How many types does the category of race consist of? They refer to a variable and then sort it into categories, types, or groups.
Statistically, we think of type as nominal variables or fixed effects. Nominal variables are simply variables that vary by type, and by nominal, we usually think of “name” and not a number. So, gender type varies by name, male or female. Race varies by name, such as White/Caucasian, Black, Asian, Latino/Hispanic. By fixed effect, we mean no variation within each group in a variable. Everybody in the male group is considered male, and everyone in the female group is considered female. There is no one more “male” or “female” than the others within that group (yes – the preceding sentence ignores social constructs of masculinity and femininity, but for now, the example is only used to illustrate a fixed effect).
How do you tell if something varies by type? Well, “it” varies by type if whatever it is can be sorted into categories or groups that are discrete or mutually exclusive. The categories and groups are different from each other. It can only be in one category or another. What makes something a type is if one group consists of members or things with something or attributes in common, AND the other group does not have any attributes: no sharing, no blurring, no spillover of the characteristics.
Gender is a common example. Assuming binary for the moment, we sort people into males and females because we say males have something (yes, biological) that the females do not have.
As you can surmise, gender may not be the most outstanding example of type under a gender-affirming, continuous social construct identity perspective. Because there is overlap among genders, rightly so. A better example is treatment versus control groups. One group gets the treatment; the other gets the control group. There is no overlap between the groups because the treatment group receives the treatment, and the control group gets nothing that resembles the treatment. Even here, rightly so, there could be overlap. If the treatment group gets mindfulness treatment protocol, who is to say that the people in control are NOT getting some type of mindfulness practice in their everyday lives? So, is treatment versus control a fitting example of variation by type? More on this later.
When we think that something varies by type, it is important to note that no one type is higher, lower, more, less, better, or worse than the order. There is no valence or value to the type. The order of the groups or categories within a type is arbitrary. Think horizontal, on an even plane. All groups within that type are on an equal plane. The order DOES NOT MATTER! Emphasis is necessary because order matters a lot when we discuss continuous variables later.
In our gender example, males and females are on an equal level. Males are not more, better, or less than females and vice versa (assuming a gender equality society, which, as we know, is not happening). Which means the order can be interchanged. Females can be listed first, then males, and it does not matter.
What exactly doesn’t matter? The numbers that represent each group or category within that type.
Numbers are codes that represent each type. The numbers don’t matter. The numbers themselves have no value; they are just codes, numbers arbitrarily assigned to each category. Statistics programs cannot process words. So, we need numbers for each category or group within that type for the statistical program to do its computations.
For gender, we cannot enter “male” and “female” into the statistics program. We code them with “male” = 1 and “female” = 2. The order does not matter. We could use “female” = 1 and “male” = 2. The number does not matter either. We could use “male” = 0 and “female” = 1, or “male” = 5 and “female” = 7.
Honestly, we don’t just assign numbers arbitrarily. Intuitively and for organizations, it is best to assign numbers in conventional ways. So yes, a “5” and “7” seem random and don’t help our projects. It does matter for organizational purposes to assign numbers in an order that makes sense for comparison and advanced analyses, such as dummy coding in regression analyses. For example, assigning numbers such as “1” for treatment and “0” for control does make sense in terms of low to high order. We want to see if treatment improves symptoms, so we anticipate that treatment will be higher. We give it a “1” for organization purposes rather than a “0” because we want the treatment to produce higher, better symptom relief than the control group. So, sometimes, we want to intentionally assign numbers to code the distinct groups or categories. We’ll return to this issue.
How many types? Depends on what you know about the variable, how many types are possible, and what types you need based on your research question. Usually, we are concerned about whether the variable is dichotomous, or just two groups, or consisting of multiple groups. Gender is usually thought of as dichotomous, male, or female. Race is usually thought of as multiple groups: White/Caucasian, Black, Latino/Hispanic, and Asian.
The best way to answer how many types are within a nominal, categorical variable is to use your conceptualization of your research question. For gender, you can ask if someone is male or female. But is that all you need for your research question? Does it matter if knowing if someone is male or female explains the variation you are curious about in your outcome? Suppose the psychological outcome is willingness to seek treatment. Being male or female might be an issue because, in general, females tend to seek therapy more so than males. So, we need to ask about gender in the male vs. female category. But if you are interested in how cis-gender and transgender seek treatment, now you are introducing another conceptualization. So, we need to ask about gender as male vs. female, but now include transgender Male to Female, Transgender female to male.
At this point, the rabbit hole can really open up. You may be concerned that you did not include people who are transitioning genders and have yet to consider themselves as fully transitioned to male or female. So, you may include additional category types for gender.
What is correct?
We do have to be mindful about being appropriate and not appearing inclusive. If your goal is to create a census, you may need more categories. For example, if you want to create a census of the number of gender-affirming identities enrolling in a psychology graduate program, then yes, to be inclusive, more gender categories are needed to address all identities.
You need to let your conceptualization of your research question be your guide. Suppose you have enough literature review and concern about comparing just males and females and their willingness to seek treatment. In that case, you may only need male and female categories. Suppose you have concerns that something is going on regarding transgender experiences compared to cis-gender experiences and that something may be stigma or hostile gender climates within schools or religious groups. In that case, you may need to include transgender categories. You should not arbitrarily list every category. Including everything simply gets messy in organizing your variables and data.
What is incorrect is to have an “everything but the kitchen sink” mentality to ensure you are addressing all possibilities to ensure that you do not miss anything. Including all possible gender categories is not good practice. Too many gender categories make it difficult to account for all of them in your analysis. It entails more work because you will have to collapse and re-code the categories if there are few or zero participants in each category. Including a category to get one more participant does not dramatically change your results. If you find yourself in a position where you feel you are being forced to include more categories, types, or groups within a variable to be comprehensive, that is a sure sign to say “stop” and not go any further. Only include those categories, types, and groups that are conceptually useful in answering your research question.
Groups typically collapse into fewer groups after examining the data. Race often collapses because there are not enough participants per category. Suppose we examine race but have most participants in the White/Caucasian category and only a few in the Black, Latino/Hispanic, and Asian categories. In that case, someone might decide to collapse the Black, Latino/Hispanic, and Asian categories into a general “minority” or “non-White” category. Is that good practice? Always use your research question and your desired outcome as your guide. If you are doing a census of racial categories of students seeking counselling at a university counselling center, and the numbers for other racial categories are low, it is best not to collapse them. In this case, your variable of interest is encouraging students from racial groups to attend counselling, and it does matter which racial groups are underrepresented more so than others. So, it is best not to collapse those groups. On the other hand, if you are examining something where race is not the primary concern in the outcome, such as risk factors for pediatric cancer (acknowledging that racial disparities are present in every issue). If that is the case, then collapsing race into, say, a White versus non-White category might be sufficient to answer the research question.
What does nominal data look like? Put differently, what does the data show for each type within a categorical variable? The data are frequency counts or a “yes” or a “no.” The person or observation is included in that category. In the dataset, we assign a code to each person, subject, or observation. So, we have a participant, and we want to show whether they are male or female for their data record. We enter “1” or “2” for the gender code. The statistical analysis output is a frequency count of how many per type are within that variable. If we say we have 50 males and 50 females, our statistical output is a total frequency count or a percentage of how many subjects are in a particular category.
The numbers we assign to each type within that category are just codes; they do not matter and have no inherent meaning. The numbers are strictly just for labelling each of the types. However, the output numbers are total frequency counts representing the number of persons for each type within that categorical variable.
This decision about whether the numbers mean anything or do not mean anything now matters when we turn to the second type of variables considered continuous.


