Skip to main content
Statistics LibreTexts

4.4: Multicollinearity and Categorical Independent Variables

  • Page ID
    57720
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    So far, our independent variable was either numeric (Toy Example) or dichotomous (Handedness Example). Let us now look at the interesting case of a discrete independent variable with three levels. This example raises questions about how to handle categorical independent variables in a manner that lends itself to using OLS and in a manner that eases interpretation.

    Note that we are examining categorical independent variables in this chapter. Should we want to examine a categorical dependent variable, we will have to move away from OLS. To see why, you will need to continue reading the book.

    ✦•················• ✦ •··················•✦

    Example: Profiting Farmers

    His Majesty Rudolph II would like some input on his next five-year plan. The primary crop in Ruritania is corn. To help optimize the profits made by farmers, Rudolph wants to know if that crop should be changed to summer wheat or to soybeans.

    To help him, let us model the relationship between farmer profit and crop in Ruritania.

    Table \(\PageIndex{1}\): Rawd ata for the example
    Crop Profit per Acre
    Wheat 722
    Wheat 965
    Wheat 940
    Wheat 756
    Corn 763
    Corn 765
    Corn 565
    Corn 621
    Soybean 566
    Soybean 658
    Soybean 540
    Soybean 485

    Solution:

    Collecting the data is not as difficult as it may seem at first. All three crops are currently grown in Ruritania. All we had to do was obtain a list of all farms and their primary crop and randomly select records from that. The table above provides our raw data. Note that the response variable is numeric, and the predictor variable is categorical. How do we code that variable so that we can use the methods of this chapter (and this class)???

    In the Toy Example, it was easy to change our dichotomous variable into a numeric variable by selecting one level as the base level and measuring the other level from there. In other words, one level was given the value \(0\) (absence) and the other was given the value \(1\) (presence). This is a nice thing about dicotomous variables. In this case, we have three levels in our independent variable. It does not seem to make sense to select one level to represent with \(0\) (absence), one to represent with \(1\) (presence), and one to represent with \(2\) (huh???? What does "double presence" mean???).

    One method that always works is to create a series of dichotomous indicator variables from the one nominal variable. Thus, since there are three levels here, we would create three new dichotomous variables: corn, soybeans, and wheat. This change is presented in the table below. Note that each of the three dichotomous variables is now numeric. Each value indicates absence (0) or presence (1) of that trait (crop). With this change, we can use the methods of this chapter to calculate the values of the OLS estimators \(\beta_0\), \(\beta_1\), \(\beta_2\), and \(\beta_3\)... or can we?

    To see why I ended that paragraph in an evil and foreboding voice, let us work through this using matrices.

    Table \(\PageIndex{2}\): Transformed data for the example
    Wheat Corn Soybean Profit per Acre
    1 0 0 722
    1 0 0 965
    1 0 0 940
    1 0 0 756
    0 1 0 763
    0 1 0 765
    0 1 0 565
    0 1 0 621
    0 0 1 566
    0 0 1 658
    0 0 1 540
    0 0 1 485

    Remember the formula to calculate the OLS estimators:
    \begin{equation}
    \mathbf{b} = \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \mathbf{Y}
    \end{equation}

    Here, \(\mathbf{Y}\) is

    \begin{equation}
    \mathbf{Y} = \left[ \begin{matrix}
    722 \\ 965 \\ 940 \\ 756 \\ 763 \\ 765 \\ 565 \\ 621 \\ 566 \\ 658 \\ 540 \\ 485 \\
    \end{matrix} \right]
    \end{equation}

    The design matrix, \(\mathbf{X}\) is
    \begin{equation}
    \mathbf{X}= \left[ \begin{matrix}
    1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\
    1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\
    1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\
    \end{matrix} \right]
    \end{equation}

    So far, so good!

    Note

    At this point, can you see why this matrix is termed the "design" matrix? From it, one can deduce the experimental design that gave rise to the data.

    Next, let us calculate \(\mathbf{X}^\prime\mathbf{X}\):

    \begin{align}
    \mathbf{X}^\prime\mathbf{X} &= \left[ \begin{matrix}
    1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\
    1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\
    1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\
    \end{matrix} \right]^\prime \left[ \begin{matrix}
    1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\
    1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\
    1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\
    \end{matrix} \right] \\[1em]
    &= \left[ \begin{matrix}
    12 & 4 & 4 & 4 \\ 4 & 4 & 0 & 0 \\ 4 & 0 & 4 & 0 \\ 4 & 0 & 0 & 4 \\
    \end{matrix} \right]
    \end{align}

    Fantastický! That is a rather interesting matrix. From it, you can pick out the sample size (\(n=12\)) and the sample sizes in each of the three levels (\(n_i=4\) in the diagonals). The next step is to calculate the inverse of this matrix.

    At this point, it is soooooooooo much easier to use technology to perform this calculation. However, when you do, you will get a notification that the matrix is singular. This means two things:

    1. Its inverse does not exist.
    2. One column is a linear combination of the others.

    Notice that the first column is the sum of the other three columns. Thus, there is redundant information. Thus, the columns are not linearly independent. To prove this last point, use the coefficient vector \(a = \{1, -1, -1, -1\}\) in the definition of linearly dependent.

    Note

    From an information standpoint, if one column is a linear combination of the others, then that column is redundant. The model can be repeated without that information.

    This is one of the very few places in statistics where throwing away information helps. It is rather ironic that it helps solely in terms of the mathematics.

    So, what do we do? We drop one of the redundant columns. The one we drop determines how we interpret the results. The next few sections looks at dropping the different columns in the \(\mathbf{X}\) matrix.

    Dropping the First Column

    This is appropriate. In fact, it leads to this design matrix:

    \begin{align}
    \mathbf{X} &= \left[ \begin{matrix}
    1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\
    0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\
    0 & 0 & 1 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \\
    \end{matrix} \right] \\
    \text{This leads to} \hfill & \nonumber \\
    \mathbf{X}^\prime\mathbf{X} &= \left[ \begin{matrix} 4&0&0 \\ 0&4&0\\ 0&0&4 \end{matrix}\right] \\[2em]
    \left( \mathbf{X}^\prime\mathbf{X} \right)^{-1} &= \frac{1}{4}\left[ \begin{matrix} 1&0&0 \\ 0&1&0\\ 0&0&1 \end{matrix}\right] \\[2em]
    \mathbf{X}^\prime\mathbf{Y} &= \left[\begin{matrix} 3383 \\ 2714 \\ 2249 \end{matrix}\right]
    \end{align}

    Finally, this leads to

    \begin{align}
    \mathbf{b} &= \left( \mathbf{X}^\prime\mathbf{X} \right)^{-1} \mathbf{X}^\prime\mathbf{Y} \\[1em]
    &= \left[\begin{matrix} 845.75 \\ 678.50 \\ 562.25 \end{matrix}\right]
    \end{align}

    We can interpret these results as the average profit for wheat is 845.75; for corn, 678.50; and for soybeans, 562.25.

    This is called the "means model" because the returned values are the means in each group.

    Dropping the Second Column

    This is also appropriate (the second column corresponds to the wheat design). When doing so, the design matrix is

    \begin{equation}
    \mathbf{X} = \left[ \begin{matrix}
    1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\
    1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\
    1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\
    \end{matrix} \right] \end{equation}

    Feel free to work through the calculation to obtain these estimates:

    \begin{align}
    \mathbf{b} &= \left( \mathbf{X}^\prime\mathbf{X} \right)^{-1} \mathbf{X}^\prime\mathbf{Y} \\[1em]
    &= \left[\begin{matrix} \phantom{-}845.75 \\ -167.25 \\ -283.50 \end{matrix}\right]
    \end{align}

    The interpretation here is that the average profit for wheat (the base category/dropped column) is \($845.75\). The effect of corn over wheat is \(-167.25\), and the effect of soybeans over wheat is \(-283.50\). In other words, the expected corn profit is \(-167.25\) over the wheat profit, and the expected soybean profit is \(-283.50\) over the wheat profit.

    Note that we dropped the first data column. Thus, the first result is the expected value of the first variable and the other results are the effects of those levels as compared to the base category (wheat). Because the estimate are the effects of the other levels as compared to the selected base level, this is called an "effects model" for wheat.

    Dropping the Third Column

    As you can probably guess, this is appropriate, as well. When doing so, the design matrix is

    \begin{equation}
    \mathbf{X} = \left[ \begin{matrix}
    1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\
    1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\
    1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\
    \end{matrix} \right]
    \end{equation}

    Feel free to work through the calculation to obtain these estimates:

    \begin{equation}\mathbf{b} = \left[\begin{matrix} \phantom{-}678.50 \\ \phantom{-}167.25 \\ -116.25 \end{matrix}\right]\end{equation}

    This interpretation is similar to the previous. The mean of the base category (corn) is 678.50 (first number). The effect of wheat over corn is 167.25. The effect of soybean over corn is -116.25.

    Corn is the base category because the third column corresponds to the corn design. Note that this is the effects model for corn. The estimates are the effects in relation to the base category.

    ───── ⋆⋅☆⋅⋆ ─────

    If you look at these three sets of results, you will see a lot of commonalities. The one chosen depends on what you are trying to say about the relationship between the crop and the profit. Here, is it also very easy to move between the means model and the effects model.

    Note

    We are only investigating expected values (averages) in this analysis. Should we also decide to include the uncertainties in our estimates (as we should), the two models are complementary. It is very difficult to move between the standard error in the means model and the standard error in the effects model. It is so much easier to have the computer perform that computation for you.

    The Code

    For the record, here is the code I used for fitting the means model:

    X = matrix( c(1,0,0, 1,0,0, 1,0,0, 1,0,0,
        0,1,0, 0,1,0, 0,1,0, 0,1,0,
        0,0,1, 0,0,1, 0,0,1, 0,0,1 ),
        ncol=3, byrow=TRUE)
    Y = matrix( c(722, 965, 940, 756, 763, 765,
        565, 621, 566, 658, 540, 485) )
    
    solve(t(X)%*%X)
    t(X)%*%Y
    
    solve(t(X)%*%X) %*% t(X)%*%Y
    

    Here is the code I used for the first effects model, in which I dropped the second column:

    X = matrix( c(1,0,0, 1,0,0, 1,0,0, 1,0,0,
        1,1,0, 1,1,0, 1,1,0, 1,1,0,
        1,0,1, 1,0,1, 1,0,1, 1,0,1),
        ncol=3, byrow=TRUE)
    Y = matrix( c(722, 965, 940, 756, 763, 765,
        565, 621, 566, 658, 540, 485) )
    
    solve(t(X)%*%X)
    t(X)%*%Y
    
    solve(t(X)%*%X) %*% t(X)%*%Y
    

    Note that the only change is in the line that defines the data matrix, \(\mathbf{X}\).

    Finally, here is the code I used when dropping the third column, the corn column.

    X = matrix( c(1,1,0, 1,1,0, 1,1,0, 1,1,0,
        1,0,0, 1,0,0, 1,0,0, 1,0,0,
        1,0,1, 1,0,1, 1,0,1, 1,0,1),
        ncol=3, byrow=TRUE)
    Y = matrix( c(722, 965, 940, 756, 763, 765,
        565, 621, 566, 658, 540, 485) )
    
    solve(t(X)%*%X)
    t(X)%*%Y
    
    solve(t(X)%*%X) %*% t(X)%*%Y
    

    Again, the only change is in the line that defines the data matrix, \(\mathbf{X}\).

    Caution

    While I did provide the code at the matrix-multiplication level, I only do that to show you the options when fitting data with a categorical independent variable. This section encourages you to think about what you can get out of the model.

    A real statistician figures out the easiest way to get the computer to provide the information. We will see how in the future.


    This page titled 4.4: Multicollinearity and Categorical Independent Variables is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?