Skip to main content
Statistics LibreTexts

6.4: Multicollinearity

  • Page ID
    57734
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Recall that there was just one strict, mathematical requirement for ordinary least squares estimation: the independent variables cannot be linear combinations of each other. If they are — a condition known as perfect (or super) multicollinearity — the underlying matrix calculations become impossible, and the OLS estimates simply cannot be computed.

    However, perfect multicollinearity is a relatively rare and extreme case. A far more common and insidious challenge arises when two or more independent variables are highly, but not perfectly, correlated. This phenomenon, often termed high multicollinearity or simply multicollinearity, allows the mathematics to proceed but introduces significant problems that can distort and undermine the regression analysis.

    These problems manifest in two primary domains: the computational and the interpretative. From a computer science or numerical perspective, high multicollinearity inflates the standard errors of the coefficient estimates, making them unstable and highly sensitive to minor changes in the model or data. From the standpoint of experimental logic, it becomes difficult, if not impossible, to isolate the unique effect of each correlated variable on the dependent variable, thereby crippling the goal of causal inference. The following section will delve into these dual consequences, exploring both the diagnostic symptoms and the practical implications for researchers.

    ✦•················• ✦ •··················•✦

    The CS of Multicollinearity

    Note that we are using a computer to perform our calculations. Because a computer is not able to store most numbers in memory perfectly, rounding errors creep into the calculations. While this will always happen when the number does not have a finite binary representation, it is most important to understand when the decimal is close to zero, when it is less than the "machine epsilon." When this is the case, the number rounds to zero. In other words, if the value is between \(-\epsilon\) and \(+\epsilon\), the computer treats it as a zero.

    On 64-bit computers, the value of epsilon is approximately \(2.220446 \times 10^{-16} = 0.000\ 000\ 000\ 000\ 000\ 222\ 044\ 6\). If the determinant of the matrix is this value or less in magnitude, the computer will claim it is singular.

    solve( matrix( c(1,0, 0,2.220446e-16), ncol=2) )
    

    Mathematically, we can calculate the determinant to be \(2.220446 \times 10^{-16}\), which means the matrix is actually not singular. However, calculating the inverse returns the following:

    Error in solve.default(matrix(c(1, 0, 0, 2.220446e-17), ncol = 2)) :
      system is computationally singular: reciprocal condition number = 2.22045e-17
    
    

    The lesson to take beyond this specific case is that the matrix does not have to be singular for the computer to tell you it is. All that is needed is that the determinant of the \(\mathbf{X}^\prime\mathbf{X}\) matrix be sufficiently close to zero.

    I term this a Computer Science result because it is based on the vagaries of computers instead of the vagaries of mathematics. The next section looks at what multicollinearity means in terms of the logic of experimental design and interpretation.

    The Logic of Multicollinearity

    The previous section examined the effects of multicollinearity on calculations, specifically of the inverse of the \(\mathbf{X}^{\prime}\mathbf{X}\) matrix. If its determinant is sufficiently close to zero, then the computer will treat is as a zero, meaning the matrix will be effectively singular. However, this is only a CS problem. A good statistician will pay attention to the conditions that cause multicollinearity... even minor amounts of it.

    Recall that multicollinearity occurs when a column in the data matrix is a linear combination of the other columns; that is, it happens when one variable is a linear combination of the others; that is, it happens when one variable adds no new information beyond what the other variables contain. For instance, if one variable is a person's height in inches and another variable is a person's height in centimeters, then multicollinearity exists. The first variable offer no information that is not contained in the second.

    A statistician cares about the independent variables in the model. They are designed to explain the dependent variable. Each independent variable is supposed to be independent of the others, because each is designed to explain a different aspect of the response variable. If two explanatory variables are highly correlated with each other, then it will be logically impossible to determine which of the two is causing the change in the dependent variable:

    1. Is it the logarithm of a person's height in inches or the logarithm of the square of a person's height in inches that can be used to estimate weight?
    2. Is it average daily temperature or ice cream consumption that can be used to estimate the violent crime rate?
    3. Is it educational attainment or parental income that can be used to estimate a person's future income?

    These three exemplify the issue with multicollinearity in practice. The first example produces mathematical (a.k.a. "super-") multicollinearity because the logarithm of the square of a variable is exactly twice the logarithm of the variable. The first column is twice the second.

    The second example does not exemplify mathematical multicollinearity. There is no function of average daily temperature that gives the ice cream consumption. However, there is a very strong linear relationship between the two. Because of this, one cannot statistically tell if it is the temperature or the ice cream that is affecting violent crime. With this being said, unless the dairy farmers are attacking the very foundation of Ruritanian society, the substantive scientific theory suggests that the temperature is likely the factor affecting the crime rate, not the frigid dairy.

    The third example is more subtle. There is also a strong relationship between a person's education attainment and the parent's income (at least in Ruritania). Because of this, we are unable to statistically determine if it is the person's educational attainment or the income of the person's parents that affects the person's future income. Social science theories suggest each. The statistics with each explanatory variable also suggest each. What can we do in this case?

    Indications of Multicollinearity

    To see some statistical indications of multicollinearity, try the following code.

    set.seed(30)
    
    b0 = 3
    b1 = 2
    b2 = 3
    
    x1 = seq(0,10,length=8)
    x2 = c(1,2,3,4,6,7,8,9)
    e  = rnorm(8)
    
    y = b0 + b1*x1 + b2*x2 + e
    
    mod1 = lm(y~x1)
    mod2 = lm(y~x2)
    modA = lm(y~x1+x2)
    

    Clearly, from how this experiment is set up, we know the following:

    1. There is a strong relationship between x1 and y.
    2. There is a strong relationship between x2 and y.
    3. There is a strong relationship between x1 and x2.

    Running summary(mod1) shows us that the first statement is true, if we ignore the effect of x2. Similarly, running summary(mod2) shows us that the second statement is true, if we ignore the effect of x1. Combining the two explanatory variables in modA is confusing if we do not think about multicollinearity, because summary(modA) gives the following results:

    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)    2.814      1.898   1.483    0.198
    x1             2.145      1.681   1.276    0.258
    x2             2.861      2.009   1.424    0.214
    
    Residual standard error: 1.386 on 5 degrees of freedom
    Multiple R-squared:  0.9946,    Adjusted R-squared:  0.9924
    F-statistic: 458.7 on 2 and 5 DF,  p-value: 2.164e-06
    

    Note that this model shows that neither independent variable is valuable in modeling the dependent variable. Since a scientist will usually put all the explanatory variables in the model, this is a lesson for us to pay attention to the relationships among the independent variables.

    Caution

    It is also important to look at all of the regression output. Note that neither of the two independent variables have small p-values. However, the \(R^2\) value is very close to 1. Thus, this inconsistency also suggests that something is wrong.

    A Test of Multicollinearity

    So, how do we statistically detect this type of multicollinearity? A simple correlation test will not suffice if we have more than two independent variables because correlation is between only two variables.

    The answer comes from the cause of the multicollinearity: If there is multicollinearity, then one independent variable should be linearly related to the others. A linear regression will be able to detect this. Technically, a linear regression for each independent variable will detect this. To make this process easy, there is the vif function in the car package. This function calculates the "variance inflation factor" for each independent variable. The variance inflation factor for independent variable \(i\) is defined as

    \begin{equation}
    \mathrm{VIF}_i \stackrel{\text{def}}{=} \frac{1}{1-R_i^2}
    \end{equation}

    Here, \(R_i^2\) is the R-squared value for the model regressing the independent variables on independent variable \(i\).

    In our example above, one can calculate the VIF by hand:

    summary( lm(x1~x2) )
    1/(1-0.9921)
    

    The higher the value of \(R^2\), the more independent variable \(i\) can be explained by the other independent variables. In other words, the higher the VIF, the less new information that variable adds to the model.

    As in much of the field, this description leads to the question

    How high is too high?

    The "rule of thumb" depends on the discipline. Typical cut-offs are 5, 8, and 10. If the VIF for any of the variables is greater than the cut-off, then there is "too much multicollinearity in the the model."

    Fixing Multicollinearity

    So, let's say that you have detected multicollinearity in your model. What can you do?

    The presence of multicollinearity means that one of your independent variables is highly correlated with a linear function of the others. It adds little to the understanding of the response variable. However, is it variable \(i\) that should be examined or the others?

    From a statistical standpoint, the model is not too helpful. Multiple variables are trying to explain the same aspects of the dependent variable. In other words,

    1. Is it educational attainment or parent's income that affects the respondent's income?
    2. Is it race or poverty that affects violent crime?
    3. Is it intelligence or birth position (oldest, youngest, middle, etc. child) that affects success?
    4. Is it the ranch or the cattle feed that affects the weight?
    5. Is it race or income or religion or parental income or home state that affects voting behavior?
    6. Is it Nordic ancestry or blood type or Neanderthal genes that affect the severity of CoViD-19?
    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{1}\): Diagram to illustrate multicollinearity. The left circle is the effect of the first independent variable on the dependent variable; the right, the second. The purple overlap represents the similarity between the two variables. The rectangle represents everything that affects the dependent variable, which means the colored portion represents everything in the model that affects the dependent variable.

    In each of these, the explanatory variables are highly correlated and have been used to model the response variable. Because of the correlations, conclusions about what really affects the dependent variable are unclear. Statistically, the answer is "Yes, each does."

    However, since each of the independent variables above are correlated, their effects overlap. This is represented as the purple overlapping area in Figure \(\PageIndex{1}\). While each variable has an effect on the dependent variable (red and blue), that effect is also split with the other variable (or variables). As such, the key is trying to separate the three sections to determine whether it is the red, the blue, or the purple that is affecting the response variable.

    Unfortunately, this is beyond the scope of this course. For those who are interested, you may want to investigate factor analysis (FA) and principal component analysis (PCA). These are two methods for dealing with that overlap (purple area). The first focuses on estimating the purple area; the second, on creating two other variables that combine the two independent variables into their independent components (the parts that are purely red and blue). The advantage is that the independent variables become independent (\(\mathrm{VIF}=1\)). The disadvantage is that the newly-created independent variables are only related to the original explanatory variables; thus, interpretation is made more complicated.

    Note

    By the way, Figure \(\PageIndex{1}\) illustrates several things about the model. Everything that affects the dependent variable is represented by the rectangle, whereas the colored part represents only what the model explains. Thus, one may think of the ratio of the colored area to the total area as being the \(R^2\) value.

    Question:

    Given that the size of the circles is fixed, how would you arrange them to cover the most area of the rectangle? What would that mean in terms of the variables?


    This page titled 6.4: Multicollinearity is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?