Skip to main content
Statistics LibreTexts

15.1: Binary Dependent Variables

  • Page ID
    57776
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    A dichotomous variable is one that can take one of two values: 1 or 0, True or False, Yes or No, success or failure. In research, these variables include the occurrence of terrorism, the election of a specific party to power, the existence of a fire, and the failure of a plane. In each of these cases, there are only two possible values, which we will refer to as success and failure. This is the hallmark of dichotomous variables. Before Nelder and Wedderburn (1972) created the GLM framework, statisticians created special models for binary dependent variable problems (with different transformations).

    They did so because the classical linear model invariably makes predictions outside the logical range, demonstrates heteroskedasticity, and has residuals that are not Normally distributed — all violations of the OLS assumptions. To illustrate this, let us model the decision to purchase life insurance using age and income and the classical linear model (fit using OLS). The next example illustrates these issues.

    Example \(\PageIndex{1}\): Life Insurance

    In Ruritania, the decision to buy life insurance is related to several variables, including age and income. The table below includes records of several individuals. Fit this data with a linear model using OLS:

    \[ \texttt{insurance} = \beta_0 + \beta_1 \texttt{age} + \beta_2 \texttt{income} \]

    Next, predict whether Václav will buy life insurance, given that his age is 65 and his income is $125,000. Finally, determine if the assumptions of ordinary least squares are violated with this model and data.

    Individual Insurance Age Income ($000)
    1 0 25 20
    2 0 30 30
    3 0 21 30
    4 0 35 25
    5 0 28 27
    6 1 80 90
    7 1 55 25
    8 1 40 60
    9 1 40 65
    10 1 25 125
    Table \(\PageIndex{1}\): Insurance pseudo data, in which we predict a person purchasing life insurance based on the person's age and income.

    Solution.

    Using our statistical program, we get the following as our linear regression equation, when fit using OLS:

    \begin{equation}
    \texttt{insurance} = -0.4277 + 0.0130 \times \texttt{age} + 0.0088 \times \texttt{income}
    \end{equation}

    Using the provided information, we predict Václav will buy life insurance at:

    \begin{align*}
    \texttt{insurance} &= -0.4277 + 0.0130 \times \texttt{age} + 0.0088 \times \texttt{income} \\
    &= -0.4277 + 0.0130 \times 65 + 0.0088 \times 125 \\
    &= +1.5121
    \end{align*}

    What does this value of 1.5121 actually mean?

    🤔 I don't know, either.

    Residual scatterplot
    Figure \(\PageIndex{1}\): Scatter plot of the residuals against the values of the dependent variable. Note the different variances for the two groups. As such, the linear model is not appropriate in this case.

    Next, to check the assumptions of OLS, let us merely check the assumption of homoskedasticity (constant variance). To do this, we plot the residuals against the values of the dependent variable. Figure \(\PageIndex{1}\) above shows that the variation in the residuals significantly differs across the two groups in this model — a violation of our assumptions. In fact, calculations show that the variance for those who bought insurance is about 24 times higher than for those who did not (\(0.1325\) vs. \(0.0055\)). This is an example of non-constant variance. Performing the usual F-test for comparing two variances, we also see that this difference is statistically significant (\(F=0.0416, \nu_n=4, \nu_d=4, p=0.0093\)). Therefore, we conclude that our model is not appropriate for this data.

    There were two problems with this analysis. First, the model predicted an outcome that did not make sense. Second, the model violated at least one assumption of ordinary least squares (it actually violates all three). To solve the first problem, we could create a decision rule that any predicted value above the threshold \(\tau=0.500\) will be treated as a "Buy" prediction, and any predicted value less than \(\tau=0.500\) will be treated as a "Not Buy" prediction.

    The second problem is more serious and not so easily solved, especially if we care about our estimate's uncertainty (i.e., create confidence intervals). One may consider performing a transformation on the dependent variable to make it unbounded. A logit transformation would be a natural transformation for this; however, all of the dependent variables are either 1 or 0, which means the transformed values will be either \(+\infty\) or \(-\infty\). Furthermore, this transformation would not take care of the relationship between the residuals and the (transformed) dependent variables.

    Caution

    There is a tendency to feel disappointed when our model violates assumptions, such as here. However, instead of seeing the existence of a relationship between the residuals and the dependent variable as a problem, let us realize such a relationship tells us that there is more information in the data than we are modeling at this point.

    As an interested researcher, we want to use that information to get more from our data. Thus, violations are not steps backwards; they are a path towards a deeper understanding of the data generating process.


    This page titled 15.1: Binary Dependent Variables is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?