Skip to main content
Statistics LibreTexts

9.5: Introduction to Logistic Regression

  • Page ID
    56962
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    In this section we introduce as a tool for building models when there is a categorical response variable with two levels, e.g. yes and no. Logistic regression is a type of () for response variables where regular multiple regression does not work very well. In particular, the response variable in these settings often takes a form where residuals look completely different from the normal distribution.

    GLMs can be thought of as a two-stage modeling approach. We first model the response variable using a probability distribution, such as the binomial or Poisson distribution. Second, we model the parameter of the distribution using a collection of predictors and a special form of multiple regression. Ultimately, the application of a GLM will feel very similar to multiple regression, even if some of the details are different.

    Resume data

    We will consider experiment data from a study that sought to understand the effect of race and sex on job application callback rates; details of the study and a link to the data set may be found in Appendix [ch_regr_mult_and_log_data]. To evaluate which factors were important, job postings were identified in Boston and Chicago for the study, and researchers created many fake resumes to send off to these jobs to see which would elicit a callback. The researchers enumerated important characteristics, such as years of experience and education details, and they used these characteristics to randomly generate the resumes. Finally, they randomly assigned a name to each resume, where the name would imply the applicant’s sex and race.

    The first names that were used and randomly assigned in this experiment were selected so that they would predominantly be recognized as belonging to Black or White individuals; other races were not considered in this study. While no name would definitively be inferred as pertaining to a Black individual or to a White individual, the researchers conducted a survey to check for racial association of the names; names that did not pass this survey check were excluded from usage in the experiment. You can find the full set of names that did pass the survey test and were ultimately used in the study in Figure [resumeFirstName]. For example, Lakisha was a name that their survey indicated would be interpreted as a Black woman, while Greg was a name that would generally be interpreted to be associated with a White male.

    firstname race sex   firstname race sex   firstname race sex
    Aisha black female   Hakim black male   Laurie white female
    Allison white female   Jamal black male   Leroy black male
    Anne white female   Jay white male   Matthew white male
    Brad white male   Jermaine black male   Meredith white female
    Brendan white male   Jill white female   Neil white male
    Brett white male   Kareem black male   Rasheed black male
    Carrie white female   Keisha black female   Sarah white female
    Darnell black male   Kenya black female   Tamika black female
    Ebony black female   Kristen white female   Tanisha black female
    Emily white female   Lakisha black female   Todd white male
    Geoffrey white male   Latonya black female   Tremayne black male
    Greg white male   Latoya black female   Tyrone black male
                         

    The response variable of interest is whether or not there was a callback from the employer for the applicant, and there were 8 attributes that were randomly assigned that we’ll consider, with special interest in the race and sex variables. Race and sex are in the United States, meaning they are not legally permitted factors for hiring or employment decisions. The full set of attributes considered is provided in Figure [resumeVariables].

    variable description
      Specifies whether the employer called the applicant following submission of the application for the job.
      City where the job was located: Boston or Chicago.
      An indicator for whether the resume listed a college degree.
      Number of years of experience listed on the resume.
      Indicator for the resume listing some sort of honors, e.g. employee of the month.
      Indicator for if the resume listed any military experience.
      Indicator for if the resume listed an email address for the applicant.
      Race of the applicant, implied by their first name listed on the resume.
      Sex of the applicant (limited to only and in this study), implied by the first name listed on the resume.

    All of the attributes listed on each resume were randomly assigned. This means that no attributes that might be favorable or detrimental to employment would favor one demographic over another on these resumes. Importantly, due to the experimental nature of this study, we can infer causation between these variables and the callback rate, if the variable is statistically significant. Our analysis will allow us to compare the practical importance of each of the variables relative to each other.

    Modeling the probability of an event

    Logistic regression is a generalized linear model where the outcome is a two-level categorical variable. The outcome, \(Y_i\), takes the value 1 (in our application, this represents a callback for the resume) with probability \(p_i\) and the value 0 with probability \(1 - p_i\). Because each observation has a slightly different context, e.g. different education level or a different number of years of experience, the probability \(p_i\) will differ for each observation. Ultimately, it is this probability that we model in relation to the predictor variables: we will examine which resume characteristics correspond to higher or lower callback rates.

    Notation for a logistic regression model The outcome variable for a GLM is denoted by \(Y_i\), where the index \(i\) is used to represent observation \(i\). In the resume application, \(Y_i\) will be used to represent whether resume \(i\) received a callback (\(Y_i=1\)) or not (\(Y_i=0\)).

    The predictor variables are represented as follows: \(x_{1,i}\) is the value of variable 1 for observation \(i\), \(x_{2,i}\) is the value of variable 2 for observation \(i\), and so on.

    The logistic regression model relates the probability a resume would receive a callback (\(p_i\)) to the predictors \(x_{1,i}\), \(x_{2,i}\), ..., \(x_{k,i}\) through a framework much like that of multiple regression:

    \[\begin{aligned} transformation(p_{i}) = \beta_0 + \beta_1x_{1,i} + \beta_2 x_{2,i} + \cdots + \beta_k x_{k,i} \label{linkTransformationEquation}\end{aligned}\]

    We want to choose a transformation in the equation that makes practical and mathematical sense. For example, we want a transformation that makes the range of possibilities on the left hand side of the equation equal to the range of possibilities for the right hand side; if there was no transformation for this equation, the left hand side could only take values between 0 and 1, but the right hand side could take values outside of this range. A common transformation for \(p_i\) is the , which may be written as

    \[\begin{aligned} logit(p_i) = \log_{e}\left( \frac{p_i}{1-p_i} \right)\end{aligned}\]

    The logit transformation is shown in Figure [logitTransformationFigureHoriz]. Below, we rewrite the equation relating \(Y_i\) to its predictors using the logit transformation of \(p_i\):

    \[\begin{aligned} \log_{e}\left( \frac{p_i}{1-p_i} \right) = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \cdots + \beta_k x_{k,i}\end{aligned}\]

    In our resume example, there are 8 predictor variables, so \(k = 8{}\). While the precise choice of a logit function isn’t intuitive, it is based on theory that underpins generalized linear models, which is beyond the scope of this book. Fortunately, once we fit a model using software, it will start to feel like we’re back in the multiple regression context, even if the interpretation of the coefficients is more complex.

    We start by fitting a model with a single predictor: . This variable indicates whether the applicant had any type of honors listed on their resume, such as employee of the month. The following logistic regression model was fit using statistical software:

    \[\begin{aligned} \log_e \left( \frac{p_i}{1-p_i} \right) = -2.4998{} + 0.8668{} \times\text{\var{honors}} \end{aligned}\]

    (a) If a resume is randomly selected from the study and it does not have any honors listed, what is the probability resulted in a callback?

    (b) What would the probability be if the resume did list some honors?

    [logisticExampleWithHonors] (a) If a randomly chosen resume from those sent out is considered, and it does not list honors, then takes value 0 and the right side of the model equation equals -2.4998. Solving for \(p_i\): \(\frac{e^{-2.4998{}}}{1 + e^{-2.4998{}}} = 0.076{}\). Just as we labeled a fitted value of \(y_i\) with a “hat” in single-variable and multiple regression, we do the same for this probability: \(\hat{p}_i = 0.076{}\).

    (b) If the resume had listed some honors, then the right side of the model equation is \(-2.4998{} + 0.8668{} \times 1 = -1.6330{}\), which corresponds to a probability \(\hat{p}_i = 0.163{}\).

    Notice that we could examine -2.4998 and -1.6330 in Figure [logitTransformationFigureHoriz] to estimate the probability before formally calculating the value.

    To convert from values on the logistic regression scale (e.g. -2.4998 and -1.6330 in Example [logisticExampleWithHonors]), use the following formula, which is the result of solving for \(p_i\) in the regression model:

    \[\begin{aligned} p_i = \frac{e^{\beta_0 + \beta_1 x_{1,i}+\cdots+\beta_k x_{k,i}}{}} {\ 1\ \ +\ \ e^{\beta_0 + \beta_1 x_{1,i}+\cdots+\beta_k x_{k,i}}{}\ }\end{aligned}\]

    As with most applied data problems, we substitute the point estimates for the parameters (the \(\beta_i\)) so that we can make use of this formula. In Example [logisticExampleWithHonors], the probabilities were calculated as

    \[\begin{aligned} &\frac{\ e^{-2.4998{}}\ } {\ 1\ +\ e^{-2.4998{}}\ } = 0.076{} && \frac{\ e^{-2.4998{} + 0.8668{}}\ } {\ 1\ +\ e^{-2.4998{} + 0.8668{}}\ } = 0.163{}\end{aligned}\]

    While knowing whether a resume listed honors provides some signal when predicting whether or not the employer would call, we would like to account for many different variables at once to understand how each of the different resume characteristics affected the chance of a callback.

    Building the logistic model with many variables

    We used statistical software to fit the logistic regression model with all 8 predictors described in Figure [resumeVariables]. Like multiple regression, the result may be presented in a summary table, which is shown in Figure [resumeLogisticModelResults]. The structure of this table is almost identical to that of multiple regression; the only notable difference is that the p-values are calculated using the normal distribution rather than the \(t\)-distribution.

             
      Estimate Std. Error z value Pr(\(>\)\(|\)z\(|\))
             
    (Intercept) -2.6632 0.1820 -14.64 \(<\)0.0001
    jobcity -0.4403 0.1142 -3.85 0.0001
    collegedegree -0.0666 0.1211 -0.55 0.5821
    yearsexperience 0.0200 0.0102 1.96 0.0503
    honors 0.7694 0.1858 4.14 \(<\)0.0001
    military -0.3422 0.2157 -1.59 0.1127
    emailaddress 0.2183 0.1133 1.93 0.0541
    race 0.4424 0.1080 4.10 \(<\)0.0001
    sex -0.1818 0.1376 -1.32 0.1863

    Just like multiple regression, we could trim some variables from the model. Here we’ll use a statistic called , which is an analog to how we used adjusted R-squared in multiple regression, and we look for models with a lower AIC through a backward elimination strategy. After using this criteria, the variable is eliminated, giving the smaller model summarized in Figure [resumeLogisticReducedModel], which is what we’ll rely on for the remainder of this section.

             
      Estimate Std. Error z value Pr(\(>\)\(|\)z\(|\))
             
    (Intercept) -2.7162 0.1551 -17.51 \(<\)0.0001
    jobcity -0.4364 0.1141 -3.83 0.0001
    yearsexperience 0.0206 0.0102 2.02 0.0430
    honors 0.7634 0.1852 4.12 \(<\)0.0001
    military -0.3443 0.2157 -1.60 0.1105
    emailaddress 0.2221 0.1130 1.97 0.0494
    race 0.4429 0.1080 4.10 \(<\)0.0001
    sex -0.1959 0.1352 -1.45 0.1473

    The variable had taken only two levels: and . Based on the model results, was race a meaningful factor for if a prospective employer would call back? We see that the p-value for this coefficient is very small (very nearly zero), which implies that race played a statistically significant role in whether a candidate received a callback. Additionally, we see that the coefficient shown corresponds to the level of , and it is positive. This positive coefficient reflects a positive gain in callback rate for resumes where the candidate’s first name implied they were White. The data provide very strong evidence of racism by prospective employers that favors resumes where the first name is typically interpreted to be White.

    The coefficient of \(\indfunc{race}{white}\) in the full model in Figure [resumeLogisticModelResults], is nearly identical to the model shown in Figure [resumeLogisticReducedModel]. The predictors in this experiment were thoughtfully laid out so that the coefficient estimates would typically not be much influenced by which other predictors were in the model, which aligned with the motivation of the study to tease out which effects were important to getting a callback. In most observational data, it’s common for point estimates to change a little, and sometimes a lot, depending on which other variables are included in the model.

    Use the model summarized in Figure [resumeLogisticReducedModel] to estimate the probability of receiving a callback for a job in Chicago where the candidate lists 14 years experience, no honors, no military experience, includes an email address, and has a first name that implies they are a White male. [exampleForResumeAndWhiteQuantified] We can start by writing out the equation using the coefficients from the model, then we can add in the corresponding values of each variable for this individual:

    \[\begin{aligned} &log_e \left(\frac{p}{1 - p}\right) \\ &\quad= - 2.7162 - 0.4364 \times \indfunc{job\us{}city}{Chicago} + 0.0206 \times \var{years\us{}experience} + 0.7634 \times \var{honors} \\ &\quad\qquad - 0.3443 \times \var{military} + 0.2221 \times \var{email} + 0.4429 \times \indfunc{race}{white} - 0.1959 \times \indfunc{sex}{male} \\ &\quad= - 2.7162 - 0.4364 \times 1 + 0.0206 \times 14 + 0.7634 \times 0 \\ &\quad\qquad - 0.3443 \times 0 + 0.2221 \times 1 + 0.4429 \times 1 - 0.1959 \times 1 \\ &\quad= - 2.3955 \end{aligned}\]

    We can now back-solve for \(p\): the chance such an individual will receive a callback is about 8.35%.

    Compute the probability of a callback for an individual with a name commonly inferred to be from a Black male but who otherwise has the same characteristics as the one described in Example [exampleForResumeAndWhiteQuantified]. We can complete the same steps for an individual with the same characteristics who is Black, where the only difference in the calculation is that the indicator variable \(\indfunc{race}{white}\) will take a value of . Doing so yields a probability of 0.0553. Let’s compare the results with those of Example [exampleForResumeAndWhiteQuantified].

    In practical terms, an individual perceived as White based on their first name would need to apply to \(\frac{1}{0.0835} \approx 12\) jobs on average to receive a callback, while an individual perceived as Black based on their first name would need to apply to \(\frac{1}{0.0553} \approx 18\) jobs on average to receive a callback. That is, applicants who are perceived as Black need to apply to 50% more employers to receive a callback than someone who is perceived as White based on their first name for jobs like those in the study.

    What we’ve quantified in this section is alarming and disturbing. However, one aspect that makes this racism so difficult to address is that the experiment, as well-designed as it is, cannot send us much signal about which employers are discriminating. It is only possible to say that discrimination is happening, even if we cannot say which particular callbacks – or non-callbacks – represent discrimination. Finding strong evidence of racism for individual cases is a persistent challenge in enforcing anti-discrimination laws.

    Diagnostics for the callback rate model

    Logistic regression conditions There are two key conditions for fitting a logistic regression model:

    1. Each outcome \(Y_i\) is independent of the other outcomes.
    2. Each predictor \(x_i\) is linearly related to logit\((p_i)\) if all other predictors are held constant.

    The first logistic regression model condition – independence of the outcomes – is reasonable for the experiment since characteristics of resumes were randomly assigned to the resumes that were sent out.

    The second condition of the logistic regression model is not easily checked without a fairly sizable amount of data. Luckily, we have 4870 resume submissions in the data set! Let’s first visualize these data by plotting the true classification of the resumes against the model’s fitted probabilities, as shown in Figure [logisticModelPredict].

    We’d like to assess the quality of the model. For example, we might ask: if we look at resumes that we modeled as having a 10% chance of getting a callback, do we find about 10% of them actually receive a callback? We can check this for groups of the data by constructing a plot as follows:

    1. Bucket the data into groups based on their predicted probabilities.
    2. Compute the average predicted probability for each group.
    3. Compute the observed probability for each group, along with a 95% confidence interval.
    4. Plot the observed probabilities (with 95% confidence intervals) against the average predicted probabilities for each group.

    The points plotted should fall close to the line \(y = x\), since the predicted probabilities should be similar to the observed probabilities. We can use the confidence intervals to roughly gauge whether anything might be amiss. Such a plot is shown in Figure [logisticModelBucketDiag].

    Additional diagnostics may be created that are similar to those featured in Section 3. For instance, we could compute residuals as the observed outcome minus the expected outcome (\(e_i = Y_i - \hat{p}_i\)), and then we could create plots of these residuals against each predictor. We might also create a plot like that in Figure [logisticModelBucketDiag] to better understand the deviations.

    Exploring discrimination between groups of different sizes

    Any form of discrimination is concerning, and this is why we decided it was so important to discuss this topic using data. The resume study also only examined discrimination in a single aspect: whether a prospective employer would call a candidate who submitted their resume. There was a 50% higher barrier for resumes simply when the candidate had a first name that was perceived to be from a Black individual. It’s unlikely that discrimination would stop there.

    Let’s consider a sex-imbalanced company that consists of 20% women and 80% men, and we’ll suppose that the company is very large, consisting of perhaps 20,000 employees. Suppose when someone goes up for promotion at this company, 5 of their colleagues are randomly chosen to provide feedback on their work.

    Now let’s imagine that 10% of the people in the company are prejudiced against the other sex. That is, 10% of men are prejudiced against women, and similarly, 10% of women are prejudiced against men.

    Who is discriminated against more at the company, men or women?

    [sex_imbalance_leads_to_discrimination] Let’s suppose we took 100 men who have gone up for promotion in the past few years. For these men, \(5 \times 100 = 500\) random colleagues will be tapped for their feedback, of which about 20% will be women (100 women). Of these 100 women, 10 are expected to be biased against the man they are reviewing. Then, of the 500 colleagues reviewing them, men will experience discrimination by about 2% of their colleagues when they go up for promotion.

    Let’s do a similar calculation for 100 women who have gone up for promotion in the last few years. They will also have 500 random colleagues providing feedback, of which about 400 (80%) will be men. Of these 400 men, about 40 (10%) hold a bias against women. Of the 500 colleagues providing feedback on the promotion packet for these women, 8% of the colleagues hold a bias against the women.

    Example [sex_imbalance_leads_to_discrimination] highlights something profound: even in a hypothetical setting where each demographic has the same degree of prejudice against the other demographic, the smaller group experiences the negative effects more frequently. Additionally, if we would complete a handful of examples like the one above with different numbers, we’d learn that the greater the imbalance in the population groups, the more the smaller group is disproportionately impacted.2

    Of course, there are other considerable real-world omissions from the hypothetical example. For example, studies have found instances where people from an oppressed group also discriminate against others within their own oppressed group. As another example, there are also instances where a majority group can be oppressed, with apartheid in South Africa being one such historic example. Ultimately, discrimination is complex, and there are many factors at play beyond the mathematics property we observed in Example [sex_imbalance_leads_to_discrimination].

    We close this book on this serious topic, and we hope it inspires you to think about the power of reasoning with data. Whether it is with a formal statistical model or by using critical thinking skills to structure a problem, we hope the ideas you have learned will help you do more and do better in life.


    1. There are ways to make them work, but we’ll leave those options to a later course.
    2. If a proportion \(p\) of a company are women and the rest of the company consists of men, then under the hypothetical situation the ratio of rates of discrimination against women vs men would be given by \(\frac{1 - p}{p}\); this ratio is always greater than 1 when \(p < 0.5\).

    This page titled 9.5: Introduction to Logistic Regression is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel via source content that was edited to the style and standards of the LibreTexts platform.