15.3: The Mathematics
- Page ID
- 57778
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)This section develops the full mathematical specification of a logistic regression model for a binary dependent variable within the Generalized Linear Model (GLM) framework. It formally defines each of the three required components: the linear predictor, the conditional distribution of the response, and the link function. The section demonstrates that the Bernoulli distribution is the natural choice for binary data, derives its key property that the expected value \(E[Y]=\pi\) (the probability of success), and shows that its variance is inherently non-constant (\(V[Y]=\pi(1-\pi)\)), explaining the heteroskedasticity observed when such data are modeled with OLS. It then derives the logit function as the canonical link for the Bernoulli distribution by writing the probability mass function in the exponential family form. Finally, it introduces alternative link functions (probit, cauchit, log-log, complementary log-log) that can also map the probability \(\pi\) to the unbounded linear predictor, noting that the choice among them is often guided by disciplinary tradition and can affect predictions, though appropriate models should yield broadly similar conclusions.
By the end of this section, you will be able to:
- Specify the three mathematical components of a logistic regression model for a binary outcome: the linear predictor \(\eta = \beta_o + \beta_1 x_1 + \cdots + \beta_k x_k\), the Bernoulli distribution for the response (conditional distribution), and the logit link function \(\mathrm{logit} \pi = \eta\).
- Derive the expected value and variance of a Bernoulli random variable, and explain why this variance structure inherently leads to heteroskedasticity when a binary outcome is modeled with ordinary least squares.
- Demonstrate that the logit function is the canonical link for the Bernoulli distribution by rewriting its probability mass function in the standard exponential family form.
- Identify several alternative link functions for binary data (e.g., probit, cauchit, log-log, complementary log-log) and recognize that while the choice of link may affect predictions, appropriate models should lead to broadly similar conclusions about the significance and direction of effects.
✦•················• ✦ •··················•✦
When we model using the Classical Linear Model, we actually model/predict the expected value of the dependent variable, the mean. In the previous insurance example, we modeled/predicted the probability of a person purchasing life insurance. What is the connection? It is that \(E[Y \mid x] = \pi\).
Remember from the chapter on GLMs, performing GLM estimation requires that we know three things about our data and our model:
- the linear predictor,
- the conditional distribution of the dependent variable, and
- the function that links the two domains.
The previous section discussed the linear predictor (\(\eta = \beta_0 + \beta_1 \text{age} + \beta_2 \text{income}\)) and a link function, \(\mathrm{logit}(\mu)\), for our example. That only leaves the conditional distribution of the dependent variable.
Conditional Distribution
What are the possible values of the dependent variable? They are \(\{0, 1\}\). What distribution has only these two outcomes? It is the Bernoulli distribution. For the Bernoulli distribution, the probability of getting a "1" (success) is \(\pi\) and the probability of getting a "0" (failure) is \(1-\pi\). Mathematically, this means the full probability mass function (pmf) is:
\begin{equation}
f(y) = \begin{cases} \pi^y (1-\pi)^{1-y} & y \in \{0, 1\} \\[1em] 0 & \text{otherwise} \end{cases}
\end{equation}
Strictly speaking, the probability mass function is not as important as the expected value of this distribution. Why? Remember that the Generalized Linear Model paradigm models the expected value, \(\mathrm{E}[Y \mid x]\), of the distribution of the dependent variable... just like in ordinary least squares.
Calculating the expected value of the Bernoulli distribution is easy using the definition of expected value:
\begin{align}
E[Y] &\stackrel{\text{def}}{=} \sum_i\ y_i\ f(y_i) \\[1em]
&= 0 \cdot f(0) + 1 \cdot f(1) \\[1em]
&= 0 \cdot (1-\pi) + 1 \cdot (\pi) \\[1em]
&= \pi
\end{align}
Thus, the expected value of a Bernoulli random variable is \(\pi\), the success probability.
This fact makes the results of modeling more apparent: As the GLM paradigm models the expected value, when we use the Bernoulli distribution, we end up modeling the probability of a success, which is what we want.
Recall that one of the assumptions of Ordinary Least Squares is that the variance is constant with respect to the independent variable. When the outcomes are Bernoulli random variables, we can easily prove that the variance is not constant with respect to the expected values. If there is a relationship between the independent variable and the probabilities (which will be true if X affects Y), then a relationship between variance and expected value indicates heteroskedasticity.
To see this, let \(Y \sim Bern(\pi)\). With this, and with the pmf above, we use the definition of variance to calculate \(V[Y]\):
\begin{align*}V[Y] &\stackrel{\text{def}}{=} \sum_i (y_i - \mu)^2 f(y_i) \\[1em]
&= (0 - \mu)^2 f(0) + (1 - \mu)^2 f(1) \\[1em]
&= (0 - \pi)^2 f(0) + (1 - \pi)^2 f(1) \\[1em]
&= \pi^2 (1-\pi) + (1-\pi)^2 \pi \\[1em]
&= \pi(1-\pi) \big[ \pi + (1-\pi) \big]
\end{align*}
This last line simplifies to \(V[Y] = \pi(1-\pi)\), as \(\pi + (1-\pi) = 1\), which means \(V[Y]\) is a function of \(\pi\), the expected value. It is not a constant with respect to the expected value, \(\pi\). Binary dependent variables violate the assumption of homoskedasticity — by definition.
As an aside: The variance is a quadratic function of the success probability, \(V[Y]=\pi(1-\pi)\). From this formula, we see that we are most unsure (the variance is highest) when the probability of a success is \(\pi=0.500\). Check that this makes sense: Which has a more uncertain outcome, a fair coin (\(\pi=0.500\)) or a two-headed coin (\(\pi=1.000\))?
Now that we understand our choice of distribution a bit better, and the resulting expected value, let us examine the third facet: the link function. First, note that \(\pi\) is bounded: \(\pi \in (0,1)\). Thus we need a function that takes a doubly-bounded variable and transforms it into an unbounded variable. We have already met a link function that can handle this — the logit function.
Here, I must mention that the logit is not the only appropriate link function. Any monotonic function that maps \((0, 1) \mapsto \mathbb{R}\) is appropriate. This includes the entire class of quantile functions, of which the probit is a member.
The choice of the link function often reduces to tradition within your field. However, social science theory is getting advanced enough to suggest link functions that are more appropriate than others.
And so, we have the three necessary components to use generalized linear models in this example:
- the linear predictor, \[ \eta = \beta_0 + \beta_1 \text{age} + \beta_2 \text{income} \]
- the distribution of the dependent variable, \[ \text{insurance} \sim \text{Bernoulli}(\pi) \] with the formula for the expected value \(\mu = \pi\).
- and the link function, \[ \mathrm{logit}(\mu) = \eta \]
Here is what you need to take away from this section: The distribution must fit the possible outcomes. The link must translate the bounds on the parameter to the linear predictor. Both require you to know some distributions.
Deriving the Canonical Link
In the chapter on GLMs, we mentioned that each distribution has a canonical link. Let us derive the canonical link for the Bernoulli distribution. As a side note, one does not have to understand this section to use Generalized Linear Models.
The steps to determine the canonical link are the same for the Binomial as it was for the Gaussian:
- Write the probability mass function (pmf).
- Write the probability mass function in the required form.
- Read off the canonical link.
For this distribution, this results in:
\begin{align}
&\text{pmf}: \quad && \pi^y(1-\pi)^{1-y} \\[1em]
&&& = \exp \big[ \log\left( \pi^y(1-\pi)^{1-y} \right) \big] \\[1em]
&&& = \exp \big[ \log\left( \pi^y\right) + \log \left( (1-\pi)^{1-y} \right) \big] \\[1em]
&&& = \exp \big[ y\ \log\left(\pi\right) + (1-y) \log \left(1-\pi\right) \big] \\[1em]
&&& = \exp \big[ y\ \log\left(\pi\right) + \log \left(1-\pi\right) - y\log \left(1-\pi\right) \big] \\[1em]
&&& = \exp \big[ y\ \left( \log(\pi)-\log(1-\pi) \right) + \log(1-\pi) \big] \\[1em]
&&& = \exp \big[ y\ \log \left(\frac{\pi}{1-\pi}\right) + \log(1-\pi) \big] \\[1em]
&&& = \exp \big[ y\ \mathrm{logit}(\pi) + \log(1-\pi) \big] \\[1em]
&&& = \exp \left[ \frac{y\ \mathrm{logit}(\pi) + \log(1-\pi)}{1} + 0 \right] \\[1em]
\end{align}
This is in the required form: \( \exp \left[ \frac{\displaystyle y \theta - b(\theta)}{\displaystyle a(\phi)} + c(y, \phi) \right] \).
Thus, reading off the standard form, we have the following:
- \(y = y\)
- \(\theta = \mathrm{logit}(\pi)\)
- \(a(\phi)=1\)
- \(b(\theta) = \log(1-\pi) = - \log(1 + e^\theta)\)
- \(c(y, \phi)=0\)
As such, the canonical link is the logit function, \(g(\pi) = \mathrm{logit}(\pi) = \eta\).
Other Links
As mentioned in the GLM chapter, we do not have to use the canonical link. Any monotonic, increasing function that maps the restricted domain to the unrestricted domain works. Thus, there are several options for the link function. The table above gives some options.3
The logit link is the canonical link. The probit link is frequently used in biostatistics. Its advantage is that it is based on the Normal distribution, with which we are intimately familiar. There is usually little difference between predictions made with the logit link and those made by the probit link. The coefficient estimates will usually differ by a factor of approximately 3.7, and the levels of significance will usually be close. The cauchit link is a symmetric link with heavy tails, as compared to the logit and the probit links (see the figure below).
The log-log link and the complementary-log-log link are asymmetric links. The log-log link has a heavy right tail; the complementary-log-log link, a heavy left tail (see the second figure below). Most science theory is only now beginning to be able to state which of the three types of link functions will be most appropriate for the given model (symmetric, heavy left, heavy right).
R has the built-in ability to model using the following link functions for the binomial (Bernoulli) distribution: logit, probit, cauchit, log, and complementary-log-log. The KnoxStats package adds the log-log link function:
glm( y ~ x, family=binomial(link=make.link("loglog")) )
| Link | Inverse Link |
|---|---|
| Logit: \(\log \left( \mu/(1-\mu) \right)\) | Logistic: \((1+\exp(-\eta))^{-1}\) |
| Probit: \(\Phi^{-1}(\mu)\) | Normal CDF: \(\Phi(\eta)\) |
| Cauchit: \(\tan\big(\pi(\mu-\frac{1}{2})\big)\) | Cauchy CDF: \(\arctan(\eta)/\pi + \frac{1}{2}\) |
| log-log: \(-\log( -\log(\mu))\) | \(\exp(-\exp(-\eta))\) |
| Complementary log-log: \(\log(-\log(1-\mu))\) | \(1-\exp(-\exp(\eta))\) |
Graphics of Other Links
So, ... when to use an asymmetric link ... and which one???
Choosing between the two primary asymmetric links (and the symmetric links) is difficult. It depends on the asymmetry of the underlying probability process you are modeling. Both are used for binary or proportion data, but they handle asymmetry differently.
The complementary log-log link, \( g(\pi) = \log(-\log(1-\pi)) \), is appropriate when the probability of an event (a "success") is asymmetric, with the probability increasing slowly from zero but then approaching one sharply. This link is commonly used in areas like:
- Mortality studies: When death arises from the failure of any one of a number of independent organs or components.
- Modeling non-zero counts: When modeling the probability of a non-zero count in Poisson-distributed data.
- Constrained ordination: In ecology, it is sometimes preferred over the logit link for certain species distribution models.
The log-log link, \( g(\pi) = -\log(-\log(\pi)) \), is essentially the mirror image of the cloglog link. Fitting a model with a cloglog link to the probability of a failure (\(1-\pi\)) is equivalent to fitting a log-log link to the probability of a success (\(\pi\)) . This means the log-log link is appropriate when the probability of success is asymmetric in the opposite direction: it increases sharply from zero but then approaches one slowly. It is sometimes used for parameters constrained to be greater than 1, though this application is more niche.
In summary, use a complementary log-log link when the increase in probability is slow at the beginning but accelerates near the end. Use a log-log link when the increase is rapid at the beginning but slows down near the end.
Again, the link choice is usually a matter of tradition, rarely of theory. Statistical significance of the variables should be similar across the several link functions. So, from a theory-testing standpoint, the link functions are rather interchangeable. With that said, predictions will vary depending on the link function chosen. Thus, if prediction is important then you will want to investigate the effect of different link functions on your predictions (and confidence bounds).
While the actual predictions will differ, they should only do so slightly. The rule is that all models that are "appropriate" should provide similar conclusions and predictions. If they do not, then your model is too fragile… a bad thing. Build for model robustness.


