15.3: The Mathematics
- Page ID
- 57778
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)When we model using the Classical Linear Model, we actually model/predict the expected value of the dependent variable, the mean. In the previous insurance example, we modeled/predicted the probability of a person purchasing life insurance. What is the connection? It is that \(E[Y \mid x] = \pi\).
Remember from the chapter on GLMs, performing GLM estimation requires that we know three things about our data and our model:
- the linear predictor,
- the conditional distribution of the dependent variable, and
- the function that links the two domains.
The previous section discussed the linear predictor (\(\eta = \beta_0 + \beta_1 \text{age} + \beta_2 \text{income}\)) and a link function, \(\mathrm{logit}(\mu)\), for our example. That only leaves the conditional distribution of the dependent variable.
Conditional Distribution
What are the possible values of the dependent variable? They are \(\{0, 1\}\). What distribution has only these two outcomes? It is the Bernoulli distribution. For the Bernoulli distribution, the probability of getting a "1" (success) is \(\pi\) and the probability of getting a "0" (failure) is \(1-\pi\). Mathematically, this means the full probability mass function (pmf) is:
\begin{equation}
f(y) = \begin{cases} \pi^y (1-\pi)^{1-y} & y \in \{0, 1\} \\[1em] 0 & \text{otherwise} \end{cases}
\end{equation}
Strictly speaking, the probability mass function is not as important as the expected value of this distribution. Why? Remember that the Generalized Linear Model paradigm models the expected value, \(\mathrm{E}[Y \mid x]\), of the distribution of the dependent variable... just like in ordinary least squares.
Calculating the expected value of the Bernoulli distribution is easy using the definition of expected value:
\begin{align}
E[Y] &\stackrel{\text{def}}{=} \sum_i\ y_i\ f(y_i) \\[1em]
&= 0 \cdot f(0) + 1 \cdot f(1) \\[1em]
&= 0 \cdot (1-\pi) + 1 \cdot (\pi) \\[1em]
&= \pi
\end{align}
Thus, the expected value of a Bernoulli random variable is \(\pi\), the success probability.
This fact makes the results of modeling more apparent: As the GLM paradigm models the expected value, when we use the Bernoulli distribution, we end up modeling the probability of a success, which is what we want.
Recall that one of the assumptions of Ordinary Least Squares is that the variance is constant with respect to the independent variable. When the outcomes are Bernoulli random variables, we can easily prove that the variance is not constant with respect to the expected values. If there is a relationship between the independent variable and the probabilities (which will be true if X affects Y), then a relationship between variance and expected value indicates heteroskedasticity.
To see this, let \(Y \sim Bern(\pi)\). With this, and with the pmf above, we use the definition of variance to calculate \(V[Y]\):
\begin{align*}V[Y] &\stackrel{\text{def}}{=} \sum_i (y_i - \mu)^2 f(y_i) \\[1em]
&= (0 - \mu)^2 f(0) + (1 - \mu)^2 f(1) \\[1em]
&= (0 - \pi)^2 f(0) + (1 - \pi)^2 f(1) \\[1em]
&= \pi^2 (1-\pi) + (1-\pi)^2 \pi \\[1em]
&= \pi(1-\pi) \big[ \pi + (1-\pi) \big]
\end{align*}
This last line simplifies to \(V[Y] = \pi(1-\pi)\), as \(\pi + (1-\pi) = 1\), which means \(V[Y]\) is a function of \(\pi\), the expected value. It is not a constant with respect to the expected value, \(\pi\). Binary dependent variables violate the assumption of homoskedasticity — by definition.
As an aside: The variance is a quadratic function of the success probability, \(V[Y]=\pi(1-\pi)\). From this formula, we see that we are most unsure (the variance is highest) when the probability of a success is \(\pi=0.500\). Check that this makes sense: Which has a more uncertain outcome, a fair coin (\(\pi=0.500\)) or a two-headed coin (\(\pi=1.000\))?
Now that we understand our choice of distribution a bit better, and the resulting expected value, let us examine the third facet: the link function. First, note that \(\pi\) is bounded: \(\pi \in (0,1)\). Thus we need a function that takes a doubly-bounded variable and transforms it into an unbounded variable. We have already met a link function that can handle this — the logit function.
Here, I must mention that the logit is not the only appropriate link function. Any monotonic function that maps \((0, 1) \mapsto \mathbb{R}\) is appropriate. This includes the entire class of quantile functions, of which the probit is a member.
The choice of the link function often reduces to tradition within your field. However, social science theory is getting advanced enough to suggest link functions that are more appropriate than others.
And so, we have the three necessary components to use generalized linear models in this example:
- the linear predictor, \[ \eta = \beta_0 + \beta_1 \text{age} + \beta_2 \text{income} \]
- the distribution of the dependent variable, \[ \text{insurance} \sim \text{Bernoulli}(\pi) \] with the formula for the expected value \(\mu = \pi\).
- and the link function, \[ \mathrm{logit}(\mu) = \eta \]
Here is what you need to take away from this section: The distribution must fit the possible outcomes. The link must translate the bounds on the parameter to the linear predictor. Both require you to know some distributions.
Deriving the Canonical Link
In the chapter on GLMs, we mentioned that each distribution has a canonical link. Let us derive the canonical link for the Bernoulli distribution. As a side note, one does not have to understand this section to use Generalized Linear Models.
The steps to determine the canonical link are the same for the Binomial as it was for the Gaussian:
- Write the probability mass function (pmf).
- Write the probability mass function in the required form.
- Read off the canonical link.
For this distribution, this results in:
\begin{align}
&\text{pmf}: \quad && \pi^y(1-\pi)^{1-y} \\[1em]
&&& = \exp \big[ \log\left( \pi^y(1-\pi)^{1-y} \right) \big] \\[1em]
&&& = \exp \big[ \log\left( \pi^y\right) + \log \left( (1-\pi)^{1-y} \right) \big] \\[1em]
&&& = \exp \big[ y \log\left(\pi\right) + (1-y) \log \left(1-\pi\right) \big] \\[1em]
&&& = \exp \big[ y \log\left(\pi\right) + \log \left(1-\pi\right) - y\log \left(1-\pi\right) \big] \\[1em]
&&& = \exp \big[ y \left( \log(\pi)-\log(1-\pi) \right) + \log(1-\pi) \big] \\[1em]
&&& = \exp \big[ y \log \left(\frac{\pi}{1-\pi}\right) + \log(1-\pi) \big] \\[1em]
&&& = \exp \big[ y \mathrm{logit}(\pi) + \log(1-\pi) \big] \\[1em]
&&& = \exp \left[ \frac{y \mathrm{logit}(\pi) + \log(1-\pi)}{1} + 0 \right] \\[1em]
\end{align}
This is in the required form: \( \exp \left[ \frac{\displaystyle y \theta - b(\theta)}{\displaystyle a(\phi)} + c(y, \phi) \right] \).
Thus, reading off the standard form, we have the following:
- \(y = y\)
- \(\theta = \mathrm{logit}(\pi)\)
- \(a(\phi)=1\)
- \(b(\theta) = \log(1-\pi) = - \log(1 + e^\theta)\)
- \(c(y,\theta)=0\)
As such, the canonical link is the logit function, \(g(\pi) = \mathrm{logit}(\pi)\).
Other Links
As mentioned in the GLM chapter, we do not have to use the canonical link. Any monotonic, increasing function that maps the restricted domain to the unrestricted domain works. Thus, there are several options for the link function. The table above gives some options.3
The logit link is the canonical link. The probit link is frequently used in biostatistics. Its advantage is that it is based on the Normal distribution, with which we are intimately familiar. There is usually little difference between predictions made with the logit link and those made by the probit link. The coefficient estimates will usually differ by a factor of approximately 3.7, and the levels of significance will usually be close. The cauchit link is a symmetric link with heavy tails, as compared to the logit and the probit links (see the figure below).
The log-log link and the complementary-log-log link are asymmetric links. The log-log link has a heavy right tail; the complementary-log-log link, a heavy left tail (see the second figure below). Most science theory is only now beginning to be able to state which of the three types of link functions will be most appropriate for the given model (symmetric, heavy left, heavy right).
R has the built-in ability to model using the following link functions for the binomial (Bernoulli) distribution: logit, probit, cauchit, log, and complementary-log-log. The KnoxStats package adds the log-log link function:
glm( y ~ x, family=binomial(link=make.link("loglog")) )
| Link | Inverse Link |
|---|---|
| Logit: \(\log \left( \mu/(1-\mu) \right)\) | Logistic: \((1+\exp(-\eta))^{-1}\) |
| Probit: \(\Phi^{-1}(\mu)\) | Normal CDF: \(\Phi(\eta)\) |
| Cauchit: \(\tan\big(\pi(\mu-\frac{1}{2})\big)\) | Cauchy CDF: \(\arctan(\eta)/\pi + \frac{1}{2}\) |
| log-log: \(-\log( -\log(\mu))\) | \(\exp(-\exp(-\eta))\) |
| Complementary log-log: \(\log(-\log(1-\mu))\) | \(1-\exp(-\exp(\eta))\) |
Graphics of Other Links
Again, the link choice is usually a matter of tradition, rarely of theory. Statistical significance of the variables should be similar across the several link functions. So, from a theory-testing standpoint, the link functions are rather interchangeable. With that said, predictions will vary depending on the link function chosen. Thus, if prediction is important then you will want to investigate the effect of different link functions on your predictions (and confidence bounds).
While the actual predictions will differ, they should only do so slightly. The rule is that all models that are "appropriate" should provide similar conclusions and predictions. If they do not, then your model is too fragile… a bad thing. Build for model robustness.


