Skip to main content
Statistics LibreTexts

14.5: Generalized Linear Models in R

  • Page ID
    57772
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    In previous chapters, we performed linear modeling using the lm function. To perform generalized linear modeling, we use the glm function. When one uses the Gaussian distribution and its canonical link, results between the two methods will be identical. That is, we could have replaced all of our previous uses of lm with glm and not change a thing.

    Be Aware!

    If one uses the Gaussian distribution and a non-canonical link, the predictions will be very close, but not identical. The reason is that the transformation is performed on different quantities between the two methods... as the following shows.

    To see this, we can look at two things. The first is focusing on what the alteration applies to (the residuals????).

    When transforming the dependent variable:
    \begin{align}
    g^{-1}(\mathbf{Y}) &= \mathbf{X}\mathbf{B} + \mathbf{E} \\[1em]
    \Rightarrow \qquad \qquad \mathbf{Y} &= g(\mathbf{X}\mathbf{B} + \mathbf{E})
    \end{align}

    When using the link function:
    \begin{align}
    g^{-1}(E[\mathbf{Y} \ |\ \mathbf{X}]) &= \mathbf{X}\mathbf{B} \\[1em]
    E[\mathbf{Y} \ |\ \mathbf{X}] &= g(\mathbf{X}\mathbf{B}) \\[1em]
    \mathbf{Y} &= g(\mathbf{X}\mathbf{B}) + \mathbf{E}
    \end{align}

    So, the only difference is in whether the function applies to the residuals. In the CLM, it does; in the GLM, it does not. This is why there will be (usually slight) differences between the CLM transformed and the GLM with a link function.

    Second, we can see this ourselves in an old example, use the GLM paradigm to find the answers, and see that the results are slightly different from when the model fit when transforming the dependent variable.

    Example \(\PageIndex{1}\): Voting in Děčín

    Let us return to the cows data file. The voters of Děčín are being sent to the polls to vote on a constitutional referendum (ballot measure) that proposes to limit the number of cows that can be housed within the city limits. This was not the first time that Ruritanians were sent to the polls to vote on this or a closely related issue. Given the information from previous votes, what is the estimated proportion of voters who will vote in favor of the ballot measure in Děčín?

    Solution.
    The example asks us to estimate the proportion of voters who will vote in favor of the ballot measure in Děčín. The dependent variable will be propWin and the independent variables will be yearPassed, chickens, and religPct. For now, let us assume a linear relationship between the independent variables and the dependent variable; that is, the equation we will use to fit the data is
    \begin{equation}
    \texttt{propWin} = \beta_0 + \beta_1 (\texttt{yearPassed}) + \beta_2 (\texttt{chickens}) + \beta_3 (\texttt{religPct}) + \varepsilon
    \end{equation}

    This is equivalent to
    \begin{equation}
    E[\texttt{propWin}] = \beta_0 + \beta_1 (\texttt{yearPassed}) + \beta_2 (\texttt{chickens}) + \beta_3 (\texttt{religPct})
    \end{equation}
    which is more clearly connected to the GLM paradigm than before.

    Performing Generalized Linear Modeling in R is straight-forward (as it is in all modern statistical packages). The function to use is glm (for "Generalized Linear Modeling"):

    glm(propWin ~ yearPassed + chickens + religPct)
    

    As glm returns a lot of information, we should store its results in a variable, which I will call mod1. Once the computer computes the regression (and all associated information), we can summarize the results in the standard results table (below) using the command

    summary(model.1)
    

    The resulting output is

                             Estimate   Std.Err   t-value    p-value
    Constant Term              0.1512    0.0659     2.293     0.0295
    Year Passed (post 2000)   -0.0201    0.0036    -5.618  << 0.0001
    Banned Chickens           -0.0373    0.0200    -1.868     0.0723
    Percent Religious          0.0095    0.0011     8.801  << 0.0001
    

    Notice that all three variables of interest are statistically significant at the \(\alpha = 0.05\) level. Additionally, the model has a residual deviance of \(0.063072\) (as compared to the null deviance of \(0.286802\)). This indicates that the model reduced the deviance by a factor of

    \begin{equation}
    1-\frac{0.063072}{0.286802} = 0.7801
    \end{equation}

    And this agrees with the \(R^2\) from Section 7.3: Cows in Děčín.

    Thus, the equation for the line of best fit (also known as the prediction line) is approximately

    \begin{equation}
    E[propWin] = 0.1512 - 0.0201 (yearPassed) - 0.0373 (chickens) + 0.0095 (religPct)
    \end{equation}

    According to this model, what is the expected vote in Děčín? To answer this, we need this information about the Děčín ballot measure: yearPassed = 9, chickens = 0, religPct = 48. With this information, and under the assumption that the model is correct, we have our prediction that 42% of the Děčín voters will vote in favor of this ballot measure.

    \(\blacksquare\)

    There is absolutely nothing in the previous paragraphs that differs from the analysis results from when we answered the same question using OLS. This is because the Generalized Linear Model paradigm extends the Classical Linear Model paradigm and is equivalent to it when the dependent variable is Gaussian distributed and the link is the identity function. We can even use something like the goodness-of-fit measure we developed in Section 3.4: PRE Measures, the \(R^2\) measure. Here, however, we calculate it based on the null and residual deviances. The null deviance is the deviance inherent in the data (akin to the variance of the data, TSS). The residual deviance is the deviance in the data unexplained by the model (akin to the SSE).

    If we wish to predict the results of a Venkovský ballot measure from 1994, which also restricted chickens, we would still get an impossible prediction --- one that is outside logical limits. In Section 8.1: The Issue of Boundedness, we corrected this error by transforming the data, modeling, then back-transforming the results. Instead of transforming the dependent variable, let us merely change the link function. Here is how that is done in R and with glm:

    The command to use is

    mod3 = glm(propWin ~ yearPassed + chickens + religPct,
               family=gaussian(link=make.link("logit")))
    

    The results of this new analysis are

                              Estimate   Std.Err   t-value     p-value
     Constant Term             -1.8909    0.2898     -6.53   << 0.0001
     Year Passed (post 2000)   -0.0886    0.0157     -5.63   << 0.0001
     Banned Chickens           -0.2318    0.0878     -2.64      0.0134
     Percent Religious          0.0475    0.0047     10.06   << 0.0001
    

    ... and the resulting graphic comparing the CLM and the GLM methods is

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{1}\): A plot of the predictions across various values of religiosity comparing the two models: CLM and GLM. Note that while the two results tables provided different results, the prediction plots are quite close together. The curves would have been equal only if we were to use the canonical link and the Gaussian distribution. For the predictions, the year was 2010 and the ballot measure also banned chickens.

    Now, summary(mod3) provides many results. Note that all three independent variables are more statistically significant than in the non-transformed model. Also note that the effect directions are the same as before.

    Finally, note that these parameter estimates are not the same as those where we used the Classical Linear Model with a logit transformation to fit the data in Section 8.1: The Issue of Boundedness. However, if we make predictions, we see that the results are very close (Figure \(\PageIndex{1}\)). CLMs and GLMs give identical results only with the Gaussian distribution and its canonical link. Here, we used the logit link.

    Let us now re-examine Example 8.1: The Issue of Boundedness from Chapter 8: Fixing the Violations. Recall that in that example, we were modeling a variable that was bounded below, but not above. This led us to transform the dependent variable using the logarithm function. Here, we fit the model with the Gaussian distribution and another non-canonical link.

    Example \(\PageIndex{2}\): Wealth and Corruption

    The gross domestic product (GDP) per capita is one of many measures of average wealth in countries. If extant theory is correct, then the wealth in the country is directly affected by the level of honesty in the government --- countries with high levels of honesty (low levels of corruption) should be wealthier than those with low levels of honesty (high levels of corruption).

    Furthermore, if theory is correct, the level of democracy in a country should also influence the country's level of wealth --- countries with higher levels of democracy should be wealthier than countries with low levels of democracy.

    Let us determine if reality (using the data in the gdpcap data file) supports the current theory or if current theory needs to explain the severe discrepancies.

    Solution.
    The process of fitting this model with a GLM should be getting rote by now as it is so similar to fitting with a CLM. The R command is

    m2 = glm(gdpcap ~ democracy + hig,
             family=gaussian(link = make.link("log")))
    

    To see the results, we perform a summary call. The results of that call are provided in the regression table below. Note that both independent variables are highly significant at the usual level of significance. Furthermore, the effect directions are the same as in the CLM model. Here is the regression table.

                            Estimate    Std.Err    t-value     p-value
    Constant Term             8.1595     0.1546      52.77   << 0.0001
    Level of Democracy       -0.0452     0.0061      -7.44   << 0.0001
    Honesty in Government     0.3335     0.0219      15.20   << 0.0001
    

    The graphic is

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{2}\): A plot of the two prediction curves, corresponding to the model fit using the Classical Linear Model and the Generalized Linear Model. Note that the two prediction curves are similar, but not really that close for large values of honesty in government. Estimates for Papua New Guinea are shown with the two circle-x symbols.

    For some link functions, R allows you to skip the make.link portion. The log link is one of those for the Gaussian. Thus, the following command would also work:

    glm(gdpcap ~ democracy + hig, family=gaussian(link="log"))
    

    I recommend writing it out. That helps those who follow you to interpret what you are doing.

    To predict the GDP per capita for Papua New Guinea, we repeat the same steps as when we were fitting CLMs: predict, then back-transform. Thus, a prediction statement will be

    PNG = data.frame(hig=2.1, democracy=10)
    exp(predict(m2, newdata=PNG))
    

    The predicted GDP per capita for Papua New Guinea was $2678 when fitted with the CLM. For this GLM model, the prediction is $4481. Thus, the prediction for Papua New Guinea is higher using GLMs than when using CLMs. Looking at the prediction graph (Figure \(\PageIndex{2}\)), we see that GLM predictions are lower than CLM predictions for certain values of the dependent variable (and larger for others).


    This page titled 14.5: Generalized Linear Models in R is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?