Skip to main content
Statistics LibreTexts

17.1: Linear or Poisson Regression?

  • Page ID
    57794
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    This section addresses a fundamental question when the dependent variable is a count: should one use ordinary least squares (perhaps after a transformation) or a Poisson regression model? Using a simulated dataset with known true parameters, it demonstrates the practical consequences of this choice. The section walks through fitting three competing models — an untransformed OLS model, an OLS model with a log transformation of the response, and a Poisson GLM — and compares their performance. It highlights that while all three models may produce a reasonable fitted curve, only the Poisson model satisfies the key statistical assumptions necessary for valid inference, such as constant variance and appropriate handling of the count nature of the data. The section concludes by emphasizing that model comparison based on information criteria (like AIC) is invalid when the response variable has been transformed differently across models, and that checking assumptions is the primary way to select the appropriate approach.

     

    Learning Objectives

    By the end of this section, you will be able to:

    1. Fit three different models to count data in R: an ordinary least squares model on the raw counts, an ordinary least squares model on a log-transformed response (log(y+1) to handle zeros), and a Poisson regression model using the glm function with family=poisson.
    2. Compare the fitted values from these three models graphically and recognize that while they may produce similar curves, their underlying assumptions and therefore the validity of their inferential statistics (standard errors, confidence intervals, p-values) differ substantially.
    3. Explain why model comparison using information criteria such as AIC or BIC is not valid when the response variable has been transformed differently across models (e.g., raw counts vs. log(y+1)), as these criteria require the likelihood to be computed on exactly the same scale.
    4. Select the Poisson regression model as the most appropriate for count data when the key assumptions of linear regression (normality, constant variance) are violated, and when proper inference about uncertainty is a priority.

     

    ✦•················• ✦ •··················•✦

     

    To illustrate some of these observations, let us create a count dataset, fit it with a simple linear model, fit it with a Poisson model, and then compare the results. The data that we will use for this example, fakepoisson, was fabricated so that we know the parameters. As such, we can compare the estimates we get from the three modeling techniques to the true parameters.

    Here is the code I used to create the fakepoisson data set:

    set.seed(370)
    
    n=75
    x = sort( runif(n, min=0, max=2) )
    beta0 = 0
    beta1 = 2
    lambda = exp( beta0 + beta1*x )
    
    y = rpois(n, lambda)
    

     

    By this point, you should be able to determine what each line of code does. You should also take note of how the parameter lambda is defined and keep this in mind as you read forward.

     

    Plot of the pseudo data with three regression curves.
    Figure \(\PageIndex{1}\): Plot of the pseudo data with three regression equations overlaying. The linear regression is in red, the linear regression on the log-transformed data is in green, and the Poisson regression is in blue. The black curve is the "correct" curve.

    For this example, the true parameters are \(\tilde{\beta}_0 = 0\) and \(\tilde{\beta}_1 = 2\). Both of these are in log units (the tildes to serve as reminders of this). Except for those provided for the linear model, it is difficult to compare the estimates of the true value. It is much easier to compare the prediction curves.

     

    OLS Model (Untransformed)

    The OLS (untransformed) model can easily be performed. However, it does not fit the data well at all (see the figure above). If you decide to perform the three usual numeric tests, you will find all three violated. Yikes!

     

    OLS Model (Log-Transformed)

    The transformed OLS model has its own problem. Logically, a log transform would be appropriate (only bounded below). However, the dependent variable takes on a \(0\) value. This means you should either perform an additional transformation (add 1 to each dependent value) or drop the records with \(y=0\).

    If we add 1 to each dependent variable before performing the log transform (that is, we perform the transformation \(y^* = \log[y+1]\).) We see that there is a lingering issue with heteroskedasticity. Is it ignorable? Perhaps, but let us try Poisson regression. 

     

    Poisson Model

    Here, we easily perform Poisson regression and check the requirements. What was the code to run the analysis?

    model.c <- glm(y~x, family=poisson)
    summary(model.c)
    

    That's all.

    At this point, you should be able to guess what the lines do and what information in the first line is missing because it is the default link… as we see shortly.

     

    Which is the best model of the three?

    It may not be clear that these three cannot be directly compared. The linear model makes no adjustments, the log-transformed model does, as does the Poisson regression model. None of the transformations are the same. This means that we cannot use information criteria to compare the models.

    So, how can we determine which of these models is best? A first step is to determine which is appropriate by checking the assumptions. The untransformed OLS model violates all three requirements. The transformed OLS model is heteroskedastic. The Poisson model, however, violates none of the requirements. Thus, on the basis of meeting assumptions, the third model is the best of these three.

    If all we care about is estimates (and not confidence intervals), we could look at the graphic comparing the data and the estimations from the model (see the figure above). Numerically, we could also check how much the uncertainty in \(y\) has changed. The uncertainty using the null model (predicting \(y = \bar{y}\)) is \(184.96\). The uncertainty with the linear model is \(43.6\) — a reduction of 76%. The log-transformed linear model has an uncertainty of \(8.1\), which is a reduction of 96%. This is quite much different from the pure linear model. The Poisson model also has an uncertainty of \(8.1\) and a total reduction of 96%.

    Thus, if all you care about is the estimate (which scientists should not), finding some adjustment so that the curve fits the data works. If you are a true scientist, then the confidence interval (and p-values) are important. This means assumptions about homoskedasticity are important — if they exist. Some modeling requires homoskedasticity others do not. Poisson regression does not (at least, not really).

     

    Remember

    Remember that we can use AIC, BIC, etc. only when the y-values are the same. This is not true here, as the y-values are all transformed differently.

     

     

      


    This page titled 17.1: Linear or Poisson Regression? is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?