16.3: O Canada!
- Page ID
- 57788
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Let our first full example concern finding Canada on a world map. The research question is:
What is the relationship between the type of course being taken, the average age of the students in the class, and the proportion of students who can locate Canada on a map?
According to the wording, the measurement unit is the classroom; that is, each record represents a classroom. For each classroom, four measurements are taken: number of students in the class (trials), number of students who could locate Canada on the map (successes), type of class (independent variable), and average age of students in the class (independent variable).
The pseudodata are located at: http://rur.kvasaheim.com/data/ocanada.csv
Modeling Steps
The first step is to determine the linear predictor. The two independent variables are the course type taught to the class, course, and the average age of the students in the class, averageAge.
Thus, the linear predictor is:
\begin{equation*}
\eta = \beta_0 + \beta_1 \texttt{course} + \beta_{2} \texttt{averageAge}
\end{equation*}
The second thing is to determine the conditional distribution of the dependent variable. Here, note that the dependent variable is the number of students in the class who can locate Canada on the map. This random variable is the number of successes out of a total number of trials (number of students in the course). Because of this, the dependent variable may follow a Binomial distribution.
\begin{equation*}
Y \mid x \sim \text{Binomial}\left(n,\ \pi\right)
\end{equation*}
Note that it is likely that the observations are not independent: those in a class will be more similar in terms of geographic knowledge. Thus, while the Binomial distribution is likely, we may find need to use a different estimation method than maximum likelihood to take into consideration the dependence (a.k.a. clumping).
The third, and final, aspect of a generalized linear model that needs to be specified is the link function. This is the function that links the ranges of the distribution's expected value and the unbounded linear predictor, \(\eta\).
Before we can move ahead, we need to think about this expected value. From elementary statistics, we know \(\mathrm{E}[Y \mid x] = n\pi\) for a Binomial distribution. However, because \(n\) varies from classroom to classroom, and because we really care about the proportion of students who can find Canada on the map, it is always preferable to "divide out by \(n\)" and focus on modeling \(\pi\), our parameter of interest. This is what actually happens in the estimation procedure.
Back in the section on the canonical link for the Binomial, we decided that the canonical link was the logit function. In fact, any function that maps \((0,1) \mapsto \mathbb{R}\) would be appropriate. Using the logit function, we now have:
\begin{equation*} \mathrm{logit}(\pi) = \beta_0 + \beta_1 \texttt{course} + \beta_{2} \texttt{averageAge} + \varepsilon \end{equation*}
Furthermore, to increase the evidence that our basic model is appropriate, we should fit with multiple link functions (see the sections on alternative link functions for logistic and binomial regression). Perhaps we should also fit with the probit and cauchit link functions.
Putting it Together
Note that the three parts combine to make the following distribution statement:
\begin{equation*}
Y_i \sim \text{Binomial}\left( n_i,\ \underbrace{\text{logistic}\left(\beta_0 + \beta_1 \text{course}_i + \beta_{2} \text{averageAge}_i + \varepsilon_i\right)}_{ \pi_i} \right)
\end{equation*}
The importance of this equation is that it shows how the three parts fit together into a coherent whole.
The Models
Here is how to fit the model in R using these three link functions:
depVar = cbind(correct, classSize-correct)
modL = glm( depVar~course + averageAge,
family=binomial(link=make.link("logit")) )
modP = glm( depVar~course + averageAge,
family=binomial(link=make.link("probit")) )
modC = glm( depVar~course + averageAge,
family=binomial(link=make.link("cauchit")) )
Note that the first thing we need to do is specify both the number of successes and the number of failures. This is the variable depVar, the successes (correct) stacked against the failures (classSize-correct) as a single dependent variable.
Closely study the three modeling statements. In them you will find the linear predictor, the conditional distribution, and the link function.
The Assumptions
Before reading this section, read back through the section on the mathematics of binomial regression and determine the assumptions. Once you have them, compare your list to the following:
The first assumption is that the conditional distribution of the dependent variable is the Binomial distribution. The section on the Binomial distribution provides the five requirements for a Binomial distribution. Usually, it will be quite easy to meet requirements 1, 2, and 5. The other two requirements may, or may not, be met by the data.
Other requirements/assumtions follow:
Variance (Dispersion)
At this point, the usual result of violating Requirements 3 and/or 4 is to have the variance of the variable not be what is expected. Recall the value of \(a(\phi)\) in the exponential form of the Binomial is just 1. That means that if the random variable exactly follows a Binomial distribution, then the data's dispersion is approximately \(1\). To calculate the observed dispersion parameter, divide the residual deviance by the residual degrees of freedom.
It can be shown that the deviance asymptotically follows a Chi-square distribution with \(n-p\) degrees of freedom:
\begin{equation}
-2 \ln \mathcal{L} \sim \chi^2_{n-p}
\end{equation}
As such, we can create a test for overdispersion… or for "non-unit dispersion." Running summary(modL) gives the residual dispersion as \(197.56\) and the degrees of freedom as \(\nu=15\). The ratio for this sample is \(13.17067\). Do we have sufficient evidence that the ratio differs from 1 for the population?
Well, the usual 95% confidence interval for the sample dispersion, given that the population dispersion is 1, is from \(6.26\) to \(27.49\):
qchisq(c(0.025,0.975), df=15)
Since the observed value of 197.56 is above this interval, we can conclude that there is significant evidence of overdispersion (see Figure \(\PageIndex{2}\) above).
Note that we are working with both ratios and totals in this discussion. Knowing the difference in terminology is important in understanding. The statement "the population dispersion is 1" means that the ratio is 1; that is, the dispersion is 1 for each degree of freedom. That is why the number of degrees of freedom, \(n-p\), is within the interval while 1 is not.
What do we do when there is evidence of overdispersion?
That depends on the effects of overdispersion. At this level, the usual effect of overdispersion is that the variance estimate is not correct. In maximum likelihood estimation, the usual method for estimating all parameters in generalized linear models, the dispersion is required to be 1 in the Binomial case because the value of \(a(\phi)=1\). The maximum quasi-likelihood method includes an additional variable in the denominator, thus changing it from \(a(\phi)\) to \(z\, a(\phi)\). In the case where \(a(\phi)\) is not a constant, such as the Gaussian, this addition offers absolutely nothing. Estimating \(a(\phi)\) from the data gives no information above estimating \(z\, a(\phi)\) from the data.
However, when \(a(\phi)\) is a constant, such as with the Binomial (and the Poisson), including \(z\) allows more flexibility in the model. Make no mistake, if \(z \ne 1\) then the actual distribution is not Binomial. The inclusion of \(z\) allows for distributions that are seemingly "the same" as the Binomial with the exception of the variance.
Fitting the model using maximum quasi-likelihood is almost as easy as fitting it using maximum likelihood estimation. The only difference is adding quasi to the family name:
QmodL = glm( depVar~course + averageAge,
family=quasibinomial(link=make.link("logit")) )
Compare the summary of this model, summary(QmodL), with the summary from the MLE model, summary(modL). Notice that the effect estimates are identical. The difference is in the estimate of the standard errors, … which affects the estimates of the test statistics, … which affects the estimates of the p-values.
When the dispersion is greater than one, the p-values estimated using maximum likelihood are too low because they assume \(a(\phi) = 1\). When the dispersion is less than one (a rare event), the p-values estimated using maximum likelihood are bigger than reality.
The abbreviated output from the summary(QmodL) function is:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.9030 6.2123 -1.594 0.1318
courseHumanities -0.1049 1.0654 -0.098 0.9229
coursePhysical Science 1.2328 1.2901 0.956 0.3544
courseSocial Science -0.9069 1.0270 -0.883 0.3911
averageAge 0.5342 0.3002 1.780 0.0954
(Dispersion parameter for quasibinomial family taken to be 11.95446)
Null deviance: 276.86 on 19 degrees of freedom
Residual deviance: 197.56 on 15 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
Most of the output table should be familiar by now. Note that the "Number of Fisher Scoring iterations" is not important at this point. This is the number of "loops" through the estimating procedure needed until the computer determined the estimates converged.
Since we are fitting this generalized linear model using maximum quasi-likelihood estimation, there is little useful information contained in the deviances. You could use them to calculate a measure similar to the infamous \(R^2\) value. Recall that the \(R^2\) value is a PRE (proportional reduction in error) measure. Understanding what a PRE measure is supposed to measure allows us to create a formula for one:
\[ \text{pseudo-}R^2 = 1 - \frac{\text{residual deviance}}{\text{null deviance}} \]
Does this equation makes sense to you? The better you understand the meanings of the terms, the better your understanding of this equation. Make sure you realize that these equations are not unique. There are other ways of measuring both "before uncertainty" and "after uncertainty." Different measures of each will lead to different equations for the PRE.
This is why there are so many PRE measures. Typically, specific measures are traditional/expected for specific models. Usually, the only thing special about those PRE measures is that they came first. Regardless, including them in a report is expected.
From the regression table, we can tell that there is no relationship detected between the average age in the class and the proportion of students correctly finding Canada on a map (\(p=0.3911\)). Thus, should we be in the model selection phase, we should drop this variable and fit the new model.
Is the course-type variable statistically significant? From this table, we cannot tell. Each of the three lines in the regression table devoted to that variable are actually specifying the level's effect with respect to some base category ("Arts" here). That means the table can only tell us if one level is statistically different than the effect of the Arts level. We cannot tell whether the variable is significant as a whole.
Think through that last paragraph. It gives you a hint on how to determine if a categorical variable significantly affects the dependent variable.
If the variable is significant in modeling the dependent variable, then removing it from the model will produce a significantly worse model. So… how do we measure "significantly worse"? There are many ways. One is to compare the AIC or BIC values between the two models. Note that both AIC and BIC depend on the value of the likelihood. When using maximum quasi-likelihood estimation, there is no likelihood (there is only quasi-likelihood).
We could also use the likelihood ratio test. This test relies on the asymptotic distribution of the difference in deviances. Thus, it is great for large-sample cases. How large? It depends on how closely the deviances follow the Chi-square distribution… which means (at its core) this test relies on the Central Limit Theorem. From experience, as long as no one will die from you being wrong, a sample size in excess of 50 will suffice in most cases. For a Binomial random variable, if \(n\pi\) and \(n(1-\pi)\) are both at least \(15\), things will tend to be acceptable. When in doubt, make sure you check this assumption for your specific case.
Performing a likelihood ratio test is as easy as fitting the full model, fitting the reduced (constrained) model, and testing if the dispersions are significantly different:
QmodLreduced = glm( depVar~averageAge,
family=quasibinomial(link=make.link("logit")))
anova(QmodLreduced, QmodL, test="LRT")
After running these lines, we have the following output:
Analysis of Deviance Table Model 1: depVar~averageAge Model 2: depVar~course + averageAge Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 18 246.20 2 15 197.56 3 48.635 0.2542
Since the p-value is greater than our \(\alpha=0.05\), there is no significant difference between the two models. This means that including course does not significantly improve our model (\(\mathrm{p-value}=0.2542\)). If we are trying to create our final model, we would drop course from consideration and deal with this reduced model (exploratory analysis). If we are testing whether the course matters in being able to locate Canada on the map, we would have to conclude that there is no significant evidence that it does (confirmatory analysis).
There are other testing options in the anova function. There are also other ways of performing the likelihood ratio test. In very large sample sizes, the tests will return the same substantive conclusions. In smaller sample sizes, the conclusions may differ. The anova function uses an unbiased estimator of the variance, while others may used biased estimators. Thus, it may be better to use the anova function… in general.
Model Fit
The second assumption is that the “curve of best fit” consistently fits the data. This is the same assumption we met back in the section on the constant expected value assumption for linear models. Its tests are the same as always… no change, whatsoever. Thus, the code I would run to check that the link function is appropriate is:
eL = residuals(modL) runs.test(eL, order=averageAge) summary(aov(eL~course))
The first line calculates the residuals. The second tests the residuals against the numeric variable; the third, against the categorical variable. In neither case is the p-value less than our \(\alpha=0.05\). In fact, neither are even close to \(0.05\). Thus, there is no evidence that the link we used, the logit link, is incorrect.
Since the p-value for the tests of model fit are greater than \(\alpha=0.05\) and since we adjusted for overdispersion, this is an acceptable model. The results of fitting this model using the logit link are given by summary(QmodL). From that model and the above analysis, we can make the following conclusions:
- There is no significant evidence that the average age in the class has an impact on the class's ability to identify Canada on the map.
- There is also no significant evidence that the class type affects the ability to identify Canada.
In other words, we have no new information on what influences a class's ability to find Canada on the map. It could be that there is no relationship in the population, or it could be that our sample is too small to detect it. There is no way for us to know with this data.
Ethicality
Were we unethical, we could create a graphic to illustrate our non-results (see Figure \(\PageIndex{3}\): below). This graphic strongly indicates that those in physical science classes are much more adept at geography than those in social science classes. Note how much higher the blue curve is than the red curve.
However, the figure below is misleading — it is very misleading. The statistics give us no evidence that the abilities differ between those in those two types of classes. The graphic lies by suggesting such a difference exists.
As an ethical statistician, be aware that your graphics must illustrate what the statistics actually tell us. They should not suggest that which does not exist. Similarly, they should not minimize an effect that does exist. A statistician needs to use graphics to tell the story of the data — and nothing else. This is not as easy as it seems. Violations of ethics happen when you violate these precepts by design. Violations of these tenets may still happen by accident. To avoid claims that you are being unethical, make sure you are clear on your conclusions and why you are making those specific conclusions.
It is possible that two ethical researchers come to different conclusions. It is unlikely that those conclusions are substantively different. If this happens, then ethical researchers will find the results interesting and seek to understand why the models produced such different results. I would argue that this is the hallmark of scientists. When proven wrong, an ethical scientist will try to better understand the phenomenon to explain the differences in the conclusions.
Remember this: The choice is always between humility and humiliation.
We could actually obtain stronger results if we had data at the student level. This would allow us to forego aggregation and measure the relationship of interest (what allows students to find Canada on the map) directly.
The vast majority of the time, this is a true statement: We are interested in relationships at the individual level, but our data only exists at some level of aggregation.
If we are careful with our wording, this aggregation results in no more than a loss of power. If we are not careful, then we are drawing conclusions not warranted by the model. Strictly speaking, we can only draw clear conclusions on our measurement unit (classes in this example). However, we can say that our conclusions are consistent with hypotheses about the unit of interest (students in this example).
[For the following interpretation, let us pretend our variables were significant. I am doing this to illustrate how to write out a conclusion.]
We can conclude:
Physical Science classes with older students tend to do better at locating Canada on the map.
How boring and unconnected with a really interesting question about the students themselves.
We could also conclude:
The results are consistent with the hypothesis that Physical Science majors who are older tend to be better able to locate Canada on the map.
Much better, but less decisive. To be "consistent with" is logically quite weak. There are many results "consistent with" any hypothesis. If relying on consistency, your theory needs to be very strong… or your results very interesting.
The best of both worlds is having data at the individual level. Unfortunately, that is not always available. As such, we do need to understand the limitations on our conclusions due to the limitations of our data.
Again, know what you can conclude and conclude no farther.


