Skip to main content
Statistics LibreTexts

15.7: Model Selection

  • Page ID
    57782
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Which of the models is best? That is a model selection question. Model selection procedures (not tests) attempt to balance competing desires — accuracy and parsimony — to create the "best" model, by some standard. For linear models, we discussed the \(R^2\) value as a measure of accuracy. However, we noted that adding variables to the model can never decrease the \(R^2\) value, and will usually increase it. Thus, there is a pressure to increase the number of variables. However, science is guided by the philosophy of William of Occam and his Razor:

    Numquam ponenda est pluralitas sine necessitate.

    Plurality must never be posited without necessity. That is, models should be as simple as possible, but no simpler. In other words, as scientists, we should only include variables if the theory warrants it.

    Caution

    Make no mistake, all models are wrong. As scientists, we are merely searching for the useful ones.

    In linear regression, we corrected for the pressure to keep adding variables by using the adjusted \(R^2\) as a guide. This value penalizes the model for the number of variables it has. Thus, unless the variable is statistically significant, there is no benefit to adding it to the model. This is why many scientists use the adjusted \(R^2\) measure to help them model select.

    There is neither a true \(R^2\) nor a true \(\bar{R}^2\) value for discrete dependent variable models. Thus, there has been much work in creating an appropriate measure to use for model selection. Three different measures are frequently used in the literature: Akaike's Information Criterion (Akaike 1974), Bayesian Information Criterion (Schwarz 1978), and Likelihood Ratio Test (Wilks 1938). Each of these three penalizes additional variables in a different manner and to a different degree. The one you select depends on the one available to you and the relationship between the two models.

    Akaike Information Criterion

    One of the first attempts to explicitly penalize for additional parameters (variables) was done by Hirotugu Akaike (1974). In his paper, he developed (albeit without much mathematical rigor) a comparative measure of "model goodness" that can be used to select the better of two models based on the log-likelihood measure. The Akaike Information Criterion (AIC) score can be calculated whenever Maximum Likelihood Estimation is used to estimate the model parameters. The formula for the AIC is

    \[ \text{AIC} \stackrel{\text{def}}{=} - 2 \ln(\mathcal{L}) + 2k \]

    Here, \(k\) is the number of parameters being estimated in the model and \(\mathcal{L}\) is the likelihood of the data with the model.

    The quantity \(-2 \ln(\mathcal{L})\) is often called the deviance of the model, which will be used in the section on the Likelihood Ratio Test.

    The procedure to determine if one model is better than the other is straight-forward:

    1. Calculate the AIC for Model A.
    2. Calculate the AIC for Model B.
    3. The model with the lower AIC score is the preferred model.

    Its simplicity is its strength. Its weakness is that this measure, called the minimum information theoretical criterion (MAICE) in the paper, has no known probability distribution. As such, there is no way to determine whether the model with the lower AIC is enough better to justify eliminating the other from the discussion: If the AIC of Model 1 is 3 less than the AIC of Model 2, do we completely ignore Model 2?

    This question actually leads to several "rules of thumb" that determine when that difference is "large enough." The usual rules of thumb are to drop the model with the higher AIC if the difference is at least 5 (or 8 or 10).

    That there is no a priori statistical distribution to the AIC score only means the test is not optimal. In his paper, Akaike concurs (1974: 722):

    Although the present author has no proof of optimality of MAICE it is at present the only procedure applicable to every situation where the likelihood can be properly defined and it is actually producing very reasonable results without very much amount of help of subjective judgment.

    The R function that calculates the Akaike Information Criterion is AIC. Using this function, the AIC for each of the three coin models are \(\text{AIC}_{\text{logit}} = 118.48\), \(\text{AIC}_{\text{cloglog}}=119.74\), and \(\text{AIC}_{\text{loglog}}=117.05\). Thus, while the log-log is the "best" model from the AIC standpoint, it is not sufficiently better to completely ignore the other two models (the AIC improvement is not greater than 5). As such, this procedure is inconclusive with respect to the single model we should choose.

    Be Aware!

    Please keep in mind that for the AIC to be valid in comparing models, the dependent variable values must be the exactly same across the models. If not, then this process cannot be used (nor any of these methods). Thus, transformations of the dependent variable mean the AIC cannot be used. Similarly, if data points are removed between two models, the AIC cannot be used.

    Keeping multiple appropriate models is a good idea. Since sufficient science theory does not exist to determine the "right" link, we should keep as many as possible. This will allow us to better understand how much our conclusions depend on our choice of the link function.

    Bayesian Information Criterion

    Akaike's paper did not give a mathematically solid reason why there should be a 2 point penalty for each additional estimated parameter (the \(2k\) factor). This created an opening for other researchers to improve upon Akaike's proof and to create different penalty factors. Schwarz (1978) took Akaike's idea and put it on a more solid foundation. He humbly called his measure the Bayesian Information Criterion (BIC), others may refer to it as the Schwartz Information Criterion (SIC) or the Schwarz Bayesian Criterion (SBC).

    Its formula is quite similar to the AIC:

    \[ \text{BIC} \stackrel{\text{def}}{=} - 2 \ln(\mathcal{L}) + k \log(n) \]

    Here, \(k\) is the number of parameters being estimated, \(n\) is the number of data points, and \(\mathcal{L}\) is the likelihood of the model. Thus, the difference between the AIC and the BIC is the effect of the additional parameter. In the AIC, each additional parameter penalizes the score by 2 points; in the BIC, \(\log(n)\) points — usually a much greater penalty.

    The process to select the better of two models is the same as for the AIC: Select the model with the lower BIC score (including the rules of thumb). Furthermore, the requirement that the dependent variable values are the same between the models remains.

    Likelihood Ratio Test

    Frequently, we wish to determine if a group of variables are jointly significant in the model. To do this, we compare the two nested models. We say that Model B is nested in Model A if Model A contains all the same variables as does Model B, plus at least one other. For instance, let Model A contain the variables X1, X2, X3, X32, and X4. Let Model B contain variables X1, X2, and X3. Here, Model B is nested within Model A. Now, if we want to determine if variables X32 and X4 are jointly significant, then we merely compare Models A and B. To do this, we can use the AIC or the BIC, but the Likelihood Ratio test is more statistically clean.

    The Likelihood Ratio test is superior to the AIC and BIC — when it can be used — because there is a known asymptotic probability distribution for the test statistic. As such, we can determine whether Model A is significantly better than Model B — i.e., whether variables X4 and X5 are jointly significant.

    The procedure is also straight forward:

    1. Calculate the deviance for Model A.
    2. Calculate the deviance for Model B.
    3. The difference between the two deviances is distributed asymptotically as a chi-squared random variable with degrees of freedom equal to the parameter (variable) difference in the two models.

    The deviance of a model is defined as

    \[ D \stackrel{\text{def}}{=} - 2 \ln(\mathcal{L}) \]

    Thus, if Model B is nested in Model A, the test statistic is equal to

    \[ \text{TS} \stackrel{\text{def}}{=} D_B - D_A \sim \chi^2_{v_A - v_B} \]

    Here, \(v_A\) is the number of parameters in Model A; \(v_B\), in Model B.

    Example \(\PageIndex{1}\)

    Let us assume that Model A uses three variables, \(X1\), \(X2\), and \(X3\), and has a log-likelihood of -20, and Model B uses one variable, \(X1\), and has a log-likelihood of -22. Are variables \(X2\) and \(X3\) jointly significant?

    Solution.
    This is a direct application of the Likelihood Ratio test. The test statistic is

    \begin{align}
    \text{TS} &\stackrel{\text{def}}{=} D_B - D_A \\[1em]
    &= \big(-2 \ln(\mathcal{L}_B)\big) - \big(-2 \ln(\mathcal{L}_A)\big) \\[1em]
    &= \big(-2(-22)\big) - \big(-2(-20)\big) \\ &= 44 - 40 = 4
    \end{align}

    This test statistic is approximately distributed as a chi-squared random variable with \(3-1=2\) degrees of freedom; that is, \(\text{TS} \stackrel{\text{a}}{\sim} \chi^2_2\).

    A chi-squared table gives us a p-value of approximately \(p=0.15\). This is close to what R gives us:

    pchisq(4, df=2, lower.tail=FALSE) = 0.135
    

    Thus, we conclude at the \(\alpha=0.05\) level that we cannot reject the null hypothesis and we conclude that the restricted model is not significantly different from the full model.

    That is, we conclude that the two variables are not jointly significant and we can use Model B in lieu of Model A with little loss of precision.


    This page titled 15.7: Model Selection is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?