9.2: Model Selection
- Page ID
- 56960
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)The best model is not always the most complicated. Sometimes including variables that are not evidently important can actually reduce the accuracy of predictions. In this section, we discuss model selection strategies, which will help us eliminate variables from the model that are found to be less important. It’s common (and hip, at least in the statistical world) to refer to models that have undergone such variable pruning as .
In practice, the model that includes all available explanatory variables is often referred to as the . The full model may not be the best model, and if it isn’t, we want to identify a smaller model that is preferable.
Identifying variables in the model that may not be helpful
Adjusted \(R^2\) describes the strength of a model fit, and it is a useful tool for evaluating which predictors are adding value to the model, where adding value means they are (likely) improving the accuracy in predicting future outcomes.
Let’s consider two models, which are shown in Tables [loansFullModelModelSelectionSection] and [loansModelAllButIssued]. The first table summarizes the full model since it includes all predictors, while the second does not include the variable.
| Estimate | Std. Error | t value | Pr(\(>\)\(|\)t\(|\)) | |
| (Intercept) | 1.9251 | 0.2102 | 9.16 | \(<\)0.0001 |
| incomever | 0.9750 | 0.0991 | 9.83 | \(<\)0.0001 |
| incomever | 2.5374 | 0.1172 | 21.65 | \(<\)0.0001 |
| debttoincome | 0.0211 | 0.0029 | 7.18 | \(<\)0.0001 |
| creditutil | 4.8959 | 0.1619 | 30.24 | \(<\)0.0001 |
| bankruptcy | 0.3864 | 0.1324 | 2.92 | 0.0035 |
| term | 0.1537 | 0.0039 | 38.96 | \(<\)0.0001 |
| issued | 0.0276 | 0.1081 | 0.26 | 0.7981 |
| issued | -0.0397 | 0.1065 | -0.37 | 0.7093 |
| creditchecks | 0.2282 | 0.0182 | 12.51 | \(<\)0.0001 |
| Estimate | Std. Error | t value | Pr(\(>\)\(|\)t\(|\)) | |
| (Intercept) | 1.9213 | 0.1982 | 9.69 | \(<\)0.0001 |
| incomever | 0.9740 | 0.0991 | 9.83 | \(<\)0.0001 |
| incomever | 2.5355 | 0.1172 | 21.64 | \(<\)0.0001 |
| debttoincome | 0.0211 | 0.0029 | 7.19 | \(<\)0.0001 |
| creditutil | 4.8958 | 0.1619 | 30.25 | \(<\)0.0001 |
| bankruptcy | 0.3869 | 0.1324 | 2.92 | 0.0035 |
| term | 0.1537 | 0.0039 | 38.97 | \(<\)0.0001 |
| creditchecks | 0.2283 | 0.0182 | 12.51 | \(<\)0.0001 |
Which of the two models is better? We compare the adjusted \(R^2\) of each model to determine which to choose. Since the first model has an \(R^2_{adj}\) smaller than the \(R^2_{adj}\) of the second model, we prefer the second model to the first.
Will the model without be better than the model with ? We cannot know for sure, but based on the adjusted \(R^2\), this is our best assessment.
Two model selection strategies
Two common strategies for adding or removing variables in a multiple regression model are called backward elimination and forward selection. These techniques are often referred to as model selection strategies, because they add or delete one variable at a time as they “step” through the candidate predictors.
starts with the model that includes all potential predictor variables. Variables are eliminated one-at-a-time from the model until we cannot improve the adjusted \(R^2\). The strategy within each elimination step is to eliminate the variable that leads to the largest improvement in adjusted \(R^2\).
Results corresponding to the full model for the data are shown in Figure [loansFullModelModelSelectionSection]. How should we proceed under the backward elimination strategy? [loansBackwardElimEx] Our baseline adjusted \(R^2\) from the full model is \(R^2_{adj} = 0.25843\), and we need to determine whether dropping a predictor will improve the adjusted \(R^2\). To check, we fit models that each drop a different predictor, and we record the adjusted \(R^2\):
| Exclude ... | ||||
| \(R^2_{adj} = 0.22380\) | \(R^2_{adj} = 0.25468\) | \(R^2_{adj} = 0.19063\) | \(R^2_{adj} = 0.25787\) | |
| \(R^2_{adj} = 0.14581\) | \(R^2_{adj} = 0.25854\) | \(R^2_{adj} = 0.24689\) |
The model without has the highest adjusted \(R^2\) of 0.25854, higher than the adjusted \(R^2\) for the full model. Because eliminating leads to a model with a higher adjusted \(R^2\), we drop from the model.
Since we eliminated a predictor from the model in the first step, we see whether we should eliminate any additional predictors. Our baseline adjusted \(R^2\) is now \(R^2_{adj} = 0.25854\). We now fit new models, which consider eliminating each of the remaining predictors in addition to :
| Exclude and ... | |||
| \(R^2_{adj} = 0.22395\) | \(R^2_{adj} = 0.25479\) | \(R^2_{adj} = 0.19074\) | |
| \(R^2_{adj} = 0.25798\) | \(R^2_{adj} = 0.14592\) | \(R^2_{adj} = 0.24701\) |
None of these models lead to an improvement in adjusted \(R^2\), so we do not eliminate any of the remaining predictors. That is, after backward elimination, we are left with the model that keeps all predictors except , which we can summarize using the coefficients from Figure [loansModelAllButIssued]:
\[\begin{aligned} \widehat{rate} &= \ 1.921 + 0.974 \times \indfunc{income\us{}ver}{source\us{}only} + 2.535 \times \indfunc{income\us{}ver}{verified} \\ &\qquad + 0.021 \times \var{debt\us{}to\us{}income} + 4.896 \times \var{credit\us{}util} + 0.387 \times \var{bankruptcy} \\ &\qquad + 0.154 \times \var{term} + 0.228 \times \var{credit\us{}check} \end{aligned}\]
The strategy is the reverse of the backward elimination technique. Instead of eliminating variables one-at-a-time, we add variables one-at-a-time until we cannot find any variables that improve the model (as measured by adjusted \(R^2\)).
Construct a model for the data set using the forward selection strategy. [loansForwardElimEx] We start with the model that includes no variables. Then we fit each of the possible models with just one variable. That is, we fit the model including just , then the model including just , then a model with just , and so on. Then we examine the adjusted \(R^2\) for each of these models:
| Add ... | ||||
| \(R^2_{adj} = 0.05926\) | \(R^2_{adj} = 0.01946\) | \(R^2_{adj} = 0.06452\) | \(R^2_{adj} = 0.00222\) | |
| \(R^2_{adj} = 0.12855\) | \(R^2_{adj} = 0.00018\) | \(R^2_{adj} = 0.01711\) |
In this first step, we compare the adjusted \(R^2\) against a baseline model that has no predictors. The no-predictors model always has \(R_{adj}^2 = 0\). The model with one predictor that has the largest adjusted \(R^2\) is the model with the predictor, and because this adjusted \(R^2\) is larger than the adjusted \(R^2\) from the model with no predictors (\(R_{adj}^2 = 0\)), we will add this variable to our model.
We repeat the process again, this time considering 2-predictor models where one of the predictors is and with a new baseline of \(R^2_{adj} = 0.12855\):
| Add and ... | |||
| \(R^2_{adj} = 0.16851\) | \(R^2_{adj} = 0.14368\) | \(R^2_{adj} = 0.20046\) | |
| \(R^2_{adj} = 0.13070\) | \(R^2_{adj} = 0.12840\) | \(R^2_{adj} = 0.14294\) |
The best second predictor, , has a higher adjusted \(R^2\) (0.20046) than the baseline (0.12855), so we also add to the model.
Since we have again added a variable to the model, we continue and see whether it would be beneficial to add a third variable:
| Add , , and ... | |||
| \(R^2_{adj} = 0.24183\) | \(R^2_{adj} = 0.20810\) | ||
| \(R^2_{adj} = 0.20169\) | \(R^2_{adj} = 0.20031\) | \(R^2_{adj} = 0.21629\) |
The model adding improved adjusted \(R^2\) (0.24183 to 0.20046), so we add to the model.
We continue on in this way, next adding , then , and . At this point, we come again to the variable: adding this variable leads to \(R_{adj}^2 = 0.25843\), while keeping all the other variables but excluding leads to a higher \(R_{adj}^2 = 0.25854\). This means we do not add . In this example, we have arrived at the same model that we identified from backward elimination.
Model selection strategies Backward elimination begins with the model having the largest number of predictors and eliminates variables one-by-one until we are satisfied that all remaining variables are important to the model. Forward selection starts with no variables included in the model, then it adds in variables according to their importance until no other important variables are found.
Backward elimination and forward selection sometimes arrive at different final models. If trying both techniques and this happens, it’s common to choose the model with the larger \(R_{adj}^2\).
The p-value approach, an alternative to adjusted \(\pmb{R^2}\)
The p-value may be used as an alternative to \(R_{adj}^2\) for model selection:
- Backward elimination with the p-value approach.
-
In backward elimination, we would identify the predictor corresponding to the largest p-value. If the p-value is above the significance level, usually \(\alpha = 0.05\), then we would drop that variable, refit the model, and repeat the process. If the largest p-value is less than \(\alpha = 0.05\), then we would not eliminate any predictors and the current model would be our best-fitting model.
- Forward selection with the p-value approach.
-
In forward selection with p-values, we reverse the process. We begin with a model that has no predictors, then we fit a model for each possible predictor, identifying the model where the corresponding predictor’s p-value is smallest. If that p-value is smaller than \(\alpha = 0.05\), we add it to the model and repeat the process, considering whether to add more variables one-at-a-time. When none of the remaining predictors can be added to the model and have a p-value less than 0.05, then we stop adding variables and the current model would be our best-fitting model.
Examine Figure [loansModelAllButIssued] on page , which considers the model including all variables except the variable for the month the loan was issued. If we were using the p-value approach with backward elimination and we were considering this model, which of these variables would be up for elimination? Would we drop that variable, or would we keep it in the model?
While the adjusted \(R^2\) and p-value approaches are similar, they sometimes lead to different models, with the \(R_{adj}^2\) approach tending to include more predictors in the final model.
Adjusted \(\pmb{R^2}\) vs p-value approach When the sole goal is to improve prediction accuracy, use \(R_{adj}^2\). This is commonly the case in machine learning applications.
When we care about understanding which variables are statistically significant predictors of the response, or if there is interest in producing a simpler model at the potential cost of a little prediction accuracy, then the p-value approach is preferred.
Regardless of whether you use \(R_{adj}^2\) or the p-value approach, or if you use the backward elimination of forward selection strategy, our job is not done after variable selection. We must still verify the model conditions are reasonable.


