Skip to main content
Statistics LibreTexts

9.3: Checking Model Assumptions using Graphs

  • Page ID
    56961
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Multiple regression methods using the model

    \[\begin{aligned} \hat{y} &= \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k\end{aligned}\]

    generally depend on the following four conditions:

    1. the residuals of the model are nearly normal (less important for larger data sets),
    2. the variability of the residuals is nearly constant,
    3. the residuals are independent, and
    4. each variable is linearly related to the outcome.

    Diagnostic plots

    can be used to check each of these conditions. We will consider the model from the Lending Club loans data, and check whether there are any notable concerns:

    \[\begin{aligned} \widehat{rate} &= \ 1.921 + 0.974 \times \indfunc{income\us{}ver}{source\us{}only} + 2.535 \times \indfunc{income\us{}ver}{verified} \\ &\qquad + 0.021 \times \var{debt\us{}to\us{}income} + 4.896 \times \var{credit\us{}util} + 0.387 \times \var{bankruptcy} \\ &\qquad + 0.154 \times \var{term} + 0.228 \times \var{credit\us{}check}\end{aligned}\]

    Check for outliers.

    In theory, the distribution of the residuals should be nearly normal; in practice, normality can be relaxed for most applications. Instead, we examine a histogram of the residuals to check if there are any outliers: Figure [loansDiagNormalHistogram] is a histogram of these outliers. Since this is a very large data set, only particularly extreme observations would be a concern in this particular case. There are no extreme observations that might cause a concern.

    If we intended to construct what are called for future observations, we would be more strict and require the residuals to be nearly normal. Prediction intervals are further discussed in an online extra on the OpenIntro website:

    Absolute values of residuals against fitted values.

    A plot of the absolute value of the residuals against their corresponding fitted values (\(\hat{y}_i\)) is shown in Figure [loansDiagEvsAbsF]. This plot is helpful to check the condition that the variance of the residuals is approximately constant, and a smoothed line has been added to represent the approximate trend in this plot. There is more evident variability for fitted values that are larger, which we’ll discuss further.

    Residuals in order of their data collection.

    This type of plot can be helpful when observations were collected in a sequence. Such a plot is helpful in identifying any connection between cases that are close to one another. The loans in this data set were issued over a 3 month period, and the month the loan was issued was not found to be important, suggesting this is not a concern for this data set. In cases where a data set does show some pattern for this check, methods may be useful.

    Residuals against each predictor variable.

    We consider a plot of the residuals against each of the predictors in Figure [loansDiagEvsVariables]. For those instances where there are only 2-3 groups, box plots are shown. For the numerical outcomes, a smoothed line has been fit to the data to make it easier to review. Ultimately, we are looking for any notable change in variability between groups or pattern in the data.

    Here are the things of importance from these plots:

    • There is some minor differences in variability between the verified income groups.
    • There is a very clear pattern for the debt-to-income variable. What also stands out is that this variable is very strongly right skewed: there are few observations with very high debt-to-income ratios.
    • The downward curve on the right side of the credit utilization and credit check plots suggests some minor misfitting for those larger values.

    Having reviewed the diagnostic plots, there are two options. The first option is to, if we’re not concerned about the issues observed, use this as the final model; if going this route, it is important to still note any abnormalities observed in the diagnostics. The second option is to try to improve the model, which is what we’ll try to do with this particular model fit.

    Options for improving the model fit

    There are several options for improvement of a model, including transforming variables, seeking out additional variables to fill model gaps, or using more advanced methods that would account for challenges around inconsistent variability or nonlinear relationships between predictors and the outcome.

    The main concern for the initial model is that there is a notable nonlinear relationship between the debt-to-income variable observed in Figure [loansDiagEvsVariables]. To resolve this issue, we’re going to consider a couple strategies for adjusting the relationship between the predictor variable and the outcome.

    Let’s start by taking a look at a histogram of in Figure [loansDebtToIncomeHist]. The variable is extremely skewed, and upper values will have a lot of leverage on the fit. Below are several options:

    • log transformation (\(\log{x}\)),
    • square root transformation (\(\sqrt{x}\)),
    • inverse transformation (\(1 / x\)),
    • truncation (cap the max value possible)

    If we inspected the data more closely, we’d observe some instances where the variable takes a value of 0, and since \(\log(0)\) and \(1 / x\) are undefined when \(x = 0\), we’ll exclude these transformations from further consideration.1 A square root transformation is valid for all values the variable takes, and truncating some of the larger observations is also a valid approach. We’ll consider both of these approaches.

    To try transforming the variable, we make two new variables representing the transformed versions:

    Square root.

    We create a new variable, , where all the values are simply the square roots of the values in , and then refit the model as before. The result is shown in the left panel of Figure [loansDiagEvsTransformDebtToIncome]. The square root pulled in the higher values a bit, but the fit still doesn’t look great since the smoothed line is still wavy.

    Truncate at 50.

    We create a new variable, , where any values in that are greater than 50 are shrunk to exactly 50. Refitting the model once more, the diagnostic plot for this new variable is shown in the right panel of Figure [loansDiagEvsTransformDebtToIncome]. Here the fit looks much more reasonable, so this appears to be a reasonable approach.

    The downside of using transformations is that it reduces the ease of interpreting the results. Fortunately, since the truncation transformation only affects a relatively small number of cases, the interpretation isn’t dramatically impacted.

    As a next step, we’d evaluate the new model using the truncated version of , we would complete all the same procedures as before. The other two issues noted while inspecting diagnostics in Section 3.1 are still present in the updated model. If we choose to report this model, we would want to also discuss these shortcomings to be transparent in our work. Depending on what the model will be used for, we could either try to bring those under control, or we could stop since those issues aren’t severe. Had the non-constant variance been a little more dramatic, it would be a higher priority. Ultimately we decided that the model was reasonable, and we report its final form here:

    \[\begin{aligned} \widehat{rate} &= \ 1.562 + 1.002 \times \indfunc{income\us{}ver}{source\us{}only} + 2.436 \times \indfunc{income\us{}ver}{verified} \\ &\qquad + 0.048 \times \var{debt\us{}to\us{}income\us{}50} + 4.694 \times \var{credit\us{}util} + 0.394 \times \var{bankruptcy} \\ &\qquad + 0.153 \times \var{term} + 0.223 \times \var{credit\us{}check}\end{aligned}\]

    A sharp eye would notice that the coefficient for is more than twice as large as what the coefficient had been for the variable in the earlier model. This suggests those larger values not only were points with high leverage, but they were influential points that were dramatically impacting the coefficient.

    “All models are wrong, but some are useful” -George E.P. Box The truth is that no model is perfect. However, even imperfect models can be useful. Reporting a flawed model can be reasonable so long as we are clear and report the model’s shortcomings.

    Don’t report results when conditions are grossly violated. While there is a little leeway in model conditions, don’t go too far. If model conditions are very clearly violated, consider a new model, even if it means learning more statistical methods or hiring someone who can help. To help you get started, we’ve developed a couple additional sections that you may find on OpenIntro’s website. These sections provide a light introduction to what are called and to fitting to data, respectively:


    This page titled 9.3: Checking Model Assumptions using Graphs is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel via source content that was edited to the style and standards of the LibreTexts platform.