Skip to main content
Statistics LibreTexts

9.1: Introduction to Multiple Regression

  • Page ID
    56959
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Multiple regression extends simple two-variable regression to the case that still has one response but many predictors (denoted \(x_1\), \(x_2\), \(x_3\), ...). The method is motivated by scenarios where many variables may be simultaneously connected to an output.

    We will consider data about loans from the peer-to-peer lender, Lending Club, which is a data set we first encountered in Chapters [ch_intro_to_data] and [ch_summarizing_data]. The loan data includes terms of the loan as well as information about the borrower. The outcome variable we would like to better understand is the interest rate assigned to the loan. For instance, all other characteristics held constant, does it matter how much debt someone already has? Does it matter if their income has been verified? Multiple regression will help us answer these and other questions.

    The data set includes results from 10,000 loans, and we’ll be looking at a subset of the available variables, some of which will be new from those we saw in earlier chapters. The first six observations in the data set are shown in Figure [loansDataMatrix], and descriptions for each variable are shown in Figure [loansVariables]. Notice that the past bankruptcy variable () is an indicator variable, where it takes the value 1 if the borrower had a past bankruptcy in their record and 0 if not. Using an indicator variable in place of a category name allows for these variables to be directly used in regression. Two of the other variables are categorical ( and ), each of which can take one of a few different non-numerical values; we’ll discuss how these are handled in the model in Section 1.1.

    interestrate incomever debttoincome creditutil bankruptcy term issued creditchecks
    1 14.07 verified 18.01 0.55 0 60 Mar2018 6
    2 12.61 not 5.04 0.15 1 36 Feb2018 1
    3 17.09 source_only 21.15 0.66 0 36 Feb2018 4
    4 6.72 not 10.16 0.20 0 36 Jan2018 0
    5 14.07 verified 57.96 0.75 0 36 Mar2018 7
    6 6.72 not 6.46 0.09 0 36 Jan2018 6
    \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
    variable description
      Interest rate for the loan.
      Categorical variable describing whether the borrower’s income source and amount have been verified, with levels , , and .
      Debt-to-income ratio, which is the percentage of total debt of the borrower divided by their total income.
      Of all the credit available to the borrower, what fraction are they utilizing. For example, the credit utilization on a credit card would be the card’s balance divided by the card’s credit limit.
      An indicator variable for whether the borrower has a past bankruptcy in her record. This variable takes a value of if the answer is “yes” and if the answer is “no”.
      The length of the loan, in months.
      The month and year the loan was issued, which for these loans is always during the first quarter of 2018.
      Number of credit checks in the last 12 months. For example, when filing an application for a credit card, it is common for the company receiving the application to run a credit check.

    Indicator and categorical variables as predictors

    Let’s start by fitting a linear regression model for interest rate with a single predictor indicating whether or not a person has a bankruptcy in their record:

    \[\begin{aligned} \widehat{rate} &= 12.33 + 0.74{} \times bankruptcy\end{aligned}\]

    Results of this model are shown in Figure [intRateVsPastBankrModel].

             
      Estimate Std. Error t value Pr(\(>\)\(|\)t\(|\))
    (Intercept) 12.3380 0.0533 231.49 \(<\)0.0001
    bankruptcy 0.7368 0.1529 4.82 \(<\)0.0001
             

    Interpret the coefficient for the past bankruptcy variable in the model. Is this coefficient significantly different from 0? The variable takes one of two values: 1 when the borrower has a bankruptcy in their history and 0 otherwise. A slope of 0.74 means that the model predicts a 0.74% higher interest rate for those borrowers with a bankruptcy in their record. (See Section [categoricalPredictorsWithTwoLevels] for a review of the interpretation for two-level categorical predictor variables.) Examining the regression output in Figure [intRateVsPastBankrModel], we can see that the p-value for is very close to zero, indicating there is strong evidence the coefficient is different from zero when using this simple one-predictor model.

    Suppose we had fit a model using a 3-level categorical variable, such as . The output from software is shown in Figure [intRateVsVerIncomeModel]. This regression output provides multiple rows for the variable. Each row represents the relative difference for each level of . However, we are missing one of the levels: (for not verified). The missing level is called the , and it represents the default level that other levels are measured against.

             
      Estimate Std. Error t value Pr(\(>\)\(|\)t\(|\))
    (Intercept) 11.0995 0.0809 137.18 \(<\)0.0001
    incomever 1.4160 0.1107 12.79 \(<\)0.0001
    incomever 3.2543 0.1297 25.09 \(<\)0.0001
             

    How would we write an equation for this regression model? [verIncomeEquationExample] The equation for the regression model may be written as a model with two predictors:

    \[\begin{aligned} \widehat{rate} = 11.10 + 1.42 \times \indfunc{income\us{}ver}{source\us{}only} + 3.25 \times \indfunc{income\us{}ver}{verified} \end{aligned}\]

    We use the notation \(\indfunc{variable}{level}\) to represent indicator variables for when the categorical variable takes a particular value. For example, \(\indfunc{income\us{}ver}{source\us{}only}\) would take a value of 1 if was for a loan, and it would take a value of 0 otherwise. Likewise, \(\indfunc{income\us{}ver}{verified}\) would take a value of 1 if took a value of and 0 if it took any other value.

    The notation used in Example [verIncomeEquationExample] may feel a bit confusing. Let’s figure out how to use the equation for each level of the variable.

    Using the model from Example [verIncomeEquationExample], compute the average interest rate for borrowers whose income source and amount are both unverified. When takes a value of , then both indicator functions in the equation from Example [verIncomeEquationExample] are set to zero:

    \[\begin{aligned} \widehat{rate} &= 11.10 + 1.42 \times 0 + 3.25 \times 0 \\ &= 11.10 \end{aligned}\]

    The average interest rate for these borrowers is 11.1%. Because the level does not have its own coefficient and it is the reference value, the indicators for the other levels for this variable all drop out.

    Using the model from Example [verIncomeEquationExample], compute the average interest rate for borrowers whose income source is verified but the amount is not. When takes a value of , then the corresponding variable takes a value of 1 while the other (\(\indfunc{income\us{}ver}{verified}\)) is 0:

    \[\begin{aligned} \widehat{rate} &= 11.10 + 1.42 \times 1 + 3.25 \times 0 \\ &= 12.52 \end{aligned}\]

    The average interest rate for these borrowers is 12.52%.

    Compute the average interest rate for borrowers whose income source and amount are both verified.

    Predictors with several categories When fitting a regression model with a categorical variable that has \(k\) levels where \(k > 2\), software will provide a coefficient for \(k - 1\) of those levels. For the last level that does not receive a coefficient, this is the , and the coefficients listed for the other levels are all considered relative to this reference level.

    Interpret the coefficients in the model.

    The higher interest rate for borrowers who have verified their income source or amount is surprising. Intuitively, we’d think that a loan would look less risky if the borrower’s income has been verified. However, note that the situation may be more complex, and there may be confounding variables that we didn’t account for. For example, perhaps lender require borrowers with poor credit to verify their income. That is, verifying income in our data set might be a signal of some concerns about the borrower rather than a reassurance that the borrower will pay back the loan. For this reason, the borrower could be deemed higher risk, resulting in a higher interest rate. (What other confounding variables might explain this counter-intuitive relationship suggested by the model?)

    How much larger of an interest rate would we expect for a borrower who has verified their income source and amount vs a borrower whose income source has only been verified?

    Including and assessing many variables in a model

    The world is complex, and it can be helpful to consider many factors at once in statistical modeling. For example, we might like to use the full context of borrower to predict the interest rate they receive rather than using a single variable. This is the strategy used in . While we remain cautious about making any causal interpretations using multiple regression on observational data, such models are a common first step in gaining insights or providing some evidence of a causal connection.

    We want to construct a model that accounts not only for any past bankruptcy or whether the borrower had their income source or amount verified, but simultaneously accounts for all the variables in the data set: , , , , , , and .

    \[\begin{aligned} \widehat{\var{rate}} &= \beta_0 + \beta_1\times \indfunc{income\us{}ver}{source\us{}only} + \beta_2\times \indfunc{income\us{}ver}{verified} + \beta_3\times \var{debt\us{}to\us{}income} \\ &\qquad\ + \beta_4 \times \var{credit\us{}util} + \beta_5 \times \var{bankruptcy} + \beta_6 \times \var{term} \\ &\qquad\ + \beta_7 \times \indfunc{issued}{Jan2018} + \beta_8 \times \indfunc{issued}{Mar2018} + \beta_9 \times \var{credit\us{}checks}\end{aligned}\]

    This equation represents a holistic approach for modeling all of the variables simultaneously. Notice that there are two coefficients for and also two coefficients for , since both are 3-level categorical variables.

    We estimate the parameters \(\beta_0\), \(\beta_1\), \(\beta_2\), ..., \(\beta_9\) in the same way as we did in the case of a single predictor. We select \(b_0\), \(b_1\), \(b_2\), ..., \(b_9\) that minimize the sum of the squared residuals:

    \[\begin{aligned} \label{sumOfSqResInMultRegr} SSE = e_1^2 + e_2^2 + \dots + e_{10000}^2 = \sum_{i=1}^{10000} e_i^2 = \sum_{i=1}^{10000} \left(y_i - \hat{y}_i\right)^2\end{aligned}\]

    where \(y_i\) and \(\hat{y}_i\) represent the observed interest rates and their estimated values according to the model, respectively. 10,000 residuals are calculated, one for each observation. We typically use a computer to minimize the sum of squares and compute point estimates, as shown in the sample output in Figure [loansFullModelOutput]. Using this output, we identify the point estimates \(b_i\) of each \(\beta_i\), just as we did in the one-predictor case.

             
      Estimate Std. Error t value Pr(\(>\)\(|\)t\(|\))
             
    (Intercept) 1.9251 0.2102 9.16 \(<\)0.0001
    incomever 0.9750 0.0991 9.83 \(<\)0.0001
    incomever 2.5374 0.1172 21.65 \(<\)0.0001
    debttoincome 0.0211 0.0029 7.18 \(<\)0.0001
    creditutil 4.8959 0.1619 30.24 \(<\)0.0001
    bankruptcy 0.3864 0.1324 2.92 0.0035
    term 0.1537 0.0039 38.96 \(<\)0.0001
    issued 0.0276 0.1081 0.26 0.7981
    issued -0.0397 0.1065 -0.37 0.7093
    creditchecks 0.2282 0.0182 12.51 \(<\)0.0001
             

    Multiple regression model A multiple regression model is a linear model with many predictors. In general, we write the model as

    \[\begin{aligned} \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k \end{aligned}\]

    when there are \(k\) predictors. We always estimate the \(\beta_i\) parameters using statistical software.

    Write out the regression model using the point estimates from Figure [loansFullModelOutput]. How many predictors are there in this model? [loansFullModelEqWCoef] The fitted model for the interest rate is given by:

    \[\begin{aligned} \widehat{\var{rate}} &= 1.925 + 0.975 \times \indfunc{income\us{}ver}{source\us{}only} + 2.537 \times \indfunc{income\us{}ver}{verified} + 0.021 \times \var{debt\us{}to\us{}income} \\ &\qquad\ + 4.896 \times \var{credit\us{}util} + 0.386 \times \var{bankruptcy} + 0.154 \times \var{term} \\ &\qquad\ + 0.028 \times \indfunc{issued}{Jan2018} -0.040 \times \indfunc{issued}{Mar2018} + 0.228 \times \var{credit\us{}checks} \end{aligned}\]

    If we count up the number of predictor coefficients, we get the effective number of predictors in the model: \(k = 9\). Notice that the categorical predictor counts as two, once for the two levels shown in the model. In general, a categorical predictor with \(p\) different levels will be represented by \(p - 1\) terms in a multiple regression model.

    What does \(\beta_4\), the coefficient of variable , represent? What is the point estimate of \(\beta_4\)?

    Compute the residual of the first observation in Figure [loansDataMatrix] on page using the equation identified in Guided Practice [loansFullModelEqWCoef]. To compute the residual, we first need the predicted value, which we compute by plugging values into the equation from Example [loansFullModelEqWCoef]. For example, \(\indfunc{income\us{}ver}{source\us{}only}\) takes a value of 0, \(\indfunc{income\us{}ver}{verified}\) takes a value of 1 (since the borrower’s income source and amount were verified), was 18.01, and so on. This leads to a prediction of \(\widehat{rate}_1 = 18.09\). The observed interest rate was 14.07%, which leads to a residual of \(e_1 = 14.07 - 18.09 = -4.02\).

    We estimated a coefficient for in Section 1.1 of \(b_4 = 0.74{}\) with a standard error of \(SE_{b_1} = 0.15{}\) when using simple linear regression. Why is there a difference between that estimate and the estimated coefficient of 0.39 in the multiple regression setting? [pastBankrCoefDiffExplained] If we examined the data carefully, we would see that some predictors are correlated. For instance, when we estimated the connection of the outcome and predictor using simple linear regression, we were unable to control for other variables like whether the borrower had her income verified, the borrower’s debt-to-income ratio, and other variables. That original model was constructed in a vacuum and did not consider the full context. When we include all of the variables, underlying and unintentional bias that was missed by these other variables is reduced or eliminated. Of course, bias can still exist from other confounding variables.

    Example [pastBankrCoefDiffExplained] describes a common issue in multiple regression: correlation among predictor variables. We say the two predictor variables are (pronounced as co-linear) when they are correlated, and this collinearity complicates model estimation. While it is impossible to prevent collinearity from arising in observational data, experiments are usually designed to prevent predictors from being collinear.

    The estimated value of the intercept is 1.925, and one might be tempted to make some interpretation of this coefficient, such as, it is the model’s predicted price when each of the variables take value zero: income source is not verified, the borrower has no debt (debt-to-income and credit utilization are zero), and so on. Is this reasonable? Is there any value gained by making this interpretation?

    Adjusted \(\pmb{R^2}\) as a better tool for multiple regression

    We first used \(R^2\) in Section [fittingALineByLSR] to determine the amount of variability in the response that was explained by the model:

    \[\begin{aligned} R^2 = 1 - \frac{\text{variability in residuals}} {\text{variability in the outcome}} = 1 - \frac{Var(e_i)}{Var(y_i)}\end{aligned}\]

    where \(e_i\) represents the residuals of the model and \(y_i\) the outcomes. This equation remains valid in the multiple regression framework, but a small enhancement can make it even more informative when comparing models.

    [computeUnadjR2ForFullLoansModel] The variance of the residuals for the model given in Guided Practice [loansFullModelEqWCoef] is 18.53, and the variance of the total price in all the auctions is 25.01. Calculate \(R^2\) for this model.

    This strategy for estimating \(R^2\) is acceptable when there is just a single variable. However, it becomes less helpful when there are many variables. The regular \(R^2\) is a biased estimate of the amount of variability explained by the model when applied to a new sample of data. To get a better estimate, we use the adjusted \(R^2\).

    Adjusted \(\pmb{R^2}\) as a tool for model assessment The is computed as

    \[\begin{aligned} R_{adj}^{2} = 1 - \frac{s_{\text{residuals}}^2 / (n-k-1)} {s_{\text{outcome}}^2 / (n-1)} = 1 - \frac{s_{\text{residuals}}^2}{s_{\text{outcome}}^2} \times \frac{n-1}{n-k-1} \end{aligned}\]

    where \(n\) is the number of cases used to fit the model and \(k\) is the number of predictor variables in the model. Remember that a categorical predictor with \(p\) levels will contribute \(p - 1\) to the number of variables in the model.

    Because \(k\) is never negative, the adjusted \(R^2\) will be smaller – often times just a little smaller – than the unadjusted \(R^2\). The reasoning behind the adjusted \(R^2\) lies in the associated with each variance, which is equal to \(n - k - 1\) for the multiple regression context. If we were to make predictions for new data using our current model, we would find that the unadjusted \(R^2\) would tend to be slightly overly optimistic, while the adjusted \(R^2\) formula helps correct this bias.

    There were \(n=10000\) auctions in the data set and \(k=9\) predictor variables in the model. Use \(n\), \(k\), and the variances from Guided Practice [computeUnadjR2ForFullLoansModel] to calculate \(R_{adj}^2\) for the interest rate model.

    Suppose you added another predictor to the model, but the variance of the errors \(Var(e_i)\) didn’t go down. What would happen to the \(R^2\)? What would happen to the adjusted \(R^2\)?

    Adjusted \(R^2\) could have been used in Chapter [linRegrForTwoVar]. However, when there is only \(k = 1\) predictors, adjusted \(R^2\) is very close to regular \(R^2\), so this nuance isn’t typically important when the model has only one predictor.


    This page titled 9.1: Introduction to Multiple Regression is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel via source content that was edited to the style and standards of the LibreTexts platform.