9.4: Multiple regression case study - Mario Kart
- Page ID
- 57066
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)We’ll consider Ebay auctions of a video game called Mario Kart for the Nintendo Wii. The outcome variable of interest is the total price of an auction, which is the highest bid plus the shipping cost. We will try to determine how total price is related to each characteristic in an auction while simultaneously controlling for other variables. For instance, all other characteristics held constant, are longer auctions associated with higher or lower prices? And, on average, how much more do buyers tend to pay for additional Wii wheels (plastic steering wheels that attach to the Wii controller) in auctions? Multiple regression will help us answer these and other questions.
Data set and the full model
The data set includes results from 141 auctions. Four observations from this data set are shown in Figure [marioKartDataMatrix], and descriptions for each variable are shown in Figure [marioKartVariables]. Notice that the condition and stock photo variables are indicator variables, similar to in the data set.
| price | condnew | stockphoto | duration | wheels | |
|---|---|---|---|---|---|
| 1 | 51.55 | 1 | 1 | 3 | 1 |
| 2 | 37.04 | 0 | 1 | 7 | 1 |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
| 140 | 38.76 | 0 | 0 | 7 | 0 |
| 141 | 54.51 | 1 | 1 | 1 | 2 |
| variable | description |
|---|---|
| Final auction price plus shipping costs, in US dollars. | |
| Indicator variable for if the game is new () or used (). | |
| Indicator variable for if the auction’s main photo is a stock photo. | |
| The length of the auction, in days, taking values from 1 to 10. | |
| The number of Wii wheels included with the auction. A Wii wheel is an optional steering wheel accessory that holds the Wii controller. |
[condNewVarForMarioKartOnly] We fit a linear regression model with the game’s condition as a predictor of auction price. Results of this model are summarized below:
| Estimate | Std. Error | t value | Pr(\(>\)\(|\)t\(|\)) | |
| (Intercept) | 42.8711 | 0.8140 | 52.67 | \(<\)0.0001 |
| condnew | 10.8996 | 1.2583 | 8.66 | \(<\)0.0001 |
Write down the equation for the model, note whether the slope is statistically different from zero, and interpret the coefficient.
Sometimes there are underlying structures or relationships between predictor variables. For instance, new games sold on Ebay tend to come with more Wii wheels, which may have led to higher prices for those auctions. We would like to fit a model that includes all potentially important variables simultaneously. This would help us evaluate the relationship between a predictor variable and the outcome while controlling for the potential influence of other variables.
We want to construct a model that accounts for not only the game condition, as in Guided Practice [condNewVarForMarioKartOnly], but simultaneously accounts for three other variables:
\[\begin{aligned} \widehat{\var{price}} &= \beta_0 + \beta_1\times \var{cond\us{}new} + \beta_2\times \var{stock\us{}photo} \\ &\qquad\ + \beta_3 \times \var{duration} + \beta_4 \times \var{wheels}\end{aligned}\]
Figure [MarioKartFullModelOutput] summarizes the full model. Using this output, we identify the point estimates of each coefficient.
| Estimate | Std. Error | t value | Pr(\(>\)\(|\)t\(|\)) | |
| (Intercept) | 36.2110 | 1.5140 | 23.92 | \(<\)0.0001 |
| condnew | 5.1306 | 1.0511 | 4.88 | \(<\)0.0001 |
| stockphoto | 1.0803 | 1.0568 | 1.02 | 0.3085 |
| duration | -0.0268 | 0.1904 | -0.14 | 0.8882 |
| wheels | 7.2852 | 0.5547 | 13.13 | \(<\)0.0001 |
[eqForMultRegrOfTotalPrForAllPredWithCoef] Write out the model’s equation using the point estimates from Figure [MarioKartFullModelOutput]. How many predictors are there in this model?
What does \(\beta_4\), the coefficient of variable \(x_4\) (Wii wheels), represent? What is the point estimate of \(\beta_4\)?
[computeMultipleRegressionResidualForMarioKart] Compute the residual of the first observation in Figure [marioKartDataMatrix] using the equation identified in Guided Practice [eqForMultRegrOfTotalPrForAllPredWithCoef].
We estimated a coefficient for in Section [condNewVarForMarioKartOnly] of \(b_1 = 10.90\) with a standard error of \(SE_{b_1} = 1.26\) when using simple linear regression. Why might there be a difference between that estimate and the one in the multiple regression setting? [colinearityOfCondNewAndStockPhoto] If we examined the data carefully, we would see that there is collinearity among some predictors. For instance, when we estimated the connection of the outcome and predictor using simple linear regression, we were unable to control for other variables like the number of Wii wheels included in the auction. That model was biased by the confounding variable . When we use both variables, this particular underlying and unintentional bias is reduced or eliminated (though bias from other confounding variables may still remain).
Model selection
Let’s revisit the model for the Mario Kart auction and complete model selection using backwards selection. Recall that the full model took the following form:
\[\begin{aligned} \widehat{price} = 36.21 + 5.13 \times \var{cond\us{}new} + 1.08 \times \var{stock\us{}photo} - 0.03 \times \var{duration} + 7.29 \times \var{wheels}\end{aligned}\]
Results corresponding to the full model for the data were shown in Figure [MarioKartFullModelOutput]. For this model, we consider what would happen if dropping each of the variables in the model:
| Exclude ... | ||||
| \(R^2_{adj} = 0.6626\) | \(R^2_{adj} = 0.7107\) | \(R^2_{adj} = 0.7128\) | \(R^2_{adj} = 0.3487\) |
For the full model, \(R_{adj}^2 = 0.7108\). How should we proceed under the backward elimination strategy?
[backwardEliminationExampleWMarioKartData] The third model without has the highest \(R_{adj}^2\) of 0.7128, so we compare it to \(R_{adj}^2\) for the full model. Because eliminating leads to a model with a higher \(R_{adj}^2\), we drop from the model.
In Example [backwardEliminationExampleWMarioKartData], we eliminated the variable, which resulted in a model with \(R_{adj}^2 = 0.7128\). Let’s look at if we would eliminate another variable from the model using backwards elimination:
| Exclude and ... | |||
| \(R^2_{adj} = 0.6587\) | \(R^2_{adj} = 0.7124\) | \(R^2_{adj} = 0.3414\) |
Should we eliminate any additional variable, and if so, which variable should we eliminate?
[totPrPredictionUsedStockPhotoTwoWheels] After eliminating the auction’s duration from the model, we are left with the following reduced model:
\[\begin{aligned} \widehat{price} &= \ 36.05 + 5.18 \times \text{\var{cond\us{}new}} + 1.12 \times \text{\var{stock\us{}photo}} + 7.30 \times \text{\var{wheels}}\end{aligned}\]
How much would you predict for the total price for the Mario Kart game if it was used, used a stock photo, and included two wheels and put up for auction during the time period that the Mario Kart data were collected?
Would you be surprised if the seller from Guided Practice [totPrPredictionUsedStockPhotoTwoWheels] didn’t get the exact price predicted?
Checking model conditions using graphs
Let’s take a closer look at the diagnostics for the Mario Kart model to check if the model we have identified is reasonable.
- Check for outliers.
-
A histogram of the residuals is shown in Figure [mkDiagResHist]. With a data set well over a hundred, we’re primarily looking for major outliers. While one minor outlier appears on the upper end, it is not a concern for this large of a data set.
- Absolute values of residuals against fitted values.
-
A plot of the absolute value of the residuals against their corresponding fitted values (\(\hat{y}_i\)) is shown in Figure [mkDiagnosticEvsAbsF]. We don’t see any obvious deviations from constant variance in this example.
- Residuals in order of their data collection.
-
A plot of the residuals in the order their corresponding auctions were observed is shown in Figure [mkDiagnosticInOrder]. Here we see no structure that indicates a problem.
- Residuals against each predictor variable.
-
We consider a plot of the residuals against the variable, the residuals against the variable, and the residuals against the variable. These plots are shown in Figure [mkDiagnosticEvsVariables]. For the two-level condition variable, we are guaranteed not to see any remaining trend, and instead we are checking that the variability doesn’t fluctuate across groups, which it does not. However, looking at the stock photo variable, we find that there is some difference in the variability of the residuals in the two groups. Additionally, when we consider the residuals against the variable, we see some possible structure. There appears to be curvature in the residuals, indicating the relationship is probably not linear.
As with the analysis, we would summarize diagnostics when reporting the model results. In the case of this auction data, we would report that there appears to be non-constant variance in the stock photo variable and that there may be a nonlinear relationship between the total price and the number of wheels included for an auction. This information would be important to buyers and sellers who may review the analysis, and omitting this information could be a setback to the very people who the model might assist.
Note: there are no exercises for this section.


