8.17: Practice problems

Last updated
Save as PDF

Page ID: 33307

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

8.1. Treadmill data analysis The original research goal for the treadmill data set used for practice problems in the last two chapters was to replace the costly treadmill oxygen test with a cheap to find running time measurement but there were actually quite a few variables measured when the run time was found – maybe we can replace the treadmill test result with a combined prediction built using a few variables using the MLR techniques. The following code will get us re-started in this situation.

treadmill <- read_csv("http://www.math.montana.edu/courses/s217/documents/treadmill.csv")
tm1 <- lm(TreadMillOx ~ RunTime, data = treadmill)

8.1.1. Fit the MLR that also includes the running pulse (RunPulse), the resting pulse (RestPulse), body weight (BodyWeight), and Age (Age) of the subjects. Report and interpret the R² for this model.

8.1.2. Compare the R² and the adjusted R² to the results for the SLR model that just had RunTime in the model. What do these results suggest?

8.1.3. Interpret the estimated RunTime slope coefficients from the SLR model and this MLR model. Explain the differences in the estimates.

8.1.4. Find the VIFs for this model and discuss whether there is an issue with multicollinearity noted in these results.

8.1.5. Report the value for the overall \(F\)-test for the MLR model and interpret the result.

8.1.6. Drop the variable with the largest p-value in the MLR model and re-fit it. Compare the resulting R² and adjusted R² values to the others found previously.

8.1.7. Use the dredge function as follows to consider some other potential reduced models and report the top two models according to adjusted R² values. What model had the highest R²? Also discuss and compare the model selection results provided by the delta AICs here.

library(MuMIn)
options(na.action = "na.fail") #Must run this code once to use dredge
dredge(MODELNAMEFORFULLMODEL, rank = "AIC", 
       extra = c("R^2", adjRsq = function(x) summary(x)$adj.r.squared))

8.1.8. For one of the models, interpret the Age slope coefficient. Remember that only male subjects between 38 and 57 participated in this study. Discuss how this might have impacted the results found as compared to a more general population that could have been sampled from.

8.1.9. The following code creates a new three-level variable grouping the ages into low, middle, and high for those observed. The scatterplot lets you explore whether the relationship between treadmill oxygen and run time might differ across the age groups.

treadmill <- treadmill %>% mutate(Ageb = factor(cut(Age, breaks = c(37, 44.5, 50.5, 58))))
summary(treadmill$Ageb)
treadmill %>% ggplot(mapping = aes(x = RunTime, y = TreadMillOx, 
                                   color = Ageb, shape = Ageb)) + 
  geom_point(size = 1.5, alpha = 0.5) +
  geom_smooth(method = "lm") +
  theme_bw() +
  scale_color_viridis_d(end = 0.8) + 
  facet_grid(rows = vars(Ageb))

Based on the plot, do the lines look approximately parallel or not?

8.1.10. Fit the MLR that contains a RunTime by Ageb interaction – do not include any other variables. Compare the R² and adjusted R² results to previous models.

8.1.11. Find and report the results for the \(F\)-test that assesses evidence relative to the need for different slope coefficients.

8.1.12. Write out the overall estimated model. What level was R using as baseline? Write out the simplified model for two of the age levels. Make an effects plot and discuss how it matches the simplified models you generated.

8.1.13. Fit the additive model with RunTime and predict the mean treadmill oxygen values for subjects with run times of 11 minutes in each of the three Ageb groups.

8.1.14. Find the \(F\)-test results for the binned age variable in the additive model. Report and interpret those results.

References

Akaike, Hirotugu. 1974. “A New Look at the Statistical Model Identification.” IEEE Transactions on Automatic Control 19: 716–23.

Bartoń, Kamil. 2022. MuMIn: Multi-Model Inference. https://CRAN.R-project.org/package=MuMIn.

Burnham, Kenneth P., and David R. Anderson. 2002. Model Selection and Multimodel Inference. NY: Springer.

Çetinkaya-Rundel, Mine, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Paterno, and Christopher Barr. 2022. Openintro: Data Sets and Supplemental Functions from OpenIntro Textbooks and Labs. https://CRAN.R-project.org/package=openintro.

De Veaux, Richard D., Paul F. Velleman, and David E. Bock. 2011. Stats: Data and Models, 3rd Edition. Pearson.

Fox, John. 2003. “Effect Displays in R for Generalised Linear Models.” Journal of Statistical Software 8 (15): 1–27. http://www.jstatsoft.org/v08/i15/.

Fox, John, and Michael Friendly. 2021. Heplots: Visualizing Hypothesis Tests in Multivariate Linear Models. http://friendly.github.io/heplots/.

———. 2022b. carData: Companion to Applied Regression Data Sets. https://CRAN.R-project.org/package=carData.

Garnier, Simon. 2021. Viridis: Colorblind-Friendly Color Maps for r. https://CRAN.R-project.org/package=viridis.

Liao, Xiyue, and Mary C. Meyer. 2014. “Coneproj: An R Package for the Primal or Dual Cone Projections with Routines for Constrained Regression.” Journal of Statistical Software 61 (12): 1–22. http://www.jstatsoft.org/v61/i12/.

Merkle, Ed, and Michael Smithson. 2018. Smdata: Data to Accompany Smithson & Merkle, 2013. https://CRAN.R-project.org/package=smdata.

Meyer, Mary C., and Xiyue Liao. 2021. Coneproj: Primal or Dual Cone Projections with Routines for Constrained Regression. https://CRAN.R-project.org/package=coneproj.

Ramsey, Fred, and Daniel Schafer. 2012. The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning. https://books.google.com/books?id=eSlLjA9TwkUC.

If you take advanced applied mathematics courses, you can learn more about the algorithms being used by lm. Everyone else only cares about the algorithms when they don’t work – which is usually due to the user’s inputs in these models not the algorithm itself.↩︎
Sometimes the effects plots ignores the edge explanatory observations with the default display. Always check the original variable summaries when considering the range of observed values. By turning on the “partial residuals” with SLR models, the plots show the original observations along with the fitted values and 95% confidence interval band. In more complex models, these displays with residuals are more complicated but can be used to assess linearity with each predictor in the model after accounting for other variables.↩︎
We used this same notation in the fitting the additive Two-Way ANOVA and this is also additive in terms of these variables. Interaction models are discussed later in the chapter.↩︎
I have not given you a formula for calculating partial residuals. We will leave that for more advanced material.↩︎
Imagine showing up to a ski area expecting a 40 inch base and there only being 11 inches. I’m sure ski areas are always more accurate than this model in their reporting of amounts of snow on the ground…↩︎
The site name is redacted to protect the innocence of the reader. More information on this site, located in Beaverhead County in Montana, is available at www.wcc.nrcs.usda.gov/nwcc/site?sitenum=355&state=mt.↩︎
Term-plots with additive factor variables use the weighted (based on percentage of the responses in each category) average of their predicted mean responses across their levels but we don’t have any factor variables in the MLR models, yet.↩︎
This also applies to the additive two-way ANOVA model.↩︎
The seq function has syntax of seq(from = startingpoint, to = endingpoint, length.out = #ofvalues_between_start_and_end) and the rep function has syntax of rep(numbertorepeat, #oftimes).↩︎
Also see Section 8.13 for another method of picking among different models.↩︎
This section was inspired by a similar section from De Veaux, Velleman, and Bock (2011).↩︎
There are some social science models where the model is fit with the mean subtracted from each predictor so all have mean 0 and the precision of the \(y\)-intercept is interesting. In some cases both the response and predictor variables are “standardized” to have means of 0 and standard deviations of 1. The interpretations of coefficients then relates to changes in standard deviations around the means. These coefficients are called “standardized betas”. But even in these models where the \(x\)-values of 0 are of interest, the test for the \(y\)-intercept being 0 is rarely of interest.↩︎
The variables were renamed to better interface with R code and our book formatting using the rename function.↩︎
The answer is no – it should be converted to a factor variable prior to plotting so it can be displayed correctly by ggpairs, but was intentionally left this way so you could see what happens when numerically coded categorical variables are not carefully handled in R.↩︎
Either someone had a weighted GPA with bonus points, or more likely here, there was a coding error in the data set since only one observation was over 4.0 in the GPA data. Either way, we could remove it and note that our inferences for HSGPA do not extend above 4.0.↩︎
When there are just two predictors, the VIFs have to be the same since the proportion of information shared is the same in both directions. With more than two predictors, each variable can have a different VIF value.↩︎
We are actually making an educated guess about what these codes mean. Other similar data sets used 1 for males but the documentation on these data is a bit sparse. We proceed with a small potential that the conclusions regarding differences in gender are in the wrong direction.↩︎
Some people also call them dummy variables to reflect that they are stand-ins for dealing with the categorical information. But it seems like a harsh anthropomorphism so I prefer “indicators”.↩︎
This is true for additive uses of indicator variables. In Section 8.11, we consider interactions between quantitative and categorical variables which has the effect of changing slopes and intercepts. The simplification ideas to produce estimated equations for each group are used there but we have to account for changing slopes by group too.↩︎
Models like this with a categorical variable and quantitative variable are often called ANCOVA or analysis of covariance models but really are just versions of our linear models we’ve been using throughout this material.↩︎
The scale_color_viridis_d(end = 0.85, option = "inferno") code makes the plot in a suite of four colors from the viridis package (Garnier 2021) that attempt to be color-blind friendly.↩︎
The strength of this recommendation drops when you have many predictors as you can’t do this for every variable, but the concern remains about an assumption of no interaction whenever you fit models without them. In more complex situations, think about variables that are most likely to interact in their impacts on the response based on the situation being studied and try to explore those.↩︎
Standardizing quantitative predictor variables is popular in social sciences, often where the response variable is also standardized. In those situations, they generate what are called “standardized betas” (https://en.Wikipedia.org/wiki/Standardized_coefficient) that estimate the change in SDs in the response for a 1 SD increase in the explanatory variable.↩︎
There is a way to test for a difference in the two lines at a particular \(x\) value but it is beyond the scope of this material.↩︎
This is an example of what is called “step down” testing for model refinement which is a commonly used technique for arriving at a final model to describe response variables. Note that each step in the process should be reported, not just the final model that only has variables with small p-values remaining in it.↩︎
We could also use the anova function to do this but using Anova throughout this material provides the answers we want in the additive model and it has no impact for the only test of interest in the interaction model since the interaction is the last component in the model.↩︎
In most situations, it would be crazy to assume that the true model for a process has been obtained so we can never pick the “correct” model. In fact, we won’t even know if we are picking a “good” model, but just the best from a set of the candidate models on a criterion. But we can study the general performance of methods using simulations where we know the true model and the AIC has some useful properties in identifying the correct model when it is in the candidate set of models. No such similar theory exists for the adjusted R².↩︎
Most people now call this Akaike’s (pronounced ah-kah-ee-kay) Information Criterion, but he used the AIC nomenclature to mean An Information Criterion – he was not so vain as to name the method after himself in the original paper that proposed it. But it is now common to use “A” for his last name.↩︎
More details on these components of the methods will be left for more advanced material – we will focus on an introduction to using the AIC measure here.↩︎
Although sometimes excluded, the count of parameters should include counting the residual variance as a parameter.↩︎
It makes it impossible to fit models with any missing values in the data set and this prevents you from making incorrect comparisons of AICs to models with different observations.↩︎
We put quotes on “full” or sometimes call it the “fullish” model because we could always add more to the model, like interactions or other explanatory variables. So we rarely have a completely full model but we do have our “most complicated that we are considering” model.↩︎
The options in extra = ... are to get extra information displayed that you do not necessarily need. You can simply run dredge(m6, rank = "AIC") to get just the AIC results.↩︎