6.14: Practice problems
- Page ID
- 33277
6.1. Treadmill data analysis These questions revisit the treadmill data set from Chapter 1. Researchers were interested in whether the run test variable could be used to replace the treadmill oxygen consumption variable that is expensive to measure. The following code loads the data set and provides a scatterplot matrix using ggpairs
for all variables except for the subject identifier variable that was in the first column and was removed by select(-1)
.
treadmill <- read_csv("http://www.math.montana.edu/courses/s217/documents/treadmill.csv")
library(psych)
treadmill %>% select(-1) %>% ggpairs()
6.1.1. First, we should get a sense of the strength of the correlation between the variable of primary interest, TreadMillOx
, and the other variables and consider whether outliers or nonlinearity are going to be major issues here. Which variable is it most strongly correlated with? Which variables are next most strongly correlated with this variable?
6.1.2. Fit the SLR using RunTime
as explanatory variable for TreadMillOx
. Report the estimated model.
6.1.3. Predict the treadmill oxygen value for a subject with a run time of 14 minutes. Repeat for a subject with a run time of 16 minutes. Is there something different about these two predictions?
6.1.4. Interpret the slope coefficient from the estimated model, remembering the units on the variables.
6.1.5. Report and interpret the \(y\)-intercept from the SLR.
6.1.6. Report and interpret the \(R^2\) value from the output. Show how you can find this value from the original correlation matrix result.
6.1.7. Produce the diagnostic plots and discuss any potential issues. What is the approximate leverage of the highest leverage observation and how large is its Cook’s D? What does that tell you about its potential influence in this model?
References
- There are measures of correlation between categorical variables but when statisticians say correlation they mean correlation of quantitative variables. If they are discussing correlations of other types, they will make that clear.↩︎
- Some of the details of this study have been lost, so we will assume that the subjects were randomly assigned and that a beer means a regular sized can of beer and that the beer was of regular strength. We don’t know if any of that is actually true. It would be nice to repeat this study to know more details and possibly have a larger sample size but I doubt if our institutional review board would allow students to drink as much as 9 beers.↩︎
- This interface with the
cor
function only works after you load themosaic
package.↩︎ - The natural log (\(\log_e\) or \(\ln\)) is used in statistics so much that the function in R
log
actually takes the natural log and if you want a \(\log_{10}\) you have to use the functionlog10
. When statisticians say log we mean natural log.↩︎ - We will not use the “significance stars” in the plot that display with the estimated correlations. You can ignore them but we will sometimes remove them from the plot by using the more complex code of
ggpairs(upper = list(continuous = GGally::wrap(ggally_cor, stars = F)))
.↩︎ - The
end = 0.7
is used to avoid the lightest yellow color in the gradient that is often hard to see.↩︎ - This is related to what is called Simpson’s paradox, where the overall analysis (ignoring a grouping variable) leads to a conclusion of a relationship in one direction, but when the relationship is broken down into subgroups it is in the opposite direction in each group. This emphasizes the importance of checking and accounting for differences in groups and the more complex models we are setting the stage to consider in the coming chapters.↩︎
- The interval is “far” from the reference value under the null (0) so this provides at least strong evidence. With using confidence intervals for tests, we really don’t know much about the strength of evidence against the null hypothesis but the hypothesis test here is a bit more complicated to construct and understand and we will have to tolerate just having crude information about the p-value to assess strength of evidence.↩︎
- Observations at the edge of the \(x\text{'s}\) will be called high leverage points in Section 6.9; this point is a low leverage point because it is close to mean of the \(x\text{'s}\).↩︎
- Even with clear scientific logic, we sometimes make choices to flip the model directions to facilitate different types of analyses. In Vsevolozhskaya et al. (2014) we looked at genomic differences based on obesity groups, even though we were really interested in exploring how gene-level differences explained differences in obesity.↩︎
- The residuals from these methods and ANOVA are the same because they all come from linear models but are completely different from the standardized residuals used in the Chi-square material in Chapter 5.↩︎