Home
Bookshelves
Advanced Statistics
Intermediate Statistics with R (Greenwood)
7: Simple Linear Regression Inference
7.7: Confidence interval for the mean and prediction intervals for a new observation

7.7: Confidence interval for the mean and prediction intervals for a new observation

Last updated
Save as PDF

Page ID: 33286

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

Figure 7.7 provided a term-plot of the estimated regression line and a shaded area surrounding the estimated regression equation. Those shaded areas are based on connecting the dots on 95% confidence intervals constructed for the true mean $y$ value across all the $x$-values. To formalize this idea, consider a specific value of $x$, and call it $\boldsymbol{x_{\nu}}$ (pronounced x-new¹²⁹). Then the true mean response for this subpopulation (a subpopulation is all observations we could obtain at $\boldsymbol{x = x_{\nu}}$) is given by $\boldsymbol{E(Y) = \mu_\nu = \beta_0+\beta_1x_{\nu}}$. To estimate the mean response at $\boldsymbol{x_{\nu}}$, we plug $\boldsymbol{x_{\nu}}$ into the estimated regression equation:

\[\hat{\mu}_{\nu} = b_0 + b_1x_{\nu}.\]

To form the confidence interval, we appeal to our standard formula of $\textbf{estimate} \boldsymbol{\mp t^*}\textbf{SE}_{\textbf{estimate}}$. The standard error for the estimated mean at any $x$-value, denoted $\text{SE}_{\hat{\mu}_{\nu}}$, can be calculated as

\[\text{SE}_{\hat{\mu}_{\nu}} = \sqrt{\text{SE}^2_{b_1}(x_{\nu}-\bar{x})^2 + \frac{\hat{\sigma}^2}{n}}\]

where $\hat{\sigma}^2$ is the squared residual standard error. This formula combines the variability in the slope estimate, $\text{SE}_{b_1}$, scaled based on the distance of $x_{\nu}$ from $\bar{x}$ and the variability around the regression line, $\hat{\sigma}^2$. Fortunately, R’s predict function can be used to provide these results for us and avoid doing this calculation by hand most of the time. The confidence interval for $\boldsymbol{\mu_{\nu}}$, the population mean response at $x_{\nu}$, is

\[\boldsymbol{\hat{\mu}_{\nu} \mp t^*_{n-2}}\textbf{SE}_{\boldsymbol{\hat{\mu}_{\nu}}}.\]

In application, these intervals get wider the further we go from the mean of the $x\text{'s}$. These have interpretations that are exactly like those for the y-intercept:

For an $x$-value of $\boldsymbol{x_{\nu}}$, we are __% confident that the true mean of y is between LL and UL [units of y].

It is also useful to remember that this interpretation applies individually to every $x$ displayed in term-plots.

A second type of interval in this situation takes on a more challenging task – to place an interval on where we think a new observation will fall, called a prediction interval (PI). This PI will need to be much wider than the CI for the mean since we need to account for both the uncertainty in the mean and the randomness in sampling a new observation from the normal distribution centered at the true mean for $x_{\nu}$. The interval is centered at the estimated regression line (where else could we center it?) with the estimate denoted as $\hat{y}_{\nu}$ to help us see that this interval is for a new $y$ at this $x$-value. The $\text{SE}_{\hat{y}_{\nu}}$ incorporates the core of the previous SE calculation and adds in the variability of a new observation in $\boldsymbol{\hat{\sigma}^2}$:

\[\text{SE}_{\hat{y}_{\nu}} = \sqrt{\text{SE}^2_{b_1}(x_{\nu}-\bar{x})^2 + \dfrac{\hat{\sigma}^2}{n} + \boldsymbol{\hat{\sigma}^2}} = \sqrt{\text{SE}_{\hat{\mu}_{\nu}}^2 + \boldsymbol{\hat{\sigma}^2}}\]

The __% PI is calculated as

\[\boldsymbol{\hat{y}_{\nu} \mp t^*_{n-2}}\textbf{SE}_{\boldsymbol{\hat{y}_{\nu}}}\]

and interpreted as:

We are __% sure that a new observation at $\boldsymbol{x_{\nu}}$ will be between LL and UL [units of y].

The formula also helps us to see that

since $\text{SE}_{\hat{y}_{\nu}} > \text{SE}_{\hat{\mu}_{\nu}}$, the PI will always be wider than the CI.

As in confidence intervals, we assume that a 95% PI “succeeds” – now when it succeeds it contains the new observation – in 95% of applications of the methods and fails the other 5% of the time. Remember that for any interval estimate, the true value is either in the interval or it is not and our confidence level essentially sets our failure rate! Because PIs push into the tails of the assumed distribution of the responses, these methods are very sensitive to violations of assumptions. We should not use these if there are any concerns about violations of assumptions as they will not work as advertised (at the nominal (specified) level).

There are two ways to explore CIs for the mean and PIs for a new observation. The first is to focus on a specific $x$-value of interest. The second is to plot the results for all $x\text{'s}$. To do both of these, but especially to make plots, we want to learn to use the predict function. It can either produce the estimate for a particular $x_{\nu}$ and the $\text{SE}_{\hat{\mu}_{\nu}}$ or we can get it to directly calculate the CI and PI. The first way to use it is predict(MODELNAME, se.fit = T) which will provide fitted values and $\text{SE}_{\hat{\mu}_{\nu}}$ for all observed $x\text{'s}$. We can then use the $\text{SE}_{\hat{\mu}_{\nu}}$ to calculate $\text{SE}_{\hat{y}_{\nu}}$ and form our own PIs. If you want CIs, run predict(MODELNAME, interval = "confidence"); if you want PIs, run predict(MODELNAME, interval = "prediction"). If you want to do prediction at an $x$-value that was not in the original observations, add the option newdata = tibble(XVARIABLENAME_FROM_ORIGINAL_MODEL = Xnu) to the predict function call.

Some examples of using the predict function follow. For example, it might be interesting to use the regression model to find a 95% CI and PI for the Beers vs BAC study for a student who would consume 8 beers. Four different applications of the predict function follow. Note that lwr and upr in the output depend on what we requested. The first use of predict just returns the estimated mean for 8 beers:

m1 <- lm(BAC ~ Beers, data = BB)
predict(m1, newdata = tibble(Beers = 8))

##         1 
## 0.1310095

By turning on the se.fit = T option, we also get the SE for the confidence interval and degrees of freedom. Note that elements returned are labeled as $fit, $se.fit, etc. and provide some of the information to calculate CIs or PIs “by hand”.

predict(m1, newdata = tibble(Beers = 8), se.fit = T)

## $fit
##         1 
## 0.1310095 
## 
## $se.fit
## [1] 0.009204354
## 
## $df
## [1] 14
## 
## $residual.scale
## [1] 0.02044095

Instead of using the components of the intervals to make them, we can also directly request the CI or PI using the interval = ... option, as in the following two lines of code.

predict(m1, newdata = tibble(Beers = 8), interval = "confidence")

##         fit       lwr       upr
## 1 0.1310095 0.1112681 0.1507509

predict(m1, newdata = tibble(Beers = 8), interval = "prediction")

##         fit        lwr       upr
## 1 0.1310095 0.08292834 0.1790906

Based on these results, we are 95% confident that the true mean BAC for 8 beers consumed is between 0.111 and 0.15 grams of alcohol per dL of blood. For a new student drinking 8 beers, we are 95% sure that the observed BAC will be between 0.083 and 0.179 g/dL. You can see from these results that the PI is much wider than the CI – it has to capture a new individual’s results 95% of the time which is much harder than trying to capture the true mean at 8 beers consumed. For completeness, we should do these same calculations “by hand”. The predict(..., se.fit = T) output provides almost all the pieces we need to calculate the CI and PI. The $fit is the $\text{estimate} = \hat{\mu}_{\nu} = 0.131$, the $se.fit is the SE for the estimate of the $\text{mean} = \text{SE}_{\hat{\mu}_{\nu}} = 0.0092$ , $df is $n-2 = 16-2 = 14$, and $residual.scale is $\hat{\sigma} = 0.02044$. So we just need the $t^*$ multiplier for 95% confidence and 14 df:

qt(0.975, df = 14) #t* multiplier for 95% CI or 95% PI

## [1] 2.144787

The 95% CI for the true mean at $\boldsymbol{x_{\nu} = 8}$ is then:

0.131 + c(-1,1)*2.1448*0.0092

## [1] 0.1112678 0.1507322

which matches the previous output quite well.

The 95% PI requires the calculation of $\sqrt{\text{SE}_{\hat{\mu}_{\nu}}^2 + \boldsymbol{\hat{\sigma}^2}} = \sqrt{(0.0092)^2+(0.02044)^2} = 0.0224$.

sqrt(0.0092^2 + 0.02044^2)

## [1] 0.02241503

The 95% PI at $\boldsymbol{x_{\nu} = 8}$ is

0.131 + c(-1,1)*2.1448*0.0224

## [1] 0.08295648 0.17904352

These calculations are “fun” and informative but displaying these results for all $x$-values is a bit more informative about the performance of the two types of intervals and for results we might expect in this application. The calculations we just performed provide endpoints of both intervals at Beers = 8. To make this plot, we need to create a sequence of Beers values to get other results for, say from 0 to 10 beers, using the seq function. The seq function requires three arguments, that the endpoints (from and to) are defined and the length.out, which defines the resolution of the grid of equally spaced points to create. Here, length.out = 30 provides 30 points evenly spaced between 0 and 10 and is more than enough to make the confidence and prediction intervals from 0 to 10 Beers.

# Creates vector of predictor values from 0 to 10
beerf <- seq(from = 0, to = 10, length.out = 30) 
head(beerf, 6)

## [1] 0.0000000 0.3448276 0.6896552 1.0344828 1.3793103 1.7241379

tail(beerf, 6)

## [1]  8.275862  8.620690  8.965517  9.310345  9.655172 10.000000

Now we can call the predict function at the values stored in beerf to get the CIs across that range of Beers values:

BBCI <- as_tibble(predict(m1, newdata = tibble(Beers = beerf), interval = "confidence"))
head(BBCI)

## # A tibble: 6 × 3
##         fit      lwr    upr
##       <dbl>    <dbl>  <dbl>
## 1 -0.0127   -0.0398  0.0144
## 2 -0.00651  -0.0320  0.0190
## 3 -0.000312 -0.0242  0.0236
## 4  0.00588  -0.0165  0.0282
## 5  0.0121   -0.00873 0.0329
## 6  0.0183   -0.00105 0.0376

And the PIs:

BBPI <- as_tibble(predict(m1, newdata = tibble(Beers = beerf), interval = "prediction"))
head(BBPI)

## # A tibble: 6 × 3
##         fit     lwr    upr
##       <dbl>   <dbl>  <dbl>
## 1 -0.0127   -0.0642 0.0388
## 2 -0.00651  -0.0572 0.0442
## 3 -0.000312 -0.0502 0.0496
## 4  0.00588  -0.0433 0.0551
## 5  0.0121   -0.0365 0.0606
## 6  0.0183   -0.0296 0.0662

To visualize these results as shown in Figure 7.23, we need to work to combine some of the previous results into a common tibble, called modelresB, using the bind_cols function that allows multiple columns to be put together. Because some of the names are the same in the BBCI and BBPI objects and were awkwardly given unique names, there is an additional step to rename the columns using the rename function. The rename function changes the name to what is provided before the = for the column identified after the = (think of it like mutate except that it does not create a new variable). The layers in the plot start with adding a line for the fitted values (solid, using geom_line) based on the information in modelresB. We also introduce the geom_ribbon explicitly for the first time¹³⁰ to plot our confidence and prediction intervals. It allows plotting of a region (and its edges) defined by ymin and ymax across the values provided to x. I also wanted to include the original observations, but they are stored in a different tibble (BB), so the geom_point needs to be explicitly told to use a different data set for its contribution to the plot with data = BB along with its own local aesthetic with x and y selections from the original variables.

Figure 7.23: Estimated SLR for BAC data with fitted values (solid line), 95% confidence (darker, dashed lines), and 95% prediction (lighter, dotted lines) intervals.

# Patch the beerf vector, fits (just one version), and intervals from BBCI and 
# BBPI together with bind_cols:
modelresB <- bind_cols(beerf = tibble(beerf), BBCI, BBPI %>% select(-fit))

# Rename CI and PI limits to have more explicit column names:
modelresB <- modelresB %>% rename(lwr_CI = lwr...3, upr_CI = upr...4, 
                                  lwr_PI = lwr...5, upr_PI = upr...6)

modelresB %>% ggplot() + 
  geom_line(aes(x = beerf, y = fit), lwd = 1) +
  geom_ribbon(aes(x = beerf, ymin = lwr_CI, ymax = upr_CI), alpha = .4, 
              fill = "beige", color = "darkred", lty = 2, lwd = 1) +
  geom_ribbon(aes(x = beerf, ymin = lwr_PI, ymax = upr_PI), alpha = .1, 
              fill = "gray80", color = "grey", lty = 3, lwd = 1.5) +
  geom_point(data = BB, mapping = aes(x = Beers, y = BAC)) + 
  labs(y = "BAC", x = "Beers", 
       title = "Scatterplot with estimated regression line and 95% CI and PI") + 
  theme_bw()

More importantly, note that the CI in Figure 7.23 clearly shows widening as we move further away from the mean of the $x\text{'s}$ to the edges of the observed $x$-values. This reflects a decrease in knowledge of the true mean as we move away from the mean of the $x\text{'s}$. The PI also is widening slightly but not as clearly in this situation. The difference in widths in the two types of intervals becomes extremely clear when they are displayed together, with the PI much wider than the CI for any $x$-value.

Similarly, the 95% CI and PIs for the Bozeman yearly average maximum temperatures in Figure 7.24 provide interesting information on the uncertainty in the estimated mean temperature over time. It is also interesting to explore how many of the observations fall within the 95% prediction intervals. The PIs are for new observations, but you can see how the PIs that were constructed to contain almost all the observations in the original data set but not all of them. In fact, only 2 of the 109 observations (1.8%) fall outside the 95% PIs. Since the PI needs to be concerned with unobserved new observations it makes sense that it might contain more than 95% of the observations used to make it.

temp1 <- lm(meanmax ~ Year, data = bozemantemps)
Yearf <- seq(from = 1901, to = 2014, length.out = 75)
 
TCI <- as_tibble(predict(temp1, newdata = tibble(Year = Yearf), interval = "confidence"))

TPI <- as_tibble(predict(temp1, newdata = tibble(Year = Yearf), interval = "prediction"))

# Patch the Yearf vector, fits (just one version), and intervals from TCI and 
# TPI together with bind_cols:
modelresT <- bind_cols(Yearf = tibble(Yearf), TCI, TPI %>% select(-fit))

# Rename CI and PI limits to have more explicit column names:
modelresT <- modelresT %>% rename(lwr_CI = lwr...3, upr_CI = upr...4, 
                                  lwr_PI = lwr...5, upr_PI = upr...6)

modelresT %>% ggplot() + 
  geom_line(aes(x = Yearf, y = fit), lwd = 1) +
  geom_ribbon(aes(x = Yearf, ymin = lwr_CI, ymax = upr_CI), alpha = .4, 
              fill = "beige", color = "darkred", lty = 2, lwd = 1) +
  geom_ribbon(aes(x = Yearf, ymin = lwr_PI, ymax = upr_PI), alpha = .1, 
              fill = "gray80", color = "grey", lty = 3, lwd = 1.5) +
  geom_point(data = bozemantemps, mapping = aes(x = Year, y = meanmax)) + 
  labs(y = "degrees F", x = "Year", 
       title = "Scatterplot with estimated regression line and 95% CI and PI") + 
  theme_bw()

Figure 7.24: Estimated SLR for Bozeman temperature data with 95% confidence (dashed lines) and 95% prediction (lighter, dotted lines) intervals.

We can also use these same methods to do a prediction for the year after the data set ended, 2015, and in 2050:

predict(temp1, newdata = tibble(Year = 2015), interval = "confidence")

##        fit     lwr      upr
## 1 58.31967 57.7019 58.93744

predict(temp1, newdata = tibble(Year = 2015), interval = "prediction")

##        fit      lwr      upr
## 1 58.31967 55.04146 61.59787

predict(temp1, newdata = tibble(Year = 2050), interval = "confidence")

##        fit      lwr      upr
## 1 60.15514 59.23631 61.07397

predict(temp1, newdata = tibble(Year = 2050), interval = "prediction")

##        fit      lwr      upr
## 1 60.15514 56.80712 63.50316

These results tell us that we are 95% confident that the true mean yearly average maximum temperature in 2015 is (I guess “was”) between 55.04$^\circ F$ and 61.6$^\circ F$. And we are 95% sure that the observed yearly average maximum temperature in 2015 will be (I guess “would have been”) between 59.2$^\circ F$ and 61.1$^\circ F$. Obviously, 2015 has occurred, but since the data were not published when the data set was downloaded in July 2016, we can probably best treat 2015 as a potential “future” observation. The results for 2050 are clearly for the future mean and a new observation¹³¹ in 2050. Note that up to 2014, no values of this response had been observed above 60$^\circ F$ and the predicted mean in 2050 is over 60$^\circ F$ if the trend persists. It is easy to criticize the use of this model for 2050 because of its extreme amount of extrapolation.