# 7.7: Confidence interval for the mean and prediction intervals for a new observation

- Page ID
- 33286

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

Figure 7.7 provided a term-plot of the estimated regression line and a shaded area surrounding the estimated regression equation. Those shaded areas are based on connecting the dots on 95% confidence intervals constructed for the true mean \(y\) value across all the \(x\)-values. To formalize this idea, consider a specific value of \(x\), and call it \(\boldsymbol{x_{\nu}}\) (pronounced **x-new**^{129}). Then the true mean response for this ** subpopulation** (a subpopulation is all observations we could obtain at \(\boldsymbol{x = x_{\nu}}\)) is given by \(\boldsymbol{E(Y) = \mu_\nu = \beta_0+\beta_1x_{\nu}}\). To estimate the mean response at \(\boldsymbol{x_{\nu}}\), we plug \(\boldsymbol{x_{\nu}}\) into the estimated regression equation:

\[\hat{\mu}_{\nu} = b_0 + b_1x_{\nu}.\]

To form the confidence interval, we appeal to our standard formula of \(\textbf{estimate} \boldsymbol{\mp t^*}\textbf{SE}_{\textbf{estimate}}\). The ** standard error for the estimated mean at any \(x\)-value**, denoted \(\text{SE}_{\hat{\mu}_{\nu}}\), can be calculated as

\[\text{SE}_{\hat{\mu}_{\nu}} = \sqrt{\text{SE}^2_{b_1}(x_{\nu}-\bar{x})^2 + \frac{\hat{\sigma}^2}{n}}\]

where \(\hat{\sigma}^2\) is the squared residual standard error. This formula combines the variability in the slope estimate, \(\text{SE}_{b_1}\), scaled based on the distance of \(x_{\nu}\) from \(\bar{x}\) and the variability around the regression line, \(\hat{\sigma}^2\). Fortunately, R’s `predict`

function can be used to provide these results for us and avoid doing this calculation by hand most of the time. The ** confidence interval for** \(\boldsymbol{\mu_{\nu}}\), the population mean response at \(x_{\nu}\), is

\[\boldsymbol{\hat{\mu}_{\nu} \mp t^*_{n-2}}\textbf{SE}_{\boldsymbol{\hat{\mu}_{\nu}}}.\]

In application, these intervals get wider the further we go from the mean of the \(x\text{'s}\). These have interpretations that are exactly like those for the y-intercept:

For an \(x\)-value of \(\boldsymbol{x_{\nu}}\), we are __% confident that the true mean of

yis betweenLLandUL[.units of y]

It is also useful to remember that this interpretation applies individually to every \(x\) displayed in term-plots.

A second type of interval in this situation takes on a more challenging task – to place an interval on where we think a new observation will fall, called a ** prediction interval** (PI). This PI will need to be much wider than the CI for the mean since we need to account for both the uncertainty in the mean and the randomness in sampling a new observation from the normal distribution centered at the true mean for \(x_{\nu}\). The interval is centered at the estimated regression line (where else could we center it?) with the estimate denoted as \(\hat{y}_{\nu}\) to help us see that this interval is for a

**new**\(y\) at this \(x\)-value. The \(\text{SE}_{\hat{y}_{\nu}}\) incorporates the core of the previous SE calculation and adds in the variability of a new observation in \(\boldsymbol{\hat{\sigma}^2}\):

\[\text{SE}_{\hat{y}_{\nu}} = \sqrt{\text{SE}^2_{b_1}(x_{\nu}-\bar{x})^2 + \dfrac{\hat{\sigma}^2}{n} + \boldsymbol{\hat{\sigma}^2}} = \sqrt{\text{SE}_{\hat{\mu}_{\nu}}^2 + \boldsymbol{\hat{\sigma}^2}}\]

The __% PI is calculated as

\[\boldsymbol{\hat{y}_{\nu} \mp t^*_{n-2}}\textbf{SE}_{\boldsymbol{\hat{y}_{\nu}}}\]

and interpreted as:

We are __% sure that a new observation at \(\boldsymbol{x_{\nu}}\) will be between

LLandUL[.units of y]

The formula also helps us to see that

since \(\text{SE}_{\hat{y}_{\nu}} > \text{SE}_{\hat{\mu}_{\nu}}\), the

PI will always be wider than the CI.

As in confidence intervals, we assume that a 95% PI “succeeds” – now when it succeeds it contains the new observation – in 95% of applications of the methods and fails the other 5% of the time. Remember that for any interval estimate, the true value is either in the interval or it is not and our confidence level essentially sets our failure rate! Because PIs push into the tails of the assumed distribution of the responses, these methods are very sensitive to violations of assumptions. We should not use these if there are any concerns about violations of assumptions as they will not work as advertised (at the ** nominal** (specified) level).

There are two ways to explore CIs for the mean and PIs for a new observation. The first is to focus on a specific \(x\)-value of interest. The second is to plot the results for all \(x\text{'s}\). To do both of these, but especially to make plots, we want to learn to use the `predict`

function. It can either produce the estimate for a particular \(x_{\nu}\) and the \(\text{SE}_{\hat{\mu}_{\nu}}\) or we can get it to directly calculate the CI and PI. The first way to use it is `predict(MODELNAME, se.fit = T)`

which will provide fitted values and \(\text{SE}_{\hat{\mu}_{\nu}}\) for all observed \(x\text{'s}\). We can then use the \(\text{SE}_{\hat{\mu}_{\nu}}\) to calculate \(\text{SE}_{\hat{y}_{\nu}}\) and form our own PIs. If you want CIs, run `predict(MODELNAME, interval = "confidence")`

; if you want PIs, run `predict(MODELNAME, interval = "prediction")`

. If you want to do prediction at an \(x\)-value that was not in the original observations, add the option `newdata = tibble(XVARIABLENAME_FROM_ORIGINAL_MODEL = Xnu)`

to the `predict`

function call.

Some examples of using the predict function follow. For example, it might be interesting to use the regression model to find a 95% CI and PI for the *Beers* vs *BAC* study for a student who would consume 8 beers. Four different applications of the predict function follow. Note that `lwr`

and `upr`

in the output depend on what we requested. The first use of `predict`

just returns the estimated mean for 8 beers:

```
m1 <- lm(BAC ~ Beers, data = BB)
predict(m1, newdata = tibble(Beers = 8))
```

```
## 1
## 0.1310095
```

By turning on the `se.fit = T`

option, we also get the SE for the confidence interval and degrees of freedom. Note that elements returned are labeled as `$fit`

, `$se.fit`

, etc. and provide some of the information to calculate CIs or PIs “by hand”.

`predict(m1, newdata = tibble(Beers = 8), se.fit = T)`

```
## $fit
## 1
## 0.1310095
##
## $se.fit
## [1] 0.009204354
##
## $df
## [1] 14
##
## $residual.scale
## [1] 0.02044095
```

Instead of using the components of the intervals to make them, we can also directly request the CI or PI using the `interval = ...`

option, as in the following two lines of code.

`predict(m1, newdata = tibble(Beers = 8), interval = "confidence")`

```
## fit lwr upr
## 1 0.1310095 0.1112681 0.1507509
```

`predict(m1, newdata = tibble(Beers = 8), interval = "prediction")`

```
## fit lwr upr
## 1 0.1310095 0.08292834 0.1790906
```

Based on these results, we are 95% confident that the true mean *BAC* for 8 beers consumed is between 0.111 and 0.15 grams of alcohol per dL of blood. For a new student drinking 8 beers, we are 95% sure that the observed BAC will be between 0.083 and 0.179 g/dL. You can see from these results that the PI is much wider than the CI – it has to capture a new individual’s results 95% of the time which is much harder than trying to capture the true mean at 8 beers consumed. For completeness, we should do these same calculations “by hand”. The `predict(..., se.fit = T)`

output provides almost all the pieces we need to calculate the CI and PI. The `$fit`

is the \(\text{estimate} = \hat{\mu}_{\nu} = 0.131\), the `$se.fit`

is the SE for the estimate of the \(\text{mean} = \text{SE}_{\hat{\mu}_{\nu}} = 0.0092\) , `$df`

is \(n-2 = 16-2 = 14\), and `$residual.scale`

is \(\hat{\sigma} = 0.02044\). So we just need the \(t^*\) multiplier for 95% confidence and 14 *df*:

`qt(0.975, df = 14) #t* multiplier for 95% CI or 95% PI`

`## [1] 2.144787`

The 95% CI for the true mean at \(\boldsymbol{x_{\nu} = 8}\) is then:

`0.131 + c(-1,1)*2.1448*0.0092`

`## [1] 0.1112678 0.1507322`

which matches the previous output quite well.

The 95% PI requires the calculation of \(\sqrt{\text{SE}_{\hat{\mu}_{\nu}}^2 + \boldsymbol{\hat{\sigma}^2}} = \sqrt{(0.0092)^2+(0.02044)^2} = 0.0224\).

`sqrt(0.0092^2 + 0.02044^2)`

`## [1] 0.02241503`

The 95% PI at \(\boldsymbol{x_{\nu} = 8}\) is

`0.131 + c(-1,1)*2.1448*0.0224`

`## [1] 0.08295648 0.17904352`

These calculations are “fun” and informative but displaying these results for all \(x\)-values is a bit more informative about the performance of the two types of intervals and for results we might expect in this application. The calculations we just performed provide endpoints of both intervals at `Beers`

= 8. To make this plot, we need to create a sequence of *Beers* values to get other results for, say from 0 to 10 beers, using the `seq`

function. The `seq`

function requires three arguments, that the endpoints (`from`

and `to`

) are defined and the `length.out`

, which defines the resolution of the grid of equally spaced points to create. Here, `length.out = 30`

provides 30 points evenly spaced between 0 and 10 and is more than enough to make the confidence and prediction intervals from 0 to 10 *Beers*.

```
# Creates vector of predictor values from 0 to 10
beerf <- seq(from = 0, to = 10, length.out = 30)
head(beerf, 6)
```

`## [1] 0.0000000 0.3448276 0.6896552 1.0344828 1.3793103 1.7241379`

`tail(beerf, 6)`

`## [1] 8.275862 8.620690 8.965517 9.310345 9.655172 10.000000`

Now we can call the `predict`

function at the values stored in `beerf`

to get the CIs across that range of *Beers* values:

```
BBCI <- as_tibble(predict(m1, newdata = tibble(Beers = beerf), interval = "confidence"))
head(BBCI)
```

```
## # A tibble: 6 × 3
## fit lwr upr
## <dbl> <dbl> <dbl>
## 1 -0.0127 -0.0398 0.0144
## 2 -0.00651 -0.0320 0.0190
## 3 -0.000312 -0.0242 0.0236
## 4 0.00588 -0.0165 0.0282
## 5 0.0121 -0.00873 0.0329
## 6 0.0183 -0.00105 0.0376
```

And the PIs:

```
BBPI <- as_tibble(predict(m1, newdata = tibble(Beers = beerf), interval = "prediction"))
head(BBPI)
```

```
## # A tibble: 6 × 3
## fit lwr upr
## <dbl> <dbl> <dbl>
## 1 -0.0127 -0.0642 0.0388
## 2 -0.00651 -0.0572 0.0442
## 3 -0.000312 -0.0502 0.0496
## 4 0.00588 -0.0433 0.0551
## 5 0.0121 -0.0365 0.0606
## 6 0.0183 -0.0296 0.0662
```

To visualize these results as shown in Figure 7.23, we need to work to combine some of the previous results into a common tibble, called `modelresB`

, using the `bind_cols`

function that allows multiple columns to be put together. Because some of the names are the same in the `BBCI`

and `BBPI`

objects and were awkwardly given unique names, there is an additional step to rename the columns using the `rename`

function. The `rename`

function changes the name to what is provided before the `=`

for the column identified after the `=`

(think of it like `mutate`

except that it does not create a new variable). The layers in the plot start with adding a line for the fitted values (solid, using `geom_line`

) based on the information in `modelresB`

. We also introduce the `geom_ribbon`

explicitly for the first time^{130} to plot our confidence and prediction intervals. It allows plotting of a region (and its edges) defined by `ymin`

and `ymax`

across the values provided to `x`

. I also wanted to include the original observations, but they are stored in a different tibble (`BB`

), so the `geom_point`

needs to be explicitly told to use a different data set for its contribution to the plot with `data = BB`

along with its own local aesthetic with x and y selections from the original variables.

```
# Patch the beerf vector, fits (just one version), and intervals from BBCI and
# BBPI together with bind_cols:
modelresB <- bind_cols(beerf = tibble(beerf), BBCI, BBPI %>% select(-fit))
# Rename CI and PI limits to have more explicit column names:
modelresB <- modelresB %>% rename(lwr_CI = lwr...3, upr_CI = upr...4,
lwr_PI = lwr...5, upr_PI = upr...6)
```

```
modelresB %>% ggplot() +
geom_line(aes(x = beerf, y = fit), lwd = 1) +
geom_ribbon(aes(x = beerf, ymin = lwr_CI, ymax = upr_CI), alpha = .4,
fill = "beige", color = "darkred", lty = 2, lwd = 1) +
geom_ribbon(aes(x = beerf, ymin = lwr_PI, ymax = upr_PI), alpha = .1,
fill = "gray80", color = "grey", lty = 3, lwd = 1.5) +
geom_point(data = BB, mapping = aes(x = Beers, y = BAC)) +
labs(y = "BAC", x = "Beers",
title = "Scatterplot with estimated regression line and 95% CI and PI") +
theme_bw()
```

More importantly, note that the CI in Figure 7.23 clearly shows widening as we move further away from the mean of the \(x\text{'s}\) to the edges of the observed \(x\)-values. This reflects a decrease in knowledge of the true mean as we move away from the mean of the \(x\text{'s}\). The PI also is widening slightly but not as clearly in this situation. The difference in widths in the two types of intervals becomes extremely clear when they are displayed together, with the PI much wider than the CI for any \(x\)-value.

Similarly, the 95% CI and PIs for the Bozeman yearly average maximum temperatures in Figure 7.24 provide interesting information on the uncertainty in the estimated mean temperature over time. It is also interesting to explore how many of the observations fall within the 95% prediction intervals. The PIs are for new observations, but you can see how the PIs that were constructed to contain almost all the observations in the original data set but not all of them. In fact, only 2 of the 109 observations (1.8%) fall outside the 95% PIs. Since the PI needs to be concerned with unobserved new observations it makes sense that it might contain more than 95% of the observations used to make it.

```
temp1 <- lm(meanmax ~ Year, data = bozemantemps)
Yearf <- seq(from = 1901, to = 2014, length.out = 75)
TCI <- as_tibble(predict(temp1, newdata = tibble(Year = Yearf), interval = "confidence"))
TPI <- as_tibble(predict(temp1, newdata = tibble(Year = Yearf), interval = "prediction"))
# Patch the Yearf vector, fits (just one version), and intervals from TCI and
# TPI together with bind_cols:
modelresT <- bind_cols(Yearf = tibble(Yearf), TCI, TPI %>% select(-fit))
# Rename CI and PI limits to have more explicit column names:
modelresT <- modelresT %>% rename(lwr_CI = lwr...3, upr_CI = upr...4,
lwr_PI = lwr...5, upr_PI = upr...6)
modelresT %>% ggplot() +
geom_line(aes(x = Yearf, y = fit), lwd = 1) +
geom_ribbon(aes(x = Yearf, ymin = lwr_CI, ymax = upr_CI), alpha = .4,
fill = "beige", color = "darkred", lty = 2, lwd = 1) +
geom_ribbon(aes(x = Yearf, ymin = lwr_PI, ymax = upr_PI), alpha = .1,
fill = "gray80", color = "grey", lty = 3, lwd = 1.5) +
geom_point(data = bozemantemps, mapping = aes(x = Year, y = meanmax)) +
labs(y = "degrees F", x = "Year",
title = "Scatterplot with estimated regression line and 95% CI and PI") +
theme_bw()
```

We can also use these same methods to do a prediction for the year after the data set ended, 2015, and in 2050:

`predict(temp1, newdata = tibble(Year = 2015), interval = "confidence")`

```
## fit lwr upr
## 1 58.31967 57.7019 58.93744
```

`predict(temp1, newdata = tibble(Year = 2015), interval = "prediction")`

```
## fit lwr upr
## 1 58.31967 55.04146 61.59787
```

`predict(temp1, newdata = tibble(Year = 2050), interval = "confidence")`

```
## fit lwr upr
## 1 60.15514 59.23631 61.07397
```

`predict(temp1, newdata = tibble(Year = 2050), interval = "prediction")`

```
## fit lwr upr
## 1 60.15514 56.80712 63.50316
```

These results tell us that we are 95% confident that the true mean yearly average maximum temperature in 2015 is (I guess “was”) between 55.04\(^\circ F\) and 61.6\(^\circ F\). And we are 95% sure that the observed yearly average maximum temperature in 2015 will be (I guess “would have been”) between 59.2\(^\circ F\) and 61.1\(^\circ F\). Obviously, 2015 has occurred, but since the data were not published when the data set was downloaded in July 2016, we can probably best treat 2015 as a potential “future” observation. The results for 2050 are clearly for the future mean and a new observation^{131} in 2050. Note that up to 2014, no values of this response had been observed above 60\(^\circ F\) and the predicted mean in 2050 is over 60\(^\circ F\) if the trend persists. It is easy to criticize the use of this model for 2050 because of its extreme amount of extrapolation.