11 Bayesian statistics

In this chapter we will take up the approach to statistical modeling and inference that stands in contrast to the null hypothesis testing framework that you encountered in Chapter 9. This is known as "Bayesian statistics" after the Reverend Thomas Bayes, whose theorem you have already encountered in Chapter 3. In this chapter you will learn how Bayes' theorem provides a way of understanding data that solves many of the conceptual problems that we discussed regarding null hypothesis testing.

11.1 Generative models

Say you are walking down the street and a friend of yours walks right by but doesn't say hello. You would probably try to decide why this happened -- Did they not see you? Are they mad at you? Are you suddenly cloaked in a magic invisibility shield? One of the basic ideas behind Bayesian statistics is that we want to infer the details of how the data are being generated, based on the data themselves. In this case, you want to use the data (i.e. the fact that your friend did not say hello) to infer the process that generated the data (e.g. whether or not they actually saw you, how they feel about you, etc).

The idea behind a generative model is that we observe data that are generated by a latent (unseen) process, usually with some amount of randomness in the process. In fact, when we take a sample of data from a population and estimate a parameter from the sample, what we are doing in essence is trying to learn the value of a latent variable (the population mean) that gives rise through sampling to the observed data (the sample mean).

If we know the value of the latent variable, then it's easy to reconstruct what the observed data should look like. For example, let's say that we are flipping a coin that we know to be fair. We can describe the coin by a binomial distribution with a value of p=0.5, and then we could generate random samples from such a distribution in order to see what the observed data should look like. However, in general we are in the opposite situation: We don't know the value of the latent variable of interest, but we have some data that we would like to use to estimate it.

11.2 Bayes' theorem and inverse inference

The reason that Bayesian statistics has its name is because it takes advantage of Bayes' theorem to make inferences from data back to some features of the (latent) model that generated the data. Let's say that we want to know whether a coin is fair. To test this, we flip the coin 10 times and come up with 7 heads. Before this test we were pretty sure that the coin was fair (i.e., that $p_{heads}=0.5$), but these data certainly give us pause. We already know how to compute the conditional probability that we would flip 7 or more heads out of 10 if the coin is really fair ($P(n\ge7|p_{heads}=0.5)$), using the binomial distribution:

# compute the conditional probability of 7 or more heads when p(heads)=0.5
sprintf(
"p(7 or more heads | p(heads) = 0.5) = %.3f",
pbinom(7, 10, .5, lower.tail = FALSE)
)
##  "p(7 or more heads | p(heads) = 0.5) = 0.055"

That is a fairly small number, but this number doesn't really answer the question that we are asking -- it is telling us about the likelihood of 7 or more heads given some particular probability of heads, whereas what we really want to know is the probability of heads. This should sound familiar, as it's exactly the situation that we were in with null hypothesis testing, which told us about the likelihood of data rather than the likelihood of hypotheses.

Remember that Bayes' theorem provides us with the tool that we need to invert a conditional probability:

$P(H|D) = \frac{P(D|H)*P(H)}{P(D)}$

We can think of this theorem as having four parts:

• prior ($P(H)$): Our degree of belief about hypothesis H before seeing the data D
• likelihood ($P(D|H)$): How likely are the observed data D under hypothesis H?
• marginal likelihood ($P(D)$): How likely are the observed data, combining over all possible hypotheses?
• posterior ($P(H|D)$): Our updated belief about hypothesis H, given the data D

Here we see one of the primary differences between frequentist and Bayesian statsistics. Frequentists do not believe in the idea of a probability of a hypothesis (i.e., our degree of belief about a hypothesis) -- for them, a hypothesis is either true or it isn't. Another way to say this is that for the frequentist, the hypothesis is fixed and the data are random, which is why frequentist inference focuses on describing the probability of data given a hypothesis (i.e. the p-value). Bayesians, on the other hand, are comfortable making probability statements about both data and hypotheses.

11.3 Doing Bayesian estimation

We ultimately want to use Bayesian statistics to test hypotheses, but before we do that we need to estimate the parameters that are necessary to test the hypothesis. Here we will walk through the process of Bayesian estimation. Let's use another screening example: Airport security screening. If you fly a lot like I do, it's just a matter of time until one of the random explosive screenings comes back positive; I had the particularly unfortunate experience of this happening soon after September 11, 2001, when airport security staff were especially on edge.

What the security staff want to know is what is the likelihood that a person is carrying an explosive, given that the machine has given a positive test. Let's walk through how to calculate this value using Bayesian analysis.

11.3.1 Specifying the prior

To use Bayes' theorem, we first need to specify the prior probability for the hypothesis. In this case, we don't know the real number but we can assume that it's quite small. According to the FAA, there were 971,595,898 air passengers in the U.S. in 2017. For this example, let's say that one out of those travelers was carrying an explosive in their bag

prior <- 1/971595898

11.3.2 Collect some data

The data are composed of the results of the explosive screening test. Let's say that the security staff runs the bag through their testing apparatus 10 times, and it gives a positive reading on 9 of the 10 tests.

nTests <- 10
nPositives <- 9

11.3.3 Computing the likelihood

We want to compute the likelihood of the data under the hypothesis that there is an explosive in the bag. Let's say that we know that the sensitivity of the test is 0.99 -- that is, when a device is present, it will detect it 99% of the time. To determine the likelihood of our data under the hypothesis that a device is present, we can treat each test as a Bernoulli trial (that is, a trial with an outcome of true or false) with a probability of success of 0.99, which we can model using a binomial distribution.

likelihood <- dbinom(nPositives, nTests, 0.99)
likelihood
##  0.091

11.3.4 Computing the marginal likelihood

We also need to know the overall likelihood of the data -- that is, finding 9 positives out of 10 tests. Computing the marginal likelihood is often one of the most difficult aspects of Bayesian analysis, but for our example it's simple because we can take advantage of the specific form of Bayes' theorem that we introduced in Section 3.7:

$P(B|A) = \frac{P(A|B)*P(B)}{P(A|B)*P(B) + P(A|\neg B)*P(\neg B)}$

The marginal likelihood in this case is a weighted average of the likelihood of the data under either presence or absence of the explosive, multiplied by the probability of the explosive being present (i.e. the prior). In this case, let's say that we know that the specificity of the test is 0.9, such that the likelihood of a positive result when there is no explosive is 0.1.

We can compute this in R as follows:

marginal_likelihood <-
dbinom(
x = nPositives,
size = nTests,
prob = 0.99
) * prior +
dbinom(
x = nPositives,
size = nTests,
prob = .1
) *
(1 - prior)

sprintf("marginal likelihood = %.3e", marginal_likelihood)
##  "marginal likelihood = 9.094e-09"

11.3.5 Computing the posterior

We now have all of the parts that we need to compute the posterior probability of an explosive being present, given the observed 9 positive outcomes out of 10 tests.

posterior <- (likelihood * prior) / marginal_likelihood
posterior
##  0.01

This result shows us that the probability of an explosive in the bag is much higher than the prior, but nowhere near certainty, again highlighting the fact that testing for rare events is almost always liable to produce high numbers of false positives.

11.4 Estimating posterior distributions

In the previous example there were only two possible outcomes -- the explosive is either there or it's not -- and we wanted to know which outcome was most likely given the data. However, in other cases we want to use Bayesian estimation to estimate the numeric value of a parameter. Let's say that we want to know about the effectiveness of a new drug for pain; to test this, we can administer the drug to a group of patients and then ask them whether their pain improved or not after taking the drug. We can use Bayesian analysis to estimate the proportion of people for whom the drug will be effective using these data.

11.4.1 Specifying the prior

In this case, we don't have any prior information about the effectiveness of the drug, so we will use a uniform distribution as our prior, since all values are equally likely under a uniform distribution. In order to simplify the example, we will only look at a subset of 99 possible values of effectiveness (from .01 to .99, in steps of .01). Therefore, each possible value has a prior probability of 1/99.

11.4.2 Collect some data

We need some data in order to estimate the effect of the drug. Let's say that we administer the drug to 100 individuals, and get the following results:

# create a table with results
nResponders <- 64
nTested <- 100

drugDf <- tibble(
outcome = c("improved", "not improved"),
number = c(nResponders, nTested - nResponders)
)
pander(drugDf)
outcome number
improved 64
not improved 36

11.4.3 Computing the likelihood

We can compute the likelihood of the data under any particular value of the effectiveness parameter using the dbinom() function in R. In Figure 11.1 you can see the likelihood curves over numbers of responders for several different values of $p_{respond}$. Looking at this, it seems that our observed data are relatively more likely under the hypothesis of $p_{respond}=0.7$, somewhat less likely under the hypothesis of $p_{respond}=0.5$, and quite unlikely under the hypothesis of $p_{respond}=0.3$. One of the fundamental ideas of Bayesian inference is that we will try to find the value of our parameter of interest that makes the data most likely, while also taking into account our prior knowledge. Figure 11.1 Likelihood of each possible number of responders under several different hypotheses (p(respond)=0.5 (red), 0.7 (green), 0.3 (black). Observed value shown in blue.

11.4.4 Computing the marginal likelihood

In addition to the likelihood of the data under different hypotheses, we need to know the overall likelihood of the data, combining across all hypotheses (i.e., the marginal likelihood). This marginal likelihood is primarily important beacuse it helps to ensure that the posterior values are true probabilities. In this case, our use of a set of discrete possible parameter values makes it easy to compute the marginal likelihood, because we can just compute the likelihood of each parameter value under each hypothesis and add them up.

# compute marginal likelihood
likeDf <-
likeDf %>%
mutate(uniform_prior = array(1 / n()))

# multiply each likelihood by prior and add them up
marginal_likelihood <-
sum(
dbinom(
x = nResponders, # the number who responded to the drug
size = 100, # the number tested
likeDf$presp # the likelihood of each response ) * likeDf$uniform_prior
)

sprintf("marginal likelihood = %0.4f", marginal_likelihood)
##  "marginal likelihood = 0.0100"

11.4.5 Computing the posterior

We now have all of the parts that we need to compute the posterior probability distribution across all possible values of $p_{respond}$, as shown in Figure 11.2.

# Create data for use in figure
bayesDf <-
tibble(
steps = seq(from = 0.01, to = 0.99, by = 0.01)
) %>%
mutate(
likelihoods = dbinom(
x = nResponders,
size = 100,
prob = steps
),
priors = dunif(steps) / length(steps),
posteriors = (likelihoods * priors) / marginal_likelihood
) Figure 11.2 Posterior probability distribution plotted in blue against uniform prior distribution (dotted black line).

11.4.6 Maximum a posteriori (MAP) estimation

Given our data we would like to obtain an estimate of $p_{respond}$ for our sample. One way to do this is to find the value of $p_{respond}$ for which the posterior probability is the highest, which we refer to as the maximum a posteriori (MAP) estimate. We can find this from the data in 11.2:

# compute MAP estimate
MAP_estimate <-
bayesDf %>%
arrange(desc(posteriors)) %>%
slice(1) %>%
pull(steps)

sprintf("MAP estimate = %0.4f", MAP_estimate)
##  "MAP estimate = 0.6400"

Note that this is simply the proportion of responders from our sample -- this occurs because the prior was uniform and thus didn't bias our response.

11.4.7 Credible intervals

Often we would like to know not just a single estimate for the posterior, but an interval in which we are confident that the posterior falls. We previously discussed the concept of confidence intervals in the context of frequentist inference, and you may remember that the interpretation of confidence intervals was particularly convoluted. What we really want is an interval in which we are confident that the true parameter falls, and Bayesian statistics can give us such an interval, which we call a credible interval.

In some cases the credible interval can be computed numerically based on a known distribution, but it's more common to generate a credible interval by sampling from the posterior distribution and then to compute quantiles of the samples. This is particularly useful when we don't have an easy way to express the posterior distribution numerically, which is often the case in real Bayesian data analysis.

We will generate samples from our posterior distribution using a simple algorithm known as rejection sampling. The idea is that we choose a random value of x (in this case $p_{respond}$) and a random value of y (in this case, the posterior probability of $p_{respond}$) each from a uniform distribution. We then only accept the sample if $y < f(x)$ - in this case, if the randomly selected value of y is less than the actual posterior probability of y. Figure 11.3 shows an example of a histogram of samples using rejection sampling, along with the 95% credible intervals obtained using this method.

# Compute credible intervals for example

nsamples <- 100000

# create random uniform variates for x and y
x <- runif(nsamples)
y <- runif(nsamples)

# create f(x)
fx <- dbinom(x = nResponders, size = 100, prob = x)

# accept samples where y < f(x)
accept <- which(y < fx)
accepted_samples <- x[accept]

credible_interval <- quantile(x = accepted_samples, probs = c(0.025, 0.975))
pander(credible_interval)
2.5% 98%
0.54 0.72 Figure 11.3 Rejection sampling example.The black line shows the density of all possible values of p(respond); the blue lines show the 2.5th and 97.5th percentiles of the distribution, which represent the 95 percent credible interval for the estimate of p(respond).

The interpretation of this credible interval is much closer to what we had hoped we could get from a confidence interval (but could not): It tells us that there is a 95% probability that the value of $p_{respond}$ falls between these two values. Importantly, it shows that we have high confidence that $p_{respond} > 0.0$, meaning that the drug seems to have a positive effect.

11.4.8 Effects of different priors

In the previous example we used a flat prior, meaning that we didn't have any reason to believe that any particular value of $p_{respond}$ was more or less likely. However, let's say that we had instead started with some previous data: In a previous study, researchers had tested 20 people and found that 10 of them had responded positively. This would have lead us to start with a prior belief that the treatment has an effect in 50% of people. We can do the same computation as above, but using the information from our previous study to inform our prior (see Figure 11.4).

# compute likelihoods for data under all values of p(heads)
# using a flat or empirical prior.
# here we use the quantized values from .01 to .99 in steps of 0.01

df <-
tibble(
steps = seq(from = 0.01, to = 0.99, by = 0.01)
) %>%
mutate(
likelihoods = dbinom(nResponders, 100, steps),
priors_flat = dunif(steps) / sum(dunif(steps)),
priors_empirical = dbinom(10, 20, steps) / sum(dbinom(10, 20, steps))
)

marginal_likelihood_flat <-
sum(dbinom(nResponders, 100, df$steps) * df$priors_flat)

marginal_likelihood_empirical <-
sum(dbinom(nResponders, 100, df$steps) * df$priors_empirical)

df <-
df %>%
mutate(
posteriors_flat =
(likelihoods * priors_flat) / marginal_likelihood_flat,
posteriors_empirical =
(likelihoods * priors_empirical) / marginal_likelihood_empirical
) Figure 11.4 Effects of priors on the posterior distribution. The original posterior distribution based on a flat prior is plotted in blue. The prior based on the observation of 10 responders out of 20 people is plotted in the dotted black line, and the posterior using this prior is plotted in red.

Note that the likelihood and marginal likelihood did not change - only the prior changed. The effect of the change in prior to was to pull the posterior closer to the mass of the new prior, which is centered at 0.5.

Now let's see what happens if we come to the analysis with an even stronger prior belief. Let's say that instead of having previously observed 10 responders out of 20 people, the prior study had instead tested 500 people and found 250 responders. This should in principle give us a much stronger prior, and as we see in Figure 11.5, that's what happens: The prior is much more concentrated around 0.5, and the posterior is also much closer to the prior. The general idea is that Bayesian inference combines the information from the prior and the likelihood, weighting the relative strength of each.

# compute likelihoods for data under all values of p(heads) using strong prior.

df <-
df %>%
mutate(
priors_strong = dbinom(250, 500, steps) / sum(dbinom(250, 500, steps))
)

marginal_likelihood_strong <-
sum(dbinom(nResponders, 100, df$steps) * df$priors_strong)

df <-
df %>%
mutate(
posteriors_strongprior = (likelihoods * priors_strong) / marginal_likelihood_strong
) Figure 11.5 Effects of the strength of the prior on the posterior distribution. The blue line shows the posterior obtained using the prior based on 50 heads out of 100 people. The dotted black line shows the prior based on 250 heads out of 500 flips, and the red line shows the posterior based on that prior.

This example also highlights the sequential nature of Bayesian analysis -- the posterior from one analysis can become the prior for the next analysis.

Finally, it is important to realize that if the priors are strong enough, they can completely overwhelm the data. Let's say that you have an absolute prior that $p_{respond}$ is 0.8 or greater, such that you set the prior likelihood of all other values to zero. What happens if we then compute the posterior?

# compute likelihoods for data under all values of p(respond) using absolute prior.
df <-
df %>%
mutate(
priors_absolute = array(data = 0, dim = length(steps)),
priors_absolute = if_else(
steps >= 0.8,
1, priors_absolute
),
priors_absolute = priors_absolute / sum(priors_absolute)
)

marginal_likelihood_absolute <-
sum(dbinom(nResponders, 100, df$steps) * df$priors_absolute)

df <-
df %>%
mutate(
posteriors_absolute =
(likelihoods * priors_absolute) / marginal_likelihood_absolute
) Figure 11.6 Effects of the strength of the prior on the posterior distribution. The blue line shows the posterior obtained using an absolute prior which states that p(respond) is 0.8 or greater. The prior is shown in the dotted black line.

In Figure 11.6 we see that there is zero density in the posterior for any of the values where the prior was set to zero - the data are overwhelmed by the absolute prior.

11.5 Choosing a prior

The impact of priors on the resulting inferences are the most controversial aspect of Bayesian statistics. There are various ways to choose one's priors, which (as we saw above) can impact the resulting inferences. Uninformative priors attempt to bias the resulting posterior as little as possible, as we saw in the example of the uniform prior above. It's also common to use weakly informative priors (or default priors), which bias the result only very slightly. For example, if we had used a binomial distribution based on one heads out of two coin flips, the prior would have been centered around 0.5 but fairly flat, biasing the posterior only slightly.

It is also possible to use priors based on the scientific literature or pre-existing data, which we would call empirical priors. In general, however, we will stick to the use of uninformative/weakly informative priors, since they raise the least concern about biasing our results. In general it is a good idea to try any Bayesian analysis using multiple reasonable priors, and make sure that the results don't change in an important way based on the prior.

11.6 Bayesian hypothesis testing

Having learned how to perform Bayesian estimation, we now turn to the use of Bayesian methods for hypothesis testing. Let's say that there are two politicians who differ in their beliefs about whether the public supports the death penalty. Senator Smith thinks that only 40% of people support the death penalty, whereas Senator Jones thinks that 60% of people support the death penalty. They arrange to have a poll done to test this, which asks 1000 randomly selected people whether they support the death penalty. The results are that 490 of the people in the polled sample support the death penalty. Based on these data, we would like to know: Do the data support the claims of one senator over the other? We can test this using a concept known as the Bayes factor.

11.6.1 Bayes factors

The Bayes factor characterizes the relative likelihood of the data under two different hypotheses. It is defined as:

$BF = \frac{p(data|H_1)}{p(data|H_2)}$ for two hypotheses $H_1$ and $H_2$. In the case of our two senators, we know how to compute the likelihood of the data under each hypothesis using the binomial distribution. We will put Senator Smith in the numerator and Senator Jones in the denominator, so that a value greater than one will reflect greater evidence for Senator Smith, and a value less than one will reflect greater evidence for Senator Jones.

# compute Bayes factor for Smith vs. Jones

bf <-
dbinom(
x = 490,
size = 1000,
prob = 0.4 #Smith's hypothesis
) / dbinom(
x = 490,
size = 1000,
prob = 0.6 #Jones' hypothesis
)

sprintf("Bayes factor = %0.2f", bf)
##  "Bayes factor = 3325.26"

This number provides a measure of the evidence that the data provides regarding the two hypotheses - in this case, it tells us the data support Senator Smith more than 3000 times more strongly than they support Senator Jones.

11.6.2 Bayes factors for statistical hypotheses

In the previous example we had specific predictions from each senator, whose likelihood we could quantify using the binomial distribution. However, in real data analysis we generally must deal with uncertainty about our parameters, which complicates the Bayes factor. However, in exchange we gain the ability to quantify the relative amount of evidence in favor of the null versus alternative hypotheses.

Let's say that we are a medical researcher performing a clinical trial for the treatment of diabetes, and we wish to know whether a particular drug reduces blood glucose compared to placebo. We recruit a set of volunteers and randomly assign them to either drug or placebo group, and we measure the change in hemoglobin A1C (a marker for blood glucose levels) in each group over the period in which the drug or placebo was administered. What we want to know is: Is there a difference between the drug and placebo?

First, let's generate some data and analyze them using null hypothesis testing (see Figure 11.7).

# create simulated data for drug trial example

set.seed(123456)
nsubs <- 40
effect_size <- 0.1

# randomize indiviuals to drug (1) or placebo (0)
drugDf <-
tibble(
group = as.integer(runif(nsubs) > 0.5)
) %>%
mutate(
hbchange = rnorm(nsubs) - group * effect_size
) Figure 11.7 Box plots showing data for drug and placebo groups.

Let's perform an independent-samples t-test, which shows that there is a significant difference between the groups:

# compute t-test for drug example
drugTT <- t.test(hbchange ~ group, alternative = "greater", data = drugDf)
print(drugTT)
##
##  Welch Two Sample t-test
##
## data:  hbchange by group
## t = 2, df = 40, p-value = 0.03
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.096   Inf
## sample estimates:
## mean in group 0 mean in group 1
##            0.12           -0.48

This test tells us that there is a significant difference between the groups, but it doesn't quantify how strongly the evidence supports the null versus alternative hypotheses. To measure that, we can compute a Bayes factor using ttestBF function from the BayesFactor package in R:

# compute Bayes factor for drug data
bf_drug <- ttestBF(
formula = hbchange ~ group, data = drugDf,
nullInterval = c(0, Inf)
)

bf_drug
## Bayes factor analysis
## --------------
##  Alt., r=0.707 0<d<Inf    : 2.4  ±0%
##  Alt., r=0.707 !(0<d<Inf) : 0.12 ±0%
##
## Against denominator:
##   Null, mu1-mu2 = 0
## ---
## Bayes factor type: BFindepSample, JZS

The Bayes factor here tells us that the alternative hypothesis (i.e. that the difference is greater than zero) is about 2.4 times more likely than the point null hypothesis (i.e. a mean difference of exactly zero) given the data.

11.6.2.1 One-sided tests

We generally are less interested in testing against the null hypothesis of a specific point value (e.g. mean difference = 0) than we are in testing against a directional null hypothesis (e.g. that the difference is less than or equal to zero). We can also perform a directional (or one-sided) test using the results from ttestBF analysis, since it provides two Bayes factors: one for the alternative hypothesis that the mean difference is greater than zero, and one for the alternative hypothesis that the mean difference is less than zero. If we want to assess the relative evidence for a positive effect, we can compute a Bayes factor comparing the relative evidence for a positive versus a negative effect by simply dividing the two Bayes factors returned by the function:

bf_drug/bf_drug
## Bayes factor analysis
## --------------
##  Alt., r=0.707 0<d<Inf : 20 ±0%
##
## Against denominator:
##   Alternative, r = 0.707106781186548, mu =/= 0 !(0<d<Inf)
## ---
## Bayes factor type: BFindepSample, JZS

Now we see that the Bayes factor for a positive effect versus a negative effect is substantially larger (almost 20).

11.6.2.2 Interpreting Bayes Factors

How do we know whether a Bayes factor of 2 or 20 is good or bad? There is a general guideline for interpretation of Bayes factors suggested by Kass & Rafferty (1995):

BF Strength of evidence
1 to 3 not worth more than a bare mention
3 to 20 positive
20 to 150 strong
>150 very strong

Based on this, even though the statisical result is significant, the amount of evidence in favor of the alternative vs. the point null hypothesis is weak enough that it's not worth even mentioning, whereas the evidence for the directional hypothesis is positive, but not quite strong.

11.6.3 Assessing evidence for the null hypothesis

Because the Bayes factor is comparing evidence for two hypotheses, it also allows us to assess whether there is evidence in favor of the null hypothesis, which we couldn't do with standard null hypothesis testing (because it starts with the assumption that the null is true). This can be very useful for determining whether a non-significant result really provides strong evidence that there is no effect, or instead just reflects weak evidence overall.