# 10.8: Computing Conditional Probabilities from Data

For many examples in this course we will use data obtained from the National Health and Nutrition Examination Survey (NHANES). NHANES is a large ongoing study organized by the US Centers for Disease Control that is designed to provide an overall picture of the health and nutritional status of both adults and children in the US. Every year, the survey examines a sample of about 5000 people across the US using both interviews and physical and medical tests. The NHANES dataset is included as a package in R, making it easy to access and work with. It also provides us with a large, realistic dataset that will serve as an example for many different statistical tools.

Let’s say that we are interested in the following question: What is the probability that someone has diabetes, given that they are not physically active? – that is, $P(diabetes|inactive)$. NHANES records two variables that address the two parts of this question. The first (Diabetes) asks whether the person has ever been told that they have diabetes, and the second (PhysActive) records whether the person engages in sports, fitness, or recreational activities that are at least of moderate intensity. Let’s first compute the simple probabilities.

Table 10.2: Summary data for diabetes and physical activity
Answer N_diabetes P_diabetes N_PhysActive P_PhysActive
No 4893 0.9 2472 0.45
Yes 550 0.1 2971 0.55

The table shows that the probability that someone in the NHANES dataset has diabetes is .1, and the probability that someone is inactive is .45.

Table 10.3: Joint probabilities for Diabetes and PhysActive variables.
Diabetes PhysActive n prob
No No 2123 0.39
No Yes 2770 0.51
Yes No 349 0.06
Yes Yes 201 0.04

To compute $P(diabetes|inactive)$ we would also need to know the joint probability of being diabetic and inactive, in addition to the simple probabilities of each.

Based on these joint probabilities, we can compute $P(diabetes|inactive)$. To do this, we can first determine the truth value of whether the PhysActive variable was equal to “No” for each indivdual, and then take the mean of those truth values. Since TRUE/FALSE values are treated as 1/0 respectively by most programming languages (including R), this allows us to easily identify the probability of a simple event by simply taking the mean of a logical variable representing its truth value. We then use that value to compute the conditional probability, where we find that the probability of someone having diabetes given that they are physically inactive is 0.141.