Skip to main content
Statistics LibreTexts

10.9: Independence

  • Page ID
    8769
  • The term “independent” has a very specific meaning in statistics, which is somewhat different from the common usage of the term. Statistical independence between two variables means that knowing the value of one variable doesn’t tell us anything about the value of the other. This can be expressed as:

    P(A|B)=P(A) P(A|B) = P(A)

    That is, the probability of A given some value of B is just the same as the overall probability of A. Looking at it this way, we see that many cases of what we would call “independence” in the world are not actually statistically independent. For example, there is currently a move by a small group of California citizens to declare a new independent state called Jefferson, which would comprise a number of counties in northern California and Oregon. If this were to happen, then the probability that a current California resident would now live in the state of Jefferson would be P(Jefferson)=0.014P(\text{Jefferson})=0.014, whereas the proability that they would remain a California resident would be P(California)=0.986P(\text{California})=0.986. The new states might be politically independent, but they would not be statistically independent, because P(California|Jefferson)=0P(\text{California|Jefferson}) = 0! That is, while independence in common language often refers to sets that are exclusive, statistical independence refers to the case where one cannot predict anything about one variable from the value of another variable. For example, knowing a person’s hair color is unlikely to tell you whether they prefer chocolate or strawberry ice cream.

    Let’s look at another example, using the NHANES data: Are physical health and mental health independent of one another? NHANES includes two relevant questions: PhysActive, which asks whether the individual is physically active, and DaysMentHlthBad, which asks how many days out of the last 30 that the individual experienced bad mental health. Let’s consider anyone who had more than 7 days of bad mental health in the last month to be in bad mental health. Based on this, we can define a new variable called badMentalHealth as a logical variable telling whether each person had more than 7 days of bad mental health or not. Using this new variable, we can then determine whether mental health and physical activity are independent by asking whether the simple probability of bad mental health is different from the conditional probability of bad mental health given that one is physically active.

    PhysActive badMentalHealth
    No 0.20
    Yes 0.13

    The overall probability of bad mental health P(bad mental health)P(\text{bad mental health}) is 0.16 while the conditional probability P(bad mental health|physically active)P(\text{bad mental health|physically active}) is 0.13. Thus, it seems that the conditional probability is somewhat smaller than the overall probability, suggesting that they are not independent, though we can’t know for sure just by looking at the numbers, since these numbers might be different due to sampling variability. Later in the course we will encounter tools that will let us more directly test whether two variables are independent.