Skip to main content
Statistics LibreTexts

11: Probability in R (with Lucy King)

  • Page ID
    7650
  • 11 Probability in R (with Lucy King)

    In this chapter we will go over probability computations in R.

    11.1 Basic probability calculations

    Let’s create a vector of outcomes from one to 6, using the seq() function to create such a sequence:

    ## [1] 1 2 3 4 5 6

    Now let’s create a vector of logical values based on whether the outcome in each position is equal to 1. Remember that == tests for equality of each element in a vector:

    ## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

    Rememer that the simple probability of an outcome is number of occurrences of the outcome divided by the total number of events. To compute a probability, we can take advantage of the fact that TRUE/FALSE are equivalent to 1/0 in R. The formula for the mean (sum of values divided by the number of values) is thus exactly the same as the formula for the simple probability! So, we can compute the probability of the event by simply taking the mean of the logical vector.

    ## [1] 0.17

    11.2 Empirical frequency (Section 10.2.2)

    Let’s walk through how we computed empirical frequency of rain in San Francisco.

    First we load the data:

    ## Observations: 365
    ## Variables: 2
    ## $ DATE <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01…
    ## $ PRCP <dbl> 0.05, 0.10, 0.40, 0.89, 0.01, 0.00, 0.82, 1.…

    We see that the data frame contains a variable called PRCP which denotes the amount of rain each day. Let’s create a new variable called rainToday that denotes whether the amount of precipitation was above zero:

    ## Observations: 365
    ## Variables: 3
    ## $ DATE      <date> 2017-01-01, 2017-01-02, 2017-01-03, 20…
    ## $ PRCP      <dbl> 0.05, 0.10, 0.40, 0.89, 0.01, 0.00, 0.8…
    ## $ rainToday <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …

    Now we will summarize the data to compute the probability of rain:

    ## [1] 0.2

    11.3 Conditional probability (Section 10.4)

    Let’s determine the conditional probability of someone being unhealthy, given that they are over 70 years of age, using the NHANES dataset. Let’s create a new data frame that

    ## Observations: 4,891
    ## Variables: 2
    ## $ Unhealthy <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,…
    ## $ Over70    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

    First, what’s the probability of being over 70?

    ## [1] 0.11

    Second, what’s the probability of being unhealthy?

    ## [1] 0.36

    What’s the probability for each combination of unhealthy/healthly and over 70/ not? We can create a new variable that finds the joint probability by multiplying the two individual binary variables together; since anything times zero is zero, this will only have the value 1 for any case where both are true.

    ## [1] 0.043

    Finally, what’s the probability of someone being unhealthy, given that they are over 70 years of age?

    ## [1] 0.38
    ## [1] 0.12