Skip to main content
Statistics LibreTexts

5.5: Computing a Frequency Distribution (Section 4.2.1)

  • Page ID
    8730
  • We would like to compute a frequency distribution showing how many people report being either active or inactive. The following statement is fairly complex so we will step through it one bit at a time.

    PhysActive_table <- NHANES_unique %>%
      # convert the implicit missing values to explicit
      mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
      # select the variable of interest
      dplyr::select(PhysActive) %>% 
      # group by values of the variable
      group_by(PhysActive) %>% 
      # count the values
      summarize(AbsoluteFrequency = n()) 
    
    # kable() prints out the table in a prettier way.
    kable(PhysActive_table)
    PhysActive AbsoluteFrequency
    No 2473
    Yes 2972
    (Missing) 1334

    The first step should be familiar from the previous section (we are adding the head() function here to show us the first few rows of the data frame):

    NHANES_unique %>%
      mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
      head(10) %>% 
      glimpse()
    ## Observations: 10
    ## Variables: 77
    ## $ ID               <int> 51624, 51625, 51630, 51638, 5164…
    ## $ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_…
    ## $ Gender           <fct> male, male, female, male, male, …
    ## $ Age              <int> 34, 4, 49, 9, 8, 45, 66, 58, 54,…
    ## $ AgeDecade        <fct>  30-39,  0-9,  40-49,  0-9,  0-9…
    ## $ AgeMonths        <int> 409, 49, 596, 115, 101, 541, 795…
    ## $ Race1            <fct> White, Other, White, White, Whit…
    ## $ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ Education        <fct> High School, NA, Some College, N…
    ## $ MaritalStatus    <fct> Married, NA, LivePartner, NA, NA…
    ## $ HHIncome         <fct> 25000-34999, 20000-24999, 35000-…
    ## $ HHIncomeMid      <int> 30000, 22500, 40000, 87500, 6000…
    ## $ Poverty          <dbl> 1.4, 1.1, 1.9, 1.8, 2.3, 5.0, 2.…
    ## $ HomeRooms        <int> 6, 9, 5, 6, 7, 6, 5, 10, 6, 10
    ## $ HomeOwn          <fct> Own, Own, Rent, Rent, Own, Own, …
    ## $ Work             <fct> NotWorking, NA, NotWorking, NA, …
    ## $ Weight           <dbl> 87, 17, 87, 30, 35, 76, 68, 78, …
    ## $ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ Height           <dbl> 165, 105, 168, 133, 131, 167, 17…
    ## $ BMI              <dbl> 32, 15, 31, 17, 21, 27, 24, 24, …
    ## $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ BMI_WHO          <fct> 30.0_plus, 12.0_18.5, 30.0_plus,…
    ## $ Pulse            <int> 70, NA, 86, 82, 72, 62, 60, 62, …
    ## $ BPSysAve         <int> 113, NA, 112, 86, 107, 118, 111,…
    ## $ BPDiaAve         <int> 85, NA, 75, 47, 37, 64, 63, 74, …
    ## $ BPSys1           <int> 114, NA, 118, 84, 114, 106, 124,…
    ## $ BPDia1           <int> 88, NA, 82, 50, 46, 62, 64, 76, …
    ## $ BPSys2           <int> 114, NA, 108, 84, 108, 118, 108,…
    ## $ BPDia2           <int> 88, NA, 74, 50, 36, 68, 62, 72, …
    ## $ BPSys3           <int> 112, NA, 116, 88, 106, 118, 114,…
    ## $ BPDia3           <int> 82, NA, 76, 44, 38, 60, 64, 76, …
    ## $ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ DirectChol       <dbl> 1.29, NA, 1.16, 1.34, 1.55, 2.12…
    ## $ TotChol          <dbl> 3.5, NA, 6.7, 4.9, 4.1, 5.8, 5.0…
    ## $ UrineVol1        <int> 352, NA, 77, 123, 238, 106, 113,…
    ## $ UrineFlow1       <dbl> NA, NA, 0.094, 1.538, 1.322, 1.1…
    ## $ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ Diabetes         <fct> No, No, No, No, No, No, No, No, …
    ## $ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ HealthGen        <fct> Good, NA, Good, NA, NA, Vgood, V…
    ## $ DaysPhysHlthBad  <int> 0, NA, 0, NA, NA, 0, 10, 0, 4, NA
    ## $ DaysMentHlthBad  <int> 15, NA, 10, NA, NA, 3, 0, 0, 0, …
    ## $ LittleInterest   <fct> Most, NA, Several, NA, NA, None,…
    ## $ Depressed        <fct> Several, NA, Several, NA, NA, No…
    ## $ nPregnancies     <int> NA, NA, 2, NA, NA, 1, NA, NA, NA…
    ## $ nBabies          <int> NA, NA, 2, NA, NA, NA, NA, NA, N…
    ## $ Age1stBaby       <int> NA, NA, 27, NA, NA, NA, NA, NA, …
    ## $ SleepHrsNight    <int> 4, NA, 8, NA, NA, 8, 7, 5, 4, NA
    ## $ SleepTrouble     <fct> Yes, NA, Yes, NA, NA, No, No, No…
    ## $ PhysActive       <fct> No, (Missing), No, (Missing), (M…
    ## $ PhysActiveDays   <int> NA, NA, NA, NA, NA, 5, 7, 5, 1, …
    ## $ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ TVHrsDayChild    <int> NA, 4, NA, 5, 1, NA, NA, NA, NA,…
    ## $ CompHrsDayChild  <int> NA, 1, NA, 0, 6, NA, NA, NA, NA,…
    ## $ Alcohol12PlusYr  <fct> Yes, NA, Yes, NA, NA, Yes, Yes, …
    ## $ AlcoholDay       <int> NA, NA, 2, NA, NA, 3, 1, 2, 6, NA
    ## $ AlcoholYear      <int> 0, NA, 20, NA, NA, 52, 100, 104,…
    ## $ SmokeNow         <fct> No, NA, Yes, NA, NA, NA, No, NA,…
    ## $ Smoke100         <fct> Yes, NA, Yes, NA, NA, No, Yes, N…
    ## $ Smoke100n        <fct> Smoker, NA, Smoker, NA, NA, Non-…
    ## $ SmokeAge         <int> 18, NA, 38, NA, NA, NA, 13, NA, …
    ## $ Marijuana        <fct> Yes, NA, Yes, NA, NA, Yes, NA, Y…
    ## $ AgeFirstMarij    <int> 17, NA, 18, NA, NA, 13, NA, 19, …
    ## $ RegularMarij     <fct> No, NA, No, NA, NA, No, NA, Yes,…
    ## $ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, 20, …
    ## $ HardDrugs        <fct> Yes, NA, Yes, NA, NA, No, No, Ye…
    ## $ SexEver          <fct> Yes, NA, Yes, NA, NA, Yes, Yes, …
    ## $ SexAge           <int> 16, NA, 12, NA, NA, 13, 17, 22, …
    ## $ SexNumPartnLife  <int> 8, NA, 10, NA, NA, 20, 15, 7, 10…
    ## $ SexNumPartYear   <int> 1, NA, 1, NA, NA, 0, NA, 1, 1, NA
    ## $ SameSex          <fct> No, NA, Yes, NA, NA, Yes, No, No…
    ## $ SexOrientation   <fct> Heterosexual, NA, Heterosexual, …
    ## $ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, …
    ## $ isChild          <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, …

    You can see that this data frame contains all of the original variables. Since we are only interested in the PhysActive variable, let’s extract that one and get rid of the rest. We can do this using the select() command from the dplyr package. Because there is also another select command available in R, we need to explicitly refer to the one from the dplyr package, which we do by including the package name followed by two colons: dplyr::select().

    NHANES_unique %>%
      # convert the implicit missing values to explicit
      mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
      # select the variable of interest
      dplyr::select(PhysActive) %>% 
      head(10) %>%
      kable()
    PhysActive
    No
    (Missing)
    No
    (Missing)
    (Missing)
    Yes
    Yes
    Yes
    Yes
    (Missing)

    The next function, group_by() tells R that we are going to want to analyze the data separate according to the different levels of the PhysActive variable:

    NHANES_unique %>%
      # convert the implicit missing values to explicit
      mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
      # select the variable of interest
      dplyr::select(PhysActive) %>% 
      group_by(PhysActive) %>%
      head(10) %>%
      kable()
    PhysActive
    No
    (Missing)
    No
    (Missing)
    (Missing)
    Yes
    Yes
    Yes
    Yes
    (Missing)

    The final command tells R to create a new data frame by summarizing the data that we are passing in (which in this case is the PhysActive variable, grouped by its different levels). We tell the summarize() function to create a new variable (called AbsoluteFrequency) will contain a count of the number of observations for each group, which is generated by the n() function.

    NHANES_unique %>%
      # convert the implicit missing values to explicit
      mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
      # select the variable of interest
      dplyr::select(PhysActive) %>% 
      group_by(PhysActive) %>%
      summarize(AbsoluteFrequency = n())  %>%
      kable()
    PhysActive AbsoluteFrequency
    No 2473
    Yes 2972
    (Missing) 1334

    Now let’s say we want to add another column with percentage of observations in each group. We compute the percentage by dividing the absolute frequency for each group by the total number. We can use the table that we already generated, and add a new variable, again using mutate():

    PhysActive_table <- PhysActive_table %>%
      mutate(
        Percentage = AbsoluteFrequency / sum(AbsoluteFrequency) * 100
      )
    
    kable(PhysActive_table, digits=2)
    PhysActive AbsoluteFrequency Percentage
    No 2473 36
    Yes 2972 44
    (Missing) 1334 20