Skip to main content

# 5.5: Computing a Frequency Distribution (Section 4.2.1)

We would like to compute a frequency distribution showing how many people report being either active or inactive. The following statement is fairly complex so we will step through it one bit at a time.

PhysActive_table <- NHANES_unique %>%
# convert the implicit missing values to explicit
mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
# select the variable of interest
dplyr::select(PhysActive) %>%
# group by values of the variable
group_by(PhysActive) %>%
# count the values
summarize(AbsoluteFrequency = n())

# kable() prints out the table in a prettier way.
kable(PhysActive_table)
PhysActive AbsoluteFrequency
No 2473
Yes 2972
(Missing) 1334

The first step should be familiar from the previous section (we are adding the head() function here to show us the first few rows of the data frame):

NHANES_unique %>%
mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
head(10) %>%
glimpse()
## Observations: 10
## Variables: 77
## $ID <int> 51624, 51625, 51630, 51638, 5164… ##$ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_…
## $Gender <fct> male, male, female, male, male, … ##$ Age              <int> 34, 4, 49, 9, 8, 45, 66, 58, 54,…
## $AgeDecade <fct> 30-39, 0-9, 40-49, 0-9, 0-9… ##$ AgeMonths        <int> 409, 49, 596, 115, 101, 541, 795…
## $Race1 <fct> White, Other, White, White, Whit… ##$ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, …
## $Education <fct> High School, NA, Some College, N… ##$ MaritalStatus    <fct> Married, NA, LivePartner, NA, NA…
## $HHIncome <fct> 25000-34999, 20000-24999, 35000-… ##$ HHIncomeMid      <int> 30000, 22500, 40000, 87500, 6000…
## $Poverty <dbl> 1.4, 1.1, 1.9, 1.8, 2.3, 5.0, 2.… ##$ HomeRooms        <int> 6, 9, 5, 6, 7, 6, 5, 10, 6, 10
## $HomeOwn <fct> Own, Own, Rent, Rent, Own, Own, … ##$ Work             <fct> NotWorking, NA, NotWorking, NA, …
## $Weight <dbl> 87, 17, 87, 30, 35, 76, 68, 78, … ##$ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
## $HeadCirc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, … ##$ Height           <dbl> 165, 105, 168, 133, 131, 167, 17…
## $BMI <dbl> 32, 15, 31, 17, 21, 27, 24, 24, … ##$ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, …
## $BMI_WHO <fct> 30.0_plus, 12.0_18.5, 30.0_plus,… ##$ Pulse            <int> 70, NA, 86, 82, 72, 62, 60, 62, …
## $BPSysAve <int> 113, NA, 112, 86, 107, 118, 111,… ##$ BPDiaAve         <int> 85, NA, 75, 47, 37, 64, 63, 74, …
## $BPSys1 <int> 114, NA, 118, 84, 114, 106, 124,… ##$ BPDia1           <int> 88, NA, 82, 50, 46, 62, 64, 76, …
## $BPSys2 <int> 114, NA, 108, 84, 108, 118, 108,… ##$ BPDia2           <int> 88, NA, 74, 50, 36, 68, 62, 72, …
## $BPSys3 <int> 112, NA, 116, 88, 106, 118, 114,… ##$ BPDia3           <int> 82, NA, 76, 44, 38, 60, 64, 76, …
## $Testosterone <dbl> NA, NA, NA, NA, NA, NA, NA, NA, … ##$ DirectChol       <dbl> 1.29, NA, 1.16, 1.34, 1.55, 2.12…
## $TotChol <dbl> 3.5, NA, 6.7, 4.9, 4.1, 5.8, 5.0… ##$ UrineVol1        <int> 352, NA, 77, 123, 238, 106, 113,…
## $UrineFlow1 <dbl> NA, NA, 0.094, 1.538, 1.322, 1.1… ##$ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, …
## $UrineFlow2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, … ##$ Diabetes         <fct> No, No, No, No, No, No, No, No, …
## $DiabetesAge <int> NA, NA, NA, NA, NA, NA, NA, NA, … ##$ HealthGen        <fct> Good, NA, Good, NA, NA, Vgood, V…
## $DaysPhysHlthBad <int> 0, NA, 0, NA, NA, 0, 10, 0, 4, NA ##$ DaysMentHlthBad  <int> 15, NA, 10, NA, NA, 3, 0, 0, 0, …
## $LittleInterest <fct> Most, NA, Several, NA, NA, None,… ##$ Depressed        <fct> Several, NA, Several, NA, NA, No…
## $nPregnancies <int> NA, NA, 2, NA, NA, 1, NA, NA, NA… ##$ nBabies          <int> NA, NA, 2, NA, NA, NA, NA, NA, N…
## $Age1stBaby <int> NA, NA, 27, NA, NA, NA, NA, NA, … ##$ SleepHrsNight    <int> 4, NA, 8, NA, NA, 8, 7, 5, 4, NA
## $SleepTrouble <fct> Yes, NA, Yes, NA, NA, No, No, No… ##$ PhysActive       <fct> No, (Missing), No, (Missing), (M…
## $PhysActiveDays <int> NA, NA, NA, NA, NA, 5, 7, 5, 1, … ##$ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, …
## $CompHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, … ##$ TVHrsDayChild    <int> NA, 4, NA, 5, 1, NA, NA, NA, NA,…
## $CompHrsDayChild <int> NA, 1, NA, 0, 6, NA, NA, NA, NA,… ##$ Alcohol12PlusYr  <fct> Yes, NA, Yes, NA, NA, Yes, Yes, …
## $AlcoholDay <int> NA, NA, 2, NA, NA, 3, 1, 2, 6, NA ##$ AlcoholYear      <int> 0, NA, 20, NA, NA, 52, 100, 104,…
## $SmokeNow <fct> No, NA, Yes, NA, NA, NA, No, NA,… ##$ Smoke100         <fct> Yes, NA, Yes, NA, NA, No, Yes, N…
## $Smoke100n <fct> Smoker, NA, Smoker, NA, NA, Non-… ##$ SmokeAge         <int> 18, NA, 38, NA, NA, NA, 13, NA, …
## $Marijuana <fct> Yes, NA, Yes, NA, NA, Yes, NA, Y… ##$ AgeFirstMarij    <int> 17, NA, 18, NA, NA, 13, NA, 19, …
## $RegularMarij <fct> No, NA, No, NA, NA, No, NA, Yes,… ##$ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, 20, …
## $HardDrugs <fct> Yes, NA, Yes, NA, NA, No, No, Ye… ##$ SexEver          <fct> Yes, NA, Yes, NA, NA, Yes, Yes, …
## $SexAge <int> 16, NA, 12, NA, NA, 13, 17, 22, … ##$ SexNumPartnLife  <int> 8, NA, 10, NA, NA, 20, 15, 7, 10…
## $SexNumPartYear <int> 1, NA, 1, NA, NA, 0, NA, 1, 1, NA ##$ SameSex          <fct> No, NA, Yes, NA, NA, Yes, No, No…
## $SexOrientation <fct> Heterosexual, NA, Heterosexual, … ##$ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, …
## \$ isChild          <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, …

You can see that this data frame contains all of the original variables. Since we are only interested in the PhysActive variable, let’s extract that one and get rid of the rest. We can do this using the select() command from the dplyr package. Because there is also another select command available in R, we need to explicitly refer to the one from the dplyr package, which we do by including the package name followed by two colons: dplyr::select().

NHANES_unique %>%
# convert the implicit missing values to explicit
mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
# select the variable of interest
dplyr::select(PhysActive) %>%
head(10) %>%
kable()
PhysActive
No
(Missing)
No
(Missing)
(Missing)
Yes
Yes
Yes
Yes
(Missing)

The next function, group_by() tells R that we are going to want to analyze the data separate according to the different levels of the PhysActive variable:

NHANES_unique %>%
# convert the implicit missing values to explicit
mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
# select the variable of interest
dplyr::select(PhysActive) %>%
group_by(PhysActive) %>%
head(10) %>%
kable()
PhysActive
No
(Missing)
No
(Missing)
(Missing)
Yes
Yes
Yes
Yes
(Missing)

The final command tells R to create a new data frame by summarizing the data that we are passing in (which in this case is the PhysActive variable, grouped by its different levels). We tell the summarize() function to create a new variable (called AbsoluteFrequency) will contain a count of the number of observations for each group, which is generated by the n() function.

NHANES_unique %>%
# convert the implicit missing values to explicit
mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
# select the variable of interest
dplyr::select(PhysActive) %>%
group_by(PhysActive) %>%
summarize(AbsoluteFrequency = n())  %>%
kable()
PhysActive AbsoluteFrequency
No 2473
Yes 2972
(Missing) 1334

Now let’s say we want to add another column with percentage of observations in each group. We compute the percentage by dividing the absolute frequency for each group by the total number. We can use the table that we already generated, and add a new variable, again using mutate():

PhysActive_table <- PhysActive_table %>%
mutate(
Percentage = AbsoluteFrequency / sum(AbsoluteFrequency) * 100
)

kable(PhysActive_table, digits=2)
PhysActive AbsoluteFrequency Percentage
No 2473 36
Yes 2972 44
(Missing) 1334 20