Skip to main content
2.1.1: Exercises (Examining Numerical Data)
- Last updated
-
-
Save as PDF
\exercisesheader{}
% 1
\eoce{\qt{Mammal life spans\label{mammal_life_spans}} Data were collected on life spans (in
years) and gestation lengths (in days) for 62 mammals. A scatterplot of life span versus
length of gestation is shown below. \footfullcite{Allison+Cicchetti:1975}
\noindent\begin{minipage}[c]{0.44\textwidth}
\begin{parts}
\item What type of an association is apparent between life span and length of gestation?
\item What type of an association would you expect to see if the axes of the plot were reversed, i.e. if we plotted length of gestation versus life span?
\item Are life span and length of gestation independent? Explain your reasoning.
\end{parts}
\end{minipage}
\begin{minipage}[c]{0.55\textwidth}
\begin{center}
\Figures[A scatterplot of 62 points is shown. The variable "Gestation" is shown along the horizontal axis with a range of 0 days to about 650 days. The variable "Life Span" is shown along the vertical axis with a range of 0 years to 100 years. The a large cluster of points is shown between 0 to 250 gestational days and 0 to 30 years. Outside of this cluster, there is one point at approximately (10, 50). There is another cluster of points between 250 and 450 gestational days and 25 and 50 years. Beyond the points so far described are three points located at (250 days, 100 years), (640 days, 70 years), and (650 days, 45 years).]{0.86}{eoce/mammal_life_spans}{mammal_life_spans_scatterplot}
\end{center}
\end{minipage}
}{}
% 2
\eoce{\qt{Associations\label{association_plots}}
Indicate which of the plots show
(a)~a positive association,
(b)~a negative association, or
(c)~no~association.
Also determine if the positive and negative associations
are linear or nonlinear.
Each part may refer to more than one plot.
\begin{center}
\Figures[Four scatterplots are shown and are labeled 1, 2, 3, and 4. There are no label axes on these plots, as only the patterns of the points in the plots are important for this exercise. In plot 1, the points are moderately clustered in the lower left corner of the plot and remain clustered looking further right in the plot, where the points follow steadily upwards to the top-right corner. In plot 2, the points appear to be scattered almost randomly all around the rectangular plotting region. Plot 3 shows points clustered tightly in the lower left corner and the data points remain clustered even as moving right, with the data trending upwards gradually and then more steeply as it reaches the right side of the plot. Plot 4, when looking on the left portion, shows data moderately clustered in the upper-left corner, which then steadily trends downward to the lower-right corner of the plot.]{0.95}{eoce/association_plots}{association_plots}
\end{center}
}{}
% 3
\eoce{\qt{Reproducing bacteria\label{reproducing_bacteria}} Suppose that there is only
sufficient space and nutrients to support one million bacterial cells in a petri dish.
You place a few bacterial cells in this petri dish, allow them to reproduce freely, and
record the number of bacterial cells in the dish over time. Sketch a plot representing
the relationship between number of bacterial cells and time.
% first exponential
}{}
% 4
\eoce{\qt{Office productivity\label{office_productivity}} Office productivity is relatively low
when the employees feel no stress about their work or job security. However, high levels
of stress can also lead to reduced employee productivity. Sketch a plot to represent the
relationship between stress and productivity.
}{}
% 5
\eoce{\qt{Parameters and statistics\label{parameters_stats}} Identify which value represents
the sample mean and which value represents the claimed population mean.
\begin{parts}
\item American households spent an average of about \$52 in 2007 on Halloween
merchandise such as costumes, decorations and candy. To see if this number had changed,
researchers conducted a new survey in 2008 before industry numbers were reported. The
survey included 1,500 households and found that average Halloween spending was \$58 per
household.
\item The average GPA of students in 2001 at a private university was 3.37. A survey
on a sample of 203 students from this university yielded an average GPA of 3.59
a decade later.
\end{parts}
}{}
% 6
\eoce{\qt{Sleeping in college\label{college_sleeping}} A recent article in a college newspaper
stated that college students get an average of 5.5 hrs of sleep each night. A student who
was skeptical about this value decided to conduct a survey by randomly sampling 25
students. On average, the sampled students slept 6.25 hours per night. Identify which
value represents the sample mean and which value represents the claimed population mean.
}{}
\D{\newpage}
% 7
\eoce{\qt{Days off at a mining plant\label{days_off_mining}} Workers at a particular mining
site receive an average of 35 days paid vacation, which is lower than the national
average. The manager of this plant is under pressure from a local union to increase the
amount of paid time off. However, he does not want to give more days off to the workers
because that would be costly. Instead he decides he should fire 10 employees in such a
way as to raise the average number of days off that are reported by his employees. In
order to achieve this goal, should he fire employees who have the most number of days
off, least number of days off, or those who have about the average number of days off?
}{}
% 8
\eoce{\qt{Medians and IQRs} For each part, compare distributions (1) and (2) based on their medians and IQRs. You do not need to calculate these statistics; simply state how the medians and IQRs compare. Make sure to explain your reasoning.
\begin{multicols}{2}
\begin{parts}
\item (1) 3, 5, 6, 7, 9 \\
(2) 3, 5, 6, 7, 20
\item (1) 3, 5, 6, 7, 9 \\
(2) 3, 5, 7, 8, 9
\item (1) 1, 2, 3, 4, 5 \\
(2) 6, 7, 8, 9, 10
\item (1) 0, 10, 50, 60, 100 \\
(2) 0, 100, 500, 600, 1000
\end{parts}
\end{multicols}
}{}
% 9
\eoce{\qt{Means and SDs} For each part, compare distributions (1) and (2) based on their means and standard deviations. You do not need to calculate these statistics; simply state how the means and the standard deviations compare. Make sure to explain your reasoning. \textit{Hint:} It may be useful to sketch dot plots of the distributions.
\begin{multicols}{2}
\begin{parts}
\item (1) 3, 5, 5, 5, 8, 11, 11, 11, 13 \\
(2) 3, 5, 5, 5, 8, 11, 11, 11, 20 \\
\item (1) -20, 0, 0, 0, 15, 25, 30, 30 \\
(2) -40, 0, 0, 0, 15, 25, 30, 30
\item (1) 0, 2, 4, 6, 8, 10 \\
(2) 20, 22, 24, 26, 28, 30
\item (1) 100, 200, 300, 400, 500 \\
(2) 0, 50, 300, 550, 600
\end{parts}
\end{multicols}
}{}
% 10
\eoce{\qt{Mix-and-match} Describe the distribution in the histograms below and match them to the box plots. \\
\begin{center}
\Figures[Six plots are shown, three histograms labeled a, b, and c, and 3 box plots labeled 1, 2, and 3.
Plot (a) shows a histogram with horizontal range for the data of 50 to 70. The data are bell-shaped and centered in the plot, with only a little data reaching close to the lower end of 50 and the upper end of 70.
Plot (b) shows another histogram, where the horizontal axis extends from 0 to 100, and the histogram bins are relatively steady in their height in the first bin near zero across the plot to the last bin near 100.
Plot (c) is a histogram with a horizontal axis running from 0 to about 7. The first few bins rise quickly to a peak at the horizontal location of 1 and then fall until reaching 2 and then decline much more gradually until about 4, where the bins are near zero and stay near zero for larger values.
Plot (1) is a box plot. The vertical axis for the box plot spans from 0 to about 7. The lower whisker is at 0, the box spans about 1 to 2, with the center line for the box plot at about 1.4. The upper whisker extends up to about 3.5, and then there are several points marked individually extending further upwards to about 7.
Plot (2) is a box plot with a vertical axis spanning about 50 to 70. The box for the plot is centered at 60 and runs from about 58 to 62. The whiskers span about 52 to 68. There are 2 individually points shown below 52 and about 4 points shown above 68.
Plot (3) is a box plot spanning from 0 to 100. The box is centered at about 50, and the box spans about 25 to 75. The whiskers extend down to 0 and up to 100.]{}{eoce/hist_box_match}{hist_box_match}
\end{center}
}{}
\D{\newpage}
% 11
\eoce{\qt{Air quality\label{air_quality_durham}} Daily air quality is measured by the air
quality index (AQI) reported by the Environmental Protection Agency. This index reports
the pollution level and what associated health effects might be a concern. The index is
calculated for five major air pollutants regulated by the Clean Air Act and takes values
from 0 to 300, where a higher value indicates lower air quality. AQI was reported for a
sample of 91 days in 2011 in Durham, NC. The relative frequency histogram below shows
the distribution of the AQI values on these days. \footfullcite{data:durhamAQI:2011} \\
\begin{minipage}[c]{0.55\textwidth}
\begin{parts}
\item Estimate the median AQI value of this sample.
\item Would you expect the mean AQI value of this sample to be higher or lower than the
median? Explain your reasoning.
\item Estimate Q1, Q3, and IQR for the distribution.
\item Would any of the days in this sample be considered to have an unusually low or
high AQI? Explain your reasoning.
\end{parts}
\end{minipage}
\begin{minipage}[c]{0.45\textwidth}
\begin{center}
\Figures[A histogram of "Daily AQI", where the horizontal axis for the data runs from about 5 to 65. The bin width is 5, there are 12 bins from 5 to 60, and the vertical axis shows proportions. The heights of the 12 bins, in order from left to right, are about 0.02 (for the bin 5 to 10), 0.06, 0.20, 0.06, 0.20, 0.15, 0.07, 0.04, 0.07, 0.08, 0.03, and 0.02 for the last bin for 60 to 65.]{}{eoce/air_quality_durham}{air_quality_durham_rel_freq_hist}
\end{center}
\end{minipage}
}{}
% 12
\eoce{\qt{Median vs. mean\label{estimate_mean_median_simple}} Estimate the median for the
400 observations shown in the histogram, and note whether you expect the mean
to be higher or lower than the median.
\begin{center}
\Figures[A histogram is shown, with the horizontal axis for the data runs from 40 to 100, with a bin size width of 5. The frequencies for the bins are as follows, where counts are approximate: 2 (for bin 40 to 45), 4, 2, 10, 20, 25, 50, 75, 70, 85, 45, and 10 for the last bin from 95 to 100.
]{0.6}{eoce/estimate_mean_median_simple}{estimate_mean_median_simple}
\end{center}
}{}
% 13
\eoce{\qt{Histograms vs. box plots\label{hist_vs_box}} Compare the two plots below. What
characteristics of the distribution are apparent in the histogram and not in the box
plot? What characteristics are apparent in the box plot but not in the histogram?
\begin{center}
\Figures[Two plots are shown, first a histogram and second a box plot. The data for each plot runs from about 0 to 30.
The histogram has bins of width 2. The bins, starting at the lower values, shows an initial peak at about the horizontal location of 5, then declines to near the horizontal axis at 10, before rising again between 10 and 14, and then lower values again for bins between 15 to 30.
The box plot has its box centered at 10 and runs from about 5 to 12. The whiskers reach out to about 2 and up to about 22. There are a few points above the upper whisker.
]{0.6}{eoce/hist_vs_box}{hist_vs_box}
\end{center}
}{}
% 14
\eoce{\qt{Facebook friends\label{dist_shape_fb_friends}} Facebook data indicate that
50\% of Facebook users have 100 or more friends, and that the average friend
count of users is 190. What do these findings suggest about the shape of the
distribution of number of friends of Facebook users? \footfullcite{Backstrom:2011}
}{}
% 15
\eoce{\qt{Distributions and appropriate statistics, Part I\label{dist_shape_pets_dist_height}}
For each of the following, state whether you expect the distribution to be
symmetric, right skewed, or left skewed. Also specify whether the mean or
median would best represent a typical observation in the data, and whether
the variability of observations would be best represented using the
standard deviation or IQR. Explain your reasoning.
\begin{parts}
\item Number of pets per household.
\item Distance to work, i.e. number of miles between work and home.
\item Heights of adult males.
\end{parts}
}{}
\D{\newpage}
% 16
\eoce{\qt{Distributions and appropriate statistics, Part II\label{dist_shape_housing_alcohol_salary}}
For each of the following, state whether you expect the distribution to be symmetric,
right skewed, or left skewed. Also specify whether the mean or median would best
represent a typical observation in the data, and whether the variability of observations
would be best represented using the standard deviation or IQR. Explain your reasoning.
\begin{parts}
\item Housing prices in a country where 25\% of the houses cost below \$350,000,
50\% of the houses cost below \$450,000, 75\% of the houses cost below \$1,000,000
and there are a meaningful number of houses that cost more than \$6,000,000.
\item Housing prices in a country where 25\% of the houses cost below \$300,000,
50\% of the houses cost below \$600,000, 75\% of the houses cost below \$900,000
and very few houses that cost more than \$1,200,000.
\item Number of alcoholic drinks consumed by college students in a given week.
Assume that most of these students don't drink since they are under 21 years old,
and only a few drink excessively.
\item Annual salaries of the employees at a Fortune 500 company where only a few
high level executives earn much higher salaries than all the other employees.
\end{parts}
}{}
% 17
\eoce{\qt{Income at the coffee shop\label{income_coffee_shop}} The first histogram
below shows the distribution of the yearly incomes of 40 patrons at a college
coffee shop. Suppose two new people walk into the coffee shop: one making
\$225,000 and the other \$250,000. The second histogram shows the new income
distribution. Summary statistics are also provided. \\
\begin{minipage}[c]{0.57\textwidth}
\Figures[Two histograms are shown and are labeled 1 and 2. Plot 1 has a horizontal axis from \$60,000 to \$70,000. The bins, from left to right, generally rise steadily from frequencies of 2 to 3 at \$60,000 to \$62,000 and up to a peak of about 7 to 8 between \$64,000 to \$66,000. From here, the bin counts steadily decline down to about 2 for the last bin, \$69,000 to \$70,000. Plot (2) shows a histogram, with the horizontal axis running from about \$60,000 to \$260,000. The width of the bins are \$1,000, like in the first plot, and the first 10 bins reflect those described in Plot (1). Two additional bins are shown at about \$225,000 and \$250,000, each with a bin height of 1.]{}{eoce/income_coffee_shop}{income_coffee_shop}
\end{minipage}
\begin{minipage}[c]{0.4\textwidth}
\begin{center}
\begin{tabular}{rrr}
\hline
& (1) & (2) \\
\hline
n & 40 & 42 \\
Min. & 60,680 & 60,680 \\
1st Qu. & 63,620 & 63,710 \\
Median & 65,240 & 65,350 \\
Mean & 65,090 & 73,300 \\
3rd Qu. & 66,160 & 66,540 \\
Max. & 69,890 & 250,000 \\
SD & 2,122 & 37,321 \\
\hline
\end{tabular}
\end{center}
\end{minipage}
\begin{parts}
\item Would the mean or the median best represent what we might think of as a
typical income for the 42 patrons at this coffee shop? What does this say about
the robustness of the two measures?
\item Would the standard deviation or the IQR best represent the amount of
variability in the incomes of the 42 patrons at this coffee shop? What does
this say about the robustness of the two measures?
\end{parts}
}{}
% 18
\eoce{\qt{Midrange\label{midrange}} The \textit{midrange} of a distribution is defined as
the average of the maximum and the minimum of that distribution. Is this statistic
robust to outliers and extreme skew? Explain your reasoning
}{}
\D{\newpage}
% 19
\eoce{\qt{Commute times\label{county_commute_times}} The US census collects data on
time it takes Americans to commute to work, among many other variables. The
histogram below shows the distribution of average commute times in 3,142 US
counties in 2010. Also shown below is a spatial intensity map of the same data.
\begin{center}
\Figures[A histogram is shown, where the horizontal axis is for the variable "Mean work travel in minutes" spans approximately 0 to 50, with the vertical axis representing frequency with a peak value of about 200. The bins start with small bin heights on the left side, and the bin heights start increasing at about 10 and then rapidly ascend by 15 before leveling off and reaching a peak at about 22. The bins begin declining again about 24 gradually and then more rapidly around 26 to 29. At 30, the bins continue declining, but at a slower pace, before they level off near a height of 0 at about 35.]{0.48}{eoce/county_commute_times}{county_commute_times_hist}
\Figures[A spatial intensity map is shown of the United States. The legend for the shading runs from values of 4 to about 33. The shading for the eastern half of the country suggests slightly higher values, while the western portion of the upper midwest (North Dakota, South Dakota, and Nebraska) shows lower values. Other specific regions that show patterns of higher values than surrounding areas are in lower Florida and northern California.]{0.48}{eoce/county_commute_times}{county_commute_times_map}
\end{center}
\begin{parts}
\item Describe the numerical distribution and comment on whether or not a log
transformation may be advisable for these data.
\item Describe the spatial distribution of commuting times using the map above.
\end{parts}
}{}
% 20
\eoce{\qt{Hispanic population\label{county_hispanic_pop}} The US census collects
data on race and ethnicity of Americans, among many other variables. The
histogram below shows the distribution of the percentage of the population
that is Hispanic in 3,142 counties in the US in 2010. Also shown is a
histogram of logs of these values.
\begin{center}
\Figures[A histogram is shown for the variable "Percent Hispanic", where the horizontal axis runs from 0 to 100. The first bin, from 0 to 5, is dramatically higher than all other bins at about 2000. From here, the bins descend rapidly: about 500 between 5 and 10, 200 between 10 and 15, 100 between 15 and 20, then then trail off with the bins being nearly indistinguishable from a height of 0 for bins about 50\%.]{0.48}{eoce/county_hispanic_pop}{county_hispanic_pop_hist}
\Figures[A histogram is shown for the transformed variable, "log-base-e of Percent Hispanic", where the horizontal axis runs from about -2.5 to 4.5. The bins are very close to 0 in frequency until -1, then the rise slightly to about -0.5, before sharply rising to a peak at about 0.5. From here, the bins steadily decline towards a frequency of 0 at the horizontal location of 4.5.]{0.48}{eoce/county_hispanic_pop}{county_hispanic_pop_log_hist}
\Figures[A spatial intensity map is shown of the United States. The legend for the shading runs from values of 0\% to a peak of "greater than 40\%". A large portion of the eastern and central portion of the country -- east of Texas, east of Colorado, east of Utah, and east of Idaho -- is shaded mostly with values below 10\%. Florida is an exception to this rule, where a handful of counties show higher values. Higher values are particularly prominent in Texas, New Mexico, Arizona, and California, which mostly shows shading for values of at least 20\%. Nevada, Idaho, Oregon, and Washington shows values averaging around 10-20\%.]{0.48}{eoce/county_hispanic_pop}{county_hispanic_pop_map}
\end{center}
\begin{parts}
\item Describe the numerical distribution and comment on why we might want
to use log-transformed values in analyzing or modeling these data.
\item What features of the distribution of the Hispanic population in US
counties are apparent in the map but not in the histogram? What features are
apparent in the histogram but not the map?
\item Is one visualization more appropriate or helpful than the other? Explain
your reasoning.
\end{parts}
}{}