Skip to main content
Statistics LibreTexts

1.2.1: Exercises (Data Basics)

  • Page ID
    59296
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    1.3: Air pollution and birth outcomes, study components

    Researchers collected data to examine the relationship between air pollutants and preterm births in Southern California. During the study air pollution levels were measured by air quality monitoring stations. Specifically, levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in parts per hundred million, and coarse particulate matter (\(PM_{10}\)) in \(\mu g/m^3\). Length of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth. The analysis suggested that increased ambient \(PM_{10}\) and, to a lesser degree, \(\ce{CO}\) concentrations may be associated with the occurrence of preterm births.

    1. Identify the main research question of the study.
    2. Who are the subjects in this study, and how many are included?
    3. What are the variables in the study? Identify each variable as numerical or categorical. If numerical, state whether the variable is discrete or continuous. If categorical, state whether the variable is ordinal.

    1.4: Buteyko method, study components

    The Buteyko method is a shallow breathing technique developed by Konstantin Buteyko, a Russian doctor, in 1952. Anecdotal evidence suggests that the Buteyko method can reduce asthma symptoms and improve quality of life. In a scientific study to determine the effectiveness of this method, researchers recruited 600 asthma patients aged 18-69 who relied on medication for asthma treatment. These patients were randomly split into two research groups: one practiced the Buteyko method and the other did not. Patients were scored on quality of life, activity, asthma symptoms, and medication reduction on a scale from 0 to 10. On average, the participants in the Buteyko group experienced a significant reduction in asthma symptoms and an improvement in quality of life.

    1. Identify the main research question of the study.
    2. Who are the subjects in this study, and how many are included?
    3. What are the variables in the study? Identify each variable as numerical or categorical. If numerical, state whether the variable is discrete or continuous. If categorical, state whether the variable is ordinal.

    1.5: Cheaters, study components

    Researchers studying the relationship between honesty, age and self-control conducted an experiment on 160 children between the ages of 5 and 15. Participants reported their age, sex, and whether they were an only child or not. The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet, and said they would only reward children who report white. The study's findings can be summarized as follows: ``Half the students were explicitly told not to cheat and the others were not given any explicit instructions. In the no instruction group probability of cheating was found to be uniform across groups based on child's characteristics. In the group that was explicitly told to not cheat, girls were less likely to cheat, and while rate of cheating didn't vary by age for boys, it decreased with age for girls.

    1. Identify the main research question of the study.
    2. Who are the subjects in this study, and how many are included?
    3. How many variables were recorded for each subject in the study in order to conclude these findings? State the variables and their types. \end{parts} }{} \D{\newpage}

    1.6: Stealers, study components

    In a study of the relationship between socio-economic class and unethical behavior, 129 University of California undergraduates at Berkeley were asked to identify themselves as having low or high social-class by comparing themselves to others with the most (least) money, most (least) education, and most (least) respected jobs. They were also presented with a jar of individually wrapped candies and informed that the candies were for children in a nearby laboratory, but that they could take some if they wanted. After completing some unrelated tasks, participants reported the number of candies they had taken.\footfullcite{Piff:2012} \begin{parts}

    1. Identify the main research question of the study.
    2. Who are the subjects in this study, and how many are included?
    3. The study found that students who were identified as upper-class took more candy than others. How many variables were recorded for each subject in the study in order to conclude these findings? State the variables and their types.

    1.7: Migraine and acupuncture, Part II

    Exercise~\ref{migraine_and_acupuncture_intro} introduced a study exploring whether acupuncture had any effect on migraines. Researchers conducted a randomized controlled study where patients were randomly assigned to one of two groups: treatment or control. The patients in the treatment group received acupuncture that was specifically designed to treat migraines. The patients in the control group received placebo acupuncture (needle insertion at non-acupoint locations). 24 hours after patients received acupuncture, they were asked if they were pain free. What are the explanatory and response variables in this study?

    1.8: Sinusitis and antibiotics, Part II

    Exercise~\ref{sinusitis_and_antibiotics_intro} introduced a study exploring the effect of antibiotic treatment for acute sinusitis. Study participants either received either a 10-day course of an antibiotic (treatment) or a placebo similar in appearance and taste (control). At the end of the 10-day period, patients were asked if they experienced improvement in symptoms. What are the explanatory and response variables in this study?

    1.9: Fisher's irises

    Sir Ronald Aylmer Fisher was an English statistician, evolutionary biologist, and geneticist who worked on a data set that contained sepal length and width, and petal length and width from three species of iris flowers (setosa, versicolor and virginica). There were 50 flowers from each species in the data set.

    1. How many cases were included in the data?
    2. How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.
    3. How many categorical variables are included in the data, and what are they? List the corresponding levels (categories).

    1.10: Smoking habits of UK residents

    A survey was conducted to study the smoking habits of UK residents. Below is a data matrix displaying a portion of the data collected in this survey. Note that ``$\pounds$" stands for British Pounds Sterling, ``cig" stands for cigarettes, and ``N/A'' refers to a missing component of the data. \footfullcite{data:smoking} \begin{center} \scriptsize{ \begin{tabular}{rccccccc} \hline & sex & age & marital & grossIncome & smoke & amtWeekends & amtWeekdays \\ \hline 1 & Female & 42 & Single & Under $\pounds$2,600 & Yes & 12 cig/day & 12 cig/day \\ 2 & Male & 44 & Single & $\pounds$10,400 to $\pounds$15,600 & No & N/A & N/A \\ 3 & Male & 53 & Married & Above $\pounds$36,400 & Yes & 6 cig/day & 6 cig/day \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1691 & Male & 40 & Single & $\pounds$2,600 to $\pounds$5,200 & Yes & 8 cig/day & 8 cig/day \\ \hline \end{tabular} } \end{center} \begin{parts}

    1. What does each row of the data matrix represent?
    2. How many participants were included in the survey?
    3. Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

    1.11: US Airports

    The visualization below shows the geographical distribution of airports in the contiguous United States and Washington, DC. This visualization was constructed based on a dataset where each observation is an airport.\footfullcite{data:usairports} \begin{center} \Figures[Four copies of a map of the United States are shown in a 2-by-2 grid. For each map, the axis labels are longitude (130 degrees west to 60 degrees west) and latitude (20 degrees north to 50 degrees north). The first column of plots is labeled "private use" and the second column "public use". The first row of plots is labeled "privately owned" and the second is labeled "publicly owned". Points are shown on each of the four plots, where each point represents an airport. There appear to be many thousands of points shown in the upper-left map (private use, privately owned) and the lower-right map (public use, publicly owned), while there are relatively fewer points -- even if still numbering in the hundreds or low thousands -- in the other two plots. In all plots, there is a greater density of points shown in the Middle and Eastern portions of the United States, with a more sparse number of points over the mountain and desert areas, and then a higher concentration of points again around the states bordered by the Pacific Ocean, especially near large cities.]{0.9}{eoce/airports}{airports} \end{center} \begin{parts}

    1. List the variables used in creating this visualization.
    2. Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal. \end{parts} }{} % 12 \eoce{\qt{UN Votes\label{unvotes}}

    The visualization below shows voting patterns in the United States, Canada, and Mexico in the United Nations General Assembly on a variety of issues. Specifically, for a given year between 1946 and 2015, it displays the percentage of roll calls in which the country voted yes for each issue. This visualization was constructed based on a dataset where each observation is a country/year pair.

    clipboard_ecd4c86c2cdab5df8a823e8e4e9fffb73.png
    A grid of scatter plots with overlaid trend lines for each of three groups of points (colored green, blue, and red) per plot is shown. The grid of plots has 2 rows and 3 columns, and the plots in this description will be referenced by number, where the numbering runs from 1 to 3 in the first row and 4 to 6 in the second row. For all plots, the horizontal axis is for "year" (about 1945 to about 2018) and the vertical axis is for "percent yes" with values ranging from 0% to 100%. Each of the six plots summarizes voting patterns in response to a different topic at the UN General Assembly and for the countries Canada (blue), Mexico (green), and the United States (red). Each plot has points and flexible (nonlinear) trend lines fit to those points. In all cases except Plot 2 for "Colonialism", the points (data) are relatively sparse in 1940 to 1960 relative to later years. Plot 1 represents "Arms control and disarmament", which for all countries starts out low, between 0% and 25%, and then quickly rises by 1960 to between 25% to 95%, where the US remains the lowest (hovering around 25% to 40%), Canada a bit higher between 50% to 70%, and Mexico the highest and typically between 85% to 100%. Plot 2 is labeled "Colonialism", and the trend lines start out between 50% to 80%, with the US then descending close to 0% by 1980, while Canada fluctuates between 25% to 60% over the duration, and Mexico rises to close to 100% by 1980. Plot 3 represents "Economic development", where the three countries al start near 25% to 40%, with the US declining to about 5% by 1990 before rising up to 20%, Canada descending to about 25% by 1985 before rising to 50% by 2000 before descending again to 25%, and Mexico rising to about 100% by 1980 before descending to about 85%. Plot 4 represents "Human rights", with all countries being clustered near 65% in 1945, then the US descends to 25% by 1975 and fluctuates between 10% and 30% for the rest of the time, Canada slowly descends over time to about 15%, and Mexico rises to close to 100% by 1985 and then descends slowly to about 80%. Plot 5 represents "Nuclear weapons and materials", with all countries starting near 0% in 1945, with the US then rising a bit but generally fluctuating between 15% to 40%, Canada rising to about 60% by 1965 before descending to and fluctuating around 40% to 50%, and Mexico rising rapidly to about 90% by 1970 then approaching 100% over time. Plot 6 represents the "Palestinian conflict", where the countries all start between 50% to 75%, with the US declining steadily to about 10% by 1985 and then approaching 5% after that, Canada declines a bit to about 35% in 1970 before rising to about 70% in 2000 and then descending rapidly to close to 0%, and Mexico gradually increases to about 95% in 1985 and then holds roughly steady.
    1. List the variables used in creating this visualization.
    2. Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

    This page titled 1.2.1: Exercises (Data Basics) is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel.

    • Was this article helpful?