1: Introduction to Data

Last updated
Save as PDF

Page ID: 55020

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\dsum}{\displaystyle\sum\limits} \)

\( \newcommand{\dint}{\displaystyle\int\limits} \)

\( \newcommand{\dlim}{\displaystyle\lim\limits} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\(\newcommand{\longvect}{\overrightarrow}\)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

A Data Study Example – Tracking Tree Health

Before we define what data is or what statistics means, let’s look at a real-world example of how people use data to answer important questions. To introduce the kinds of decisions and steps that are involved in statistics, let's imagine a research project that tracks the health of trees over time in a large urban park.

Asking a Question

Suppose city ecologists want to understand whether the health of old-growth trees in the Central City Park has changed over the past 10 years. They want to find out:

How has tree health changed in Central City Park from 2013 to 2023, and what factors might be related to that change?

This question is important for managing urban green spaces and planning tree preservation strategies. It’s specific and measurable and can be answered by collecting and analyzing data.

Collecting Accurate and Relevant Data

To answer this question, the researchers need a dataset that includes information about individual trees over time. They might decide to select a sample of 200 trees within the park and assess each one once a year from 2013 to 2023.

For each tree, they could record:

Tree ID: A unique number or code for each tree
Species: Common/species name (e.g., Oak, Maple)
Location: GPS coordinates or zone within the park
Year: Year of observation
Health Score: A rating from 0 to 10 based on visible signs (e.g., leaf color, bark condition, dead branches)
Comments: Notes about the health score (this could be useful for tracking specific diseases or blights)
Canopy Size (ft²): Estimated tree cover area
Trunk Diameter (ft): Measured by circumference and then calculated to quantify tree growth.
Presence of Disease: Yes or No
Nearby Construction: Yes or No

This data can be collected using field surveys, aerial imagery, and environmental sensors.

Organizing and Summarizing the Data

Once the data is collected, it must be organized into a consistent format. Below you can see an example of four records from two trees. This is a small picture of a much larger data set.

Each horizontal row in the dataset represents a measurement of one tree in one year. Each column represents one type of recording, called a variable. Since we have 10 variables, we will have 10 columns. Since we have sampled 200 trees over 10 years, we expect to have 2000 rows. Why might we end up with fewer rows than that?

Table describes exerpt from table of trees.
Tree ID	Species	Year	Health Score
45	Common Hackberry (Celtis occidentalis)	2013	10
119	Montana Oak (Quercus montana)	2017	6

Researchers could start by investigating and exploring the data by summarizing each variable. For example:

What is the average health score across all trees each year?
How many trees show signs of disease each year?
How many different tree species are represented?

They might create graphs and charts, for example showing average health score by year, or the number of diseased trees per species.

Analyzing the Data to Find Patterns

The researchers now look for trends and relationships. Some possible investigates are:

Is there a downward trend in tree health over the decade?
Are trees near construction sites more likely to have lower health scores?
Are there areas in the park whose trees exhibit slower growth or higher rates of disease?

They will use the statistical methods we will explore in the course to make sure their results are supported mathematically and rigorously.

Interpreting the Results

Based on the analysis, the research team might discover, for instance, that tree health scores have declined slightly over the decade, and that this decline is strongly associated with nearby construction. They interpret this to mean that urban development is likely affecting tree health.

Understanding the why behind the numbers is a key part of every data investigation.

Communicating the Findings

The final step is to share results in a clear and useful way. The team might prepare:

A report with graphs, maps, and summary statistics for city officials
A public presentation with visuals and plain-language findings
Publicly releasing data and calculations to allow for other scientists, students or reporters to continue to investigate

Clear communication ensures the data can influence policy—such as protecting key areas, planting new trees, increasing tree care, or limiting construction in sensitive locations.

What's Next?

Now that you've seen how statistics is used in action, we'll step back and define some core ideas. In the next section, we’ll look more closely at what data is, and how we describe the different kinds of information that can be collected.

Search

Text Color

Text Size

Margin Size

Font Type