Skip to main content
Statistics LibreTexts

1: Introduction to Data

  • Page ID
    55020
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    A Data Study Example – Tracking Tree Health

    Before we define what data is or what statistics means, let’s look at a real-world example of how people use data to answer important questions. To introduce the kinds of decisions and steps that are involved in statistics, let's imagine a research project that tracks the health of trees over time in a large urban park.

    Asking a Question

    Suppose city ecologists want to understand whether the health of old-growth trees in the Central City Park has changed over the past 10 years. They want to find out:

    How has tree health changed in Central City Park from 2013 to 2023, and what factors might be related to that change?

    This question is important for managing urban green spaces and planning tree preservation strategies. It’s specific and measurable and can be answered by collecting and analyzing data.

    Collecting Accurate and Relevant Data

    To answer this question, the researchers need a dataset that includes information about individual trees over time. They might decide to select a sample of 200 trees within the park and assess each one once a year from 2013 to 2023.

    For each tree, they could record:

    • Tree ID: A unique number or code for each tree
    • Species: Common/species name (e.g., Oak, Maple)
    • Location: GPS coordinates or zone within the park
    • Year: Year of observation
    • Health Score: A rating from 0 to 10 based on visible signs (e.g., leaf color, bark condition, dead branches)
    • Comments: Notes about the health score (this could be useful for tracking specific diseases or blights)
    • Canopy Size (ft²): Estimated tree cover area
    • Trunk Diameter (ft): Measured by circumference and then calculated to quantify tree growth.
    • Presence of Disease: Yes or No
    • Nearby Construction: Yes or No

    This data can be collected using field surveys, aerial imagery, and environmental sensors.

    Organizing and Summarizing the Data

    Once the data is collected, it must be organized into a consistent format. Below you can see an example of four records from two trees. This is a small picture of a much larger data set.

    Each horizontal row in the dataset represents a measurement of one tree in one year. Each column represents one type of recording, called a variable. Since we have 10 variables, we will have 10 columns. Since we have sampled 200 trees over 10 years, we expect to have 2000 rows. Why might we end up with fewer rows than that?

    Table describes exerpt from table of trees.
    Tree ID Species Year Health Score
    45 Common Hackberry (Celtis occidentalis) 2013 10
    119 Montana Oak (Quercus montana) 2017 6

    Researchers could start by investigating and exploring the data by summarizing each variable. For example:

    • What is the average health score across all trees each year?
    • How many trees show signs of disease each year?
    • How many different tree species are represented?

    They might create graphs and charts, for example showing average health score by year, or the number of diseased trees per species.

    Analyzing the Data to Find Patterns

    The researchers now look for trends and relationships. Some possible investigates are:

    • Is there a downward trend in tree health over the decade?
    • Are trees near construction sites more likely to have lower health scores?
    • Are there areas in the park whose trees exhibit slower growth or higher rates of disease?

    They will use the statistical methods we will explore in the course to make sure their results are supported mathematically and rigorously.

    Interpreting the Results

    Based on the analysis, the research team might discover, for instance, that tree health scores have declined slightly over the decade, and that this decline is strongly associated with nearby construction. They interpret this to mean that urban development is likely affecting tree health.

    Understanding the why behind the numbers is a key part of every data investigation.

    Communicating the Findings

    The final step is to share results in a clear and useful way. The team might prepare:

    • A report with graphs, maps, and summary statistics for city officials
    • A public presentation with visuals and plain-language findings
    • Publicly releasing data and calculations to allow for other scientists, students or reporters to continue to investigate

    Clear communication ensures the data can influence policy—such as protecting key areas, planting new trees, increasing tree care, or limiting construction in sensitive locations.

    What's Next?

    Now that you've seen how statistics is used in action, we'll step back and define some core ideas. In the next section, we’ll look more closely at what data is, and how we describe the different kinds of information that can be collected.


    This page titled 1: Introduction to Data is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by Mathematics Department.

    • Was this article helpful?