Skip to main content
Statistics LibreTexts

2.1: Examining Numerical Data

  • Page ID
    56979
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    In this section we will explore techniques for summarizing numerical variables. For example, consider the loan_amount variable from the loan50 data set, which represents the loan size for all 50 loans in the data set. This variable is numerical since we can sensibly discuss the numerical difference of the size of two loans. On the other hand, area codes and zip codes are not numerical, but rather they are categorical variables.

    Throughout this section and the next, we will apply these methods using the loan50 and county data sets, which were introduced in Section 1.2. If you’d like to review the variables from either data set, see Figures 1.3 and 1.5.

    Scatterplots for paired data

    A provides a case-by-case view of data for two numerical variables. In Figure 1.8, a scatterplot was used to examine the homeownership rate against the fraction of housing units that were part of multi-unit properties (e.g. apartments) in the data set. Another scatterplot is shown in Figure 2.1, comparing the total income of a borrower (total_income) and the amount they borrowed (loan_amount) for the data set. In any scatterplot, each point represents a single case. Since there are 50 cases in , there are 50 points in Figure 2.1.

    A scatterplot is shown with "Total Income" along the horizontal axis (range from $0 to $325,000) and "Loan Amount" along the vertical axis (range from $0 to $40,000). The points lie in a range from $2,000 to $33,000 in loan amount when total income is smaller than $150,000 (representing most of the points). The range of loan amounts is higher when total income is greater than $175,000, with the range of observations being about $15,000 to $40,000.
    Figure 2.1: A scatterplot of total income versus loan amount for the loan50 data set.

    Looking at Figure 2.1, we see that there are many borrowers with an income below $100,000 on the left side of the graph, while there are a handful of borrowers with income above $250,000.

    Example \(\PageIndex{1}\)

    Figure 2.1 shows a plot of median household income against the poverty rate for 3,142 counties. What can be said about the relationship between these variables?

    Solution

    The relationship is evidently nonlinear, as highlighted by the dashed line. This is different from previous scatterplots we’ve seen, which show relationships that do not show much, if any, curvature in the trend.

    Exercise \(\PageIndex{1}\)

    What do scatterplots reveal about the data, and how are they useful?

    Answer

    Answers may vary. Scatterplots are helpful in quickly spotting associations relating variables, whether those associations come in the form of simple trends or whether those relationships are more complex.

    Exercise \(\PageIndex{2}\)

    Describe two variables that would have a horseshoe-shaped association in a scatterplot (\(\cap\) or \(\frown\)).

    Answer

    Consider the case where your vertical axis represents something “good” and your horizontal axis represents something that is only good in moderation. Health and water consumption fit this description: we require some water to survive, but consume too much and it becomes toxic and can kill a person.

    Dot Plots and the Mean

    Sometimes two variables are one too many: only one variable may be of interest. In these cases, a dot plot provides the most basic of displays. A dot plot is a one-variable scatterplot; an example using the interest rate of 50 loans is shown in Figure 2.3. A stacked version of this dot plot is shown in Figure 2.4.

    A dot plot is shown for the variable "Interest Rate". There is a horizontal axis ranging from about 5% to a bit over 25%, and then several points are shown horizontally above the axis, scattered over the range. There is a higher density of points between 5% to 11%, with a moderate density of points from 12% to about 20%, and then a few more observations at about 22%, 25%, and 26%. A red triangle is also shown at approximately 12%.
    Figure 2.3: A dot plot of interest rate for the loan50 data set. The distribution’s mean is shown as a red triangle.
    A stacked dot plot is shown for the variable "Interest Rate". There is a horizontal axis ranging from about 5% to a bit over 25%, and then several stacks of points are shown at values 5%, 6%, 7%, and so on. There are 3 points stacked at 5%, 3 points stacked at 6%, 5 at 7%, 4 at 8%, 5 at 9%, 8 at 10%, 5 at 11%, 1 at 11%, 3 at 12%, then 1 each at 14%, 15%, and 16%, 3 at 17%, 2 at 18%, and then 1 each at 19%, 20%, 21%, 25%, and 26%. A red triangle is also shown at approximately 12%.
    Figure 2.3: A dot plot of interest rate for the loan50 data set. The distribution’s mean is shown as a red triangle.

    The mean, often called the average, is a common way to measure the center of a distribution of data. To compute the mean interest rate, we add up all the interest rates and divide by the number of observations:

    \[\bar{x}=\frac{10.90 \%+9.92 \%+26.30 \%+\cdots+6.08 \%}{50}=11.57 \% \nonumber\]

    The sample mean is often labeled \(\bar{x}\). The letter \(x\) is being used as a generic placeholder for the variable of interest, interest_rate, and the bar over the \(x\) communicates we’re looking at the average interest rate, which for these 50 loans was 11.57%. It is useful to think of the mean as the balancing point of the distribution, and it’s shown as a triangle in Figures 2.3 and 2.4.

    Mean

    The sample mean can be computed as the sum of the observed values divided by the number of observations:

    \[\begin{aligned} \bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\end{aligned}\]

    where \(x_1\), \(x_2\), \(\dots\), \(x_n\) represent the \(n\) observed values.

    Exercise \(\PageIndex{3}\)

    Examine the equation for the mean. What does \(x_1\) correspond to? And \(x_2\)? Can you infer a general meaning to what \(x_i\) might represent?

    Answer

    \(x_1\) corresponds to the interest rate for the first loan in the sample (10.90%), x2 to the second loan’s interest rate (9.92%), and xi corresponds to the interest rate for the ith loan in the data set. For example, if i = 4, then we’re examining x4, which refers to the fourth observation in the data set.

    Exercise \(\PageIndex{4}\)

    What was \(n\) in this sample of loans?

    Answer

    The sample size was \(n = 50\).

    The loan50 data set represents a sample from a larger population of loans made through Lending Club. We could compute a mean for this population in the same way as the sample mean. However, the population mean has a special label: \(\mu\). The symbol \(\mu\) is the Greek letter mu and represents the average of all observations in the population. Sometimes a subscript, such as \(_x\), is used to represent which variable the population mean refers to, e.g. \(\mu_x\). Often times it is too expensive to measure the population mean precisely, so we often estimate \(\mu\) using the sample mean, \(\bar{x}\).

    Example \(\PageIndex{2}\)

    The average interest rate across all loans in the population can be estimated using the sample data. Based on the sample of 50 loans, what would be a reasonable estimate of \(\mu_x\), the mean interest rate for all loans in the full data set?

    Solution

    The sample mean, 11.57%, provides a rough estimate of \(\mu_x\). While it’s not perfect, this is our single best guess of the average interest rate of all the loans in the population under study.

    In Chapter 5 and beyond, we will develop tools to characterize the accuracy of point estimates like the sample mean. As you might have guessed, point estimates based on larger samples tend to be more accurate than those based on smaller samples.

    Example \(\PageIndex{3}\)

    The mean is useful because it allows us to rescale or standardize a metric into something more easily interpretable and comparable. Provide 2 examples where the mean is useful for making comparisons.

    Solution

    1. We would like to understand if a new drug is more effective at treating asthma attacks than the standard drug. A trial of 1500 adults is set up, where 500 receive the new drug, and 1000 receive a standard drug in the control group:

    New drug Standard drug
    Number of patients 500 1000
    Total asthma attacks 200 300

    Comparing the raw counts of 200 to 300 asthma attacks would make it appear that the new drug is better, but this is an artifact of the imbalanced group sizes. Instead, we should look at the average number of asthma attacks per patient in each group:

    • \(\text { New drug: } 200 / 500=0.4\)
    • \(\text { Standard drug: } 300 / 1000=0.3\)

    The standard drug has a lower average number of asthma attacks per patient than the average in the treatment group.

    2. Emilio opened a food truck last year where he sells burritos, and his business has stabilized over the last 3 months. Over that 3 month period, he has made $11,000 while working 625 hours. Emilio’s average hourly earnings provides a useful statistic for evaluating whether his venture is, at least from a financial perspective, worth it:

    \[\begin{aligned} \frac{\$11000}{625\text{ hours}} = \$17.60\text{ per hour} \end{aligned}\]

    By knowing his average hourly wage, Emilio now has put his earnings into a standard unit that is easier to compare with many other jobs that he might consider.

    Example \(\PageIndex{4}\)

    Suppose we want to compute the average income per person in the US. To do so, we might first think to take the mean of the per capita incomes across the 3,142 counties in the data set. What would be a better approach?

    Solution

    The county data set is special in that each county actually represents many individual people. If we were to simply average across the income variable, we would be treating counties with 5,000 and 5,000,000 residents equally in the calculations. Instead, we should compute the total income for each county, add up all the counties’ totals, and then divide by the number of people in all the counties. If we completed these steps with the countydata, we would find that the per capita income for the US is $30,861. Had we computed the simple mean of per capita income across counties, the result would have been just $26,093!

    This example used what is called a weighted mean. For more information on this topic, check out the following online supplement regarding weighted means .

    Histograms and shape

    Dot plots show the exact value for each observation. This is useful for small data sets, but they can become hard to read with larger samples. Rather than showing the value of each observation, we prefer to think of the value as belonging to a bin. For example, in the data set, we created a table of counts for the number of loans with interest rates between 5.0% and 7.5%, then the number of loans with rates between 7.5% and 10.0%, and so on. Observations that fall on the boundary of a bin (e.g. 10.00%) are allocated to the lower bin. This tabulation is shown in Figure 2.5. These binned counts are plotted as bars in Figure 2.6 into what is called a histogram, which resembles a more heavily binned version of the stacked dot plot shown in Figure 2.4.

    Figure 2.5: Counts for the binned interest_rate data.
    Interest Rate 5.0% - 7.5% 7.5% - 10.0% 10.0% - 12.5% 12.5% - 15.0% \(\cdots\) 25.0% - 27.5%
    Count 11 15 8 4 \(\cdots\) 1

    Histograms provide a view of the data density. Higher bars represent where the data are relatively more common. For instance, there are many more loans with rates between 5% and 10% than loans with rates between 20% and 25% in the data set. The bars make it easy to see how the density of the data changes relative to the interest rate.

    A histogram with a horizontal axis of "Interest Rate" and a vertical axis showing the frequency of occurrence of different bins of interest rate. The first bin is from 5%-7.5% with a frequency (count) of 11 observations, 7.5%-10% has a frequency of 15, 10%-12.5% has 8, 12.5%-15% has 4, 15%-17.5% has 5, 17.5%-20% has 4, and then the 20%-22.5%, 22.5%-25%, and 25%-27.5% bins each have a frequency of 1.
    Figure 2.6: A histogram of interest_rate. This distribution is strongly skewedto the right.

    Histograms are especially convenient for understanding the shape of the data distribution. Figure 2.6 suggests that most loans have rates under 15%, while only a handful of loans have rates above 20%. When data trail off to the right in this way and has a longer right , the shape is said to be right skewed. (other ways to describe data that are right skewed: skewed to the right, skewed to the high end, or skewed to the positive end)

    Data sets with the reverse characteristic – a long, thinner tail to the left – are said to be . We also say that such a distribution has a long left tail. Data sets that show roughly equal trailing off in both directions are called .

    Long tails to identify skew When data trail off in one direction, the distribution has a . If a distribution has a long left tail, it is left skewed. If a distribution has a long right tail, it is right skewed.

    Exercise \(\PageIndex{1}\)

    Take a look at the dot plots in Figures [loan_int_rate_dot_plot] and [loan_int_rate_dot_plot_stacked]. Can you see the skew in the data? Is it easier to see the skew in this histogram or the dot plots?

    Answer

    Add texts here. Do not delete this text first.

    Exercise \(\PageIndex{1}\)

    Besides the mean (since it was labeled), what can you see in the dot plots that you cannot see in the histogram?

    Answer

    Add texts here. Do not delete this text first.

    In addition to looking at whether a distribution is skewed or symmetric, histograms can be used to identify modes. A mode is represented by a prominent peak in the distribution. There is only one prominent peak in the histogram of loan_amount.

    A definition of mode sometimes taught in math classes is the value with the most occurrences in the data set. However, for many real-world data sets, it is common to have no observations with the same value in a data set, making this definition impractical in data analysis.

    Figure 2.7shows histograms that have one, two, or three prominent peaks. Such distributions are called unimodal, bimodal, and multimodal, respectively. Any distribution with more than 2 prominent peaks is called multimodal. Notice that there was one prominent peak in the unimodal distribution with a second less prominent peak that was not counted since it only differs from its neighboring bins by a few observations.

    Three histograms are shown. The first histogram shows bins of width 2 between 0 to 18 (this is along the horizontal axis), and the frequencies are 3, 16, 16, 7, 11, 6, 4, 1, and 1. The second histogram, representing a different data set, shows bins of width 2 with values ranging from 0 to 20, where the bin counts in order are 2, 9, 5, 2, 2, 2, 2, 10, 19, and 9. The third histogram, representing yet another data set, shows bins of width 2 with values ranging from 0 to 22, where the bin counts in order are 10, 8, 4, 3, 1, 20, 15, 3, 15, 18, and 5.
    Figure 2.7: Counting only prominent peaks, the distributions are (left to right) unimodal, bimodal, and multimodal. Note that we’ve said the left plot is unimodal intentionally. This is because we are counting prominent peaks, not just any peak.
    Example \(\PageIndex{1}\)

    Figure 2.6 reveals only one prominent mode in the interest rate. Is the distribution unimodal, bimodal, or multimodal?

    Solution

    Unimodal. Remember that uni stands for 1 (think unicycles). Similarly, bi stands for 2 (think bicycles). We’re hoping a multicycle will be invented to complete this analogy.

    Exercise \(\PageIndex{1}\)

    Height measurements of young students and adult teachers at a K-3 elementary school were taken. How many modes would you expect in this height data set?

    Answer

    8There might be two height groups visible in the data set: one of the students and one of the adults. That is, the data are probably bimodal.

    Looking for modes isn’t about finding a clear and correct answer about the number of modes in a distribution, which is why prominent is not rigorously defined in this book. The most important part of this examination is to better understand your data.

    Variance and standard deviation

    The mean was introduced as a method to describe the center of a data set, and in the data is also important. Here, we introduce two measures of variability: the variance and the standard deviation. Both of these are very useful in data analysis, even though their formulas are a bit tedious to calculate by hand. The standard deviation is the easier of the two to comprehend, and it roughly describes how far away the typical observation is from the mean.

    We call the distance of an observation from its mean its deviation. Below are the deviations for the \(1^{st}_{}\), \(2^{nd}_{}\), \(3^{rd}\), and \(50^{th}_{}\) observations in the variable:

    \[\begin{aligned} x_1^{}-\bar{x} &= 10.90- 11.57{} = -0.67\hspace{5mm}\text{ } \\ x_2^{}-\bar{x} &= 9.92- 11.57{} = -1.65\\ x_3^{}-\bar{x} &= 26.30- 11.57{} = 14.73\\ &\ \vdots \\ x_{50}^{}-\bar{x} &= 6.08- 11.57{} = -5.49\end{aligned}\]

    If we square these deviations and then take an average, the result is equal to the sample [varianceIsDefined], denoted by \(s_{}^2\):

    \[\begin{aligned} s_{}^2 &= \frac{(-0.67)_{}^2 + (-1.65)_{}^2 + (14.73)_{}^2 + \cdots + (-5.49)_{}^2}{50{}-1} \\ &= \frac{0.45 + 2.72 + 216.97 + \cdots + 30.14}{49} \\ &= 25.52{}\end{aligned}\]

    We divide by \(n - 1\), rather than dividing by \(n\), when computing a sample’s variance; there’s some mathematical nuance here, but the end result is that doing this makes this statistic slightly more reliable and useful.

    Notice that squaring the deviations does two things. First, it makes large values relatively much larger, seen by comparing \((-0.67)^2\), \((-1.65)^2\), \((14.73)^2\), and \((-5.49)^2\). Second, it gets rid of any negative signs.

    The is defined as the square root of the variance:

    \[\begin{aligned} s = \sqrt{25.52{}} = 5.05{}\end{aligned}\]

    While often omitted, a subscript of \(_x\) may be added to the variance and standard deviation, i.e. \(s_x^2\) and \(s_x^{}\), if it is useful as a reminder that these are the variance and standard deviation of the observations represented by \(x_1^{}\), \(x_2^{}\), ..., \(x_n^{}\).

    Variance and standard deviation The variance is the average squared distance from the mean. The standard deviation is the square root of the variance. The standard deviation is useful when considering how far the data are distributed from the mean.

    The standard deviation represents the typical deviation of observations from the mean. Usually about 70% of the data will be within one standard deviation of the mean and about 95% will be within two standard deviations. However, as seen in Figures [sdRuleForIntRate] and [severalDiffDistWithSdOf1], these percentages are not strict rules.

    Like the mean, the population values for variance and standard deviation have special symbols: \(\sigma_{}^2\) for the variance and \(\sigma\) for the standard deviation. The symbol \(\sigma\) is the Greek letter sigma.

    [A dot plot of 50 observations is shown with values ranging from about 5% to 26%. The data set is the same as that shown in the dot plot in Figure [loan_int_rate_dot_plot], where the data is more dense from 5% to about 11%, has medium density from about 12% to 20%, and then there are a few more values scattered in the 20% to 27% range. Shading is shown to represent the regions within 1, 2, and 3 standard deviations. The region within 1 standard deviation is from 6.5% to 16.7%, representing 34 of the 50 data points. The region within 2 standard deviation runs left off of the chart (but would be from about 1.4%) to 21.8% and contains 48 of the 50 data points. The third standard deviation is shown to extend out to 26.9%, and all 50 observations are contained within the 3 standard deviations.] 0.73sdRuleForIntRate

    [Three histograms are shown (upper, middle, lower). Each distribution also shows shading – dark gray between -1 to 1, lighter gray between -2 and 2, and light gray between -3 and 3, and then very light gray further out. The upper plot shows only two bins with non-zero values and of equal height at -1 and 1. middle plot shows a bell-shaped curve, where most of the higher bin values are between -1 and 1, middling heights are between -2 to -1 and 1 to 2, and the data trails off in each direction with ever-smaller values further out. The lower histogram shows no data below about -1.6, a quick increase to a peak at about -0.7 and then a slow decline of values to about half the max height at 1 and further trails off to ever smaller values to a horizontal location of 3 and beyond.] 0.6severalDiffDistWithSdOf1

    On page , the concept of shape of a distribution was introduced. A good description of the shape of a distribution should include modality and whether the distribution is symmetric or skewed to one side. Using Figure [severalDiffDistWithSdOf1] as an example, explain why such a description is important.

    Describe the distribution of the variable using the histogram in Figure [loan50IntRateHist]. The description should incorporate the center, variability, and shape of the distribution, and it should also be placed in context. Also note any especially unusual cases. The distribution of interest rates is unimodal and skewed to the high end. Many of the rates fall near the mean at 11.57%, and most fall within one standard deviation (5.05%) of the mean. There are a few exceptionally large interest rates in the sample that are above 20%.

    In practice, the variance and standard deviation are sometimes used as a means to an end, where the “end” is being able to accurately estimate the uncertainty associated with a sample statistic. For example, in Chapter [foundationsForInference] the standard deviation is used in calculations that help us understand how much a sample mean varies from one sample to the next.

    Box plots, quartiles, and the median

    A summarizes a data set using five statistics while also plotting unusual observations. Figure [loan_int_rate_box_plot_layout] provides a vertical dot plot alongside a box plot of the variable from the data set.

    [What is shown in a a dot plot adjacent to what is called a "box plot". The data values are the same ones used in past dot plots, where the data shows greatest density from 5% to 11%, moderate density from 12% to 20%, and then a few more values at about 22%, 25%, and 26%. The box plot adjacent to the data shows a box that would encapsulate the middle 50% of the data, from about 8% to 13%. The median is also annotated with a line through the center of the box. From here, the data extend out with "whiskers" up to a distance up to \(1.5 \times IQR\) below and above the box to capture as much data as possible. There are two observations that extend beyond this range at 25% and 26%.]

    [loan_int_rate_box_plot_layout]

    The first step in building a box plot is drawing a dark line denoting the , which splits the data in half. Figure [loan_int_rate_box_plot_layout] shows 50% of the data falling below the median and other 50% falling above the median. There are 50 loans in the data set (an even number) so the data are perfectly split into two groups of 25. We take the median in this case to be the average of the two observations closest to the \(50^{th}\) percentile, which happen to be the same value in this data set: \((\text{9.93\%{}} + \text{9.93\%{}}) / 2 = \text{9.93\%{}}\). When there are an odd number of observations, there will be exactly one observation that splits the data into two halves, and in such a case that observation is the median (no average needed).

    Median: the number in the middle If the data are ordered from smallest to largest, the is the observation right in the middle. If there are an even number of observations, there will be two values in the middle, and the median is taken as their average.

    The second step in building a box plot is drawing a rectangle to represent the middle 50% of the data. The total length of the box, shown vertically in Figure [loan_int_rate_box_plot_layout], is called the (, for short). It, like the standard deviation, is a measure of in data. The more variable the data, the larger the standard deviation and IQR tend to be. The two boundaries of the box are called the (the \(25^{th}\) , i.e. 25% of the data fall below this value) and the (the \(75^{th}\) percentile), and these are often labeled \(Q_1\) and \(Q_3\), respectively.

    Interquartile range (IQR) The IQR is the length of the box in a box plot. It is computed as

    \[\begin{aligned} IQR = Q_3 - Q_1 \end{aligned}\]

    where \(Q_1\) and \(Q_3\) are the \(25^{th}\) and \(75^{th}\) percentiles.

    What percent of the data fall between \(Q_1\) and the median? What percent is between the median and \(Q_3\)?

    Extending out from the box, the attempt to capture the data outside of the box. However, their reach is never allowed to be more than \(1.5\times IQR\). They capture everything within this reach. In Figure [loan_int_rate_box_plot_layout], the upper whisker does not extend to the last two points, which is beyond \(Q_3 + 1.5\times IQR\), and so it extends only to the last point below this limit. The lower whisker stops at the lowest value, 5.31%, since there is no additional data to reach; the lower whisker’s limit is not shown in the figure because the plot does not extend down to \(Q_1 - 1.5\times IQR\). In a sense, the box is like the body of the box plot and the whiskers are like its arms trying to reach the rest of the data.

    Any observation lying beyond the whiskers is labeled with a dot. The purpose of labeling these points – instead of extending the whiskers to the minimum and maximum observed values – is to help identify any observations that appear to be unusually distant from the rest of the data. Unusually distant observations are called . In this case, it would be reasonable to classify the interest rates of 24.85% and 26.30% as outliers since they are numerically distant from most of the data.

    Outliers are extreme An is an observation that appears extreme relative to the rest of the data.

    Examining data for outliers serves many useful purposes, including

    1. Identifying in the distribution.
    2. Identifying possible data collection or data entry errors.
    3. Providing insight into interesting properties of the data.

    Using Figure [loan_int_rate_box_plot_layout], estimate the following values for in the data set:
    (a) \(Q_1\), (b) \(Q_3\), and (c) IQR.

    Robust statistics

    How are the of the data set affected by the observation, 26.3%? What would have happened if this loan had instead been only 15%? What would happen to these if the observation at 26.3% had been even larger, say 35%? These scenarios are plotted alongside the original data in Figure [loan_int_rate_robust_ex], and sample statistics are computed under each scenario in Figure [robustOrNotTable].

    [Three dot plots are shown in the same plot. The largest observation from the original data set (discussed in previous dot plots) at about 26% is moved to 15% in the second dot plot and instead to 35% in the third dot plot.]

    [loan_int_rate_robust_ex]

    scenario   median IQR   \(\bar{x}\) \(s\)
    original data   9.93% 5.76%   11.57% 5.05%
    move 26.3% \(\to\) 15%   9.93% 5.76%   11.34% 4.61%
    move 26.3% \(\to\) 35%   9.93% 5.76%   11.74% 5.68%

    [interestRateWhichIsMoreRobust] (a) Which is more affected by extreme observations, the mean or median? Figure [robustOrNotTable] may be helpful. (b) Is the standard deviation or IQR more affected by extreme observations?

    The median and IQR are called because extreme observations have little effect on their values: moving the most extreme value generally has little influence on these statistics. On the other hand, the mean and standard deviation are more heavily influenced by changes in extreme observations, which can be important in some situations.

    The median and IQR did not change under the three scenarios in Figure [robustOrNotTable]. Why might this be the case? The median and IQR are only sensitive to numbers near \(Q_1\), the median, and \(Q_3\). Since values in these regions are stable in the three data sets, the median and IQR estimates are also stable.

    The distribution of loan amounts in the data set is right skewed, with a few large loans lingering out into the right tail. If you were wanting to understand the typical loan size, should you be more interested in the mean or median?

    Transforming data (special topic)

    When data are very strongly skewed, we sometimes transform them so they are easier to model.

    [county_pop_transformed]

    Consider the histogram of county populations shown in Figure [county_pop_transformed_i], which shows extreme skew. What isn’t useful about this plot? Nearly all of the data fall into the left-most bin, and the extreme skew obscures many of the potentially interesting details in the data.

    There are some standard transformations that may be useful for strongly right skewed data where much of the data is positive but clustered near zero. A is a rescaling of the data using a function. For instance, a plot of the logarithm (base 10) of county populations results in the new histogram in Figure [county_pop_transformed_log]. This data is symmetric, and any potential outliers appear much less extreme than in the original data set. By reigning in the outliers and extreme skew, transformations like this often make it easier to build statistical models against the data.

    Transformations can also be applied to one or both variables in a scatterplot. A scatterplot of the population change from 2010 to 2017 against the population in 2010 is shown in Figure [county_pop_change_v_pop_transform_i]. In this first scatterplot, it’s hard to decipher any interesting patterns because the population variable is so strongly skewed. However, if we apply a log\(_{10}\) transformation to the population variable, as shown in Figure [county_pop_change_v_pop_transform_log], a positive association between the variables is revealed. In fact, we may be interested in fitting a trend line to the data when we explore methods around fitting regression lines in Chapter [linRegrForTwoVar].

    [county_pop_change_v_pop_transform_main]

    Transformations other than the logarithm can be useful, too. For instance, the square root (\(\sqrt{\text{original observation}}\)) and inverse (\(\frac{1}{\text{original observation}}\)) are commonly used by data scientists. Common goals in transforming data are to see the data structure differently, reduce skew, assist in modeling, or straighten a nonlinear relationship in a scatterplot.

    Mapping data (special topic)

    The data set offers many numerical variables that we could plot using dot plots, scatterplots, or box plots, but these miss the true nature of the data. Rather, when we encounter geographic data, we should create an , where colors are used to show higher and lower values of a variable. Figures [countyIntensityMaps1] and [countyIntensityMaps2] shows intensity maps for poverty rate in percent (), unemployment rate (), homeownership rate in percent (), and median household income (). The color key indicates which colors correspond to which values. The intensity maps are not generally very helpful for getting precise values in any given county, but they are very helpful for seeing geographic trends and generating interesting research questions or hypotheses.

    What interesting features are evident in the and intensity maps?[map_example_poverty_and_unemployment] Poverty rates are evidently higher in a few locations. Notably, the deep south shows higher poverty rates, as does much of Arizona and New Mexico. High poverty rates are evident in the Mississippi flood plains a little north of New Orleans and also in a large section of Kentucky.

    The unemployment rate follows similar trends, and we can see correspondence between the two variables. In fact, it makes sense for higher rates of unemployment to be closely related to poverty rates. One observation that stand out when comparing the two maps: the poverty rate is much higher than the unemployment rate, meaning while many people may be working, they are not making enough to break out of poverty.

    What interesting features are evident in the intensity map in Figure [countyMedIncomeMap]?


    This page titled 2.1: Examining Numerical Data is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by David Diez, Christopher Barr, & Mine Çetinkaya-Rundel.

    • Was this article helpful?