Skip to main content
Statistics LibreTexts

2.2: Display Data

  • Page ID
    4547
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Once you have a set of data, you will need to organize it so that you can analyze how frequently each datum occurs in the set. However, when calculating the frequency, you may need to round your answers so that they are as precise as possible.

    Frequency

    Twenty students were asked how many hours they worked per day. Their responses, in hours, are as follows: 5; 6; 3; 3; 2; 4; 7; 5; 2; 3; 5; 6; 5; 4; 4; 3; 5; 2; 5; 3.

    Table \(\PageIndex{1}\) lists the different data values in ascending order and their frequencies.

    Data value Frequency
    2 3
    3 5
    4 3
    5 6
    6 2
    7 1

    Table \(\PageIndex{1}\) Frequency Table of Student Work Hours

    A frequency is the number of times a value of the data occurs. According to Table \(\PageIndex{1}\), there are three students who work 2 hours, five students who work 3 hours, and so on. The sum of the values in the frequency column, 20, represents the total number of students included in the sample.

    A relative frequency is the ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes. To find the relative frequencies, divide each frequency by the total number of students in the sample–in this case, 20. Relative frequencies can be written as fractions, percents, or decimals.

    Data value Frequency Relative frequency
    2 3 \(\frac{3}{20}\) or 0.15
    3 5 \(\frac{5}{20}\) or 0.25
    4 3 \(\frac{3}{20}\) or 0.15
    5 6 \(\frac{6}{20}\) or 0.30
    6 2 \(\frac{2}{20}\) or 0.10
    7 1 \(\frac{1}{20}\) or 0.05

    Table \(\PageIndex{2}\) Frequency Table of Student Work Hours with Relative Frequencies

    The sum of the values in the relative frequency column of Table \(\PageIndex{2}\) is \(\frac{20}{20}\), or 1.

    Cumulative relative frequency is the accumulation of the previous relative frequencies. To find the cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the current row, as shown in Table \(\PageIndex{3}\).

    Data value Frequency Relative frequency Cumulative relative frequency
    2 3 \(\frac{3}{20}\) or 0.15 0.15
    3 5 \(\frac{5}{20}\) or 0.25 0.15 + 0.25 = 0.40
    4 3 \(\frac{3}{20}\) or 0.15 0.40 + 0.15 = 0.55
    5 6 \(\frac{6}{20}\) or 0.30 0.55 + 0.30 = 0.85
    6 2 \(\frac{2}{20}\) or 0.10 0.85 + 0.10 = 0.95
    7 1 \(\frac{1}{20}\) or 0.05 0.95 + 0.05 = 1.00

    Table \(\PageIndex{3}\) Frequency Table of Student Work Hours with Relative and Cumulative Relative Frequencies

    The last entry of the cumulative relative frequency column is one, indicating that one hundred percent of the data has been accumulated.

    NOTE

    Because of rounding, the relative frequency column may not always sum to one, and the last entry in the cumulative relative frequency column may not be one. However, they each should be close to one.

    Table \(\PageIndex{4}\) represents the heights, in inches, of a sample of 100 male semiprofessional soccer players.

    Heights (inches) Frequency Relative frequency Cumulative relative frequency
    59.95–61.94 5 \(\frac{5}{10}\) = 0.05 0.05
    61.95–63.94 3 \(\frac{3}{100}\) = 0.03 0.05 + 0.03 = 0.08
    63.95–65.94 15 \(\frac{15}{100}\) = 0.15 0.08 + 0.15 = 0.23
    65.95–67.94 40 \(\frac{40}{100}\) = 0.40 0.23 + 0.40 = 0.63
    67.95–69.94 17 \(\frac{17}{100}\) = 0.17 0.63 + 0.17 = 0.80
    69.95–71.94 12 \(\frac{12}{100}\) = 0.12 0.80 + 0.12 = 0.92
    71.95–73.94 7 \(\frac{7}{100}\) = 0.07 0.92 + 0.07 = 0.99
    73.95–75.94 1 \(\frac{1}{100}\) = 0.01 0.99 + 0.01 = 1.00
      Total = 100 Total = 1.00  

    Table \(\PageIndex{4}\) Frequency Table of Soccer Player Height

    The data in this table have been grouped into the following intervals:

    • 59.95 to 61.94 inches
    • 61.95 to 63.94 inches
    • 63.95 to 65.94 inches
    • 65.95 to 67.94 inches
    • 67.95 to 69.94 inches
    • 69.95 to 71.94 inches
    • 71.95 to 73.94 inches
    • 73.95 to 75.94 inches

    In this sample, there are five players whose heights fall within the interval 59.95–61.94 inches, three players whose heights fall within the interval 61.95–63.94 inches, 15 players whose heights fall within the interval 63.95–65.94 inches, 40 players whose heights fall within the interval 65.95–67.94 inches, 17 players whose heights fall within the interval 67.95–69.94 inches, 12 players whose heights fall within the interval 69.95–71.94, seven players whose heights fall within the interval 71.95–73.94, and one player whose heights fall within the interval 73.95–75.94. All heights fall between the endpoints of an interval and not at the endpoints.

    Exercise \(\PageIndex{1}\)

    From Table \(\PageIndex{4}\), find the percentage of heights that are less than 65.95 inches.

    Example \(\PageIndex{1}\)

    From Table \(\PageIndex{5}\), find the percentage of heights that fall between 61.95 and 65.95 inches.

    Answer

    Add the relative frequencies in the second and third rows: \(0.03 + 0.15 = 0.18\) or 18%.

    Example \(\PageIndex{2}\)

    Use the heights of the 100 male semiprofessional soccer players in Table \(\PageIndex{4}\). Fill in the blanks and check your answers.

    1. The percentage of heights that are from 67.95 to 71.95 inches is: ____.
    2. The percentage of heights that are from 67.95 to 73.95 inches is: ____.
    3. The percentage of heights that are more than 65.95 inches is: ____.
    4. The number of players in the sample who are between 61.95 and 71.95 inches tall is: ____.
    5. What kind of data are the heights?
    6. Describe how you could gather this data (the heights) so that the data are characteristic of all male semiprofessional soccer players.

    Remember, you count frequencies. To find the relative frequency, divide the frequency by the total number of data values. To find the cumulative relative frequency, add all of the previous relative frequencies to the relative frequency for the current row.

    Answer
    1. 29%
    2. 36%
    3. 77%
    4. 87
    5. quantitative continuous
    6. get rosters from each team and choose a simple random sample from each
    Exercise \(\PageIndex{2}\)

    Table \(\PageIndex{5}\) shows the amount, in inches, of annual rainfall in a sample of towns.

    Rainfall (inches) Frequency Relative frequency Cumulative relative frequency
    2.95–4.96 6 \(\frac{6}{50}\) = 0.12 0.12
    4.97–6.98 7 \(\frac{7}{50}\) = 0.14 0.12 + 0.14 = 0.26
    6.99–9.00 15 \(\frac{15}{50}\) = 0.30 0.26 + 0.30 = 0.56
    9.01–11.02 8 \(\frac{8}{50}\) = 0.16 0.56 + 0.16 = 0.72
    11.03–13.04 9 \(\frac{9}{50}\) = 0.18 0.72 + 0.18 = 0.90
    13.05–15.07 5 \(\frac{5}{50}\) = 0.10 0.90 + 0.10 = 1.00
      Total = 50 Total = 1.00  
    Table \(\PageIndex{5}\)

    From Table \(\PageIndex{5}\), find the percentage of rainfall that is less than 9.01 inches.

    Exercise \(\PageIndex{3}\)

    From Table \(\PageIndex{5}\), find the percentage of rainfall that is between 6.99 and 13.05 inches.

    Exercise \(\PageIndex{4}\)

    Table \(\PageIndex{5}\) represents the amount, in inches, of annual rainfall in a sample of towns. What fraction of towns surveyed get between 11.03 and 13.05 inches of rainfall each year?

    Example \(\PageIndex{3}\)

    Nineteen people were asked how many miles, to the nearest mile, they commute to work each day. The data are as follows: 2; 5; 7; 3; 2; 10; 18; 15; 20; 7; 10; 18; 5; 12; 13; 12; 4; 5; 10. Table \(\PageIndex{6}\) was produced:

    Data Frequency Relative frequency Cumulative relative frequency
    3 3 \(\frac{3}{19}\) 0.1579
    4 1 \(\frac{1}{19}\) 0.2105
    5 3 \(\frac{3}{19}\) 0.1579
    7 2 \(\frac{2}{19}\) 0.2632
    10 3 \(\frac{4}{19}\) 0.4737
    12 2 \(\frac{2}{19}\) 0.7895
    13 1 \(\frac{1}{19}\) 0.8421
    15 1 \(\frac{1}{19}\) 0.8948
    18 1 \(\frac{1}{19}\) 0.9474
    20 1 \(\frac{1}{19}\) 1.0000
    Table \(\PageIndex{6}\) Frequency of Commuting Distances
    1. Is the table correct? If it is not correct, what is wrong?
    2. True or False: Three percent of the people surveyed commute three miles. If the statement is not correct, what should it be? If the table is incorrect, make the corrections.
    3. What fraction of the people surveyed commute five or seven miles?
    4. What fraction of the people surveyed commute 12 miles or more? Less than 12 miles? Between five and 13 miles (not including five and 13 miles)?
    Answer
    1. No. The frequency column sums to 18, not 19. Not all cumulative relative frequencies are correct.
    2. False. The frequency for three miles should be one; for two miles (left out), two. The cumulative relative frequency column should read: 0.1052, 0.1579, 0.2105, 0.3684, 0.4737, 0.6316, 0.7368, 0.7895, 0.8421, 0.9474, 1.0000.
    3. \(\frac{5}{19}\)
    4. \(\frac{7}{19}, \frac{12}{19}, \frac{7}{19)\)
    Example \(\PageIndex{4}\)

    Table \(\PageIndex{7}\) contains the total number of deaths worldwide as a result of earthquakes for the period from 2000 to 2012.

    Year Total number of deaths
    2000 231
    2001 21,357
    2002 11,685
    2003 33,819
    2004 228,802
    2005 88,003
    2006 6,605
    2007 712
    2008 88,011
    2009 1,790
    2010 320,120
    2011 21,953
    2012 768
    Total 823,856

    Table \(\PageIndex{7}\)

    Answer the following questions.

    1. What is the frequency of deaths measured from 2006 through 2009?
    2. What percentage of deaths occurred after 2009?
    3. What is the relative frequency of deaths that occurred in 2003 or earlier?
    4. What is the percentage of deaths that occurred in 2004?
    5. What kind of data are the numbers of deaths?
    6. The Richter scale is used to quantify the energy produced by an earthquake. Examples of Richter scale numbers are 2.3, 4.0, 6.1, and 7.0. What kind of data are these numbers?
    Answer
    1. 97,118 (11.8%)
    2. 41.6%
    3. 67,092/823,356 or 0.081 or 8.1 %
    4. 27.8%
    5. Quantitative discrete
    6. Quantitative continuous
    Exercise \(\PageIndex{5}\)

    Table \(\PageIndex{8}\) contains the total number of fatal motor vehicle traffic crashes in the United States for the period from 1994 to 2011.

    Year Total number of crashes Year Total number of crashes
    1994 36,254 2004 38,444
    1995 37,241 2005 39,252
    1996 37,494 2006 38,648
    1997 37,324 2007 37,435
    1998 37,107 2008 34,172
    1999 37,140 2009 30,862
    2000 37,526 2010 30,296
    2001 37,862 2011 29,757
    2002 38,491 Total 653,782
    2003 38,477    

    Table \(\PageIndex{8}\)

    Answer the following questions.

    1. What is the frequency of deaths measured from 2000 through 2004?
    2. What percentage of deaths occurred after 2006?
    3. What is the relative frequency of deaths that occurred in 2000 or before?
    4. What is the percentage of deaths that occurred in 2011?
    5. What is the cumulative relative frequency for 2006? Explain what this number tells you about the data.

    Stem-and-Leaf Graphs (Stemplots), Line Graphs, and Bar Graphs

    One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory data analysis. It is a good choice when the data sets are small. To create the plot, divide each observation of data into a stem and a leaf. The leaf consists of a final significant digit. For example, 23 has stem 2 and leaf 3. The number 432 has stem 43 and leaf 2. Likewise, the number 5,432 has stem 543 and leaf 2. The decimal 9.3 has stem 9 and leaf 3. Write the stems in a vertical line from smallest to largest. Draw a vertical line to the right of the stems. Then write the leaves in increasing order next to their corresponding stem.

    Example \(\PageIndex{5}\)

    For Susan Dean's spring pre-calculus class, scores for the first exam were as follows (smallest to largest):

    33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80; 83; 88; 88; 88; 90; 92; 94; 94; 94; 94; 96; 100

    Stem Leaf
    3 3
    4 2 9 9
    5 3 5 5
    6 1 3 7 8 8 9 9
    7 2 3 4 8
    8 0 3 8 8 8
    9 0 2 4 4 4 4 6
    10 0
    Table \(\PageIndex{9}\) Stem-and-Leaf Graph

    The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores or approximately 26% (8/31) were in the 90s or 100, a fairly high number of As.

    Exercise \(\PageIndex{6}\)

    For the Park City basketball team, scores for the last 30 games were as follows (smallest to largest):

    32; 32; 33; 34; 38; 40; 42; 42; 43; 44; 46; 47; 47; 48; 48; 48; 49; 50; 50; 51; 52; 52; 52; 53; 54; 56; 57; 57; 60; 61

    Construct a stem plot for the data.

    The stemplot is a quick way to graph data and gives an exact picture of the data. You want to look for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500) while others may indicate that something unusual is happening. It takes some background information to explain outliers, so we will cover them in more detail later.

    Example \(\PageIndex{6}\)

    The data are the distances (in kilometers) from a home to local supermarkets. Create a stemplot using the data:

    1.1; 1.5; 2.3; 2.5; 2.7; 3.2; 3.3; 3.3; 3.5; 3.8; 4.0; 4.2; 4.5; 4.5; 4.7; 4.8; 5.5; 5.6; 6.5; 6.7; 12.3

    Do the data seem to have any concentration of values?

    NOTE

    The leaves are to the right of the decimal.

    Answer

    The value 12.3 may be an outlier. Values appear to concentrate at 3 and 4 kilometers.

    Stem Leaf
    1 1 5
    2 3 5 7
    3 2 3 3 5 8
    4 0 2 5 5 7 8
    5 5 6
    6 5 7
    7  
    8  
    9  
    10  
    11  
    12 3
    Table \(\PageIndex{10}\)
    Exercise \(\PageIndex{7}\)

    The following data show the distances (in miles) from the homes of off-campus statistics students to the college. Create a stem plot using the data and identify any outliers:

    0.5; 0.7; 1.1; 1.2; 1.2; 1.3; 1.3; 1.5; 1.5; 1.7; 1.7; 1.8; 1.9; 2.0; 2.2; 2.5; 2.6; 2.8; 2.8; 2.8; 3.5; 3.8; 4.4; 4.8; 4.9; 5.2; 5.5; 5.7; 5.8; 8.0

    Example \(\PageIndex{7}\)

    A side-by-side stem-and-leaf plot allows a comparison of the two data sets in two columns. In a side-by-side stem-and-leaf plot, two sets of leaves share the same stem. The leaves are to the left and the right of the stems. Table \(\PageIndex{11}\) and Table \(\PageIndex{12}\) show the ages of presidents at their inauguration and at their death. Construct a side-by-side stem-and-leaf plot using this data.

    Answer
    Ages at Inauguration Ages at Death
    9 9 8 7 7 7 6 3 2 4 6 9
    8 7 7 7 7 6 6 6 5 5 5 5 4 4 4 4 4 2 2 1 1 1 1 1 0 5 3 6 6 7 7 8
    9 8 5 4 4 2 1 1 1 0 6 0 0 3 3 4 4 5 6 7 7 7 8
      7 0 0 1 1 1 4 7 8 8 9
      8 0 1 3 5 8
      9 0 0 3 3
    Table \(\PageIndex{13}\)
    President Age President Age President Age
    Washington 57 Lincoln 52 Hoover 54
    J. Adams 61 A. Johnson 56 F. Roosevelt 51
    Jefferson 57 Grant 46 Truman 60
    Madison 57 Hayes 54 Eisenhower 62
    Monroe 58 Garfield 49 Kennedy 43
    J. Q. Adams 57 Arthur 51 L. Johnson 55
    Jackson 61 Cleveland 47 Nixon 56
    Van Buren 54 B. Harrison 55 Ford 61
    W. H. Harrison 68 Cleveland 55 Carter 52
    Tyler 51 McKinley 54 Reagan 69
    Polk 49 T. Roosevelt 42 G.H.W. Bush 64
    Taylor 64 Taft 51 Clinton 47
    Fillmore 50 Wilson 56 G. W. Bush 54
    Pierce 48 Harding 55 Obama 47
    Buchanan 65 Coolidge 51 Trump 70
    Table \(\PageIndex{11}\) Presidential Ages at Inauguration
    President Age President Age President Age
    Washington 67 Lincoln 56 Hoover 90
    J. Adams 90 A. Johnson 66 F. Roosevelt 63
    Jefferson 83 Grant 63 Truman 88
    Madison 85 Hayes 70 Eisenhower 78
    Monroe 73 Garfield 49 Kennedy 46
    J. Q. Adams 80 Arthur 56 L. Johnson 64
    Jackson 78 Cleveland 71 Nixon 81
    Van Buren 79 B. Harrison 67 Ford 93
    W. H. Harrison 68 Cleveland 71 Reagan 93
    Tyler 71 McKinley 58    
    Polk 53 T. Roosevelt 60    
    Taylor 65 Taft 72    
    Fillmore 74 Wilson 67    
    Pierce 64 Harding 57    
    Buchanan 77 Coolidge 60    
    Table \(\PageIndex{12}\) Presidential Age at Death

    Another type of graph that is useful for specific data values is a line graph. In the particular line graph shown in Example \(\PageIndex{8}\), the x-axis(horizontal axis) consists of data values and the y-axis (vertical axis) consists of frequency points. The frequency points are connected using line segments.

    Example \(\PageIndex{8}\)

    In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do his or her chores. The results are shown in Table \(\PageIndex{14}\) and in Figure \(\PageIndex{1}\).

    Number of times teenager is reminded Frequency
    0 2
    1 5
    2 8
    3 14
    4 7
    5 4

    Table \(\PageIndex{14}\)

    A line graph showing the number of times a teenager needs to be reminded to do chores on the x-axis and  frequency on the y-axis.

    Figure \(\PageIndex{1}\)

    Exercise \(\PageIndex{8}\)

    In a survey, 40 people were asked how many times per year they had their car in the shop for repairs. The results are shown in Table \(\PageIndex{15}\). Construct a line graph.

    Number of times in shop Frequency
    0 7
    1 10
    2 14
    3 9

    Table \(\PageIndex{15}\)

    Bar graphs consist of bars that are separated from each other. The bars can be rectangles or they can be rectangular boxes (used in three-dimensional plots), and they can be vertical or horizontal. The bar graph shown in Example \(\PageIndex{9}\) has age groups represented on the x-axis and proportions on the y-axis.

    Example \(\PageIndex{9}\)

    By the end of 2011, Facebook had over 146 million users in the United States. Table \(\PageIndex{16}\) shows three age groups, the number of users in each age group, and the proportion (%) of users in each age group. Construct a bar graph using this data.

    Age groups Number of Facebook users Proportion (%) of Facebook users
    13–25 65,082,280 45%
    26–44 53,300,200 36%
    45–64 27,885,100 19%

    Table \(\PageIndex{16}\)

    This is a bar graph that matches the supplied data. The x-axis shows age groups,  and the y-axis shows the percentages of Facebook users.
    Figure \(\PageIndex{2}\)
    Exercise \(\PageIndex{9}\)

    The population in Park City is made up of children, working-age adults, and retirees. Table \(\PageIndex{17}\) shows the three age groups, the number of people in the town from each age group, and the proportion (%) of people in each age group. Construct a bar graph showing the proportions.

    Age groups Number of people Proportion of population
    Children 67,059 19%
    Working-age adults 152,198 43%
    Retirees 131,662 38%

    Table \(\PageIndex{17}\)

    Example \(\PageIndex{10}\)

    The columns in Table \(\PageIndex{18}\) contain: the race or ethnicity of students in U.S. Public Schools for the class of 2011, percentages for the Advanced Placement examine population for that class, and percentages for the overall student population. Create a bar graph with the student race or ethnicity (qualitative data) on the x-axis, and the Advanced Placement examinee population percentages on the y-axis.

    Race/ethnicity AP examinee population Overall student population
    1 = Asian, Asian American or Pacific Islander 10.3% 5.7%
    2 = Black or African American 9.0% 14.7%
    3 = Hispanic or Latino 17.0% 17.6%
    4 = American Indian or Alaska Native 0.6% 1.1%
    5 = White 57.1% 59.2%
    6 = Not reported/other 6.0% 1.7%

    Table \(\PageIndex{18}\)

    This is a bar graph that matches the supplied data. The x-axis shows race and ethnicity, and the y-axis shows the percentages of AP examinees.
    Figure \(\PageIndex{3}\)
    Exercise \(\PageIndex{10}\)

    Park city is broken down into six voting districts. Table \(\PageIndex{19}\) shows the percent of the total registered voter population that lives in each district as well as the percent total of the entire population that lives in each district. Construct a bar graph that shows the registered voter population by district.

    District Registered voter population Overall city population
    1 15.5% 19.4%
    2 12.2% 15.6%
    3 9.8% 9.0%
    4 17.4% 18.5%
    5 22.8% 20.7%
    6 22.3% 16.8%
    Table \(\PageIndex{19}\)
    Example \(\PageIndex{11}\)

    Below is a two-way table showing the types of pets owned by men and women:

    Dogs Cats Fish Total
    Men 4 2 2 8
    Women 4 6 2 12
    Total 8 8 4 20
    Table \(\PageIndex{20}\)

    Given these data, calculate the conditional distributions for the subpopulation of men who own each pet type.

    Answer
    • Men who own dogs = 4/8 = 0.5
    • Men who own cats = 2/8 = 0.25
    • Men who own fish = 2/8 = 0.25

    Note: The sum of all of the conditional distributions must equal one. In this case, 0.5 + 0.25 + 0.25 = 1; therefore, the solution "checks".

    Qualitative Data Discussion

    Below are tables comparing the number of part-time and full-time students at De Anza College and Foothill College enrolled for the spring 2010 term. The tables display counts (frequencies) and percentages or proportions (relative frequencies). The percent columns make comparing the same categories in the colleges easier. Displaying percentages along with the numbers is often helpful, but it is particularly important when comparing sets of data that do not have the same totals, such as the total enrollments for both colleges in this example. Notice how much larger the percentage for part-time students at Foothill College is compared to De Anza College.

    Table \(\PageIndex{21}\): Fall Term 2007 (Census day)
    De Anza College Foothill College
      Number Percent     Number Percent
    Full-time 9,200 40.9%   Full-time 4,059 28.6%
    Part-time 13,296 59.1%   Part-time 10,124 71.4%
    Total 22,496 100%   Total 14,183 100%

    Tables are a good way of organizing and displaying data. But graphs can be even more helpful in understanding the data. There are no strict rules concerning which graphs to use. Two graphs that are used to display qualitative(categorical) data are pie charts and bar graphs.

    • In a pie chart, categories of data are represented by wedges in a circle and are proportional in size to the percent of individuals in each category.
    • In a bar graph, the length of the bar for each category is proportional to the number or percent of individuals in each category. Bars may be vertical or horizontal.
    • A Pareto chart consists of bars that are sorted into order by category size (largest to smallest).

    Look at Figures \(\PageIndex{4}\) and \(\PageIndex{5}\) and determine which graph (pie or bar) you think displays the comparisons better.

    It is a good idea to look at a variety of graphs to see which is the most helpful in displaying the data. We might make different choices of what we think is the “best” graph depending on the data and the context. Our choice also depends on what we are using the data for.

    0b37fb7379f7024d4a4d89a837e46c9db7e22005
    Figure \(\PageIndex{4}\)A
    2981f1ad49b28e09a4d26d7de02c54a83585e724
    Figure \(\PageIndex{4}\)B
    78b0e1f796574cd72b0bb47d44ea324f510b3daa

    Figure \(\PageIndex{5}\)

    Percentages That Add to More (or Less) Than 100%

    Sometimes percentages add up to be more than 100% (or less than 100%). In the graph, the percentages add to more than 100% because students can be in more than one category. A bar graph is appropriate to compare the relative size of the categories. A pie chart cannot be used. It also could not be used if the percentages added to less than 100%.

    Table \(\PageIndex{22}\): De Anza College Spring 2010
    Characteristic/category Percent
    Full-time students 40.9%
    Students who intend to transfer to a 4-year educational institution 48.6%
    Students under age 25 61.0%
    TOTAL 150.5%
    0a6d86c32237c4f5a7c96e94f0a53d96f14f36ce
    Figure \(\PageIndex{6}\)

    Omitting Categories/Missing Data

    The table displays Ethnicity of Students but is missing the "Other/Unknown" category. This category contains people who did not feel they fit into any of the ethnicity categories or declined to respond. Notice that the frequencies do not add up to the total number of students. In this situation, create a bar graph and not a pie chart.

    Table \(\PageIndex{23}\): Ethnicity of Students at De Anza College Fall Term 2007 (Census Day)
    Frequency Percent
    Asian 8,794 36.1%
    Black 1,412 5.8%
    Filipino 1,298 5.3%
    Hispanic 4,180 17.1%
    Native American 146 0.6%
    Pacific Islander 236 1.0%
    White 5,978 24.5%
    TOTAL 22,044 out of 24,382 90.4% out of 100%
    7e0b19a321968c6b52cb7fefbd9b27478b8e4d33
    Figure \(\PageIndex{7}\)

    The following graph is the same as the previous graph but the “Other/Unknown” percent (9.6%) has been included. The “Other/Unknown” category is large compared to some of the other categories (Native American, 0.6%, Pacific Islander 1.0%). This is important to know when we think about what the data are telling us.

    This particular bar graph in Figure \(\PageIndex{9}\) is a Pareto chart. The Pareto chart has the bars sorted from largest to smallest and is easier to read and interpret.

    9646cbcb55ef870c9109db2a0b592b9e70c8ce73
    Figure \(\PageIndex{8}\): Bar Graph with Other/Unknown Category
    ae1718540e00a8c111661836e515c2f4fa84a03a
    Figure \(\PageIndex{9}\): Pareto Chart With Bars Sorted by Size

    Pie Charts: No Missing Data

    The following pie charts have the “Other/Unknown” category included (since the percentages must add to 100%). The chart in Figure \(\PageIndex{10}\).

    9efa95fa2f4eecf20c8f983b6b139cd753014ecf
    80cae641703845e355df875dee87c448e893041f
    Figure \(\PageIndex{10}\): Pie Charts with no missing data

    Histograms, Frequency Polygons, and Time Series Graphs

    For most of the work you do in this book, you will use a histogram to display the data. One advantage of a histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data set consists of 100 values or more.

    A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis. The horizontal axis is labeled with what the data represents (for instance, distance from your home to school). The vertical axis is labeled either frequency or relative frequency (or percent frequency or probability). The graph will have the same shape with either label. The histogram (like the stemplot) can give you the shape of the data, the center, and the spread of the data.

    The relative frequency is equal to the frequency for an observed value of the data divided by the total number of data values in the sample. (Remember, frequency is defined as the number of times a value occurs.) If:

    • \(f\) = frequency
    • \(n\) = total number of data values (or the sum of the individual frequencies), and
    • \(RF\) = relative frequency,

    then:

    \[\RF=\frac{f}{n}\nonumber]

    For example, if three students in Mr. Ahab's English class of 40 students received from 90% to 100%, then, \(f = 3\), \(n = 40\), and \(RF = \frac{f}{n} = \frac{3}{40} = 0.075\). In other words, 7.5% of the students received 90–100%, and 90–100% are quantitative measures.

    To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many histograms consist of five to 15 bars or classes for clarity. The number of bars needs to be chosen. Choose a starting point for the first interval to be less than the smallest data value. A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places. For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05). We say that 6.05 has more precision. If the value with the most decimal places is 2.23 and the lowest value is 1.5, a convenient starting point is 1.495 (1.5 – 0.005 = 1.495). If the value with the most decimal places is 3.234 and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 – 0.0005 = 0.9995). If all the data happen to be integers and the smallest value is 2, then a convenient starting point is 1.5 (2 – 0.5 = 1.5). Also, when the starting point and other boundaries are carried to one additional decimal place, no data value will fall on a boundary. The next two examples go into detail about how to construct a histogram using continuous data and how to create a histogram using discrete data.

    Example \(\PageIndex{12}\)

    The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. The heights are continuous data, since height is measured.

    60; 60.5; 61; 61; 61.5
    63.5; 63.5; 63.5
    64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5
    66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5
    68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5
    70; 70; 70; 70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71
    72; 72; 72; 72.5; 72.5; 73; 73.5
    74

    The smallest data value is 60. Since the data with the most decimal places has one decimal (for instance, 61.5), we want our starting point to have two decimal places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for the convenient starting point.

    60 – 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point is, then, 59.95.

    The largest value is 74, so 74 + 0.05 = 74.05 is the ending value.

    Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the ending value and divide by the number of bars (you must choose the number of bars you desire). Suppose you choose eight bars.

    \[\frac{74.05−59.95}{8}=1.76\non\nonumber\]

    NOTE

    We will round up to two and make each bar or class interval two units wide. Rounding up to two is one way to prevent a value from falling on a boundary. Rounding to the next number is often necessary even if it goes against the standard rules of rounding. For this example, using 1.76 as the width would also work. A guideline that is followed by some for the width of a bar or class interval is to take the square root of the number of data values and then round to the nearest whole number, if necessary. For example, if there are 150 values of data, take the square root of 150 and round to 12 bars or intervals.

    The boundaries are:

    • 59.95
    • 59.95 + 2 = 61.95
    • 61.95 + 2 = 63.95
    • 63.95 + 2 = 65.95
    • 65.95 + 2 = 67.95
    • 67.95 + 2 = 69.95
    • 69.95 + 2 = 71.95
    • 71.95 + 2 = 73.95
    • 73.95 + 2 = 75.95

    The heights 60 through 61.5 inches are in the interval 59.95–61.94. The heights that are 63.5 are in the interval 61.95–63.94. The heights that are 64 through 64.5 are in the interval 63.95–65.94. The heights 66 through 67.5 are in the interval 65.95–67.94. The heights 68 through 69.5 are in the interval 67.95–69.94. The heights 70 through 71 are in the interval 69.95–71.94. The heights 72 through 73.5 are in the interval 71.95–73.94. The height 74 is in the interval 73.95–75.94.

    The following histogram displays the heights on the x-axis and relative frequency on the y-axis.

    Histogram consists of 8 bars with the y-axis in increments of 0.05 from 0-0.4 and the x-axis in intervals of 2 from 59.95-75.95.
    Figure \(\PageIndex{11}\)
    Example \(\PageIndex{13}\)

    Create a histogram for the following data: the number of books bought by 50 part-time college students at ABC College. The number of books is discrete data, since books are counted.

    1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1
    2; 2; 2; 2; 2; 2; 2; 2; 2; 2
    3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3
    4; 4; 4; 4; 4; 4
    5; 5; 5; 5; 5
    6; 6

    Eleven students buy 1 book. Ten students buy 2 books. Sixteen students buy 3 books. Six students buy 4 books. Five students buy 5 books. Two students buy 6 books.

    Because the data are integers, subtract 0.5 from 1, the smallest data value and add 0.5 to 6, the largest data value. Then the starting point is 0.5 and the ending value is 6.5.

    Next, calculate the width of each bar or class interval. If the data are discrete and there are not too many different values, a width that places the data values in the middle of the bar or class interval is the most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6, and the starting point is 0.5, a width of one places the 1 in the middle of the interval from 0.5 to 1.5, the 2 in the middle of the interval from 1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in the middle of the interval from _______ to _______, the 5 in the middle of the interval from _______ to _______, and the _______ in the middle of the interval from _______ to _______ .

    Solution

    Calculate the number of bars as follows:

    \[\frac{6.5−0.5}{\text{number of bars}}=1\nonumber\]

    where 1 is the width of a bar. Therefore, bars = 6.

    The following histogram displays the number of books on the x-axis and the frequency on the y-axis.

    Histogram consists of 6 bars with the y-axis in increments of 2 from 0-16 and the x-axis in intervals of 1 from 0.5-6.5.
    Figure \(\PageIndex{12}\)
    Example \(\PageIndex{14}\)

    Using this data set, construct a histogram.

    Number of hours my classmates spent playing video games on weekends
    9.95 10 2.25 16.75 0
    19.5 22.5 7.5 15 12.75
    5.5 11 10 20.75 17.5
    23 21.9 24 23.75 18
    20 15 22.9 18.8 20.5
    Table \(\PageIndex{24}\)
    Answer
    This is a histogram that matches the supplied data. The x-axis consists of 5 bars in intervals of 5 from 0 to 25. The y-axis is marked in increments of 1 from 0 to 10. The x-axis shows the number of hours spent playing video games on the weekends, and the y-axis shows the number of students.
    Figure \(\PageIndex{13}\)

    Some values in this data set fall on boundaries for the class intervals. A value is counted in a class interval if it falls on the left boundary, but not if it falls on the right boundary. Different researchers may set up histograms for the same data in different ways. There is more than one correct way to set up a histogram.

    Frequency Polygons

    Frequency polygons are analogous to line graphs, and just as line graphs make continuous data visually easy to interpret, so too do frequency polygons.

    To construct a frequency polygon, first examine the data and decide on the number of intervals, or class intervals, to use on the x-axis and y-axis. After choosing the appropriate ranges, begin plotting the data points. After all the points are plotted, draw line segments to connect them.

    Example \(\PageIndex{15}\)

    A frequency polygon was constructed from the frequency table below.

    Lower bound Upper bound Frequency Cumulative frequency
    49.5 59.5 5 5
    59.5 69.5 10 15
    69.5 79.5 30 45
    79.5 89.5 40 85
    89.5 99.5 15 100
    Table \(\PageIndex{25}\): Frequency distribution for calculus final test scores
    A frequency polygon was constructed from the frequency table below.
    Figure \(\PageIndex{7}\)

    The first label on the x-axis is 44.5. This represents an interval extending from 39.5 to 49.5. Since the lowest test score is 54.5, this interval is used only to allow the graph to touch the x-axis. The point labeled 54.5 represents the next interval, or the first “real” interval from the table, and contains five scores. This reasoning is followed for each of the remaining intervals with the point 104.5 representing the interval from 99.5 to 109.5. Again, this interval contains no data and is only used so that the graph will touch the x-axis. Looking at the graph, we say that this distribution is skewed because one side of the graph does not mirror the other side.

    Exercise \(\PageIndex{11}\)

    Construct a frequency polygon of U.S. Presidents’ ages at inauguration shown in Table \(\PageIndex{26}\).

    Age at inauguration Frequency
    41.5–46.5 4
    46.5–51.5 11
    51.5–56.5 14
    56.5–61.5 9
    61.5–66.5 4
    66.5–71.5 2

    Table \(\PageIndex{26}\)

    Frequency polygons are useful for comparing distributions. This is achieved by overlaying the frequency polygons drawn for different data sets.

    Example \(\PageIndex{16}\)

    We will construct an overlay frequency polygon comparing the scores from Example \(\PageIndex{15}\) with the students’ final numeric grade.

    Lower bound Upper bound Frequency Cumulative frequency
    49.5 59.5 5 5
    59.5 69.5 10 15
    69.5 79.5 30 45
    79.5 89.5 40 85
    89.5 99.5 15 100
    Table \(\PageIndex{27}\): Frequency distribution for calculus final test scores
    Lower bound Upper bound Frequency Cumulative frequency
    49.5 59.5 10 10
    59.5 69.5 10 20
    69.5 79.5 30 50
    79.5 89.5 45 95
    89.5 99.5 5 100
    Table \(\PageIndex{28}\): Frequency distribution for calculus final grades
    This is an overlay frequency polygon that matches the supplied data. The x-axis shows the grades, and the y-axis shows the frequency.
    Figure \(\PageIndex{8}\)

    Constructing a Time Series Graph

    Suppose that we want to study the temperature range of a region for an entire month. Every day at noon we note the temperature and write this down in a log. A variety of statistical studies could be done with these data. We could find the mean or the median temperature for the month. We could construct a histogram displaying the number of days that temperatures reach a certain range of values. However, all of these methods ignore a portion of the data that we have collected.

    One feature of the data that we may want to consider is that of time. Since each date is paired with the temperature reading for the day, we do not have to think of the data as being random. We can instead use the times given to impose a chronological order on the data. A graph that recognizes this ordering and displays the changing temperature as the month progresses is called a time series graph.

    To construct a time series graph, we must look at both pieces of our paired data set. We start with a standard Cartesian coordinate system. The horizontal axis is used to plot the date or time increments, and the vertical axis is used to plot the values of the variable that we are measuring. By doing this, we make each point on the graph correspond to a date and a measured quantity. The points on the graph are typically connected by straight lines in the order in which they occur.

    Example \(\PageIndex{17}\)

    The following data shows the Annual Consumer Price Index, each month, for ten years. Construct a time series graph for the Annual Consumer Price Index data only.

    Year Jan Feb Mar Apr May Jun Jul
    2003 181.7 183.1 184.2 183.8 183.5 183.7 183.9
    2004 185.2 186.2 187.4 188.0 189.1 189.7 189.4
    2005 190.7 191.8 193.3 194.6 194.4 194.5 195.4
    2006 198.3 198.7 199.8 201.5 202.5 202.9 203.5
    2007 202.416 203.499 205.352 206.686 207.949 208.352 208.299
    2008 211.080 211.693 213.528 214.823 216.632 218.815 219.964
    2009 211.143 212.193 212.709 213.240 213.856 215.693 215.351
    2010 216.687 216.741 217.631 218.009 218.178 217.965 218.011
    2011 220.223 221.309 223.467 224.906 225.964 225.722 225.922
    2012 226.665 227.663 229.392 230.085 229.815 229.478 229.104
    Table \(\PageIndex{29}\)
    Year Aug Sep Oct Nov Dec Annual
    2003 184.6 185.2 185.0 184.5 184.3 184.0
    2004 189.5 189.9 190.9 191.0 190.3 188.9
    2005 196.4 198.8 199.2 197.6 196.8 195.3
    2006 203.9 202.9 201.8 201.5 201.8 201.6
    2007 207.917 208.490 208.936 210.177 210.036 207.342
    2008 219.086 218.783 216.573 212.425 210.228 215.303
    2009 215.834 215.969 216.177 216.330 215.949 214.537
    2010 218.312 218.439 218.711 218.803 219.179 218.056
    2011 226.545 226.889 226.421 226.230 225.672 224.939
    2012 230.379 231.407 231.317 230.221 229.601 229.594
    Table \(\PageIndex{30}\)
    Answer
    This is a times series graph that matches the supplied data. The x-axis shows years from 2003 to 2012, and the y-axis shows the annual CPI.
    Figure \(\PageIndex{9}\)
    Exercise \(\PageIndex{18}\)

    The following table is a portion of a data set from www.worldbank.org. Use the table to construct a time series graph for CO2emissions for the United States.

    Year Ukraine United Kingdom United States
    2003 352,259 540,640 5,681,664
    2004 343,121 540,409 5,790,761
    2005 339,029 541,990 5,826,394
    2006 327,797 542,045 5,737,615
    2007 328,357 528,631 5,828,697
    2008 323,657 522,247 5,656,839
    2009 272,176 474,579 5,299,563
    Table \(\PageIndex{20}\): CO2 emissions

    Uses of a Time Series Graph

    Time series graphs are important tools in various applications of statistics. When recording values of the same variable over an extended period of time, sometimes it is difficult to discern any trend or pattern. However, once the same data points are displayed graphically, some features jump out. Time series graphs make trends easy to spot.

    How NOT to Lie with Statistics

    It is important to remember that the very reason we develop a variety of methods to present data is to develop insights into the subject of what the observations represent. We want to get a "sense" of the data. Are the observations all very much alike or are they spread across a wide range of values, are they bunched at one end of the spectrum or are they distributed evenly and so on. We are trying to get a visual picture of the numerical data. Shortly we will develop formal mathematical measures of the data, but our visual graphical presentation can say much. It can, unfortunately, also say much that is distracting, confusing and simply wrong in terms of the impression the visual leaves. Many years ago Darrell Huff wrote the book How to Lie with Statistics. It has been through 25 plus printings and sold more than one and one-half million copies. His perspective was a harsh one and used many actual examples that were designed to mislead. He wanted to make people aware of such deception, but perhaps more importantly to educate so that others do not make the same errors inadvertently.

    Again, the goal is to enlighten with visuals that tell the story of the data. Pie charts have a number of common problems when used to convey the message of the data. Too many pieces of the pie overwhelm the reader. More than perhaps five or six categories ought to give an idea of the relative importance of each piece. This is after all the goal of a pie chart, what subset matters most relative to the others. If there are more components than this then perhaps an alternative approach would be better or perhaps some can be consolidated into an "other" category. Pie charts cannot show changes over time, although we see this attempted all too often. In federal, state, and city finance documents pie charts are often presented to show the components of revenue available to the governing body for appropriation: income tax, sales tax motor vehicle taxes and so on. In and of itself this is interesting information and can be nicely done with a pie chart. The error occurs when two years are set side-by-side. Because the total revenues change year to year, but the size of the pie is fixed, no real information is provided and the relative size of each piece of the pie cannot be meaningfully compared.

    Histograms can be very helpful in understanding the data. Properly presented, they can be a quick visual way to present probabilities of different categories by the simple visual of comparing relative areas in each category. Here the error, purposeful or not, is to vary the width of the categories. This of course makes comparison to the other categories impossible. It does embellish the importance of the category with the expanded width because it has a greater area, inappropriately, and thus visually "says" that that category has a higher probability of occurrence.

    Time series graphs perhaps are the most abused. A plot of some variable across time should never be presented on axes that change part way across the page either in the vertical or horizontal dimension. Perhaps the time frame is changed from years to months. Perhaps this is to save space or because monthly data was not available for early years. In either case this confounds the presentation and destroys any value of the graph. If this is not done to purposefully confuse the reader, then it certainly is either lazy or sloppy work.

    Changing the units of measurement of the axis can smooth out a drop or accentuate one. If you want to show large changes, then measure the variable in small units, penny rather than thousands of dollars. And of course to continue the fraud, be sure that the axis does not begin at zero, zero. If it begins at zero, zero, then it becomes apparent that the axis has been manipulated.

    Perhaps you have a client that is concerned with the volatility of the portfolio you manage. An easy way to present the data is to use long time periods on the time series graph. Use months or better, quarters rather than daily or weekly data. If that doesn't get the volatility down then spread the time axis relative to the rate of return or portfolio valuation axis. If you want to show "quick" dramatic growth, then shrink the time axis. Any positive growth will show visually "high" growth rates. Do note that if the growth is negative then this trick will show the portfolio is collapsing at a dramatic rate.

    Again, the goal of descriptive statistics is to convey meaningful visuals that tell the story of the data. Purposeful manipulation is fraud and unethical at the worst, but even at its best, making these type of errors will lead to confusion on the part of the analysis.


    This page titled 2.2: Display Data is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by OpenStax via source content that was edited to the style and standards of the LibreTexts platform.