2.3: Histograms, Ogives, Frequency Polygons, and Time Series
- Page ID
- 10921
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Histograms
For most of the work you do in this book, you will use a histogram to display the data. One advantage of a histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data set consists of 100 values or more, but we may work with smaller data sets from time to time for demonstration and practice purposes.
Histogram: a graph of the frequencies on the vertical axis and the class boundaries on the horizontal axis. Rectangles where the height is the frequency and the width is the class width are drawn for each class. The boxes are contiguous (adjoining). The horizontal axis is labeled with what the data represents (for instance, distance from your home to school). The vertical axis is labeled either frequency or relative frequency (or percent frequency or probability).
Characteristics of a histogram
- The graph will have the same shape if either frequency or relative frequency is used.
- Quantitative data are in specific orders since you are dealing with numbers. Distribution or the shape of the graph only changes a little bit depending on how many categories you set up.
- The graph can give you the shape of the data, the center, and the spread of the data.
- Quantitative data creates numerical categories, and the numbers are determined by how many categories (or what are called classes) you choose.
- If two people have the same number of categories, then they will have the same frequency distribution.
- The graph for quantitative data looks similar to a bar graphs, which are primarily used for qualitative data (seen in the next section), except there are some major differences.
- In a bar graph the categories can be put in any order on the horizontal axis.
- There is no set order for these data values.
- You can’t say how the data is distributed based on the shape, since the shape can change just by putting the categories in different orders.
- In a bar graph, the categories that you made in the frequency table were determined by you.
- There can be many different categories depending on the point of view of the table's creator.
- Bar graphs typically have gaps between the bars to show that the categories do not continue on like they do in quantitative data.
- In a bar graph the categories can be put in any order on the horizontal axis.
Since the graph for quantitative data is different from qualitative data, it is given a new name. The name of the graph is a histogram.
To create a histogram, you must first create the frequency distribution. The idea of a frequency distribution is to take the interval that the data spans and divide it up into equal subintervals called classes.
The organization of raw data into a table format using categories, or classes, and frequencies.
Summary: Create a frequency distribution:
- Find the range = largest value – smallest value
- Pick the number of classes to use. Usually the number of classes is between five and twenty. Five classes are used if there are a small number of data points and twenty classes if there are a large number of data points (over 1000 data points). (Note: categories will now be called classes from now on.)
- Class width = \(\frac{\text { range }}{\# \text { classes }}\) Always round up to the next integer (even if the answer is already a whole number go to the next integer). If you don’t do this, your last class will not contain your largest data value, and you would have to add another class just for it. If you round up, then your largest data value will fall in the last class, and there are no issues.
- Create the classes. Each class has limits that determine which values fall in each class. To find the class limits, set the smallest value as the lower class limit for the first class. Then add the class width to the lower class limit to get the next lower class limit. Repeat until you get all the classes. The upper class limit for a class is one less than the lower limit for the next class.
- In order for the classes to actually touch, then one class needs to start where the previous one ends. This is known as the class boundary. To find the class boundaries, subtract 0.5 from the lower class limit and add 0.5 to the upper class limit.
- Sometimes it is useful to find the class midpoint. The process is
Midpoint \(=\frac{\text { lower limit +upper limit }}{2}\) - To figure out the number of data points that fall in each class, go through each data value and see which class boundaries it is between. Utilizing tally marks may be helpful in counting the data values. The frequency for a class is the number of data values that fall in the class.
The above description is for data values that are whole numbers. If you data value has decimal places, then your class width should be rounded up to the nearest value with the same number of decimal places as the original data.
Ex: Given data values of 4.5, 6.3, 5.6, 1.2, etc, if we calculated a class width of 1.235, then we would round our class width to 1.3, not 2.
In addition, your class boundaries should have one more decimal place than the original data.
Ex: For the above data set, instead of adding and subtracting 0.5, we need one more decimal place that the data, so we need to add and subtract 0.05 from each class limit.
Table 2.2.1 contains the amount of rent paid every month for 24 students from a statistics course. Make a relative frequency distribution using 7 classes.
| 1500 | 1350 | 350 | 1200 | 850 | 900 |
| 1500 | 1150 | 1500 | 900 | 1400 | 1100 |
| 1250 | 600 | 610 | 960 | 890 | 1325 |
| 900 | 800 | 2550 | 495 | 1200 | 690 |
Table 2.2.1: Data of Monthly Rent
Solution:
- Find the range:
largest value - smallest value \(= 2550-350=2200\) - Pick the number of classes:
The directions to say to use 7 classes. - Find the class width:
width \(=\frac{\text { range }}{7}=\frac{2200}{7} \approx 314.286\)
Round up to 315
Always round up to the next integer even if the width is already an integer. - Find the class limits:
Start at the smallest value. This is the lower class limit for the first class. Add the width to get the lower limit of the next class. Keep adding the width to get all the lower limits.
\(350+315=665,665+315=980,980+315=1295 \rightleftharpoons\),
The upper limit is one less than the next lower limit: so for the first class the upper class limit would be \(665-1=664\).
When you have all 7 classes, make sure the last number, in this case the 2550, is at least as large as the largest value in the data. If not, you made a mistake somewhere. - Find the class boundaries:
Subtract 0.5 from the lower class limit to get the class boundaries. Add 0.5 to the upper class limit for the last class's boundary.
\(350-0.5=349.5, \quad 665-0.5=664.5,\quad 980-0.5=979.5, \quad 1295-0.5=1294.5 \rightleftharpoons\)
Every value in the data should fall into exactly one of the classes. No data values should fall right on the boundary of two classes. - Find the class midpoints:
midpoint \(=\frac{\text { lower limit }+\text { upper limit }}{2}\)
\(\frac{350+664}{2}=507, \frac{665+979}{2}=822, \rightleftharpoons\) - Tally and find the frequency of the data:
Go through the data and put a tally mark in the appropriate class for each piece of data by looking to see which class boundaries the data value is between. Fill in the frequency by changing each of the tallies into a number.
| Class Limits | Class Boundaries | Class Midpoint | Tally | Frequency |
|---|---|---|---|---|
| 350-664 | 349.5-664.5 | 507 | |||| | 4 |
| 665-979 | 664.5-979.5 | 822 | \(\cancel{||||}\) ||| | 8 |
| 980-1294 | 979.5-1294.5 | 1137 | \(\cancel{||||}\) | 5 |
| 1295-1609 | 1294.5-1609.5 | 1452 | \(\cancel{||||}\) | | 6 |
| 1610-1924 | 1609.5-1924.5 | 1767 | 0 | |
| 1925-2239 | 1924.5-2239.5 | 2082 | 0 | |
| 2240-2554 | 2239.5-2554.5 | 2397 | | | 1 |
Table 2.2.2: Frequency Distribution for Monthly Rent
Make sure the total of the frequencies is the same as the number of data points.
Now that we have a frequency distribution, it's time to draw frequency histogram or just histogram for short. If sketching by hand, then start with a numberline that shows your lowest class boundary to your largest class boundary. Then divide the axis into the equal rectangles, labeling the sides of the rectangles with your remaining class boundaries as you sketch from left to right. If you first class boundary is close to 0, then sketch a proportionate distance to the left where zero should be and draw your vertical axis. Label it with scaled values that fit your frequencies. That means you might choose to count by threes, fives or higher instead of ones. If your first class boundary is not close to 0, then still draw a little more line to its left, but this time use // to indicate that you have truncated your axis. Then draw your vertical axis.
With your axes in place, now sketch the rectangles for each class where their height is based on the frequency of each class. Remember, since we are using class boundaries for the horizontal axes, the bars should share a common side where the boundaries overlap.
Draw a histogram for the distribution from Example 2.2.1.
Solution:
The class boundaries are plotted on the horizontal axis and the frequencies are plotted on the vertical axis. You can plot the midpoints of the classes instead of the class boundaries. Graph 2.2.1 was created using the midpoints because it was easier to do with the software that created the graph.
.png?revision=1)
Graph 2.2.1: Histogram for Monthly Rent
Notice the graph has the axes labeled, the tick marks representing the class midpoints are labeled on each axis, and there is a title. It is important that your graphs (all graphs) are clearly labeled.
Reviewing the graph you can see that most of the students pay around $750 per month for rent, with about $1500 being the other common value. You can see from the graph, that most students pay between $600 and $1600 per month for rent. Of course, these values are just estimates from the graph. There is a large gap between the $1500 class and the highest data value. This seems to say that one student is paying a great deal more than everyone else. This value could be considered an outlier. An outlier is a data value that is far from the rest of the values. It may be an unusual value or a mistake. It is a data value that should be investigated. In this case, the student lives in a very expensive part of town, thus the value is not a mistake, and is just very unusual. There are other aspects that can be discussed, but first some other concepts need to be introduced.
Frequencies are helpful, but understanding the relative size each class is to the total is also useful. To find this you can divide the frequency by the total to create a relative frequency. If you have the relative frequencies for all of the classes, then you have a relative frequency distribution.
Relative Frequency Distribution
A variation on a frequency distribution is a relative frequency distribution. Instead of giving the frequencies for each class, the relative frequencies are calculated.
Relative frequency \(=\frac{\text { frequency }}{\# \text { of data points }}\)
This gives you percentages of data that fall in each class.
Find the relative frequency for the grade data.
Solution:
From Example 2.2.1, the frequency distribution is reproduced in Table 2.2.2.
| Class Limits | Class Boundaries | Class Midpoint | Frequency |
|---|---|---|---|
| 350-664 | 349.5-664.5 | 507 | 4 |
| 665-979 | 664.5-979.5 | 822 | 8 |
| 980-1294 | 979.5-1294.5 | 1127 | 5 |
| 1295-1609 | 1294.5-1609.5 | 1452 | 6 |
| 1610-1924 | 1609.5-1924.5 | 1767 | 0 |
| 1925-2239 | 1924.5-2239.5 | 2082 | 0 |
| 2240-2554 | 2239.5-2554.5 | 2397 | 1 |
Table 2.2.2: Frequency Distribution for Monthly Rent
Divide each frequency by the number of data points.
\(\frac{4}{24}=0.17, \frac{8}{24}=0.33, \frac{5}{24}=0.21, \rightleftharpoons\)
| Class Limits | Class Boundaries | Class Midpoint | Frequency | Relative Frequency |
|---|---|---|---|---|
| 350-664 | 349.5-664.5 | 507 | 4 | 0.17 |
| 665-979 | 664.5-979.5 | 822 | 8 | 0.33 |
| 980-1294 | 979.5-1294.5 | 1127 | 5 | 0.21 |
| 1295-1609 | 1294.5-1609.5 | 1452 | 6 | 0.25 |
| 1610-1924 | 1609.5-1924.5 | 1767 | 0 | 0 |
| 1925-2239 | 1924.5-2239.5 | 2082 | 0 | 0 |
| 2240-2554 | 2239.5-2554.5 | 2397 | 1 | 0.04 |
| Total | 24 | 1 |
Table 2.2.3: Relative Frequency Distribution for Monthly Rent
The relative frequencies should add up to 1 or 100%. (This might be off a little due to rounding errors.)
The graph of the relative frequency is known as a relative frequency histogram. It looks identical to the frequency histogram, but the vertical axis is relative frequency instead of just frequencies.
Draw a relative frequency histogram for the grade distribution from Example 2.2.1.
Solution:
The class boundaries are plotted on the horizontal axis and the relative frequencies are plotted on the vertical axis. The labels on the horizontal axis are once again displayed as the midpoints due to software limitations. (This is not easy to do in R, so use another technology to graph a relative frequency histogram.)
.png?revision=1)
Graph 2.2.2: Relative Frequency Histogram for Monthly Rent
Notice the shape is the same as the frequency distribution.
Ogives
A distribution that shows the number of data values that fall below the upper class boundary of each class.
Another useful piece of information is how many data points fall below a particular class boundary. As an example, a teacher may want to know how many students received below an 80%, a doctor may want to know how many adults have cholesterol below 160, or a manager may want to know how many stores gross less than $2000 per day. This is known as a cumulative frequency. If you want to know what percent of the data falls below a certain class boundary, then this would be a cumulative relative frequency.
To create a cumulative frequency distribution, count the number of data points that are below the upper class boundary, starting with the first class and working up to the top class. The last upper class boundary should have all of the data points below it. Also include the number of data points below the lowest class boundary, which is zero.
Create a cumulative frequency distribution for the data in Example 2.2.1.
Solution:
The frequency distribution for the data is in Table 2.2.2.
| Class Limits | Class Boundaries | Class Midpoint | Frequency |
|---|---|---|---|
| 350-664 | 349.5-664.5 | 507 | 4 |
| 665-979 | 664.5-979.5 | 822 | 8 |
| 980-1294 | 979.5-1294.5 | 1127 | 5 |
| 1295-1609 | 1294.5-1609.5 | 1452 | 6 |
| 1610-1924 | 1609.5-1924.5 | 1767 | 0 |
| 1925-2239 | 1924.5-2239.5 | 2082 | 0 |
| 2240-2554 | 2239.5-2554.5 | 2397 | 1 |
Table 2.2.2: Frequency Distribution for Monthly Rent
Now ask yourself how many data points fall below each class boundary. Below 349.5, there are 0 data points. Below 664.5 there are 4 data points, below 979.5, there are 4 + 8 = 12 data points, below 1294.5 there are 4 + 8 + 5 = 17 data points, and continue this process until you reach the upper class boundary. This is summarized in Table 2.2.4a and Table 2.2.4b.
| Class Limits | Class Boundaries | Class Midpoint | Frequency | Cumulative Frequency |
|---|---|---|---|---|
| 350-664 | 349.5-664.5 | 507 | 4 | 4 |
| 665-979 | 664.5-979.5 | 822 | 8 | 12 |
| 980-1294 | 979.5-1294.5 | 1127 | 5 | 17 |
| 1295-1609 | 1294.5-1609.5 | 1452 | 6 | 23 |
| 1610-1924 | 1609.5-1924.5 | 1767 | 0 | 23 |
| 1925-2239 | 1924.5-2239.5 | 2082 | 0 | 23 |
| 2240-2554 | 2239.5-2554.5 | 2397 | 1 | 24 |
Table 2.2.4a: Cumulative Distribution for Monthly Rent
| Rent Paid ($) | Cumulative Frequency |
|---|---|
| Less than 349.5 | 0 |
| Less than 664.5 | 4 |
| Less than 979.5 | 12 |
| Less than 1294.5 | 17 |
| Less than 1609.5 | 23 |
| Less than 1924.5 | 23 |
| Less than 2239.5 | 23 |
| Less than 2554.5 | 24 |
Table 2.2.4b: Cumulative Distribution for Monthly Rent
A graph that represents the cumulative frequencies for the classes in a frequency distribution.
To create an ogive, first create a scale on both the horizontal and vertical axes that will fit the data as described for a histogram.
Then plot points that have an x coordinate of the class upper class boundary and a y coordinate of the cumulative frequency. Make sure you include the point with the lowest class boundary and the 0 cumulative frequency. The last point of the graph should always be the final class's upper class boundary and the total number of data points.
Then just connect the dots.
Draw an ogive for the data in Example 2.2.1.
Solution:
Using the upper class boundary and its corresponding cumulative frequency, plot the points as ordered pairs on the axes. Then connect the dots. You should have a line graph that rises as you move from left to right.
.png?revision=1)
Graph 2.2.3: Ogive for Monthly Rent
The usefulness of a ogive is to allow the reader to find out how many students pay less than a certain value, and also what amount of monthly rent is paid by a certain number of students. As an example, suppose you want to know how many students pay less than $1500 a month in rent, then you can go up from the $1500 until you hit the graph and then you go over to the cumulative frequency axes to see what value corresponds to this value. It appears that around 20 students pay less than $1500. (See Graph 2.2.4.)
.png?revision=1)
Graph 2.2.4: Ogive for Monthly Rent with Example
Also, if you want to know the amount that 15 students pay less than, then you start at 15 on the vertical axis and then go over to the graph and down to the horizontal axis where the line intersects the graph. You can see that 15 students pay less than about $1200 a month. (See Graph 2.2.5.)
.png?revision=1)
Graph 2.2.5: Ogive for Monthly Rent with Example
If you graph the cumulative relative frequency then you can find out what percentage is below a certain number instead of just the number of people below a certain value.
Frequency Polygons
Frequency polygons are analogous to line graphs, and just as line graphs make continuous data visually easy to interpret, so too do frequency polygons.
A graph that displays the data by using lines to connect points created using the class midpoint for the x-coordinate and the frequency for the y-coordinate. The graph will start and end on the x-axis to created an enclosed shape, a polygon.
To construct a frequency polygon, first create a frequency distribution as described earlier.
Find the class midpoints. In order to created the enclosed shape of the polygon, find the midpoints of the class before the first class and the class after the last class. Another way of thinking about it, is to just count a class width to the left of the first midpoint and a class width to the right of the last midpoint. Since those values are outside of the range of the table, they will have frequencies of 0, allowing the graph to connect to the x-axis. the appropriate ranges, begin plotting the data points. After all the points are plotted, draw line segments to connect them.
A frequency polygon was constructed from the frequency table below.
| Lower Bound | Upper Bound | Midpoint | Frequency |
|---|---|---|---|
| 49.5 | 59.5 | 54.5 | 5 |
| 59.5 | 69.5 | 64.5 | 10 |
| 69.5 | 79.5 | 74.5 | 30 |
| 79.5 | 89.5 | 84.5 | 40 |
| 89.5 | 99.5 | 94.5 | 15 |
The first label on the x-axis is 44.5. This represents an the midpoint of the class that could be before the first class:39.5 to 49.5. Or, just subtract the class width from the first class's midpoint: 54.5 - 10 = 44.5. Since the lowest test score is 54.5, this interval is used only to allow the graph to touch the x-axis.
The point labeled 54.5 represents the next interval, or the first “real” interval from the table, and contains five scores.
This reasoning is followed for each of the remaining intervals with the point 104.5 representing the interval from 99.5 to 109.5. Again, this interval contains no data and is only used so that the graph will touch the x-axis. Looking at the graph, we say that this distribution is skewed because one side of the graph does not mirror the other side. More on the shap of a graph later.
Construct a frequency polygon of U.S. Presidents’ ages at inauguration shown in the Table.
| Age at Inauguration | Frequency |
|---|---|
| 41.5–46.5 | 4 |
| 46.5–51.5 | 11 |
| 51.5–56.5 | 14 |
| 56.5–61.5 | 9 |
| 61.5–66.5 | 4 |
| 66.5–71.5 | 2 |
- Answer
-
The first label on the x-axis is 39. This represents an interval extending from 36.5 to 41.5. Since there are no ages less than 41.5, this interval is used only to allow the graph to touch the x-axis. The point labeled 44 represents the next interval, or the first “real” interval from the table, and contains four scores. This reasoning is followed for each of the remaining intervals with the point 74 representing the interval from 71.5 to 76.5. Again, this interval contains no data and is only used so that the graph will touch the x-axis. Looking at the graph, we say that this distribution is skewed because one side of the graph does not mirror the other side.
Figure \(\PageIndex{7}\): This figure shows a graph entitled, 'President's Age at Inauguration.' The x-axis is labeled 'Ages' and is marked off at 39, 44, 49, 54, 59, 64, 69 and 74. The y-axis is labeled, 'Frequency,' and is marked off in intervals of 1 from 0 to 15. The following points are plotted and a line connects one to the other to create the frequency polygon: (39, 0), (44, 4), (49, 11), (54, 14), (59, 9), (64, 4), (69, 2), (74, 0).
Frequency polygons are useful for comparing distributions. This is achieved by overlaying the frequency polygons drawn for different data sets.
We will construct an overlay frequency polygon comparing the scores from Example with the students’ final numeric grade.
| Lower Bound | Upper Bound | Frequency | Cumulative Frequency |
|---|---|---|---|
| 49.5 | 59.5 | 5 | 5 |
| 59.5 | 69.5 | 10 | 15 |
| 69.5 | 79.5 | 30 | 45 |
| 79.5 | 89.5 | 40 | 85 |
| 89.5 | 99.5 | 15 | 100 |
| Lower Bound | Upper Bound | Frequency | Cumulative Frequency |
|---|---|---|---|
| 49.5 | 59.5 | 10 | 10 |
| 59.5 | 69.5 | 10 | 20 |
| 69.5 | 79.5 | 30 | 50 |
| 79.5 | 89.5 | 45 | 95 |
| 89.5 | 99.5 | 5 | 100 |
Shapes of the distribution:
When you look at a distribution, look at the basic shape. There are some basic shapes that are seen in histograms. Realize though that some distributions have no shape. The common shapes are symmetric, skewed, and uniform. Another interest is how many peaks a graph may have. This is known as modal.
Symmetric means that you can fold the graph in half down the middle and the two sides will line up. You can think of the two sides as being mirror images of each other. Skewed means one “tail” of the graph is longer than the other. The graph is skewed in the direction of the longer tail (backwards from what you would expect). A uniform graph has all the bars the same height.
Modal refers to the number of peaks. Unimodal has one peak and bimodal has two peaks. Usually if a graph has more than two peaks, the modal information is not longer of interest.
Other important features to consider are gaps between bars, a repetitive pattern, how spread out is the data, and where the center of the graph is.
Examples of Graphs:
This graph is roughly symmetric and unimodal:
.png?revision=1)
Graph 2.2.6: Symmetric, Unimodal Graph
This graph is symmetric and bimodal:
.png?revision=1)
Graph 2.2.7: Symmetric, Bimodal Graph
This graph is skewed to the right:
.png?revision=1)
Graph 2.2.8: Skewed Right Graph
This graph is skewed to the left and has a gap:
.png?revision=1)
Graph 2.2.9: Skewed Left Graph
This graph is uniform since all the bars are the same height:
.png?revision=1)
Graph 2.2.10: Uniform Graph
The following data represents the percent change in tuition levels at public, fouryear colleges (inflation adjusted) from 2008 to 2013 (Weissmann, 2013). Create a frequency distribution, histogram, and ogive for the data.
| 19.5% | 40.8% | 57.0% | 15.1% | 17.4% | 5.2% | 13.0% |
| 15.6% | 51.5% | 15.6% | 14.5% | 22.4% | 19.5% | 31.3% |
| 21.7% | 27.0% | 13.1% | 26.8% | 24.3% | 38.0% | 21.1% |
| 9.3% | 46.7% | 14.5% | 78.4% | 67.3% | 21.1% | 22.4% |
| 5.3% | 17.3% | 17.5% | 36.6% | 72.0% | 63.2% | 15.1% |
| 2.2% | 17.5% | 36.7% | 2.8% | 16.2% | 20.5% | 17.8% |
| 30.1% | 63.6% | 17.8% | 23.2% | 25.3% | 21.4% | 28.5% |
| 9.4% | ||||||
Table 2.2.5: Data of Tuition Levels at Public, Four-Year Colleges
Solution:
- Find the range:
largest value - smallest value = \(78.4\)% \(-2.2\)% \(=76.2\)% - Pick the number of classes:
Since there are 50 data points, then around 6 to 8 classes should be used. Let's use 8. - Find the class width:
width \(=\frac{\text { range }}{8}=\frac{76.2 \%}{8} \approx 9.525 \%\)
Since the data has one decimal place, then the class width should round to one decimal place. Make sure you round up.
width \(=9.6\)% - Find the class limits:
\(2.2 \%+9.6 \%=11.8 \%, 11.8 \%+9.6 \%=21.4 \%, 21.4 \%+9.6 \%=31.0 \%, \leftrightharpoons\) - Find the class boundaries:
Since the data has one decimal place, the class boundaries should have two decimal places, so subtract 0.05 from the lower class limit to get the class boundaries. Add 0.05 to the upper class limit for the last class’s boundary.
\(2.2-0.05=2.15 \%, 11.8-0.05=11.75 \%, 21.4-0.05=21.35 \% \leftrightharpoons\)
Every value in the data should fall into exactly one of the classes. No data values should fall right on the boundary of two classes. - Find the class midpoints:
midpoint \(=\frac{\text { lower limt }+\text { upper limit }}{2}\)
\(\frac{2.2+11.7}{2}=6.95 \%, \frac{11.8+21.3}{2}=16.55 \%, \leftrightharpoons\) - Tally and find the frequency of the data:
| Class Limits | Class Boundaries | Class Midpoint | Tally | Frequency | Relative Frequency | Cumulative Frequency |
|---|---|---|---|---|---|---|
| 2.2-11.7 | 2.15-11.75 | 6.95 | \(\cancel{||||} |\) | 6 | 0.12 | 6 |
| 11.8-21.3 | 11.75-21.35 | 16.55 | \(\cancel{||||} \cancel{||||} \cancel{||||} \cancel{||||}\) | 20 | 0.40 | 26 |
| 21.4-30.9 | 21.35-30.95 | 26.15 | \(\cancel{||||} \cancel{||||} |\) | 11 | 0.22 | 37 |
| 31.0-45.0 | 30.95-40.55 | 35.75 | \( |||| \) | 4 | 0.08 | 41 |
| 40.6-50.1 | 40.55-50.15 | 45.35 | \( || \) | 2 | 0.04 | 43 |
| 50.2-59.7 | 50.15-59.75 | 54.95 | \( || \) | 2 | 0.04 | 45 |
| 59.8-69.3 | 59.75-69.35 | 64.55 | \( ||| \) | 3 | 0.06 | 48 |
| 69.4-78.9 | 69.35-78.95 | 74.15 | \( || \) | 2 | 0.04 | 50 |
Table 2.2.6: Frequency Distribution for Tuition Levels at Public, Four-Year Colleges
Make sure the total of the frequencies is the same as the number of data points.
.png?revision=1)
Graph 2.2.11: Histogram for Tuition Levels at Public, Four-Year Colleges
This graph is skewed right, with no gaps. This says that most percent increases in tuition were around 16.55%, with very few states having a percent increase greater than 45.35%.
.png?revision=1)
Graph 2.2.12: Ogive for Tuition Levels at Public, Four-Year Colleges
Looking at the ogive, you can see that 30 states had a percent change in tuition levels of about 25% or less.
There are occasions where the class limits in the frequency distribution are predetermined. Example 2.2.8 demonstrates this situation.
The following are the percentage grades of 25 students from a statistics course. Make a frequency distribution and histogram.
| 62 | 87 | 81 | 69 | 87 | 62 | 45 | 95 | 76 | 76 |
| 62 | 71 | 65 | 67 | 72 | 80 | 40 | 77 | 87 | 58 |
| 84 | 73 | 93 | 64 | 89 |
Table 2.2.7: Data of Test Grades
Solution:
Since this data is percent grades, it makes more sense to make the classes in multiples of 10, since grades are usually 90 to 100%, 80 to 90%, and so forth. It is easier to not use the class boundaries, but instead use the class limits and think of the upper class limit being up to but not including the next classes lower limit. As an example the class 80 – 90 means a grade of 80% up to but not including a 90%. A student with an 89.9% would be in the 80-90 class.
| Class Limit | Class Midpoint | Tally | Freqeuncy |
|---|---|---|---|
| 40-50 | 45 | \( || \) | 2 |
| 50-60 | 55 | \( | \) | 1 |
| 60-70 | 65 | \( \cancel{||||} || \) | 7 |
| 70-80 | 75 | \( \cancel{||||} | \) | 6 |
| 80-90 | 85 | \( \cancel{||||} || \) | 7 |
| 90-100 | 95 | \( || \) | 2 |
Table 2.2.8: Frequency Distribution for Test Grades
.png?revision=1)
Graph 2.2.13: Histogram for Test Grades
It appears that most of the students had between 60 to 90%. This graph looks somewhat symmetric and also bimodal. The same number of students earned between 60 to 70% and 80 to 90%.
Constructing a Time Series Graph
Suppose that we want to study the temperature range of a region for an entire month. Every day at noon we note the temperature and write this down in a log. A variety of statistical studies could be done with this data. We could find the mean or the median temperature for the month. We could construct a histogram displaying the number of days that temperatures reach a certain range of values. However, all of these methods ignore a portion of the data that we have collected.
One feature of the data that we may want to consider is that of time. Since each date is paired with the temperature reading for the day, we don‘t have to think of the data as being random. We can instead use the times given to impose a chronological order on the data. A graph that recognizes this ordering and displays the changing temperature as the month progresses is called a time series graph.
To construct a time series graph, we must look at both pieces of our paired data set. We start with a standard Cartesian coordinate system. The horizontal axis is used to plot the date or time increments, and the vertical axis is used to plot the values of the variable that we are measuring. By doing this, we make each point on the graph correspond to a date and a measured quantity. The points on the graph are typically connected by straight lines in the order in which they occur.
The following data shows the Annual Consumer Price Index, each month, for ten years. Construct a time series graph for the Annual Consumer Price Index data only.
| Year | Jan | Feb | Mar | Apr | May | Jun | Jul |
|---|---|---|---|---|---|---|---|
| 2003 | 181.7 | 183.1 | 184.2 | 183.8 | 183.5 | 183.7 | 183.9 |
| 2004 | 185.2 | 186.2 | 187.4 | 188.0 | 189.1 | 189.7 | 189.4 |
| 2005 | 190.7 | 191.8 | 193.3 | 194.6 | 194.4 | 194.5 | 195.4 |
| 2006 | 198.3 | 198.7 | 199.8 | 201.5 | 202.5 | 202.9 | 203.5 |
| 2007 | 202.416 | 203.499 | 205.352 | 206.686 | 207.949 | 208.352 | 208.299 |
| 2008 | 211.080 | 211.693 | 213.528 | 214.823 | 216.632 | 218.815 | 219.964 |
| 2009 | 211.143 | 212.193 | 212.709 | 213.240 | 213.856 | 215.693 | 215.351 |
| 2010 | 216.687 | 216.741 | 217.631 | 218.009 | 218.178 | 217.965 | 218.011 |
| 2011 | 220.223 | 221.309 | 223.467 | 224.906 | 225.964 | 225.722 | 225.922 |
| 2012 | 226.665 | 227.663 | 229.392 | 230.085 | 229.815 | 229.478 | 229.104 |
| Year | Aug | Sep | Oct | Nov | Dec | Annual |
|---|---|---|---|---|---|---|
| 2003 | 184.6 | 185.2 | 185.0 | 184.5 | 184.3 | 184.0 |
| 2004 | 189.5 | 189.9 | 190.9 | 191.0 | 190.3 | 188.9 |
| 2005 | 196.4 | 198.8 | 199.2 | 197.6 | 196.8 | 195.3 |
| 2006 | 203.9 | 202.9 | 201.8 | 201.5 | 201.8 | 201.6 |
| 2007 | 207.917 | 208.490 | 208.936 | 210.177 | 210.036 | 207.342 |
| 2008 | 219.086 | 218.783 | 216.573 | 212.425 | 210.228 | 215.303 |
| 2009 | 215.834 | 215.969 | 216.177 | 216.330 | 215.949 | 214.537 |
| 2010 | 218.312 | 218.439 | 218.711 | 218.803 | 219.179 | 218.056 |
| 2011 | 226.545 | 226.889 | 226.421 | 226.230 | 225.672 | 224.939 |
| 2012 | 230.379 | 231.407 | 231.317 | 230.221 | 229.601 | 229.594 |
- Answer
The following table is a portion of a data set from www.worldbank.org. Use the table to construct a time series graph for CO2 emissions for the United States.
| Ukraine | United Kingdom | United States | |
|---|---|---|---|
| 2003 | 352,259 | 540,640 | 5,681,664 |
| 2004 | 343,121 | 540,409 | 5,790,761 |
| 2005 | 339,029 | 541,990 | 5,826,394 |
| 2006 | 327,797 | 542,045 | 5,737,615 |
| 2007 | 328,357 | 528,631 | 5,828,697 |
| 2008 | 323,657 | 522,247 | 5,656,839 |
| 2009 | 272,176 | 474,579 | 5,299,563 |
- Answer
-
Figure \(\PageIndex{8}\): This is a times series graph that matches the supplied data. The x-axis shows years from 2003 to 2012, and the y-axis shows the annual CPI.
Uses of a Time Series Graph
Time series graphs are important tools in various applications of statistics. When recording values of the same variable over an extended period of time, sometimes it is difficult to discern any trend or pattern. However, once the same data points are displayed graphically, some features jump out. Time series graphs make trends easy to spot.
There are other types of data summaries for quantitative data. They will be explored in the next section.
Review
A histogram is a graphic version of a frequency distribution. The graph consists of bars of equal width drawn adjacent to each other. The horizontal scale represents classes of quantitative data values and the vertical scale represents frequencies. The heights of the bars correspond to frequency values. Histograms are typically used for large, continuous, quantitative data sets. An ogive allows the reader to quickly determine how many data values fall below a specific threshold. A frequency polygon can also be used when graphing large data sets with data points that repeat. The data usually goes on y-axis with the frequency being graphed on the x-axis. Time series graphs can be helpful when looking at large amounts of data for one variable over a period of time.
References
- Data on annual homicides in Detroit, 1961–73, from Gunst & Mason’s book ‘Regression Analysis and its Application’, Marcel Dekker
- “Timeline: Guide to the U.S. Presidents: Information on every president’s birthplace, political party, term of office, and more.” Scholastic, 2013. Available online at www.scholastic.com/teachers/a...-us-presidents (accessed April 3, 2013).
- “Presidents.” Fact Monster. Pearson Education, 2007. Available online at http://www.factmonster.com/ipka/A0194030.html (accessed April 3, 2013).
- “Food Security Statistics.” Food and Agriculture Organization of the United Nations. Available online at http://www.fao.org/economic/ess/ess-fs/en/ (accessed April 3, 2013).
- “Consumer Price Index.” United States Department of Labor: Bureau of Labor Statistics. Available online at http://data.bls.gov/pdq/SurveyOutputServlet (accessed April 3, 2013).
- “CO2 emissions (kt).” The World Bank, 2013. Available online at http://databank.worldbank.org/data/home.aspx (accessed April 3, 2013).
- “Births Time Series Data.” General Register Office For Scotland, 2013. Available online at www.gro-scotland.gov.uk/stati...me-series.html (accessed April 3, 2013).
- “Demographics: Children under the age of 5 years underweight.” Indexmundi. Available online at http://www.indexmundi.com/g/r.aspx?t=50&v=2224&aml=en (accessed April 3, 2013).
- Gunst, Richard, Robert Mason. Regression Analysis and Its Application: A Data-Oriented Approach. CRC Press: 1980.
- “Overweight and Obesity: Adult Obesity Facts.” Centers for Disease Control and Prevention. Available online at http://www.cdc.gov/obesity/data/adult.html (accessed September 13, 2013).
Glossary
- Frequency
- the number of times a value of the data occurs
- Histogram
- a graphical representation in \(x-y\) form of the distribution of data in a data set; \(x\) represents the data and \(y\) represents the frequency, or relative frequency. The graph consists of contiguous rectangles.
- Relative Frequency
- the ratio of the number of times a value of the data occurs in the set of all outcomes to the number of all outcomes


