9.3: Location
- Page ID
- 64254
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)While there are many characteristics of data that can be considered, most studies concentrate on a few key characteristics that are easy for both experts and novices to interpret. One of the primary concerns with summarizing a set of data is to get some idea as to how big a typical value in the data may be. This is particularly relevant when comparing two or more sets of data. Earlier we considered two lists of salaries and wished to determine if the individuals in one of the groups tended to earn more than the individuals in the other group. Of course, some data points in a set of data may overlap with the other. That is, individuals in one group may make more than the other group and vice-versa, but we are interested in an overall summary of the size of the data in each table. Statisticians and data scientists answer these types of questions by summarizing the location or central tendency of the data.
The idea that a set of data has a location means that there is some mathematical structure to the data. At the very least we should be able to determine if one datum is greater than, equal to, or less than another datum. In this section we will insist on a little bit more structure as it is most common to measure the location of data that has been measured on an interval or ratio measurement scale. While there are some measures of location for ordinal data, they are much less common in practice.
A measure of location of a set of data is a measure that summarizes the set of data with a single value that represents the middle or center of the data.
This is an abstract concept, and there is not one unique method for finding the location of a set of data. Rather, there are several methods that attempt to measure this concept, all in slightly different ways. We will consider the two most popular methods for measuring location that are used in most simple statistical analyses.
The first method for summarizing the location of a set of data is one that most individuals are familiar with: an average of the numbers. In the field of statistics and data science, the average is usually called the mean.
To compute the mean of a set of data, add all the values in the data set and divide by the number of values in the data set.
Suppose that we have five incomes (in thousands of dollars): 23, 54, 44, 77, and 26. Find the mean.
Solution
To find the mean income we first add all the data values to get
\[23 + 54 + 44 + 77 + 36 = 234.\]
Once the values have been added, we divide by the number of observations. In this case there are five, so the mean is equal to
\[234\div 5 = 46.8.\]
The relevant question is how does the mean measure the location of a set of data? The idea behind the mean is based on the concept of center of mass used in physics. For the data given in the example, visualize a balance beam that has been marked like a ruler at every whole number. For this example, we will only include the part of this ruler needed to show the data. Now think about placing a small weight on the ruler at each point where there is a data point in the data set. For this to work, all the weights would need to have the same mass. Now the question is, if the ruler and the weights were placed on a point, where could we locate the point so that the ruler and weights would be perfectly balanced? As demonstrated in Figure \(\PageIndex{1}\), the answer to this turns out to be the mean!
In Figure \(\PageIndex{1}\) we can visually observe how this interpretation works. Note that the balance point is slightly to the right of the third largest value in the data set to make up for the fact that the largest datum is farther out to the right than the bulk of the data points. This interpretation of the mean can tell us a lot about its properties and can be useful when considering whether the mean is a good measure of location for a specific set of data when evaluating a research study. For example, it should be clear that if we take the largest data point in the previous example and add to it, essentially moving it farther right on the number line in Figure \(\PageIndex{1}\), then the mean will also have to move to the right for the line to keep balance.
Suppose that we have five incomes (in thousands of dollars): 23, 54, 44, 97, and 26. This is the same data as was used in the previous example with 77 replaced by 97. Find the mean.
Solution
To find the mean income, we first add all the data values to get
\[23+54+44+97+36 = 254.\]
Once the values have been added, we divide by the number of observations. There are five observations, so the mean is equal to
\[224\div 5=50.8. \]
We can observe that the mean increased from 46.8 to 50.8 to keep balance when the datum 77 was moved to the right, now 97 (Figure \(\PageIndex{2}\)). We can now imagine moving the largest data point even farther to the right, for example to 135.
Suppose that we have five incomes (in thousands of dollars): 23, 54, 44, 150, and 26. This is the same set of data used in the previous examples with 77 replaced by 135. Find the mean.
Solution
To find the mean income we first add all the data values to get
\[ 23+54+44+135+36=292.\]
Once the values have been added, we divide by the number of observations. In this case, there are five observations, so the mean is equal to
\[292\div 5=58.4.\]
From the previous example and Figure \(\PageIndex{3}\) we can observe that when the largest value is 135 the mean is greater than four of the five data points, or 80% of the data. In this case, many statisticians and data scientists question whether the mean is really providing a good measure of the location of the data. Indeed, the problem gets worse the farther the rightmost point is shifted. For example, if 77 is replaced by 1,000, the mean would be equal to 231.4, which is more than six times the second largest value in the set of data. Statisticians say that the mean is not robust to outliers because of this problem.
Is the behavior of the mean a problem in practical applications? Consider the data given in Table 9.3, which corresponds to 100 observations of salaries from a large employer that have been simulated for the purpose of the example. One can observe from the table that the salaries have quite a range, starting at around $20,000 and going all the way up to $919,000. One can also observe, and we will see this better in subsequent chapters, that many salaries are relatively low, around $20,000 to $40,000, and only a few that are higher. This type of behavior is quite typical of salary data at any company. There are usually many relatively low salaries with a few very large salaries.
Table 9.3 Simulated salaries of 100 employees in thousands of dollars.
|
20 |
44 |
23 |
23 |
27 |
30 |
24 |
110 |
28 |
141 |
|
51 |
23 |
24 |
31 |
29 |
26 |
21 |
325 |
97 |
42 |
|
21 |
33 |
24 |
19 |
21 |
27 |
31 |
24 |
58 |
21 |
|
20 |
36 |
55 |
34 |
30 |
27 |
58 |
47 |
24 |
39 |
|
28 |
42 |
162 |
29 |
46 |
21 |
99 |
89 |
44 |
40 |
|
32 |
41 |
69 |
29 |
25 |
31 |
43 |
23 |
37 |
44 |
|
27 |
34 |
40 |
24 |
32 |
29 |
19 |
919 |
41 |
44 |
|
22 |
25 |
35 |
27 |
26 |
29 |
34 |
27 |
35 |
27 |
|
20 |
78 |
26 |
36 |
22 |
24 |
34 |
46 |
20 |
70 |
|
21 |
26 |
24 |
29 |
31 |
34 |
114 |
39 |
26 |
29 |
If the mean is computed on the salary data, we get that the mean salary for the company is $50,070. That is, the president of the company can boast that the average worker at their company earns about $50,000. While there are certainly employees at the company who make that salary, and more, a close look at Table 9.3 reveals that 84% of the employees attending the company make less than that salary. From this viewpoint the mean seems to imply that the salaries at the company are much better than they seem to most of the employees.
The mean is generally not considered a good measure of the location of data when there are values that are far away from the bulk of the data. These data values, called outliers, can either be very large (to the right of most of the data) or can be very small (to the left of most of the data), and the mean will be pulled in the direction of the outlying value. In these cases, a measure of location that divides the data in two halves, an upper half and a lower half, is often considered a better measure of location. This measure is called a median.
A median is any point which divides the lower half from the upper half of the data.
Before proceeding to some examples there are some things that should be pointed out about a median. If there are an odd number of data values, then the median will be one of the data points. If there are an even number of data values, then the median is any value between the two halves of the data. Traditionally, for the case of an even number of data points, the average of the two data points nearest the middle of the data is the median. One should be careful in that not all researchers or statistical computing software will use this convention, but we will use it in this book.
Suppose that we have five incomes (in thousands of dollars): 23, 54, 44, 77, and 26. Find the median.
Solution
To find the median income we first need to sort the data values from smallest to largest: 23, 26, 44, 54, and 77. Because there is an odd number of data points, the median will be the data point that divides the upper half of the data from the lower half. This point is 44 since there are exactly two data points less than 44 and exactly two data points above 44.
Suppose that we have six incomes (in thousands of dollars): 23, 54, 44, 77, 34, and 26. Find the median.
Solution
To find the median income we first need to sort the data values from smallest to largest: 23, 26, 34, 44, 54, and 76. Because there are an even number of data points, the median will be any point that divides the upper half of the data from the lower half, that is, any point greater than 34 but less than 44. Following the usual convention, the two values closest to the middle are averaged together so that the median is
\[(34+44)\div 2=39.\]
In the previous examples the interpretation of the median is very simple: the median is the point that divides the upper half of the data from the lower half. That is, half of the salaries are above the median and half are below.
Note that the median does not have the same problem that is shared by the mean when considering outliers in the data. In the previous two examples the largest value in the data set 23, 54, 44, 77, and 26 was first changed from 77 to 97, and then from 77 to 135. In each case the mean became larger to balance the increasing value of the largest data point. The median does not change when only the largest value is increased.
Suppose that we have five incomes (in thousands of dollars): 23, 54, 44, 97, and 26. Find the median.
Solution
To find the median income we first need to sort the data values from smallest to largest: 23, 26, 44, 54, and 97. Because there is an odd number of data points, the median will be the data point that divides the upper half of the data from the lower half. This point is 44 since there are exactly two data points less than 44 and exactly two data points above 44. Note that the value of the median for this example matches the median for the same set of data where 77 was used in place of 97.
Because the median is more resistant to outliers, it is often preferred to the mean when the data may have very large, or very small, data values compared to the bulk of the data. Salaries and income data are two instances where the median is often preferred over the mean because of the potential for such outliers. For the simulated salary data given in Table 9.3, the median is $30,000. This means that half of the employees earn more than $30,000 and half of the employees earn less than $30,000. For this data, this value is a more realistic view of the typical salary at the company.
If the mean has problems with outliers, why not always use the median instead of the mean? Indeed, the mean is the most common measure of location, and the reason for this has to do with some theoretical properties of the mean that are beyond the scope of this book. What you need to know is that if a set of data is generally well behaved, the mean is a much better choice and has many theoretical properties that support its preference. In many of these cases the two measures will be close to one another. If there are outliers in the data, and under some additional conditions that will be discussed later, the median may be used in its place. You may find that some studies report both measures.
Are there other ways to measure the location of a set of data? Statisticians have developed many other methods for measuring the location of a set of data, and some of these have very impressive theoretical properties. The usual problem with these methods is that they can be complicated to compute and knowing when to use them is not always clear. For these reasons, the use of alternative measures is quite rare and familiarity with these methods is usually not required. Indeed, none of the research in any of the studies we use in this book rely on any measure of location other than the mean or median.

