Skip to main content
Statistics LibreTexts

7.4: Sampling Bias or Error

  • Page ID
    57569
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    However, random or convenient the sample is, we still have a problem: no matter what we do, we will never know if we got the correct sample that represents the full range of the population. We call this problem sampling bias or error inadvertently sampling just a portion of the population. That sample really is not the best representation of the population. It means that you got the wrong sample….and by wrong, we mean that your sample only pertains to one area, too high or too low, of the population. If the sample is too high or too low, that means we overshot or undershot the population mean.

    What’s the big deal? The consequence is that we make conclusions about the population based on that sample, but that conclusion really does not represent the population. Note that you will always make a sampling bias or error. Every sample has a sampling bias because we have no population to compare the sample with. We can never tell if we are right or wrong because we have no population or an answer key to determine if we are right or wrong.

    Back to our question, “How do you know your random sample is the right sample?” The answer is you don’t. There is no way to tell because of the concept of sampling bias or error. Sampling bias or error is not a bad thing. You will always make a sampling bias/error because you have no idea what the truth or what the population mean is.

    7.4.1: How to Address Sampling Bias or Error

    There is no way to remove sampling bias or error. But we can try to address it, and the solution is to keep sampling. A sample size of N = 1 will not work. So, increase the sample size by getting a collection of samples. Increasing the sample size by getting more samples is known as the sampling distribution – the distribution of sample means. The sampling distribution refers to the following: If we constructed a frequency plot of the sample means, if you put all the samples together, eventually they should look like the population distribution.

    This is the second reason why we need a large sample size.

    7.4.2: Standard Error of the Mean

    We keep sampling. But we can’t keep sampling forever and continuously.

    How do we know when to stop sampling? We really do not know, but the standard error of the mean can give us an indicator of whether our sample is getting close to the population's mean or the truth. The definition of the standard error of the mean is the standard deviation of the sample mean. It is a number (value) that tells us how close our sample mean is to the true population mean. We need some sort of gauge to tell us when we can stop sampling, and the standard error is our gauge.

    Why is it called an error? Because we always make mistakes when we use the sample mean to estimate the population mean. The nature of the mistake is that the sample mean is too high or too low compared to the population mean.

    How do we calculate the standard error of the mean? As we get more samples and add them together, and our sample mean does not shift or change that much, we are getting closer and closer to the population's mean.

    What exactly is shifting in the standard error? Every time you add a new observation, the mean shifts. How much of a shift? The shift depends on the observations themselves. The mean shifts a lot if you get highly variable observations. The mean shifts a little bit if you get observations that are close to the mean.

    The mean shifts become minimal the more we keep sampling. The fewer observations, the more shifts because you get more variable responses. But the more observations you get, the shift eventually becomes less because you are getting closer to the mean. This is the third reason we need a large sample size. You always want more observations and more samples because you will get fewer shifts in the mean.

    Here is a demonstration: If you are a psychology graduate student, chances are you have to travel to practicum sites. Ask each student to state how long it takes them to travel to a practicum site. Write down each response. The first response almost always does not represent the classroom’s average response time. Now, ask another student, and then take the average of the first two responses. Note the average. Then, ask another student to respond and take the average of the first three responses. Then, notice the difference between the first average and the second average. Keep collecting responses until you obtain the entire classroom. With each average, you will notice the difference between the averages decreasing to the point where there will be a minimal decrease. This minimal decrease is the behavior of the standard error as you keep collecting data.

    7.4.3: Interpreting Standard Error of the Mean

    We want the standard error to be small. Notice that it is called a standard error. In a sense, we always want to minimize errors, so we want these errors to be small.

    The typical range of a standard error is 0 to ~ 1.5. You want that value to be as close to 0 because that means that the sample mean is not that different from the population mean, or the sample does not shift that much each time you add a new observation.

    Note that the standard error of the mean will never be 0 because 0 means you got the correct sample mean; there is no error. However, that outcome is impossible to achieve because every time you sample, you add information or data, and the mean will shift.

    What happens is that the standard error will get near zero the more we sample, but it will never actually be zero because, conceptually, we can never get the correct, or population, mean.

    What standard error values are good? A low standard error is good with values from ~.01 to .03. This range means the sample mean is the same as the population mean. The sample mean does not shift that much each time we add more samples, otherwise known as more observations.

    If the standard error is higher than .03, is that bad? Not necessarily, but if you have a low sample size, you might want to consider collecting more data. If you can, of course. Sometimes, we cannot collect more data due to resource limitations or the nature of the population, which is hard to sample. There are other issues to consider regarding the accuracy of your sample other than just relying on the standard error. BTW, despite its importance, you rarely see the standard error reported in journal articles. Statisticians call the standard error of the mean the “root of all statistical evil.” The reason it is evil is that there are situations where it is hard to estimate the standard error. In the above example, the standard error predictably decreased when you kept sampling. However, the sampling must occur at random. Each observation taken at random has to be independent of other observations. Independent means that taking a sample has no bearing on how the next sample is taken.

    There are quite a few situations where observations are dependent on other factors. For example, better access to mental health resources is associated with increased income, so those two factors are not independent of each other. The problem occurs when you have nested data or when the participants are more alike rather than independent of each other. For example, residents on a neighborhood block are more alike in their income rather than having independently varying incomes. So, instead of sampling 10 residents, you have sampled 10 residents with similar income, amounting to just one block. It is difficult, then, to determine if you have 10 independent sources of income or one very similar source of income. When your observations are more alike than varied, the standard error is difficult to determine. Discussion of this issue is extensive, and you are encouraged to review resources about sample dependence.

    Suffice to say, the standard error is rarely reported, but it is an important concept because it is how we determine we have the best sample mean.

    Back to the original question – how do you know you got the correct sample? Use the standard error of the mean. If it is near zero, you got the best sample estimate possible. Is it really the correct sample? No, because the population parameter (the truth) is always unknown. At the end of the day, sample size, the standard error, and your research recruitment method are all you have to determine if you got the “right sample” or the sample that is the best representation of the population.

    The goal is to use a sample to estimate the population's mean. But each time we take a sample at random, it is possible that our sample mean is not a good estimate of the population mean. That is called sampling bias/error. And if we always make a sampling bias/error because it is an inherent problem that the sample will never equate to the population, what do we do?

    Solution: keep sampling. Get a sampling distribution of means. Technically speaking, if we keep sampling, eventually, we should get something that comes close to the population's mean. But how many samples should we get?

    The solution is to use the standard error of the mean. The standard error of the mean tells us how much the sample mean will shift with each sample. With each sample mean, we add it to the previous samples, and if the mean doesn’t shift too much, then we are probably converging on the true population parameter.

    So that’s it? Is this the process for determining if a sample represents the population? Nope. We need one more concept.


    This page titled 7.4: Sampling Bias or Error is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by Peter Ji.