Skip to main content
Statistics LibreTexts

1.4: Sampling Methods

  • Page ID
    61506
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)
    Learning Objectives
    • Identify biased samples
    • Distinguish between methods of sampling
    • Distinguish between random sampling and random assignment

    Why We Sample

    Sampling plays a significant role in inferential statistics. Our goal is to use information from a smaller group (the sample), to learn about a larger group (the population). To make this work well, the sample should be large enough and chosen without bias so it fairly represents the population. There are many ways to choose a sample, and some methods work better than others.

    Simple Random Sampling

    Researchers use a variety of different sampling methods, but the most straight-forward is simple random sampling. In this method, every member of the population has an equal chance of being chosen, and choosing one person does not affect the chance of choosing anyone else (in other words, each selection is independent). Because of this, we can say that simple random sampling is based on pure chance.

    Text Exercise \(\PageIndex{1}\)

    Given the following situation, determine the population and the sample. Then determine if the sample was chosen using simple random sampling and if the sample is biased.

    A research scientist is interested in studying the experiences of twins raised together versus those raised apart. She uses a list of twins from the National Twin Registry and selects people for her study. First, she chooses all twins from the registry whose last name begins with \(Z.\) Then she chooses all twins from the registry whose last name begins with \(B.\) However, there are so many names that start with \(B,\) that our researcher decides to select only every other name into her sample. Finally, she mails out a survey and compares characteristics of twins raised apart versus together.

    Answer

    The population is all twins listened in the National Twin Registry. This means that the researcher should only make conclusions about twins in this registry, not all twins everywhere, because the registry may not represent all twins.

    The sample is all twins from the National Twin Registry with the last name starting with \(Z\) and every other set of twins from the Registry with the last name starting with \(B\).

    The sample was not chosen using simple random sampling since selecting only twins with last names starting with a certain letter does not give everyone an equal chance of being chosen. Choosing the specific letter, \(Z,\) may over-represent ethnic groups where it is common to have a last name starting with \(Z\). The same issue also applies to choosing twins with last names starting with \(B\).

    Another problem is choosing every other name from the \(B\) list. This "every-other-one" method (called systematic sampling) prevents nearby names from being chosen together which, again, implies the sample was not selected using simple random sampling. Because of these issues, the sampleing method is biased and the resulting sample may not fairly represent the situation.

    Sample Size Matters

    Whether a sample is random depends on how it is chosen, not on the results. This implies that even when a sample is random, it may not represent the whole population, especially if the sample is small. For example, imagine a population with the same number of males and females. If we randomly chose only \(10\) people, there is a small chance that \(80\%\) of those people are women. However, if we increase our randomly chosen sample to \(20\) people, there is an even smaller chance that \(80\%\) of those people are women. This means that our sample is less likely to be biased when we choose a larger sample. Since we know that a sample consisting of \(80\%\) women would not be a good representation of the population, even if it was chosen randomly, we can conclude that larger samples are more likely to represent the population accurately.

    Because of this, inferential statistics considers sample size when using sample data to make conclusions about a population. In later chapters, we will learn the math that helps adjust for sample size.

    Other Sampling Methods

    The goal of sampling is to choose a group that accurately represents the population so we can make sound conclusions. The easiest way to avoid creating a biased sample is to select your sample using simple random sampling. However, sometimes a population is already segregated naturally into clear groups that differ in important ways. In these cases, we want to make sure each group is represented in the sample.

    One method that helps with this is stratified random sampling. In this method, the population is naturally divided into smaller groups called strata. A random sample is then taken from each group. The size of each group in the sample must be proportional to its size in the population, in order to help ensure fair representation.

    Text Exercise \(\PageIndex{2}\)

    Suppose we want to study students' opinions about capital punishment at an urban university. We have the time and resources to interview \(200\) students. The student body is diverse in age as about \(30\%\) of students are older adults who work during the day and take night courses (average age is \(39),\) while the other \(70\%\) are younger students who usually take classes during the day (average age of \(19).\) It is possible that night students have different opinions about capital punishment than day students? How could we use stratified sampling to choose a fair, random sample?

    Answer

    Since \(70\%\) of the student body are day students, it makes sense that \(70\%\) of the sample should also include day students. (\(.7\cdot 200\) \(=140)\). That means out of our sample of \(200\) students, \(140\) should be day students and \(60\) should be night students. This way, the sample matches the makeup of the entire student population, thus making our conclusions about all students at the university more reliable. However, we still need to ensure that both groups in our sample are chosen randomly.

    It is important to note that simple random sampling does not necessarily eliminate all bias, rather, it makes sure that any possible bias happens only by chance. However, ensuring that each all requirements of simple random sampling are met, also means that we might end up with a difficult sample to collect. Collecting data using this method is sometimes expensive and time-consuming. It would be much easier if we could collect a random sample that is also easy to access. While this is not always possible, having more information about the population can help us choose a better sampling method than simple random sampling.

    For another method to work well, the population must be well-mixed. This means that we must divide the population into smaller sections, where each section is similar to the others for the topic being studied.

    One method that works in this situation is cluster random sampling. This method is used when a population is naturally divided into smaller groups, called clusters, and these groups are similar to each other. First, a random selection of clusters is chosen. Then, there are two ways to collect data. In single-stage cluster sampling, every member of the selected clusters is studied. In double-stage cluster sampling, a random sample is taken from each selected cluster. Double-stage sampling saves time and money, but single-stage sampling is usually better because it includes more people.

    Both cluster sampling and stratified sampling divide the population into groups. The difference is that in stratified sampling, members are chosen from every group, while in cluster sampling, members are only chosen from some of the groups.

    Text Exercise \(\PageIndex{3}\)

    Suppose we want to learn how voters feel about a school bond for their local high school (assume there is only one high school in the city). Would cluster sampling be a good way to choose a random sample? If so, how could single-stage cluster sampling be used?

    Answer

    The population includes all registered voters in the city and a natural way to divide this population is by voter precincts. Because there is only one high school in the city, families are not competing with other schools for funding. Opinions about the school bond may depend on whether people have children as well as the age of their children, but families at different life stages are likely spread across all precincts. This means there are no major differences between precincts when it comes to opinions about the school bond, so cluster sampling would be an appropriate method.

    To use single-stage cluster sampling, we could randomly choose \(10\) precincts and then survey every voter in those precincts. This method would be faster and more efficient than choosing voters from every precinct (stratified sampling) or selecting voters completely at random (simple random sampling).

    Text Exercise \(\PageIndex{4}\)

    Suppose we want to find the average height of students at our local high school. Even though it would be possible to measure every student, doing so would not be practical. Consider whether stratified sampling or cluster sampling would work better in this situation.

    Answer

    There is more than one possible answer. Two natural ways to group high school students are by grade level and gender. There are clear differences in height between males and females, and students usually grow as they move through high school. Because of these differences, using cluster sampling based on gender or grade level would not be a good choice.

    Instead, stratified sampling would work better. Each grade level could be divided by gender, creating eight groups in total. Taking a random sample from each group would give a more accurate estimate of the average height.

    Another option might be to use homerooms, which often include students of different ages and genders. If homerooms are mixed evenly, they could work well as clusters for sampling. However, this depends solely on if the school chooses to separate students into homerooms this way.

    Even though our goal is to get a sample that truly represents the population, the best we can do is make sure that any bias that may happen comes only from random chance. Inferential statistics is based on this idea. We need to be careful to avoid sampling methods that create systematic bias.

    We have already seen several biased sampling methods in this course. These include voluntary response sampling, like when the coach studied only students who volunteered to do cartwheels; convenience sampling, like asking only the students sitting in the front row; and systematic sampling, such as choosing every other person whose last name starts with the letter B. Note that in systematic sampling, the selection does not have to be every other person, it can be every \(k^{th}\) person.

    Text Exercise \(\PageIndex{5}\)

    Construct definitions for voluntary response sampling and convenience sampling. Explain how they are related, how they are different, and why they often produce biased samples.

    Answer

    Voluntary Response: A sampling method where a request is sent out to a large group, asking people to participate. Anyone who chooses to respond is included in the sample. The key feature of this method is that people choose whether or not to take part.

    Convenience: A sampling method where people or items are chosen simply because they are easy to reach or available. Participants are selected based on convenience, not at random.

    Both methods are based on ease of access, but they are not the same. Voluntary response sampling always involves people choosing to participate on their own. Convenience sampling does not require volunteers, it just involves selecting whoever is easiest to study. For example, sending out a school-wide email asking students to take a survey is a voluntary response sampling. Surveying only students in a psychology class is convenience sampling. Another example of convenience sampling would be standing at one busy intersection and counting how many drivers use turn signals. The drivers did not volunteer, they were simply convenient to observe.

    Both methods often lead to biased samples. Voluntary response samples are usually biased because people with strong opinions are more likely to respond. Convenience samples are often biased because the people or items selected tend to share certain traits, while others are left out.

    Sometimes these methods still produce useful information, especially when time, cost, or access is limited. However, because we cannot be sure that any bias is ude to only chance, we should be careful when using results from voluntary response or convenience samples to make conclusions about a larger population.

    Random Assignment in Medical Trials

    In experimental research, the population being studied is often hypothetical. For example, in an experiment testing how well a new anti-depressant drug works compared to a placebo (a fake treatment), there is no actual population of people already taking the drug. Instead, Researchers define a group of people who have some level of depression and then take a random sample from that group. The sample is then randomly divided into two groups. One group receives the treatment (the drug), and the other receives the control treatment (the placebo). This process is called random assignment. Random assignment is very important because it helps make sure the experiment is fair and valid.

    To see why this matters, imagine if the first \(20\) people who arrived were placed in the treatment group and the next \(20\) were placed in the control group. If people who arrive later tend to be more depressed, the control group would start off more depressed than the treatment group. This difference existed before the treatment even began, which would affect the results.

    In experiments like this, not using random assignment is a bigger problem than using a non-random sample. Without random assignment, the results of the experiment are not valid. A non-random sample mainly limits how widely the results can be applied, but the experiment itself can still show cause and effect.


    1.4: Sampling Methods is shared under a Public Domain license and was authored, remixed, and/or curated by The Math Department at Fort Hays State University.