Skip to main content
Statistics LibreTexts

4.2: Defining a Population

  • Page ID
    59488

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    When researchers consider an issue they would like to study, there is a group of individuals who are the focus of the study. In the theory of statistics, these groups are called statistical populations, or simply populations. For example, if a researcher is interested in comparing the average income across self-identified genders for all adults in a large city, then the population consists of all adults in the city. Similarly, if a researcher is interested in the potential presidential voting profiles of all eligible voters for the next presidential election in the United States, then the population consists of every individual in the United States who has or can register to vote prior to the next presidential election. Populations can also consist of physical objects or other living beings that are not people. For example, a study might consider the number of grocery stores that offer fresh fruit and vegetables in each neighborhood of a city to determine if access to fresh produce is associated with the income level of the neighborhoods. In this case, the population of interest is all the grocery stores in the city. In other cases, the population may not consist of individuals or physical objects at all, but rather an idealized set of observations that could be made under the right conditions. These hypothetical or theoretical populations may even have an infinite number of members in the population. For example, it might be of interest to compare the waiting times of emergency room patients at urban, suburban, and rural hospitals. These waiting times do not physically exist—they are observations that will be made of an event that will happen later. Nevertheless, this population may be of interest and could be theoretically infinite, and it relates to many important issues in our modern world.

    Definition: Population

    A population consists of all individuals or items that are of interest in a statistical study.

    It is crucial in the development of a statistical study that the population of interest be defined precisely. The importance of this precise definition is because the population is the source of the observed data that will be used to learn about the issue of interest. The defined population should only include individuals or items that are of interest, while containing all the individuals or items of interest. The definition of the population of interest is guided by the question or hypothesis about the topic that the researcher is interested in investigating. In most cases researchers will use an iterative process of refining the relationship between the research question and the definition of the population many times before settling on a precise statement of each.

    As an example, suppose that a researcher is interested in studying the association between immigration status and yearly income. The population would include everyone in the United States. The associated population is quite large and complex, and there are many questions that arise as one considers the problem under consideration. What is meant by yearly income? Certainly, individual income could be observed, but how would that vary within family units? Some members of a family may be of an age that would make work outside the family possible, but they may be tasked with watching after children or siblings to support others in the family who work. As such, many in the population potentially have no income but support those who bring income into the family unit. In this situation it is also relevant that the income from the people who work outside the family would be shared between the family members, whereas a single person with the same income would support only themselves. At this point some refinement of the research question seems to be in order.

    The researcher may decide to focus on the income of individuals who work outside the home. By making this refinement, the population is then focused on individuals in the United States who work outside the home. The research question is further narrowed down to consider the individual income of workers employed outside the home. Now a time element becomes important. Some individuals may be working currently but may be unemployed later, and vice versa. In this case the researcher may refine the population to be those employed at the time of the study. The research question then is refined to study the relationship between the yearly income of individuals who were employed at the time of the study. Further refinements in the population may be considered as the research question is refined further.

    The population in surveys usually refers to the actual physical individuals or items of interest. To learn from this population, characteristics and measurements are observed from these individuals. For example, in a study of consumer credit ratings, the population of interest may be all individuals living within a specified set of zip codes between 18 and 65 years old. When individuals from this population are observed for the study, we may be interested in several characteristics of the individuals. For example, we might determine the credit rating of the individual, along with other demographic data like race, age, employment status, and gender identity. Collectively these measurements are known as data.

    Definition: Data

    Measurements taken on individuals or items in a population are called data. A single measurement is called a datum.

    Many measurements may be taken on each observed individual from the population. Each distinct measurement taken on the individuals in a population is called a variable.

    Definition: Variable

    A single measurement that is taken on everyone observed in a population is called a variable.

    Some practitioners simply define variables as quantities that vary, but this simplistic definition is unsatisfactory from a practical viewpoint, as not all variables vary between individuals and, more importantly, it is not a mathematically sound definition (Polansky 2011). In the previous example, the measurements obtained from the observed individuals from the population are the credit rating, race, age, employment status, and gender identity of the observed individuals. Taken collectively these observations and variables constitute the data observed for the study. The measurements of credit rating of the individuals in the population are referred to as observations of the credit rating variable. The other variables in the study correspond to the observations of race, age, employment status, and gender identity.

    Once a population and variables are defined, the variables’ quantities need to be determined. In the previous example we are interested in how income, as defined above, varies with immigration status. We might decide that we could answer this question definitively by comparing the average income of everyone in the population aggregated across each possible immigration status. These averages are examples of parameters, quantities that can be computed from the known values of the variables in a population.

    Definition: Parameter

    A parameter is a quantity or characteristic that can be computed when the whole population has been observed.

    The key idea here is that the value of a parameter is only known when the variables on every individual or item in a population is observed. In the example, if we observed everyone in the defined population along with their income, we could compute the average value of the income variable within each immigration status. We could not compute these values unless everyone in the population could be observed, and the income and immigration status of each individual could also be observed.

    As another example, let us consider a small population consisting of two small classes of graduate students in advanced study at a university. There are seven students in the first class where a traditional instructional approach was taken, and ten students in a second class where the same material was taught but from a diverse perspective with respect to gender. The instructor of these classes is interested in the possible effect that the gender diverse instructional method may have on the final grades in the classes. The instructor is also interested in comparing the grades across gender identities in the classes. Hence, we have a population of seventeen students with two variables: final grade and gender identity. The population is reported in Table 4.1 below.

    Table 4.1. A population of seventeen students in two classes. The student names are censored with labels. The classes are identified as traditional (T) or gender diverse (D). All students in the study identified their genders as female (F), male (M), or nonbinary (N).

    Student

    Class

    Gender Identity

    Grade

    A

    T

    M

    68

    B

    T

    N

    57

    C

    T

    F

    74

    D

    T

    M

    78

    E

    T

    F

    64

    F

    T

    F

    85

    G

    T

    M

    96

    H

    D

    M

    56

    I

    D

    N

    83

    J

    D

    F

    68

    K

    D

    F

    77

    L

    D

    M

    89

    M

    D

    N

    70

    N

    D

    F

    75

    O

    D

    N

    72

    P

    D

    M

    71

    Q

    D

    F

    87

    For this study we are interested in comparing the average grades across the classifications of gender identities and the two classroom experiences. Therefore, we can take these averages as our parameters. The population is small, and since we know the gender identity and final grade for everyone, we can compute the parameter values directly. Recall that an average grade is computed by adding up all the scores and then dividing by the number of scores. Therefore, as an example, the average final grade for students who identify as female in the traditional course is \( (74+64+85) \div 2 = 74.3 \approx 74 \) where we have rounded to the nearest whole number. The symbol \( \approx \) is used to indicate that the value following the symbol is an approximate value, in this case due to rounding the value 74.3 to 74. The remaining averages are given in Table 4.2.

    Table 4.2. The population parameters of seventeen students in two classes, rounded to the nearest whole number. The classes are identified as traditional (T) or gender diverse (D). All students in the study identified their genders as female (F), male (M), or non-binary (N).

    Gender Identity

    Class

    Average Grade

    F

    T

    74

    M

    T

    81

    N

    T

    57

    F

    D

    77

    M

    D

    72

    N

    D

    75

    We will not attempt to draw any conclusions from this hypothetical study as this population is incredibly small, and any conclusions would probably not extend to larger populations. The important conceptual idea at this point is to make the connection between the members of the population, the variables, and the parameters.


    This page titled 4.2: Defining a Population is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by .

    • Was this article helpful?