Skip to main content
Statistics LibreTexts

6.4: Validity and Reliability

  • Page ID
    64100

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Once a measurement system has been developed, it is crucial that it be evaluated for validity and reliability. This is important because researchers need to know if the measurement instrument works well for the concept and provides stable measurements. It is often helpful to think about abstract concepts in terms something familiar, and hence we will discuss these two concepts in terms of a measurement that most people have done themselves.

    Consider the four types of rulers shown in Figure \(\PageIndex{1}\). All four of the rulers have been designed to measure one foot, or twelve inches, but they have been made in different ways. The first ruler is very carefully made of stainless steel and has been marked so that the distance between the first and last marks is exactly one foot. The second ruler is also made of stainless steel but was carelessly marked, so the distance between the first and last marks is one foot and two inches—even though it is labeled as being exactly one foot. The third ruler is made of a stretchable material. When the ruler was in a relaxed state, it was marked carefully so that the distance between the first and last marks is exactly one foot. The problem is that the material is very stretchable, so when someone uses the ruler, they will sometimes expand it so that the distance between the first and last marks is greater than one foot, while at other times the ruler will get compressed, and the distance between the first and last marks shrinks to less than one foot. The fourth ruler is also made of a stretchable material. When the ruler is in a relaxed state, it is evident that it was not marked carefully because the distance between the first and last marks is one foot and two inches, even though it is labeled as exactly one foot.

    A close-up of several rulers

AI-generated content may be incorrect.
    Figure \(\PageIndex{1}\): Visualizations of the four rulers used in the discussion of validity and reliability (Public domain image created by Alan M. Polansky)

    Now let us consider how measurements from these rulers would behave in practice. Suppose that twenty people are given a wooden block that is exactly 12 inches long. Four groups of five people will each be given one of the rulers. Everyone in each group will measure the block. This experiment is done in such a way that nobody measures the block more than once. Starting with Ruler 1, we have an instrument that is correctly marked and made of a material that is very stable. When the five people measure the block, we would expect that each of them would get measurements very close to twelve inches, and the measurements might look something like what is given in the first row of Table 6.9. It should be noted that even with a very stable and correctly marked ruler that different people would get slightly different measurements, even using the same ruler, and the same block of wood.

    Table 6.9 Hypothetical Measurements of a twelve-inch block using the four rulers in Figure 6.3.

    Ruler

    Measurements

    1

    12.0

    12.1

    12.2

    12.0

    12.1

    2

    10.1

    10.4

    10.0

    10.2

    10.1

    3

    13.0

    11.9

    11.3

    12.2

    11.5

    4

    10.1

    10.1

    10.1

    10.6

    10.3

    Next consider Ruler 2, where we have an instrument that is incorrectly marked and made of a material that is very stable. When the five people measure the block, we would expect that they would all get measurements very close to 10.25 inches, which is the value on Ruler 2 corresponding to a true length that is incorrectly marked as twelve inches on Ruler 2. These measurements might look something like what is given in the second row of Table 6.9. Considering Ruler 3, we have an instrument that is correctly marked and made of a material that is very unstable. When that group measures the block, we would expect that each person would get measurements that vary around twelve inches, but they might be farther away from twelve inches compared to the Ruler 1 group’s measurements, as some people might stretch the ruler out or compress it when they measure the block. These measurements might look something like what is given in the third row of Table 6.9. Finally, Ruler 4, where we have an instrument that is incorrectly marked and made of a material that is very unstable: when the five people measure the block, we would expect them to get measurements that vary around 10.25 inches, but they might be farther away from 10.25 inches than the second ruler. These measurements might look something like what is given in the last row of Table 6.9.

    In terms validity and reliability, whether the ruler is marked correctly determines the validity of the measurements from the ruler. That is, the ruler is marked correctly so that the twelve-inch mark corresponds to twelve inches. The reliability of the rule refers to whether the measurements are stable. The first two rulers, which are made of the stable material, are reliable. That is, repeated measurements with these rulers will give very similar results. Therefore, the first ruler is both valid and reliable, the second ruler is not valid but is reliable, the third ruler is valid but not reliable, and the fourth ruler is nether valid nor reliable. The best measurements will come from the ruler that is both valid and reliable, which corresponds to Ruler 1. That is, the ruler is marked correctly, and repeated measurements are very stable.

    Researchers are often interested in measuring an abstract concept such as anxiety, security, satisfaction, and emotional stability. For these concepts the validity of a measurement instrument, as with the rulers, refers to the idea that the concept that the system is measuring matches the concept that they are attempting to measure.

    Definition: Valid Measurement System

    A measurement system is valid if the system correctly measures the concept without any systematic bias.

    What makes thinking about validity difficult for abstract concepts such as the ones that are common in research is that we usually do not have a good concrete understanding of what is being measured. From our viewpoint it suffices to imagine the construct and that a valid measurement system would act much like Ruler 1 in the example. That is, while not all the measurements would exactly give us the correct measurement, the measurements would tend to cluster around the correct measurement.

    In practice the validation process is quite complicated and technical; we will cover some of the more pertinent issues below without the technical details. Validity has many facets, and we will discuss several here. The first is face validity, which essentially refers to whether the dimensions and the indicators are logical and are consistent with known research about the subject.

    Definition: Face Validity

    A measurement system has face validity if the dimensions and the indicators make sense and are consistent with known research about the subject.

    For example, it would be widely accepted to include blood pressure as an indicator of physical health. This is consistent with the known research that links conditions like hypertension to increased mortality rates. However, a measure like whether an individual prefers to listen to podcasts or music may have little to do with physical health, and there would probably be better indicators, in any case.

    The second type of validity is called content validity, which is the degree to which the measurement system covers the range of meanings within the concept.

    Definition: Content Validity

    A measurement system has content validity if the dimensions and indicators cover the range of meanings within the concept.

    As with face validity, content validity is largely based on the collective judgement and experience in the research community. In the previous discussion of general health, it was pointed out that such a measure should include components corresponding to both physical health and mental health.

    A stronger type of validity is called criterion validity, where the measurement system is statistically compared to another indicator of measure of the concept.

    Definition: Criterion Validity

    A measurement system has criterion validity if it coincides statistically with other measures of the same concept.

    For the self-reported health measure considered earlier, the researchers reported that the measure was compared to other measures of health, including mortality. The fact that the self-reported health measure was associated with mortality rates provides evidence of criterion validity for the measure.

    There are two types of criterion validity that can be considered. The first is predictive criterion validity, which implies that the measurement system does a good job of predicting future values of an established measure of the concept.

    Definition: Criterion Validity

    A measurement system has predictive criterion validity if it does a good job of predicting future values of an established measure of a concept.

    Note the time precedence inherent in the definition. The value of the measurement system must be observed first. The measurement system then predicts the value of the established measure that is observed later in time. In the case of the self-reported health assessment, this would mean that the self-reported measures of health would be observed, and then the mortality rate of those same individuals would be observed later. If the self-reported health assessment does a good job of predicting the mortality rate, then the measure would have predictive criterion validity.

    Predictive criterion validity is a very powerful concept but is not always possible to implement in practice. For example, in the case of validating the self-reported health measure with mortality rates, several years may pass after individuals report their health until enough mortality data is available to verify that the self-reported health predictions are accurate. Researchers may not have sufficient time to wait for these results before publishing their results. The predictive nature of the criterion also implies that the same individuals need to be tracked during this period, which may require substantial resources. An alternative method of validation is based on measurements taken at the same time.

    Definition: Concurrent Criterion Validity

    A measurement system has concurrent criterion validity if it is strongly associated with a value of an established measure of a concept that is observed at the same time.

    In the case of the self-reported health measure, one could validate this measure by having individuals come to clinic and ask them about their health, and then during the same visit assess their general health medically.

    A slightly more complex type of validity is construct validity, which assesses whether the measure behaves the way researchers would expect with respect to measures of other concepts.

    Definition: Construct Validity

    A measurement system has construct validity if it behaves the way researchers would expect in how it is related to measures of other concepts.

    For example, it may be known that individuals with lower overall general health have more doctor visits per year. If the self-reported measure of general health is valid, then lower measures of that measurement system should also be associated with more doctor visits per year. A lack of a relationship could also be used for validation purposes. If it is known the general health is not associated with the amount of time one spends online, then one should expect that the self-reported measure of general health would also not be associated with the amount of time an individual spends online.

    The final type of validity is called factor validity and relies on a technical statistical technique called factor analysis. Factor validity determines whether the dimensions and indicators are consistent in practice with what the researcher intended. That is, the proposed measurement system is observed on a group of individuals. The factor analysis technique can then be used to determine whether the number of dimensions appears to be correct along with whether the indicators are grouped under the correct dimensions.

    Definition: Factor Construct Validity

    A measurement system has factor construct validity if a statistical factor analysis indicates that the number of dimensions and the grouping of the indicators within the dimensions is correct.

    If a researcher contends that four dimensions should be used to measure general health where each dimension has a specified number of indicators, the factor analysis should give results consistent with the proposed structure.

    Definition: Reliable Measurement System

    A measurement system is reliable if repeated measurements on the same individual are nearly identical.

    Essentially, this is the difference between the stainless-steel ruler, which was very reliable as repeated measurements on the same object would be nearly the same, and the flexible ruler, which was not as reliable as repeated measurements on the same object would not be nearly the same.

    At first it may seem that a measurement system should always give the same result when it is applied to the same object. But imagine what might happen if a simple measurement scale like the self-reported health scale is given to the same person on two consecutive days. On one day the individual might be feeling well with a good outlook on life, and they rate their health as being good. On the next day they might not be feeling well and, in that case, they might rate their health as fair or even bad. Reliability then is an indication of how much a researcher can trust the measurement taken at a certain time. Reliability provides the researcher with the assurance that if the measure were taken again, a very similar result would be obtained.

    Researchers have several methods that can be used to assess the reliability of a measurement system. In the simplest case the spirit of the definition of reliability can be used directly and the same measurement can be taken two or more times on the same group of individuals to assess reliability. This is known as test-retest reliability.

    Definition: Test-Retest Reliability

    The reliability of a measurement system is assessed using test-retest reliability if the same measurement is taken two or more times on the same group of individuals.

    For some measurements, such as medical measurements like weight and blood pressure, this method can be a simple, direct method for assessing reliability. For measurement systems based on surveys or written tests, this method can give biased results and can often indicate that a measurement system is more reliable than it is in reality. For example, individuals probably remember most of their answers on a survey when they retake it a second time, and to save time might repeat those answers instead of indicating their true feelings.

    A method for assessing the reliability of a measurement system that does not require more than one measurement on a group of individuals is called internal consistency reliability.

    Definition: Internal Consistency Reliability

    The reliability of a measurement system is assessed using internal consistency reliability if the association between the dimensions and the indicators within the dimension are all strongly associated with one another.

    It should be apparent from the definition that the internal consistency of a measure can only be assessed if there are at least two dimensions or indicators. The idea behind internal consistency is that if all the indicators tend to provide very similar information, then the final measurement should be very stable. For example, when one indicator acts very differently from another, it is an indication that a small change in the indicators could result in a large change in the measurement. If all the indicators tend to act together, it would be very rare for a large change in one of the indicators to occur because this would imply that many indicators would also have to change. Internal consistency reliability is assessed using statistical techniques based on correlations, which will be discussed later in this book.


    This page titled 6.4: Validity and Reliability is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by .

    • Was this article helpful?