Skip to main content
Statistics LibreTexts

3.5: Missing data

  • Page ID
    3558
  • There is no such thing as a perfect observation, much less a perfect experiment. The larger is the data, the higher is the chance of irregularities. Missing data arises from the almost every source due to imperfect methods, accidents during data recording, faults of computer programs, and many other reasons.

    Strictly speaking, there are several types of missing data. The easiest to understand is “unknown”, datum that was either not recorded, or even lost. Another type, “both” is a case when condition fits to more then one level. Imagine that we observed the weather and registered sunny days as ones and overcast days with zeros. Intermittent clouds would, in this scheme, fit into both categories. As you see, the presence of “both” data usually indicate poorly constructed methods. Finally, “not applicable”, an impossible or forbidden value, arises when we meet something logically inconsistent with a study framework. Imagine that we study birdhouses and measure beak lengths in birds found there, but suddenly found a squirrel within one of the boxes. No beak, therefore no beak length is possible. Beak length is “not applicable” for the squirrel.

    In R, all kinds of missing data are denoted with two uppercase letters NA.

    Imagine, for example, that we asked the seven employees about their typical sleeping hours. Five named the average number of hours they sleep, one person refused to answer, another replied “I do not know” and yet another was not at work at the time. As a result, three NA ’s appeared in the data:

    Code \(\PageIndex{1}\) (Python):

    (hh <- c(8, 10, NA, NA, 8, NA, 8))
    

    We entered NA without quotation marks and R correctly recognizes it among the numbers. Note that multiple kinds of missing data we had were all labeled identically.

    An attempt to just calculate an average (with a function mean()), will lead to this:

    Code \(\PageIndex{2}\) (Python):

    hh <- c(8, 10, NA, NA, 8, NA, 8)
    mean(hh)
    

    Philosophically, this is a correct result because it is unclear without further instructions how to calculate average of eight values if three of them are not in place. If we still need the numerical value, we can provide one of the following:

    Code \(\PageIndex{3}\) (Python):

    hh <- c(8, 10, NA, NA, 8, NA, 8)
    mean(hh, na.rm=TRUE)
    mean(na.omit(hh))
    

    The first one allows the function mean() to accept (and skip) missing values, while the second creates a temporary vector by throwing NA s away from the original vector hh. The third way is to substitute (impute) the missing data, e.g. with the sample mean:

    Code \(\PageIndex{4}\) (Python):

    hh <- c(8, 10, NA, NA, 8, NA, 8)
    hh.old <- hh
    hh.old
    hh[is.na(hh)] <- mean(hh, na.rm=TRUE)
    hh
    

    Here we selected from hh values that satisfy condition is.na() and permanently replaced them with a sample mean. To keep the original data, we saved it in a vector with the other name (hh.old). There are many other ways to impute missing data, more complicated are based on bootstrap, regression and/or discriminant analysis. Some are implemented in packages mice and cat.

    Collection asmisc.r supplied with this book, has Missing.map() function which is useful to determine the “missingness” (volume and relative location of missing data) in big datasets.