Skip to main content
Statistics LibreTexts

2: Understand Your Data

  • Page ID
    4408
  • Good data are the basis of any sort of regression model, because we use this data to actually construct the model. If the data is flawed, the model will be flawed. It is the old maxim of garbage in, garbage out. Thus, the first step in regression modeling is to ensure that your data is reliable. There is no universal approach to verifying the quality of your data, unfortunately. If you collect it yourself, you at least have the advantage of knowing its provenance. If you obtain your data from somewhere else, though, you depend on the source to ensure data quality. Your job then becomes verifying your source’s reliability and correctness as much as possible.

    • 2.1: Missing Values
      Any large collection of data is probably incomplete. That is, it is likely that there will be cells without values in your data table. These missing values may be the result of an error, such as the experimenter simply forgetting to fill in a particular entry. They also could be missing because that particular system configuration did not have that parameter available. Fortunately, R is designed to gracefully handle missing values.
    • 2.2: Sanity Checking and Data Cleaning
    • 2.3: The Example Data
    • 2.4: Data Frames
      The fundamental object used for storing tables of data in R is called a data frame. We can think of a data frame as a way of organizing data into a large table with a row for each system measured and a column for each parameter. An interesting and useful feature of R is that all the columns in a data frame do not need to be the same data type. Some columns may consist of numerical data, for instance, while other columns contain textual data.
    • 2.5: Accessing a Data Frame