We are finally ready to develop the multi-factor linear regression model for the int00.dat data set. As mentioned in the previous section, we must find the right balance in the number of predictors that we use in our model. Too many predictors will train our model to follow the data’s random variations (noise) too closely. Too few predictors will produce a model that may not be as accurate at predicting future values as a model with more predictors.
We will use a process called backward elimination [1] to help decide which predictors to keep in our model and which to exclude. In backward elimination, we start with all possible predictors and then use lm() to compute the model. We use the summary() function to find each predictor’s significance level. The predictor with the least significance has the largest p-value. If this value is larger than our predetermined significance threshold, we remove that predictor from the model and start over. A typical threshold for keeping predictors in a model is p = 0.05, meaning that there is at least a 95 percent chance that the predictor is meaningful. A threshold of p = 0.10 also is not unusual. We repeat this process until the significance levels of all of the predictors remaining in the model are below our threshold.