Andy Wu

Last updated
Save as PDF

Page ID: 2492

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

1. South African Heart Disease

Let’s look at applying these three methods to some empirical data. We’ll first start by using a data set from Rousseauw (1983) describing a retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. This data set has 462 observations on 9 predictors and 2 classes. We’ll be using the e1071 package for SVM, the klaR package for KDC, and the class package for k-NN.

1.1 Pre-processing

The data are a mix of categorical and numerical. The numeric features are systolic blood pressure (sbp), cumulative tobacco usage (tobacco), cholesterol (ldl), body adiposity index (adiposity), type-a behavior (typea), body mass index (obesity), alcohol consumption (alcohol), and age (age). The categorical feature is family history (famhist). Below is a listing of the summary statistics.

     sbp           tobacco             ldl           adiposity        
 Min.   :101.0   Min.   : 0.0000   Min.   : 0.980   Min.   : 6.74
 1st Qu.:124.0   1st Qu.: 0.0525   1st Qu.: 3.283   1st Qu.:19.77 
 Median :134.0   Median : 2.0000   Median : 4.340   Median :26.11  
 Mean   :138.3   Mean   : 3.6356   Mean   : 4.740   Mean   :25.41  
 3rd Qu.:148.0   3rd Qu.: 5.5000   3rd Qu.: 5.790   3rd Qu.:31.23 
 Max.   :218.0   Max.   :31.2000   Max.   :15.330   Max.   :42.49
 
 
    famhist        typea         obesity        alcohol
 Absent :270   Min.   :13.0   Min.   :14.70   Min.   :  0.00  
 Present:192   1st Qu.:47.0   1st Qu.:22.98   1st Qu.:  0.51
               Median :53.0   Median :25.80   Median :  7.51
               Mean   :53.1   Mean   :26.04   Mean   : 17.04
               3rd Qu.:60.0   3rd Qu.:28.50   3rd Qu.: 23.89 
               Max.   :78.0   Max.   :46.58   Max.   :147.19 
               
               
               
      age        chd    
 Min.   :15.00   0:302  
 1st Qu.:31.00   1:160  
 Median :45.00          
 Mean   :42.82          
 3rd Qu.:55.00          
 Max.   :64.00

We first want to separate the observations into a training set and a testing set. We’ll randomly select 362 observations as the training set, and the remaining 100 observations will serve as the testing set.

dat = read.csv("SAheart.csv")
ind.train = sample(462, 362)
dat.train = dat[ind.train,]
dat.test = dat[-ind.train,]

Since the data are over a range of different values we first need to pre-process the data. The data has no NA values but we are going to need to center and scale the data to mean 0 and standard deviation 1. We need to do this because KNN and SVM calculate the distances between points so features that are not on the same scale will skew the data. This can have the effect of overscaling or underscaling the effects of certain features when instead we want each feature to contribute to the classifier equally.

In order to do this, we want to first scale the training set by subtracting and dividing each feature by its respective mean and standard deviation. Then we’ll subtract and divide the features in the testing set by the same mean and standard deviation. It is important that we use the values from the training set to scale the testing set because in a real world situation we only have the training set to scale our new observations on.

means = colMeans(dat.train[ind])
sds = apply(dat.train[ind], 2, sd)
dat.train[ind] = sweep(dat.train[ind], 2, means, "-")
dat.train[ind] = sweep(dat.train[ind], 2, sds, "/")
dat.test[ind] = sweep(dat.test[ind], 2, means, "-")
dat.test[ind] = sweep(dat.test[ind], 2, sds, "/")

1.2 K-Nearest Neighbors

Let’s first try applying k-nearest neighbors with k = 5. Since KNN calculates distances between points we need to convert our categorical variable famhist to a numerical representation. We’ll just simply use the as.numeric() function to do this.

test.knn = dat.test
test.knn$famhist = as.numeric(test.knn$famhist) - 1
train.knn = dat.train
train.knn$famhist = as.numeric(train.knn$famhist) - 1

After converting, we can find a classifier using the knn() in the class package. This function both trains the classifier using the training set and makes class predictions on the testing set. We’ll also give a confusion matrix for the training set and testing set.

kn.train = knn(train.knn[,1:9], train.knn[,1:9], 
               train.knn[,10], k = 5)
kn.test = knn(train.knn[,1:9], test.knn[,1:9], 
              train.knn[,10], k = 5)
table(train.knn[,10], kn.train)
table(test.knn[,10], kn.test)

  kn.train      kn.test
      0   1       0  1
  0 197  34    0 62  9
  1  56  75    1 19 10

So by using a k-nearest neighbor classifier, we obtain a 75% training accuracy rate. More importantly however, is that the test accuracy rate is 72%. Unfortunately though, our false negative rate is rather high at 65%. In comparison the false positive rate is 13%. A high false negative rate is not good in this context because it means patients who have heart disease are not being diagnosed, which can be deadly.

1.3 Kernel Density Classification

Now let’s try applying a Naive Bayes classifier by generating a kernel density estimate. The NaiveBayes() function in the R package klaR makes this easy by including a usekernel argument that when set generates a kernel density estimate using density() and then computes the classifier based on that.

kdc = NaiveBayes(chd~., dat.train, usekernel = T)
kdc.train = predict(kdc, dat.train[,1:9])
kdc.preds = predict(kdc, dat.test[,1:9])
table(dat.train[,10], kdc.train$class)
table(dat.test[,10], kdc.preds$class)

   kdc.train$class   kdc.preds$class
      0   1             0  1
  0 183  48          0 56 15
  1  46  85          1 11 18

So with a kernel density estimate and a Naive Bayes classifier we achieve a 74% training accuracy rate, which is higher than our KNN training accuracy. But we also get a 74% test accuracy rate. Unfortunately though, our false positive rate has gone up to 21% while the false negative rate went down to 38%. Ideally, if the false negative rate were to lower, we would want the true positive or negative rates to get higher instead of the false positive. However, in this context, a higher false positive rate could be acceptable because it could just indicate that further tests would need to be performed to diagnose a patient.

1.4 Linear SVM

Let’s first start by applying a linear decision boundary on the training data. We’ll perform 10-fold cross validation on a range of cost parameters and make a prediction for the test data using the best model.

tuned.lin = tune(svm, chd~., data = dat.train, kernel = "linear", 
                 ranges = list(cost = 2^(-5:5)))
svm.train.lin = predict(tuned.lin$best.model, dat.train[,1:9])
svm.preds.lin = predict(tuned.lin$best.model, dat.test[,1:9])
table(dat.train[,10], svm.train.lin)
table(dat.test[,10], svm.preds.lin)

   svm.train.lin    svm.preds.lin
      0  1             0  1
  0 192 39          0 61 10
  1  61 70          1 11 18

So as we can see, this achieves an 72% accuracy rate on training data. This is not as good as the training accuracy rates from KNN and KDC, but we get a 79% accuracy rate on test data. This is a significant improvement from the KNN and KDC classifiers, and we expect that polynomial and radial kernels will perform better. Additionally, our false positive rate has gone down to 14% and our false negative rate remains at 38%. So far, this is our best model.

1.5 Polynomial SVM

The linear decision boundary is an improvement on the KNN and KDC classifiers so maybe using a polynomial kernel will be even better. As in the linear kernel, we’ll perform 10-fold cross validation on the cost parameter, but also the degree of kernel and gamma.

tuned.poly = tune(svm, chd~., data = dat.train, kernel = "polynomial", 
                  ranges = list(degree = (2:5), gamma = 2^(-1:1), 
                                cost = 2^(-6:2)))
svm.preds.poly = predict(tuned.poly$best.model, 
                         dat.test[,1:9])
table(dat.test[,10], svm.preds.poly)

   svm.train.poly    svm.preds.poly
      0  1              0  1
  0 211 20           0 64  7
  1  59 72           1 15 14

So with a polynomial kernel of degree 2, cost 0.015625, and gamma 2, the training data accuracy actually went up to 78% but the testing data accuracy rate went down to 78% compared to the linear kernel. This is interesting, as one would expect the test accuracy rate to go up with a polynomial kernel. Additionally, since the training rates and testing rates are the same, this indicates that the classifier did not overfit the data. Notably though, our false positive rate went down to 10% and the false negative rate went up to 52%. This is not good, so I would rule this out as a good classifier.

1.6 Radial SVM

Maybe a radial kernel will increase our test data accuracy rate. Again, we’ll perform 10-fold cross validation, but this time the parameters are cost and gamma.

tuned.rad = tune(svm, chd~., data = dat.train, kernel = "radial",  
                 ranges = list(gamma = 2^(-5:5), cost = 2^(-5:5)))
svm.preds.rad = predict(tuned.rad$best.model, dat.test[,1:9])
table(dat.test[,10], svm.preds.rad)

   svm.train.rad    svm.preds.rad
      0  1             0  1
  0 212 19          0 64  7
  1  52 79          1 12 17

So with a radial kernel, the training data accuracy rate is 80%. Additionally, the test data accuracy rate went up to 81%. This is the best accuracy rate we have achieved thus far. The false negative rate did go up slightly compared to KDC and Linear SVM, to 41%, but the false positive rate remained at 10%, which was the lowest false positive rate achieved.

2. Analysis

Classification Method	Training Rate	Testing Rate
KNN	75%	72%
KDC	74%	74%
Linear SVM	72%	79%
Polynomial SVM	78%	78%
Radial SVM	80%	81%

Classification Method	False Positive Rate	False Negative Rate
KNN	13%	65%
KDC	21%	38%
Linear SVM	14%	38%
Polynomial SVM	10%	52%
Radial SVM	10%	41%

We see that the highest possible test accuracy rate we get is 81% with an average of about 80% for SVMs and 74% and 72% for KNN and KDC. Ideally, we would like this to be higher, especially because the false negative rate is pretty high across all methods which could lead to disastrous consequences given the context of the classification.

What this indicates is that there is some class imbalance, and if we look at the full data set, we can confirm this, with only about 35% of observations belonging to the positive class. What happens in this situation for SVMs is that since our misclassification cost C is the same for both classes, a higher proportion of positive observations will be classified as negative, leading to a higher false negative rate. In order to rectify this, we can assign weights to each class, with the positive class having a higher weight. This doesn’t always work though. If we assign class weights on our best SVM model,