1. South African Heart Disease
Let’s look at applying these three methods to some empirical data. We’ll first start by using a data set from Rousseauw (1983) describing a retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. This data set has 462 observations on 9 predictors and 2 classes. We’ll be using the
e1071 package for SVM, the
klaR package for KDC, and the
class package for k-NN.
The data are a mix of categorical and numerical. The numeric features are systolic blood pressure (sbp), cumulative tobacco usage (tobacco), cholesterol (ldl), body adiposity index (adiposity), type-a behavior (typea), body mass index (obesity), alcohol consumption (alcohol), and age (age). The categorical feature is family history (famhist). Below is a listing of the summary statistics.
sbp tobacco ldl adiposity Min. :101.0 Min. : 0.0000 Min. : 0.980 Min. : 6.74 1st Qu.:124.0 1st Qu.: 0.0525 1st Qu.: 3.283 1st Qu.:19.77 Median :134.0 Median : 2.0000 Median : 4.340 Median :26.11 Mean :138.3 Mean : 3.6356 Mean : 4.740 Mean :25.41 3rd Qu.:148.0 3rd Qu.: 5.5000 3rd Qu.: 5.790 3rd Qu.:31.23 Max. :218.0 Max. :31.2000 Max. :15.330 Max. :42.49 famhist typea obesity alcohol Absent :270 Min. :13.0 Min. :14.70 Min. : 0.00 Present:192 1st Qu.:47.0 1st Qu.:22.98 1st Qu.: 0.51 Median :53.0 Median :25.80 Median : 7.51 Mean :53.1 Mean :26.04 Mean : 17.04 3rd Qu.:60.0 3rd Qu.:28.50 3rd Qu.: 23.89 Max. :78.0 Max. :46.58 Max. :147.19 age chd Min. :15.00 0:302 1st Qu.:31.00 1:160 Median :45.00 Mean :42.82 3rd Qu.:55.00 Max. :64.00
We first want to separate the observations into a training set and a testing set. We’ll randomly select 362 observations as the training set, and the remaining 100 observations will serve as the testing set.
dat = read.csv("SAheart.csv") ind.train = sample(462, 362) dat.train = dat[ind.train,] dat.test = dat[-ind.train,]
Since the data are over a range of different values we first need to pre-process the data. The data has no
NA values but we are going to need to center and scale the data to mean 0 and standard deviation 1. We need to do this because KNN and SVM calculate the distances between points so features that are not on the same scale will skew the data. This can have the effect of overscaling or underscaling the effects of certain features when instead we want each feature to contribute to the classifier equally.
In order to do this, we want to first scale the training set by subtracting and dividing each feature by its respective mean and standard deviation. Then we’ll subtract and divide the features in the testing set by the same mean and standard deviation. It is important that we use the values from the training set to scale the testing set because in a real world situation we only have the training set to scale our new observations on.
means = colMeans(dat.train[ind]) sds = apply(dat.train[ind], 2, sd) dat.train[ind] = sweep(dat.train[ind], 2, means, "-") dat.train[ind] = sweep(dat.train[ind], 2, sds, "/") dat.test[ind] = sweep(dat.test[ind], 2, means, "-") dat.test[ind] = sweep(dat.test[ind], 2, sds, "/")
1.2 K-Nearest Neighbors
Let’s first try applying k-nearest neighbors with k = 5. Since KNN calculates distances between points we need to convert our categorical variable
famhist to a numerical representation. We’ll just simply use the
as.numeric() function to do this.
test.knn = dat.test test.knn$famhist = as.numeric(test.knn$famhist) - 1 train.knn = dat.train train.knn$famhist = as.numeric(train.knn$famhist) - 1
After converting, we can find a classifier using the
knn() in the
class package. This function both trains the classifier using the training set and makes class predictions on the testing set. We’ll also give a confusion matrix for the training set and testing set.
kn.train = knn(train.knn[,1:9], train.knn[,1:9], train.knn[,10], k = 5) kn.test = knn(train.knn[,1:9], test.knn[,1:9], train.knn[,10], k = 5) table(train.knn[,10], kn.train) table(test.knn[,10], kn.test)
kn.train kn.test 0 1 0 1 0 197 34 0 62 9 1 56 75 1 19 10
So by using a k-nearest neighbor classifier, we obtain a 75% training accuracy rate. More importantly however, is that the test accuracy rate is 72%. Unfortunately though, our false negative rate is rather high at 65%. In comparison the false positive rate is 13%. A high false negative rate is not good in this context because it means patients who have heart disease are not being diagnosed, which can be deadly.
1.3 Kernel Density Classification
Now let’s try applying a Naive Bayes classifier by generating a kernel density estimate. The
NaiveBayes() function in the
klaR makes this easy by including a
usekernel argument that when set generates a kernel density estimate using
density() and then computes the classifier based on that.
kdc = NaiveBayes(chd~., dat.train, usekernel = T) kdc.train = predict(kdc, dat.train[,1:9]) kdc.preds = predict(kdc, dat.test[,1:9]) table(dat.train[,10], kdc.train$class) table(dat.test[,10], kdc.preds$class)
kdc.train$class kdc.preds$class 0 1 0 1 0 183 48 0 56 15 1 46 85 1 11 18
So with a kernel density estimate and a Naive Bayes classifier we achieve a 74% training accuracy rate, which is higher than our KNN training accuracy. But we also get a 74% test accuracy rate. Unfortunately though, our false positive rate has gone up to 21% while the false negative rate went down to 38%. Ideally, if the false negative rate were to lower, we would want the true positive or negative rates to get higher instead of the false positive. However, in this context, a higher false positive rate could be acceptable because it could just indicate that further tests would need to be performed to diagnose a patient.
1.4 Linear SVM
Let’s first start by applying a linear decision boundary on the training data. We’ll perform 10-fold cross validation on a range of cost parameters and make a prediction for the test data using the best model.
tuned.lin = tune(svm, chd~., data = dat.train, kernel = "linear", ranges = list(cost = 2^(-5:5))) svm.train.lin = predict(tuned.lin$best.model, dat.train[,1:9]) svm.preds.lin = predict(tuned.lin$best.model, dat.test[,1:9]) table(dat.train[,10], svm.train.lin) table(dat.test[,10], svm.preds.lin)
svm.train.lin svm.preds.lin 0 1 0 1 0 192 39 0 61 10 1 61 70 1 11 18
So as we can see, this achieves an 72% accuracy rate on training data. This is not as good as the training accuracy rates from KNN and KDC, but we get a 79% accuracy rate on test data. This is a significant improvement from the KNN and KDC classifiers, and we expect that polynomial and radial kernels will perform better. Additionally, our false positive rate has gone down to 14% and our false negative rate remains at 38%. So far, this is our best model.
1.5 Polynomial SVM
The linear decision boundary is an improvement on the KNN and KDC classifiers so maybe using a polynomial kernel will be even better. As in the linear kernel, we’ll perform 10-fold cross validation on the cost parameter, but also the degree of kernel and gamma.
tuned.poly = tune(svm, chd~., data = dat.train, kernel = "polynomial", ranges = list(degree = (2:5), gamma = 2^(-1:1), cost = 2^(-6:2))) svm.preds.poly = predict(tuned.poly$best.model, dat.test[,1:9]) table(dat.test[,10], svm.preds.poly)
svm.train.poly svm.preds.poly 0 1 0 1 0 211 20 0 64 7 1 59 72 1 15 14
So with a polynomial kernel of degree 2, cost 0.015625, and gamma 2, the training data accuracy actually went up to 78% but the testing data accuracy rate went down to 78% compared to the linear kernel. This is interesting, as one would expect the test accuracy rate to go up with a polynomial kernel. Additionally, since the training rates and testing rates are the same, this indicates that the classifier did not overfit the data. Notably though, our false positive rate went down to 10% and the false negative rate went up to 52%. This is not good, so I would rule this out as a good classifier.
1.6 Radial SVM
Maybe a radial kernel will increase our test data accuracy rate. Again, we’ll perform 10-fold cross validation, but this time the parameters are cost and gamma.
tuned.rad = tune(svm, chd~., data = dat.train, kernel = "radial", ranges = list(gamma = 2^(-5:5), cost = 2^(-5:5))) svm.preds.rad = predict(tuned.rad$best.model, dat.test[,1:9]) table(dat.test[,10], svm.preds.rad)
svm.train.rad svm.preds.rad 0 1 0 1 0 212 19 0 64 7 1 52 79 1 12 17
So with a radial kernel, the training data accuracy rate is 80%. Additionally, the test data accuracy rate went up to 81%. This is the best accuracy rate we have achieved thus far. The false negative rate did go up slightly compared to KDC and Linear SVM, to 41%, but the false positive rate remained at 10%, which was the lowest false positive rate achieved.
|Classification Method||Training Rate||Testing Rate|
|Classification Method||False Positive Rate||False Negative Rate|
We see that the highest possible test accuracy rate we get is 81% with an average of about 80% for SVMs and 74% and 72% for KNN and KDC. Ideally, we would like this to be higher, especially because the false negative rate is pretty high across all methods which could lead to disastrous consequences given the context of the classification.
What this indicates is that there is some class imbalance, and if we look at the full data set, we can confirm this, with only about 35% of observations belonging to the positive class. What happens in this situation for SVMs is that since our misclassification cost C is the same for both classes, a higher proportion of positive observations will be classified as negative, leading to a higher false negative rate. In order to rectify this, we can assign weights to each class, with the positive class having a higher weight. This doesn’t always work though. If we assign class weights on our best SVM model,