Jing Peng
SVM has been successfully applied in many areas ranging from handwriting recognition, text classification, image retrieval, etc. However, the performance of SVM decreases significantly when facing unbalance data. Application such as disease detection, credit card fraud detection which has highly skewed datasets with a very small number of minority class instances are hard to classify correctly. However, the information of minority class is very important. The default classification generally perform badly on imbalanced data, because they simply assume the classes are balanced. In this section I will show my exploration of the empirical behavior of liwnear SVM for unbalanced data. In particular, I will introduce the concept of confusion matrix, SVM with class weights, and illustrate these concepts by some simulation and real data analysis. For unbalance data, our primal problem now becomes:
1. Confusion Matrix
This study focuses on binary class data. In an imbalanced binary class data, we need to assign different class weights or different misclassification cost to the two classes. We then discuss about confusion matrix and performance metrics. We assign the positive class(+) to the majority class and negative class() to the minority one. Let \(N(i,j)\) be the total number of cases that predicted as class i while true class is j. The following is a confusion matrix:
In the confusion matrix, misclassification information is contained in False Positives(\(N_{12}\))and False Negative(\(N_{21}\)). Meanwhile \(N_{11}\) and \(N_{22}\) are the data predicted correctly. Several performance metrics of choosing tuning parameters can be derived based on the confusion matrix. There are two popular performance metrics:

\(\text{Accuracy} = \frac{1}{N..}(N_{12} + N_{21})\)

\(\text{Kappa} = (\text{Observed Accuracy}  \text{Expected Accuracy})/(1  \text{Expected Accuracy})\); where \(\text{Observed Accuracy} = \frac{1}{N..}(N_{11} + N_{22})\), \(\text{Expected Accuracy} = \frac{1}{N..^2}(N_{1.}\times N_{.1} +N_{2.}\times N_{.2})\) (wikipedia cohen’s Kappa, Calculation)

\(\text{Kappa} = (N..(N_{11} + N_{22})  (N_{1.} \times N_{.1} + N_{2.} \times N_{.2}))/ N_{..}^{2}\)
It calculates the deviation between Observed Accuracy and Expected Accuracy, where \(\text{Observed Accuracy} = \frac{1}{N..}(N_{11} + N_{22})\), \(\text{Expected Accuracy} = \frac{1}{N..^2}(N_{1.}\times N_{.1} +N_{2.}\times N_{.2})\)
Most functions such as tune() in R package e1071 and train() in R package caret choose accuracy as default performance metric. However, because the calculation of accuracy does not separately consider misclassification of each class, the effect of misclassification of minority will be ignored. Therefore, I use Kappa as performance metric to choose optimal tuning parameter.
2.Simulation
The following simulation shows how building svm model with weight improves the kappa. The simulated data has 1000 observations with two classes: majority class(\(+1\)) has 900 observations while the minority class (\(1\)) has 100 observations.
I randomly choose 800 observations to be training data and \(723\) are majority class, 77 are minority class. After 10 fold cross validation, we get the optimal tuning parameter cost is \(0.01\) with largest kappa value.
cost  Accuracy  Kappa 

1e02  0.9524351  0.7277368 
1e01  0.9437002  0.6267951 
5e01  0.9462002  0.6482385 
1e+00  0.9462002  0.6576304 
2e+00  0.9462002  0.6482385 
5e+00  0.9462002  0.6482385 
1e+01  0.9462002  0.6482385 
1e+02  0.9474502  0.6537532 
After building the linear svm model with \(\text{cost} = 0.01\), we get 151 support vectors: 76 from majority class and 75 from minority class.The training error is \(6\%\) by accuracy criteria. The test accuracy is \(94\%\), but kappa is only \(61.87\%\); 12 minority class observations are predicted as majority class and none of majority class predicted wrongly. Therefore, the information of minority class has been ignored by the choice of hyperplane.
I then build a model with weight in consideration. After cross validation, I get he optimal cost is \(0.1\) and weight is \(3\) with the largest kappa value \(74.26\%\) I then used those parameters to build SVM model; I have 129 support vectors in total with 96 from majority class and 33 from minority class. When using the SVM model to predict testing data, I get \(\text{kappa} = 79.08\%\); 3 observations of minority class are predicted as majority and 6 observations of majority class are predicted as minority. After adding weight, the kappa improves \(17.21\%\). The following two figures shows the hyperplane of SVM model without weight (left) and that with weight(right). After adding weight, the hyperplane is obviously move towards majority class; therefore, less observations of minority class will be classified as majority class and linear SVM is very sensitive to unbalance data.
One interesting discover is the number of support vector is about \(3:1\) for majority class to minority class, while the weight is \(1:3\). In model without weight, the ratio of support vector is about \(1:1\). Therefore, the weight could control the complexity of boundary between support vectors.
In addition, the reason we choice kappa as our criteria instead of accuracy is the trend of kappa match the trend of weight when fix cost. I first fix \(\text{cost} = 0.01\),the optimal cost under cross validation without weight. I then look at how weight influence classification of test data by kappa, accuracy, misclassification of minority class observations() and misclassification of majority class observations() as following table.
Weight  Kappa  Accuracy  mis_minor  mis_major 

1.0  0.659  0.948  31  11 
1.2  0.662  0.946  29  14 
1.4  0.680  0.948  26  16 
1.6  0.719  0.951  20  19 
1.8  0.705  0.948  19  23 
2.0  0.731  0.951  16  23 
2.2  0.739  0.953  15  23 
2.4  0.732  0.949  12  29 
2.6  0.743  0.950  10  30 
2.8  0.743  0.950  10  30 
3.0  0.743  0.950  10  30 
3.2  0.742  0.950  10  30 
3.4  0.733  0.946  8  35 
3.6  0.723  0.944  8  37 
3.8  0.726  0.944  7  38 
4.0  0.724  0.943  6  40 
4.2  0.719  0.941  6  41 
4.4  0.719  0.941  6  41 
4.6  0.719  0.941  6  41 
4.8  0.719  0.941  6  41 
5.0  0.719  0.941  6  41 
From the table, we get that when weight increases,the accuracy does not vary much; but the Kappa is significantly increasing too until \(\text{weight} = 3\); it then begins to decrease. The accuracy also decrease a little when weight is greater than 3. Therefore, a larger weight does not means a better fit. The misclassification of minority class is decreasing; but the misclassification of majority class increases. Our goal when analyzing unbalance data is to avoid majority class dominant our choices of hyperplane and support vectors. When the class is unbalance, the observations density of majority class would be higher than that of minority even around the hyperplane. As a consequence, in order to reduce the total number of misclassification, the hyperplane will be chosen skewed towards the minority class, which would lower the model’s performance on the minority class. Hence, the decision function in the SVM model would be more likely to classify a point near the hyperplane as majority class. We cannot ignore the information from minority class; therefore, we put more punishment to the misclassification of minority class. Thus kappa is a more appropriate criteria to find tuning parameter: cost and weight.
3. Wisconsin Breast Cancer
Reference
[1] Haibo He, Yunqian, "Imbalanced Learning: Foundations, Algorithms, and Applications,.” \(pp78\), \(2012\)
[2] C.CortesandV.Vapnik,“Supportvectornetworks,”MachineLearning,vol.\(20\), no.\(3\), pp. \(273–297\), \(1995\).
[3] Information Resources Management Association(USA), ‘A Measure Optimized Cost Sensitive Learning Framework’, Artificial Intelligence: Concepts, Methodologies, Tools, and Applications \(pp613615\)
[4]https://github.com/topepo/caret/blob...inearWeights.R
[5] https://cran.rproject.org/web/packa...1071/e1071.pdf at Arguments kenel
[6] https://cran.rproject.org/web/packa...ab/kernlab.pdf at Kernel functions Description
[7] topepo, https://github.com/topepo/caret/blob...iles/svmPoly.R,\(2016\)
[8] topepo, https://github.com/topepo/caret/blob...adialWeights.R,\(2016\)
[10] http://machinelearningmastery.com/ma...metricsinr/
[11] Jason Brownlee,http://machinelearningmastery.com/ma...metricsinr/, February \(2016\) in R Machine learning