Jing Peng

Last updated
Save as PDF

Page ID: 2490

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

SVM has been successfully applied in many areas ranging from handwriting recognition, text classification, image retrieval, etc. However, the performance of SVM decreases significantly when facing unbalance data. Application such as disease detection, credit card fraud detection which has highly skewed datasets with a very small number of minority class instances are hard to classify correctly. However, the information of minority class is very important. The default classification generally perform badly on imbalanced data, because they simply assume the classes are balanced. In this section I will show my exploration of the empirical behavior of liwnear SVM for unbalanced data. In particular, I will introduce the concept of confusion matrix, SVM with class weights, and illustrate these concepts by some simulation and real data analysis. For unbalance data, our primal problem now becomes:

1. Confusion Matrix

This study focuses on binary class data. In an imbalanced binary class data, we need to assign different class weights or different misclassification cost to the two classes. We then discuss about confusion matrix and performance metrics. We assign the positive class(+) to the majority class and negative class(-) to the minority one. Let \(N(i,j)\) be the total number of cases that predicted as class i while true class is j. The following is a confusion matrix:

In the confusion matrix, misclassification information is contained in False Positives(\(N_{12}\))and False Negative(\(N_{21}\)). Meanwhile \(N_{11}\) and \(N_{22}\) are the data predicted correctly. Several performance metrics of choosing tuning parameters can be derived based on the confusion matrix. There are two popular performance metrics:

\(\text{Accuracy} = \frac{1}{N..}(N_{12} + N_{21})\)
\(\text{Kappa} = (\text{Observed Accuracy} - \text{Expected Accuracy})/(1 - \text{Expected Accuracy})\); where \(\text{Observed Accuracy} = \frac{1}{N..}(N_{11} + N_{22})\), \(\text{Expected Accuracy} = \frac{1}{N..^2}(N_{1.}\times N_{.1} +N_{2.}\times N_{.2})\) (wikipedia cohen’s Kappa, Calculation)
\(\text{Kappa} = (N..(N_{11} + N_{22}) - (N_{1.} \times N_{.1} + N_{2.} \times N_{.2}))/ N_{..}^{2}\)
It calculates the deviation between Observed Accuracy and Expected Accuracy, where \(\text{Observed Accuracy} = \frac{1}{N..}(N_{11} + N_{22})\), \(\text{Expected Accuracy} = \frac{1}{N..^2}(N_{1.}\times N_{.1} +N_{2.}\times N_{.2})\)

Most functions such as tune() in R package e1071 and train() in R package caret choose accuracy as default performance metric. However, because the calculation of accuracy does not separately consider misclassification of each class, the effect of misclassification of minority will be ignored. Therefore, I use Kappa as performance metric to choose optimal tuning parameter.

2.Simulation

The following simulation shows how building svm model with weight improves the kappa. The simulated data has 1000 observations with two classes: majority class(\(+1\)) has 900 observations while the minority class (\(-1\)) has 100 observations.

I randomly choose 800 observations to be training data and \(723\) are majority class, 77 are minority class. After 10 fold cross validation, we get the optimal tuning parameter cost is \(0.01\) with largest kappa value.

cost	Accuracy	Kappa
1e-02	0.9524351	0.7277368
1e-01	0.9437002	0.6267951
5e-01	0.9462002	0.6482385
1e+00	0.9462002	0.6576304
2e+00	0.9462002	0.6482385
5e+00	0.9462002	0.6482385
1e+01	0.9462002	0.6482385
1e+02	0.9474502	0.6537532

After building the linear svm model with \(\text{cost} = 0.01\), we get 151 support vectors: 76 from majority class and 75 from minority class.The training error is \(6\%\) by accuracy criteria. The test accuracy is \(94\%\), but kappa is only \(61.87\%\); 12 minority class observations are predicted as majority class and none of majority class predicted wrongly. Therefore, the information of minority class has been ignored by the choice of hyperplane.

I then build a model with weight in consideration. After cross validation, I get he optimal cost is \(0.1\) and weight is \(3\) with the largest kappa value \(74.26\%\) I then used those parameters to build SVM model; I have 129 support vectors in total with 96 from majority class and 33 from minority class. When using the SVM model to predict testing data, I get \(\text{kappa} = 79.08\%\); 3 observations of minority class are predicted as majority and 6 observations of majority class are predicted as minority. After adding weight, the kappa improves \(17.21\%\). The following two figures shows the hyperplane of SVM model without weight (left) and that with weight(right). After adding weight, the hyperplane is obviously move towards majority class; therefore, less observations of minority class will be classified as majority class and linear SVM is very sensitive to unbalance data.

One interesting discover is the number of support vector is about \(3:1\) for majority class to minority class, while the weight is \(1:3\). In model without weight, the ratio of support vector is about \(1:1\). Therefore, the weight could control the complexity of boundary between support vectors.

In addition, the reason we choice kappa as our criteria instead of accuracy is the trend of kappa match the trend of weight when fix cost. I first fix \(\text{cost} = 0.01\),the optimal cost under cross validation without weight. I then look at how weight influence classification of test data by kappa, accuracy, misclassification of minority class observations() and misclassification of majority class observations() as following table.

Weight	Kappa	Accuracy	mis_minor	mis_major
1.0	0.659	0.948	31	11
1.2	0.662	0.946	29	14
1.4	0.680	0.948	26	16
1.6	0.719	0.951	20	19
1.8	0.705	0.948	19	23
2.0	0.731	0.951	16	23
2.2	0.739	0.953	15	23
2.4	0.732	0.949	12	29
2.6	0.743	0.950	10	30
2.8	0.743	0.950	10	30
3.0	0.743	0.950	10	30
3.2	0.742	0.950	10	30
3.4	0.733	0.946	8	35
3.6	0.723	0.944	8	37
3.8	0.726	0.944	7	38
4.0	0.724	0.943	6	40
4.2	0.719	0.941	6	41
4.4	0.719	0.941	6	41
4.6	0.719	0.941	6	41
4.8	0.719	0.941	6	41
5.0	0.719	0.941	6	41

From the table, we get that when weight increases,the accuracy does not vary much; but the Kappa is significantly increasing too until \(\text{weight} = 3\); it then begins to decrease. The accuracy also decrease a little when weight is greater than 3. Therefore, a larger weight does not means a better fit. The misclassification of minority class is decreasing; but the misclassification of majority class increases. Our goal when analyzing unbalance data is to avoid majority class dominant our choices of hyperplane and support vectors. When the class is unbalance, the observations density of majority class would be higher than that of minority even around the hyperplane. As a consequence, in order to reduce the total number of misclassification, the hyperplane will be chosen skewed towards the minority class, which would lower the model’s performance on the minority class. Hence, the decision function in the SVM model would be more likely to classify a point near the hyperplane as majority class. We cannot ignore the information from minority class; therefore, we put more punishment to the misclassification of minority class. Thus kappa is a more appropriate criteria to find tuning parameter: cost and weight.

3. Wisconsin Breast Cancer

Reference

[1] Haibo He, Yunqian, "Imbalanced Learning: Foundations, Algorithms, and Applications,.” \(pp7-8\), \(2012\)

[2] C.CortesandV.Vapnik,“Support-vectornetworks,”MachineLearning,vol.\(20\), no.\(3\), pp. \(273–297\), \(1995\).

[3] Information Resources Management Association(USA), ‘A Measure Optimized Cost Sensitive Learning Framework’, Artificial Intelligence: Concepts, Methodologies, Tools, and Applications \(pp613-615\)

[4]https://github.com/topepo/caret/blob...inearWeights.R

[5] https://cran.r-project.org/web/packa...1071/e1071.pdf at Arguments kenel

[6] https://cran.r-project.org/web/packa...ab/kernlab.pdf at Kernel functions Description

[7] topepo, https://github.com/topepo/caret/blob...iles/svmPoly.R,\(2016\)

[8] topepo, https://github.com/topepo/caret/blob...adialWeights.R,\(2016\)

[10] http://machinelearningmastery.com/ma...-metrics-in-r/

[11] Jason Brownlee,http://machinelearningmastery.com/ma...-metrics-in-r/, February \(2016\) in R Machine learning