Loading [MathJax]/jax/output/HTML-CSS/jax.js
Skip to main content
Library homepage
 

Text Color

Text Size

 

Margin Size

 

Font Type

Enable Dyslexic Font
Statistics LibreTexts

1: Support Vector Machine (SVM)

( \newcommand{\kernel}{\mathrm{null}\,}\)

The idea for Support Vector Machine is straight forward: Given observations xiIRp, i1,...,n, and each observation xi has a class yi, where yi{1,1}, we want to find a hyperplane such that it can separate the observations based on their classes and also maximize the the minimum distance of the observation to the hyperplane.

First, we denote the best hyperplane as wTx+b=0. From the linear algebra, we know the distance from a point x0 to the plane wTx+b=0 is calculated by: Distance=|wTx0+b|||w|| Then we want such hyperplane can correctly separate two classes, which is equivalent to satisfy following restrictions:

(wTxi+b)>0,if yi = 1(wTxi+b)<0,if yi = -1,

which is equivalent to yi(wTxi+b)>0,i=1,,n.

Our goal is to maximize the minimum distance for all observations to that hyperplane,so we have our first optimization problem: maximizew,bmin{|wTxi+b|||w||}s.t.yi(wTxi+b)>0,i=1,,n.

Then we define margin M=min{|wTxi+b|||w||}, and for the mathematics convenience, we scale w such that satisfy ||w||=1

maximizew,||w||=1,bM s.t.yi(wTxi+b)M,i=1,,n.

Then we define v=wM and the norm of the ||v|| is therefore 1M. We substitute v back to our optimization:

maximizev,b1||v||s.t.yi(wTMxi+bM)MM=1,i=1,,n.

Then we change the variable name v to w, and to maximize 1||v|| is equivalent to minimize 12wTw. So we get our final optimization problem as:

minimizew,b12wTws.t.yi(wTxi+b)1,i=1,,n.

We want to use the Method of Lagrange Multipliers to solve the optimization problem. We can write our constrained as gi(w)=yi(wTxi+b)10. Lets define L(w,b,α0)=12wTwni=1αi[yi(wTwxi+b)1]. We observe that the maximum of the function L with respect to α equals 12wTw. So we change our original optimization problem to the new one : min w,bmax α0L(w,b,α)=12wTwni=1αi[yi(wTwxi+b)1] where α is the Lagrange Multiplier.

We can verify that w,b,α satisfy Karush-Kuhn-Tucker (KKT condition). Therefore, we can solve the primal problem by solving its dual problem: max α0min w,bL(w,b,α)=12wTwni=1αi[yi(wTwxi+b)1]

To solve the dual problem, we first need to set the partial derivative of L(w,b,α) with respect of w and b to 0: wL(w,b,α)=wni=1αiyixi=0bL(w,b,α)=ni=1αiyi=0

Then we substitute them back to the function L(w,b,α) L(w,b,α)=12wTwni=1αi(yi(wTxi+b)1)=12wTwni=1αiyiwTxini=1αiyib+ni=1αi=12wTni=1αiyixiwTni=1αiyixibni=1αiyi+ni=1αi=12wTni=1αiyixibni=1αiyi+ni=1αi=12(ni=1αiyixi)Tni=1αiyixibni=1αiyi+ni=1αi=12ni=1αiyi(xi)Tni=1αiyixibni=1αiyi+ni=1αi=12ni,j=1αiαjyiyjxTixjbni=1αiyi+ni=1αi=12ni,j=1αiαjyiyjxTixj+ni=1αi

and simplify the dual problem as: maximize α0W(α)=ni=1αi12ni,j=1yiyjαiαj<xi,xj>s.t.ni=1αiyi=0

Then we notice that it is a convex optimization problem so that we can use SMO algorithm to solve the value of Lagrange multiplier α. And using formula (5), we can compute the parameter w. Also,b is calculated as b=mini:yi=1wTximaxi:yi=1wTxi2

Therefore, after figuring out the parameter w and b for the best hyperplane, for the new observation xi, the decision rule is that yi={1,for wTxi+b11,for wTxi+b1

1.2 Soft Margin

In most real-world data, hyperplane cannot totally separate binary classes data, so we are tolerant to some observations in train data at wrong side of margin or hyperplane. We called the margin in above situation as Soft Margin or Support Vector Classifier.Here is primal optimization problem of Soft Margin:
minimizew,b,εi12wTw+Cni=1εis.t.yi(wTxi+b)1εi,and εi0,i=1,,n,

where ξi is the slack variables that allow misclassification; the penalty term li=1ξi is a measure of the total number of misclassification in the model constructed by training dataset; C,a tuning parameter, is the misclassification cost that controls the trade-off of maximizing the margin and minimizing the penalty term.
Following the same process we have done in deriving the hard margin,We can solve the primal problem by proceed it in its dual space. Here is the Lagrangian of (9) and the αi and βi showing below are the Lagrange Multiplier:
min w,b,εimax α0,β0L(w,b,α,β,ε)=12||w||2+Cni=1εini=1αi[yi(wTxi+b)1+εi]ni=1βiεi,

Due to the nonnegative property of the primal constraints and Lagrange Multiplier, the part ni=1αi[yi(wTxi+b)1+εi]ni=1βiεi should be negative. thus we can minimize that part to zero in order to get max α0,β0L(w,b,α,β,ε)=12||w||2+Cni=1εini=1αi[yi(wTxi+b)1+εi]ni=1βiεi with respect to w, b and ε. However, the constraints α0 and β0 make it difficult to find the maximization. Therefore, we transform the primal problem to the following dual problem through the KKT condition:

maxα0,β0min w,b,εiL(w,b,α,β,ε)=12||w||2+Cni=1εini=1αi[yi(wTxi+b)1+εi]ni=1βiεi,

Because the subproblem for minimizing the equation with respect to w, b and ε, we have no constraints on them, thus we could set the gradient to 0 as follows to find min w,b,εiL(w,b,α,β,ε):
wL=wni=1αiyixi=0w=ni=1αiyixiLb=ni=1yixi=0ni=1yixi=0Lεi=Cαiβi=0,C=αi+βi

When we plug Equation set (12) to Equation (11), we get the dual Lagrangian form of this optimization problem as follows:
maxαi0ni=1αi12ni=1nj=1αiαjyiyjxTixjs.t.ni=1yiαi=0 and 0αiC,for i=1,...,n
Finally, the decision rule will show as follows:
(1) If αi=0 and εi=0, the testing observation xi has been classified on the right side.
(2) If 0<αi<C, we can conclude that εi=0 and yi=wTxi+b, which means the testing observation xi is a support vector.
(3) If αi=0, εi>0 and xi is a support vector, we can conclude that xi is correctly classified when 0εi<1, and xi is misclassified when εi>1.

1.3: Kernel Method

Since we calculate w=ni=1αiyixi, so our hyperplane can be rewrite as g(x)=(ni=1αiyixi)Tx+b=ni=1αiyi<xi,x>+b where <xi,x> represents the inner product of xi and x. What we discussed above is about linear separable. What if our data set is not linear separable? One common way here is define a map as: ϕ:RpRm A feature mapping ϕ, for instance, could be ϕ(x)=[xx2x3].

The idea is that if the observations are inseparable in current feature space, we can use a feature mapping function ϕ and try to separate the observations after being mapped.

map.png

Here is a plot shows how the mapping function ϕ works.

Hence, in order to solve the hyperplane to separate the observations after being mapped, we simply replace x with ϕ(x) everywhere in the previous optimization problem and get:

maximizew,b12wTws.t.yi(wTϕ(xi)+b)1,i=1,,n.

From the previous part, we know that for each observations, we only need their pairwise inner products to solve the problem. But since computing the inner product in high dimensional space is expensive or even impossible, we need another way to computing the inner product.So here, given a feature mapping ϕ(x) we define the corresponding Kernel function K as:

K(xi,xj)=ϕ(xi)Tϕ(xj)=<ϕ(xi),ϕ(xj)>

So now, instead of actually computing the inner product of pairwise observations, we use the kernel function to help us in computation. And using the same methods as before, we can get our hyperplane as:

w=ni=1αiyiϕ(xi),

and b=mini:yi=1nj=1αiyiK(xi,xj)maxi:yi=1nj=1αiyiK(xi,xj)2

So after using kernel, our decision rule becomes: yi={1,for ni=1αiyi<ϕ(xi),ϕ(x)>+b11,for ni=1αiyi<ϕ(xi),ϕ(x)>+b1

There are some commonly used kernel: Polynomial Kernel and Gaussian Kernel. Polynomial kernel with degree n is defined as K(xi,xj)=(xTixj)d and Gaussian kernel with σ as K(xi,xj)=exp||xixj||22σ2.


1: Support Vector Machine (SVM) is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

Support Center

How can we help?