1: Support Vector Machine (SVM)
( \newcommand{\kernel}{\mathrm{null}\,}\)
1.1 Basic Background of SVM
The idea for Support Vector Machine is straight forward: Given observations xi∈IRp, i∈1,...,n, and each observation xi has a class yi, where yi∈{−1,1}, we want to find a hyperplane such that it can separate the observations based on their classes and also maximize the the minimum distance of the observation to the hyperplane.
First, we denote the best hyperplane as wTx+b=0. From the linear algebra, we know the distance from a point x0 to the plane wTx+b=0 is calculated by: Distance=|wTx0+b|||w|| Then we want such hyperplane can correctly separate two classes, which is equivalent to satisfy following restrictions:
(wTxi+b)>0,if yi = 1(wTxi+b)<0,if yi = -1,
which is equivalent to yi(wTxi+b)>0,i=1,…,n.
Our goal is to maximize the minimum distance for all observations to that hyperplane,so we have our first optimization problem: maximizew,bmin{|wTxi+b|||w||}s.t.yi(wTxi+b)>0,i=1,…,n.
Then we define margin M=min{|wTxi+b|||w||}, and for the mathematics convenience, we scale w such that satisfy ||w||=1
maximizew,||w||=1,bM s.t.yi(wTxi+b)≥M,i=1,…,n.
Then we define v=wM and the norm of the ||v|| is therefore 1M. We substitute v back to our optimization:
maximizev,b1||v||s.t.yi(wTMxi+bM)≥MM=1,i=1,…,n.
Then we change the variable name v to w, and to maximize 1||v|| is equivalent to minimize 12wTw. So we get our final optimization problem as:
minimizew,b12wTws.t.yi(wTxi+b)≥1,i=1,…,n.
We want to use the Method of Lagrange Multipliers to solve the optimization problem. We can write our constrained as gi(w)=yi(wTxi+b)−1≥0. Lets define L(w,b,α≥0)=12wTw−∑ni=1αi[yi(wTwxi+b)−1]. We observe that the maximum of the function L with respect to α equals 12wTw. So we change our original optimization problem to the new one : min w,bmax α≥0L(w,b,α)=12wTw−n∑i=1αi[yi(wTwxi+b)−1] where α is the Lagrange Multiplier.
We can verify that w,b,α satisfy Karush-Kuhn-Tucker (KKT condition). Therefore, we can solve the primal problem by solving its dual problem: max α≥0min w,bL(w,b,α)=12wTw−n∑i=1αi[yi(wTwxi+b)−1]
To solve the dual problem, we first need to set the partial derivative of L(w,b,α) with respect of w and b to 0: ∇wL(w,b,α)=w−n∑i=1αiyixi=0∂∂bL(w,b,α)=n∑i=1αiyi=0
Then we substitute them back to the function L(w,b,α) L(w,b,α)=12wTw−n∑i=1αi(yi(wTxi+b)−1)=12wTw−n∑i=1αiyiwTxi−n∑i=1αiyib+n∑i=1αi=12wTn∑i=1αiyixi−wTn∑i=1αiyixi−bn∑i=1αiyi+n∑i=1αi=−12wTn∑i=1αiyixi−bn∑i=1αiyi+n∑i=1αi=−12(n∑i=1αiyixi)Tn∑i=1αiyixi−bn∑i=1αiyi+n∑i=1αi=−12n∑i=1αiyi(xi)Tn∑i=1αiyixi−bn∑i=1αiyi+n∑i=1αi=−12n∑i,j=1αiαjyiyjxTixj−bn∑i=1αiyi+n∑i=1αi=−12n∑i,j=1αiαjyiyjxTixj+n∑i=1αi
and simplify the dual problem as: maximize α≥0W(α)=n∑i=1αi−12n∑i,j=1yiyjαiαj<xi,xj>s.t.n∑i=1αiyi=0
Then we notice that it is a convex optimization problem so that we can use SMO algorithm to solve the value of Lagrange multiplier α. And using formula (5), we can compute the parameter w. Also,b is calculated as b=mini:yi=1wTxi−maxi:yi=−1wTxi2
Therefore, after figuring out the parameter w and b for the best hyperplane, for the new observation xi, the decision rule is that yi={1,for wTxi+b≥1−1,for wTxi+b≤1
1.2 Soft Margin
In most real-world data, hyperplane cannot totally separate binary classes data, so we are tolerant to some observations in train data at wrong side of margin or hyperplane. We called the margin in above situation as Soft Margin or Support Vector Classifier.Here is primal optimization problem of Soft Margin:
minimizew,b,εi12wTw+Cn∑i=1εis.t.yi(wTxi+b)≥1−εi,and εi≥0,i=1,…,n,
where ξi is the slack variables that allow misclassification; the penalty term ∑li=1ξi is a measure of the total number of misclassification in the model constructed by training dataset; C,a tuning parameter, is the misclassification cost that controls the trade-off of maximizing the margin and minimizing the penalty term.
Following the same process we have done in deriving the hard margin,We can solve the primal problem by proceed it in its dual space. Here is the Lagrangian of (9) and the αi and βi showing below are the Lagrange Multiplier:
min w,b,εimax α≥0,β≥0L(w,b,α,β,ε)=12||w||2+Cn∑i=1εi−n∑i=1αi[yi(wT⋅xi+b)−1+εi]−n∑i=1βiεi,
Due to the nonnegative property of the primal constraints and Lagrange Multiplier, the part −n∑i=1αi[yi(wT⋅→xi+b)−1+εi]−n∑i=1βiεi should be negative. thus we can minimize that part to zero in order to get max α≥0,β≥0L(w,b,α,β,ε)=12||w||2+Cn∑i=1εi−n∑i=1αi[yi(wT⋅xi+b)−1+εi]−n∑i=1βiεi with respect to w, b and ε. However, the constraints α≥0 and β≥0 make it difficult to find the maximization. Therefore, we transform the primal problem to the following dual problem through the KKT condition:
maxα≥0,β≥0min w,b,εiL(w,b,α,β,ε)=12||w||2+Cn∑i=1εi−n∑i=1αi[yi(wT⋅xi+b)−1+εi]−n∑i=1βiεi,
Because the subproblem for minimizing the equation with respect to w, b and ε, we have no constraints on them, thus we could set the gradient to 0 as follows to find min w,b,εiL(w,b,α,β,ε):
∇wL=→w−n∑i=1αiyixi=0⇒→w=n∑i=1αiyixi∂L∂b=−n∑i=1yixi=0⇒n∑i=1yixi=0∂L∂εi=C−αi−βi=0,⇒C=αi+βi
When we plug Equation set (12) to Equation (11), we get the dual Lagrangian form of this optimization problem as follows:
maxαi≥0n∑i=1αi−12n∑i=1n∑j=1αiαjyiyjxTixjs.t.n∑i=1yiαi=0 and 0≤αi≤C,for i=1,...,n
Finally, the decision rule will show as follows:
(1) If αi=0 and εi=0, the testing observation xi has been classified on the right side.
(2) If 0<αi<C, we can conclude that εi=0 and yi=wT⋅→xi+b, which means the testing observation xi is a support vector.
(3) If αi=0, εi>0 and xi is a support vector, we can conclude that xi is correctly classified when 0≤εi<1, and xi is misclassified when εi>1.
1.3: Kernel Method
Since we calculate w=∑ni=1αiyixi, so our hyperplane can be rewrite as g(x)=(n∑i=1αiyixi)Tx+b=n∑i=1αiyi<xi,x>+b where <xi,x> represents the inner product of xi and x. What we discussed above is about linear separable. What if our data set is not linear separable? One common way here is define a map as: ϕ:Rp→Rm A feature mapping ϕ, for instance, could be ϕ(x)=[xx2x3].
The idea is that if the observations are inseparable in current feature space, we can use a feature mapping function ϕ and try to separate the observations after being mapped.
Here is a plot shows how the mapping function ϕ works.
Hence, in order to solve the hyperplane to separate the observations after being mapped, we simply replace x with ϕ(x) everywhere in the previous optimization problem and get:
maximizew,b12wTws.t.yi(wTϕ(xi)+b)≥1,i=1,…,n.
From the previous part, we know that for each observations, we only need their pairwise inner products to solve the problem. But since computing the inner product in high dimensional space is expensive or even impossible, we need another way to computing the inner product.So here, given a feature mapping ϕ(x) we define the corresponding Kernel function K as:
K(xi,xj)=ϕ(xi)Tϕ(xj)=<ϕ(xi),ϕ(xj)>
So now, instead of actually computing the inner product of pairwise observations, we use the kernel function to help us in computation. And using the same methods as before, we can get our hyperplane as:
w=n∑i=1αiyiϕ(xi),
and b=mini:yi=1∑nj=1αiyiK(xi,xj)−maxi:yi=−1∑nj=1αiyiK(xi,xj)2
So after using kernel, our decision rule becomes: yi={1,for ∑ni=1αiyi<ϕ(xi),ϕ(x)>+b≥1−1,for ∑ni=1αiyi<ϕ(xi),ϕ(x)>+b≤1
There are some commonly used kernel: Polynomial Kernel and Gaussian Kernel. Polynomial kernel with degree n is defined as K(xi,xj)=(xTixj)d and Gaussian kernel with σ as K(xi,xj)=exp−||xi−xj||22σ2.