An Introduction to Support Vector Machines (SVM): Sequential Minimal Optimization (SMO)

支持向量机(SVM)概述:SMO算法求解对偶问题

Posted by Gu on June 28, 2019

Recall the Kernel SVM dual problem:

Dual Problem

$$ \max_{\lambda, \mu} L(\lambda)= \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j K_{i,j} \\ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

We have introduced using gradient descent algorithm to solve the dual problem. However, the computation of the gradient has a high time complexity and thus would be a challenge for memory, especially when the training dataset is large. In this post, I introduce an efficient and light-version algorithm to solve the dual problem: Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization (SMO)

The algorithm of SMO is:

Initialization: let be a set which satisfies the dual constraint.
Repeat:
(1) heuristically select two , and set all the other fixed;
(2) optimize with respect to ;
Until: KKT condition is satisfied with certain accuracy.

First question about the initialization: how to find a set which satisfies the dual constraints?
The answer is simply set for .

Suppose that we have finished the initialization, and pick up a pair to optimize while keeping fixed, then we have

$$ \begin{align} L(\lambda) =& \lambda_a + \lambda_b -\frac{1}{2} \lambda_a^2 K_{a,a} - \frac{1}{2} \lambda_b^2 K_{b,b} - \lambda_a \lambda_b y_a y_b K_{a,b} \\ & - \sum_{i\neq a,b} \lambda_a \lambda_i y_a y_i K_{a,i} - \sum_{i \neq a,b} \lambda_b \lambda_i y_b y_i K_{b,i} + Const \end{align} $$

Moreover, according to the dual constraints, we have

$$ \lambda_a y_a + \lambda_b y_b = -\sum_{i\neq a,b} \lambda_i y_i = - \xi\\ \lambda_b y_b = -\lambda_a y_a -\xi\\ \lambda_b = -\lambda_a y_a y_b -\xi y_b $$

So we have

$$ \begin{align} L(\lambda) =& \lambda_a -\lambda_a y_a y_b - \xi y_b - \frac{1}{2}\lambda_a^2 K_{a,a} -\frac{1}{2}(\lambda_a y_a + \xi)^2 K_{b,b} \\ & + \lambda_a y_a ( \lambda_a y_a + \xi ) K_{a,b} - \sum_{i\neq a,b} \lambda_a y_a \lambda_i y_i K_{a,i}\\ & + \sum_{i\neq a,b}(\lambda_a y_a + \xi)\lambda_i y_i K_{b,i} + Const \end{align} $$

is concave with respect to , since due to the fact that the kernel matrix is nonnegative definite (see last post An Introduction to Support Vector Machines (SVM): kernel functions ). Therefore, we can find the optimal value of which maximizes by computing the gradient and set it to 0.

$$ \begin{align} \frac{\partial{L(\lambda)}}{\partial{\lambda_a}} =& 1 - y_a y_b -\lambda_a K_{a,a} - (\lambda_a y_a +\xi)y_a K_{b,b} + 2\lambda_a K_{a,b} \\ &+ y_a \xi K_{a,b} - \sum_{i\neq a,b} y_a \lambda_i y_i K_{a,i} + \sum_{i \neq a,b}y_a \lambda_i y_i K_{b,i}\\ =& 0 \end{align} $$

By solving this equation, we will get the solution for :

$$ \lambda_a^{\text{new}} = \frac{ 1-y_a y_b - \xi y_a K_{b,b} + y_a \xi K_{a,b} - \sum_{i \neq a,b} y_a \lambda_i y_i K_{a,i} +\sum_{i\neq a,b}y_a \lambda_i y_i K_{b,i} }{ K_{a,a} + K_{b,b} -2K_{a,b} } $$

It is too complicated to compute the numerator since there are too many terms. In the next, we will show that we can actually compute from the old .

Before updating the value of , we first use the old version to perform the classification on data :

$$ \begin{align} \hat{y}_a &= \sum_{i\neq a,b}\lambda_i y_i K_{i,a} + \lambda_a^\text{old} y_a K_{a,a} + \lambda_b^\text{old} y_b K_{b,a}\\ \hat{y}_b &= \sum_{i\neq a,b}\lambda_i y_i K_{i,b} + \lambda_a^\text{old} y_a K_{a,b} + \lambda_b^\text{old} y_b K_{b,b}\\ \end{align} $$

Base on the expressions of , we can have the following equation:

$$ \begin{align} &y_a[ (\hat{y}_b - y_b) - (\hat{y}_a - y_a) ]\\ = & \sum_{i\neq a,b}y_a \lambda_i y_i K_{i,b} + \lambda_a^\text{old} K_{a,b} + \lambda_b^\text{old} y_a y_b K_{b,b} + y_a b^\text{old} - y_a y_b \\ \ & - \sum_{i \neq a,b}y_a \lambda_i y_i K_{i,a} - \lambda_a^\text{old}K_{a,a} - \lambda_b^\text{old} y_a y_b K_{b,a} - y_a b^\text{old} +1\\ =& \sum_{i\neq a,b} y_a \lambda_i y_i K_{i,b} + \lambda_a^\text{old} K_{a,b} - \xi y_a K_{b,b} - \lambda_a^\text{old} K_{b,b}- y_a y_b \\ \ & - \sum_{i \neq a,b}y_a \lambda_i y_i K_{i,a} - \lambda_a^\text{old}K_{a,a} + \lambda_a^\text{old} K_{a,b} + \xi y_a K_{a,b} +1 \\ =& 1- y_a y_b - \xi y_a K_{b,b} + \xi y_a K_{a,b} - \sum_{i \neq a,b}y_a \lambda_i y_i K_{i,a} +\sum_{i\neq a,b} y_a \lambda_i y_i K_{i,b}\\ \ & -\lambda_a^\text{old}( K_{a,a} + K_{b,b} - 2K_{a,b} )\\ = & \lambda_a^{\text{new}}(K_{a,a} + K_{b,b} -2K_{a,b})-\lambda_a^\text{old}( K_{a,a} + K_{b,b} - 2K_{a,b} ) \end{align} $$

We denote prediction error , then we have the expression of :

$$ \lambda_a^\text{new} = \lambda_a^\text{old} + \frac{y_a(E_b - E_a)}{K_{a,a} +K_{b,b} - 2K_{a,b} } $$

Discussion: What if ? In this case is a first degree function, it’s still concave, but in this case the definition of is no longer meaningful, so we just simply select another pair and do the computation above.

Note that the expression of the is not clipped, so for simplicity we name it as . It is inadequate to only compute the . We need to further clip it based on the meaningful domain determined by the dual constraints. According to the dual constraints, each actually has a box constraint. So we have:

$$ 0\leq \lambda_a \leq C\\ 0\leq \lambda_b \leq C\\ \lambda_b = -\lambda_a y_a y_b - \xi y_b $$

We know that . Based on whether or not, we can have the relationship between and with box constraints, shown in the figure below.

Relationship between $\lambda_a$ and $\lambda_b$ with box constraints.

According to the figure, we can get the lower bound and higher bound for a meaningful solution of a new :

  1. if :
    $$ L = \max(\xi y_b, 0)\\ H = \min(C+\xi y_b, C ) $$
  2. if :
    $$ L = \max(0, -C-\xi y_b)\\ H = \min(C, -\xi y_b) $$
    Based on and , we can get the clipped new :
$$ \lambda_a^\text{new, clipped} = \begin{cases} L, &\ \text{if}\ \lambda_a^\text{new, unclipped} < L \\ H, &\ \text{if}\ \lambda_a^\text{new, unclipped} > H \\ \lambda_a^\text{new, unclipped}, &\ \text{otherwise} \end{cases} $$

This is the final meaningful new value of . For simplicity, in the following we use to refer .

After getting , we need to compute :

$$ \lambda_b^\text{new} = -\lambda_a^\text{new} y_a y_b - \xi y_b $$

Now, we need to decide whether to update the value of . If , then is the support vector which is exactly located at the margin. Therefore, we can update as:

$$ \begin{align} b^\text{new} &= y_a -\sum_{i\neq a,b} \lambda_i y_i K_{i,a} - \lambda_a^\text{new} y_a K_{a,a} - \lambda_b^\text{new} y_b K_{b,a}\\ &= b^\text{old} - ( \sum_{i}\lambda_i y_i K_{i,a} + b^\text{old} - y_a ) \\ &\ \ \ + (\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,a} +(\lambda_b^\text{old}-\lambda_b^\text{new}) y_b K_{b,a} \\ &= b^\text{old} - E_a + (\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,a} +(\lambda_b^\text{old}-\lambda_b^\text{new}) y_b K_{b,a} \end{align} $$

Otherwise, if , we can update as:

$$ b^\text{new} = b^\text{old} - E_b + ( \lambda_a^\text{old} - \lambda_a^\text{new} )y_a K_{a,b} +( \lambda_b^\text{old} - \lambda_b^\text{old} ) y_b K_{b,b} $$

Note that if neither nor , here we choose not to update .

Now, we have finished one single iteration in SMO.

Before we summarize the algorithm of SMO, there are some updates that can improve the computation efficiency.

  1. Computation of : In the deduction above, we can see is used in computing and . If we compute using , it will be time consuming. Instead, we can use the equation
    $$ \xi = -\lambda_a^\text{old} y_a - \lambda_b^\text{old} y_b $$
    By substituting the expression of into the expression of , we have:
$$ \lambda_b^\text{new} = \lambda_b^\text{old} + ( \lambda_a^\text{old} - \lambda_a^\text{new}) y_a y_b $$

Sequential Minimal Optimization Algorithm

According to the deduction above, we can have the pseudo algorithm of the SMO.

Initialization: for , , and pre-calculation of the Kernel matrix
Repeat:
heuristically (or randomly) select a pair ;

if :
continue






if :

else:


if :

else if :

else:





if :

else if :


Until: Maximum iteration reached, or the dual objective function is not further maximized with a certain accuracy.

Cool, isn’t it? Now We are able to solve the dual problem using the SMO algorithm!

Ref:

  1. 机器学习算法实践-SVM中的SMO算法- 知乎