An Introduction to Support Vector Machines (SVM): SVM with slack variables

支持向量机(SVM)概述:具有松弛变量的支持向量机

Posted by Gu on June 7, 2019

Just to clarify, these contents are mainly summarized from the course I took: “Fundamental of Big Data Analytics”, taught by Prof. Mathar Rudolf. For for information please visit: https://www.ti.rwth-aachen.de

Recall of the SVM primal problem:

$$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2\\ \begin{align} s.t.\ \ & y_i(\mathbf{w}^T\mathbf{x}_i+b)\geq 1,\ i=1,\dots,n \end{align} $$

This is the primal problem of the SVM in the case where points of two classes are linearly separable. Such a primal problem has two drawbacks:

  • The separating plane is sensitive to (easily influenced by) outliers.
  • Not suitable for the case where points of two classes are not linearly separable.
  1. The separating plane is sensitive to (easily influenced by) outliers.
    Hyperplane Influenced by Outliers
    Figure Hyperplane Influenced by Outliers shows how a single outlier greatly influences the final results of the hyperplane. This is due to the constraints in the primal problem will make sure that the minimum geodesic distance between points and the separating hyperplane is . When there is an outlier, in order to satisfy the constraints, the model will choose a smaller and also greatly change the rotation/position of the separating hyperplane. However, using the separating hyperplane in Figure (b) is not a good choice, since compared with (a), in (b) the points have a much smaller average geodesic distance to the separating hyperplane. Therefore, it is more likely that the SVM makes wrong decisions when classifying new points.
  2. Not suitable for the case where points of two classes are not linearly separable.
    If the points are not linearly separable, then the SVM primal problem doesn’t have a optimal solution, since there doesn’t exist a certain and which satisfies the constraints .

SVM with Slack Variables

To solve the problems above, we need to introduce a slack variable to the original SVM primal problem. This means that we allow certain (outlier) points to be within the margin or even cross the separating hyperplane, but such cases would be penalized. Now the primal problem of the “Slack-SVM” will be:

Primal Problem

$$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n}\xi _i \\ \begin{align} s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_i+b) &\geq 1-\xi_i ,\ &i=1,\dots,n \\ \xi_i &\geq 0,\ &i=1,\dots,n \end{align} $$

Here is the slack variable, and the positive is the weight for the penalty term. Suppose that for some point , it holds :

  • if , then is exactly at the marginal hyperplane (the margin for short).
  • if , then is located within the margin, but the label of is correctly classified.
  • if , then is located at the other side of the separating hyperplane, which means a miss-classification.
    Different $\xi$ and Point Locations

It is possible to use Gradient Descent algorithm to solve the primal problem. However, due to the slack variables, the constraints is much more complex than the case without slack variables. It is more difficult to define the loss function used for gradient descent. On the contrary, the Lagrangian dual problem of this primal problem still remains compact and solvable, and can be easily extended to kernel SVM. Therefore, in the next, we mainly discuss the deduction of the Lagrangian dual problem of the Slack SVM primal problem.

Lagrangian Function

$$L( \mathbf{w}, b, \mathbf{\xi}, \lambda, \mu )= \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{n}\xi_i + \sum_{i=1}^{n}\lambda_i ( 1-\xi_i - y_i(\mathbf{w}^T\mathbf{x}_ i+b ) ) - \sum_{i=1}^{n}\mu_i \xi_i $$

Lagrangian Dual function

$$ g(\lambda, \mu) = \inf_{\mathbf{w}, b, \xi} L(\mathbf{w}, b, \xi, \lambda, \mu) $$

To get the dual function, we can compute the derivative and set them to 0.

$$ \frac{\partial{L}}{\partial{\mathbf{w}}} = \mathbf{w} - \sum_{i=1}^{n}\lambda_i y_i \mathbf{x}_i = 0 \\ \frac{\partial{L}}{\partial{b}} = -\sum_{i=1}^{n}\lambda_i y_i = 0\\ \frac{\partial{L}}{\partial{\xi_i}} = C - \lambda_i - \mu_i = 0 $$

From these 3 equations we have

$$ \mathbf{w}^\star = \sum_{i=1}^{n} \lambda_i y_i \mathbf{x}_i\\ \sum_{i=1}^{n} \lambda_i y_i = 0\\ \mu_i = C-\lambda_i $$

Substitue them in the Lagrangian function, we can get the Lagrangian dual function:

$$ g(\lambda, \mu) = \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + C\sum_{i=1}^{n}\xi_i + \sum_{i=1}^{n}\lambda_i -\sum_{i=1}^{n}\lambda_i \xi_i \\-\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j - \sum_{i=1}^{n}\lambda_i y_i b - C\sum_{i=1}^{n}\xi_i + \sum_{i=1}^{n}\lambda_i \xi_i\\ = \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j $$

Therefore, the Lagrangian dual problem is:

$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\\ \begin{align} s.t.\ & \lambda_i \geq 0\\ &\mu_i \geq 0 \\ &\mu_i = C-\lambda_i \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

We can use to represent , and finally get the dual problem:

Dual Problem

$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\\ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

Compared with the dual problem for the SVM without slack variables, the only difference is that here the constraints of are , instead of .

Actually in the primal problem of the SVM without slack variables, we can think there is a hidden , which means that the penalty of slack variables is infinitely large, so all points need to satisfy .

Solution of the Dual Problem

  1. Gradient Descent Algorithm The objective function for gradient descent is:

    $$ \min_{\lambda} L(\lambda) = -\sum_{i=1}^{n} \lambda_i + \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_ i^T \mathbf{x}_ j + \frac{c}{2}(\sum_{i=1}^{n}\lambda_i y_i)^2 \\ s.t.\ 0\leq \lambda_i \leq C \ , \ i=1,\dots,n $$
    Compared with the post An Introduction to Support Vector Machines (SVM): Dual problem solution using GD, the objective function is the same. The only difference is that here the constraints are . To achieve this constraints, we can clip the value of in the range after each gradient descent back propagation. For the detail of the gradient form, please have a look at that post.

  2. Sequential Minimal Optimization (SMO), which will be discussed in the following posts.

Discussion on the Karush-Kuhn-Tucker (KKT) conditions The KKT conditions are now slightly different, since now in the dual function there are actually two variables: and . For the primal optimum and the dual optimum , it holds:

  1. primal constraints
    $$ y_i({\mathbf{w}^\star}^T\mathbf{x}_ i +b^\star) \geq 1-\xi^\star_i \\ \xi^\star_i \geq 0 $$
  2. compute the infimum of w.r.t and
    $$ \Delta_{\mathbf{w},b,\xi}L( \mathbf{w}^\star, b^\star, \xi^\star, \lambda^\star, \mu^\star)=0 $$
  3. dual constraints
    $$ \lambda_i^\star \geq 0\\ \mu_i^\star \geq 0\\ \sum_{i=1}^{n}\lambda_i^\star y_i =0\\ \mu_i^\star = C - \lambda_i^\star $$
  4. Complementary Slackness
    $$ \lambda_i^\star ( 1-\xi_i^\star - y_i({\mathbf{w}^\star}^T\mathbf{x}_ i +b^\star ) ) =0 \\ \mu_i^\star \xi_i^\star = 0 $$

The complementary slackness is interesting. Suppose that we have already find the primal optimum and dual optimum. We can analyze the location of the point based on the value of :


  1. then , so , and . This means the distance from point to the separating hyperplane is greater than or equal to . This point is not support vector.

  2. then , so , and . This means that is exactly located at the margin hyperplane: the distance to the separating hyperplane is exactly . The point is a support vector which is used to compute and .

  3. then , so , and . This means that is within the margin, or even located in the other side of the separating hyperplane (miss-classification). The point is also a support vector which is used to compute , but not used to compute .

Suppose that we have solved the dual problem and get the dual optimum. Let represent the support set related with ; represent the support set related with . Meanwhile, we define and . Then we can compute the primal optimum:

$$ \mathbf{w}^\star = \sum_{i\in S_w}\lambda_i^\star y_i \mathbf{x}_i $$

Multiple ways can be used to compute :

$$ b^\star= y_i - {\mathbf{w}^\star}^T\mathbf{x}_i \ , \ i\in S_b\\ b^\star= \frac{1}{\vert S_b \vert}\sum_{i\in S_b}({y_i - {\mathbf{w}^\star}^T\mathbf{x}_i})\\ b^\star = -\frac{1}{2}{\mathbf{w}^\star}^T(\mathbf{x}_i+\mathbf{x}_j)\ , \ i\in S_b^+, j \in S_b^- $$

Experiment Results

We compare the separating hyperplane results between the SVM with slack variables (Slack-SVM for short) and the original SVM without slack variables (Original-SVM for short). The SVM models are trained by solving the Lagrangian dual problem using gradient descent algorithm introduced in the last post.

For further discussion, we recall the primal/dual problem of the Original-SVM and the primal/dual problem of the Slack-SVM:

  1. Original-SVM
    Primal Problem
    $$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2\\ \begin{align} s.t.\ \ & y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq 1,\ i=1,\dots,n \end{align} $$
    Dual Problem
    $$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\\ \begin{align} s.t.\ & \lambda_i \geq 0 \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$
  2. Slack-SVM
    Primal Problem
    $$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n}\xi _i \\ \begin{align} s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b) &\geq 1-\xi_i ,\ &i=1,\dots,n \\ \xi_i &\geq 0,\ &i=1,\dots,n \end{align} $$
    Dual Problem
    $$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\\ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

Experiment 1.
Comparison of performance in the case where there are outliers but the points are still linearly separable. The Slack-SVM penalty term weight

Slack SVM vs Original SVM on separable outliers
This result fits well with the analysis in the Figure Hyperplane Influenced by Outliers! The original SVM tries hard to find a separating hyperplane regardless of the obvious outlier point. It takes iterations (20 times longer than Slack SVM) to finally find the separating hyperplane with a tight margin. On the contrary, the Slack SVM simply chooses to ignore the outlier point. The separating hyperplane is almost identical to the case without the outlier point.

Experiment 2.
Analyzing the influence of different Slack-SVM penalty term weight .

Slack SVM over different penalty weight C

As we increase the value of , the geodesic margin becomes wider. The outlier point is closer to the margin hyperplane geodesically. More points become support vectors.

To explain this we need to refer the form of the Slack SVM primal problem. When we increase , the penalty term is more heavily penalized. The model tends to reduce the value of . So how to reduce ?

The answer is to reduce . This may sound a little bit bizarre, but we can tell that from the figure Slack SVM over different penalty weight C.

For different value of , the location and rotation of the separating hyperplane remains similar, so the distance from points to the separating hyperplane is similar. We know that for a point which is within the margin or is located in the other side of the separating hyperplane, its geodesic distance to the separating hyperplane is . For the outlier points which cross the separating hyperplane, like the solid blue circle in the top right corner, the geodesic distance is .

Since for large , we need to reduce the large of that outlier point, with the fact that the geodesic distance remains unchanged. So the possible solution is to reduce . As a result, the geodesic margin will be increased. Therefore, the larger is, the wider the margin area is.

Original SVM for linearly non-separable cases
We also notice that for and , the separating results are almost the same. This leads to another question: what if we set and solve the dual problem of the Slack SVM?

If we set , the primal/dual problem of the Slack SVM is exactly the same as the primal/dual problem of the original SVM. This is the short proof:

  1. for dual problem, it obviously holds.
  2. for primal problem:
    $$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n}\xi _i \\ \begin{align} s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b) &\geq 1-\xi_i ,\ &i=1,\dots,n \\ \xi_i &\geq 0,\ &i=1,\dots,n \end{align} $$
    When , to minimize the objective function into some finite value, it must hold . Therefore, , and the Slack SVM’s primal problem will be:
    $$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2 \\ \begin{align} s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b) &\geq 1,\ &i=1,\dots,n \\ \end{align} $$
    This is exactly the Original SVM’s primal problem.

Therefore, the above question is equivalent to ask: What if we apply the Original SVM to the linearly non-separable case?

The answer is that the separating results will be almost the same as the case in the figure Slack SVM over different penalty weight C. Why the geodesic margin is not further enlarged?

We showed that original SVM is equivalent to set in Slack-SVM. However, from the aspect of the dual problem, the real value of is actually determined by the up-bound of . For example, if we set , but the real up-bound of the trained is 10000, then the real effective is actually 10000. Therefore, we will see by applying Original SVM to linearly non-separable case, the final separating result is identical to the case.

Original SVM on linearly non-separable case
Here I also check the maximum of after the training for different

10 100 10000
10 62.2 62.2 62.2

We can see that when reaches 100, the maximum of usually reaches around 60. Therefore, keeping increasing does not influence the separating results further. Note that as we continue training, the may further rise, but it can hardly reach the value of if is very large.