Just to clarify, these contents are mainly summarized from the course I took: “Fundamental of Big Data Analytics”, taught by Prof. Mathar Rudolf. For for information please visit: https://www.ti.rwthaachen.de
Recall of the SVM primal problem:
This is the primal problem of the SVM in the case where points of two classes are linearly separable. Such a primal problem has two drawbacks:
 The separating plane is sensitive to (easily influenced by) outliers.
 Not suitable for the case where points of two classes are not linearly separable.
 The separating plane is sensitive to (easily influenced by) outliers.
Hyperplane Influenced by Outliers Figure Hyperplane Influenced by Outliers shows how a single outlier greatly influences the final results of the hyperplane. This is due to the constraints in the primal problem will make sure that the minimum geodesic distance between points and the separating hyperplane is . When there is an outlier, in order to satisfy the constraints, the model will choose a smaller and also greatly change the rotation/position of the separating hyperplane. However, using the separating hyperplane in Figure (b) is not a good choice, since compared with (a), in (b) the points have a much smaller average geodesic distance to the separating hyperplane. Therefore, it is more likely that the SVM makes wrong decisions when classifying new points.  Not suitable for the case where points of two classes are not linearly separable.
If the points are not linearly separable, then the SVM primal problem doesn’t have a optimal solution, since there doesn’t exist a certain and which satisfies the constraints .
SVM with Slack Variables
To solve the problems above, we need to introduce a slack variable to the original SVM primal problem. This means that we allow certain (outlier) points to be within the margin or even cross the separating hyperplane, but such cases would be penalized. Now the primal problem of the “SlackSVM” will be:
Primal Problem
Here is the slack variable, and the positive is the weight for the penalty term. Suppose that for some point , it holds :
 if , then is exactly at the marginal hyperplane (the margin for short).
 if , then is located within the margin, but the label of is correctly classified.
 if , then is located at the other side of the separating hyperplane, which means a missclassification.
Different $\xi$ and Point Locations
It is possible to use Gradient Descent algorithm to solve the primal problem. However, due to the slack variables, the constraints is much more complex than the case without slack variables. It is more difficult to define the loss function used for gradient descent. On the contrary, the Lagrangian dual problem of this primal problem still remains compact and solvable, and can be easily extended to kernel SVM. Therefore, in the next, we mainly discuss the deduction of the Lagrangian dual problem of the Slack SVM primal problem.
Lagrangian Function
Lagrangian Dual function
To get the dual function, we can compute the derivative and set them to 0.
From these 3 equations we have
Substitue them in the Lagrangian function, we can get the Lagrangian dual function:
Therefore, the Lagrangian dual problem is:
We can use to represent , and finally get the dual problem:
Dual Problem
Compared with the dual problem for the SVM without slack variables, the only difference is that here the constraints of are , instead of .
Actually in the primal problem of the SVM without slack variables, we can think there is a hidden , which means that the penalty of slack variables is infinitely large, so all points need to satisfy .
Solution of the Dual Problem

Gradient Descent Algorithm The objective function for gradient descent is:
$$ \min_{\lambda} L(\lambda) = \sum_{i=1}^{n} \lambda_i + \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_ i^T \mathbf{x}_ j + \frac{c}{2}(\sum_{i=1}^{n}\lambda_i y_i)^2 \\ s.t.\ 0\leq \lambda_i \leq C \ , \ i=1,\dots,n $$ Compared with the post An Introduction to Support Vector Machines (SVM): Dual problem solution using GD, the objective function is the same. The only difference is that here the constraints are . To achieve this constraints, we can clip the value of in the range after each gradient descent back propagation. For the detail of the gradient form, please have a look at that post. 
Sequential Minimal Optimization (SMO), which will be discussed in the following posts.
Discussion on the KarushKuhnTucker (KKT) conditions The KKT conditions are now slightly different, since now in the dual function there are actually two variables: and . For the primal optimum and the dual optimum , it holds:
 primal constraints
$$ y_i({\mathbf{w}^\star}^T\mathbf{x}_ i +b^\star) \geq 1\xi^\star_i \\ \xi^\star_i \geq 0 $$  compute the infimum of w.r.t and
$$ \Delta_{\mathbf{w},b,\xi}L( \mathbf{w}^\star, b^\star, \xi^\star, \lambda^\star, \mu^\star)=0 $$  dual constraints
$$ \lambda_i^\star \geq 0\\ \mu_i^\star \geq 0\\ \sum_{i=1}^{n}\lambda_i^\star y_i =0\\ \mu_i^\star = C  \lambda_i^\star $$  Complementary Slackness
$$ \lambda_i^\star ( 1\xi_i^\star  y_i({\mathbf{w}^\star}^T\mathbf{x}_ i +b^\star ) ) =0 \\ \mu_i^\star \xi_i^\star = 0 $$
The complementary slackness is interesting. Suppose that we have already find the primal optimum and dual optimum. We can analyze the location of the point based on the value of :
then , so , and . This means the distance from point to the separating hyperplane is greater than or equal to . This point is not support vector.
then , so , and . This means that is exactly located at the margin hyperplane: the distance to the separating hyperplane is exactly . The point is a support vector which is used to compute and .
then , so , and . This means that is within the margin, or even located in the other side of the separating hyperplane (missclassification). The point is also a support vector which is used to compute , but not used to compute .
Suppose that we have solved the dual problem and get the dual optimum. Let represent the support set related with ; represent the support set related with . Meanwhile, we define and . Then we can compute the primal optimum:
Multiple ways can be used to compute :
Experiment Results
We compare the separating hyperplane results between the SVM with slack variables (SlackSVM for short) and the original SVM without slack variables (OriginalSVM for short). The SVM models are trained by solving the Lagrangian dual problem using gradient descent algorithm introduced in the last post.
For further discussion, we recall the primal/dual problem of the OriginalSVM and the primal/dual problem of the SlackSVM:
 OriginalSVM
Primal Problem$$ \min_{\mathbf{w},b}\ \frac{1}{2}\\mathbf{w}\^2\\ \begin{align} s.t.\ \ & y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq 1,\ i=1,\dots,n \end{align} $$ Dual Problem$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i  \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\\ \begin{align} s.t.\ & \lambda_i \geq 0 \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$  SlackSVM
Primal Problem$$ \min_{\mathbf{w},b}\ \frac{1}{2}\\mathbf{w}\^2 + C \sum_{i=1}^{n}\xi _i \\ \begin{align} s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b) &\geq 1\xi_i ,\ &i=1,\dots,n \\ \xi_i &\geq 0,\ &i=1,\dots,n \end{align} $$ Dual Problem$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i  \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\\ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$
Experiment 1.
Comparison of performance in the case where there are outliers but the points are still linearly separable. The SlackSVM penalty term weight
Experiment 2.
Analyzing the influence of different SlackSVM penalty term weight .
As we increase the value of , the geodesic margin becomes wider. The outlier point is closer to the margin hyperplane geodesically. More points become support vectors.
To explain this we need to refer the form of the Slack SVM primal problem. When we increase , the penalty term is more heavily penalized. The model tends to reduce the value of . So how to reduce ?
The answer is to reduce . This may sound a little bit bizarre, but we can tell that from the figure Slack SVM over different penalty weight C.
For different value of , the location and rotation of the separating hyperplane remains similar, so the distance from points to the separating hyperplane is similar. We know that for a point which is within the margin or is located in the other side of the separating hyperplane, its geodesic distance to the separating hyperplane is . For the outlier points which cross the separating hyperplane, like the solid blue circle in the top right corner, the geodesic distance is .
Since for large , we need to reduce the large of that outlier point, with the fact that the geodesic distance remains unchanged. So the possible solution is to reduce . As a result, the geodesic margin will be increased. Therefore, the larger is, the wider the margin area is.
Original SVM for linearly nonseparable cases
We also notice that for and , the separating results are almost the same. This leads to another question: what if we set and solve the dual problem of the Slack SVM?
If we set , the primal/dual problem of the Slack SVM is exactly the same as the primal/dual problem of the original SVM. This is the short proof:
 for dual problem, it obviously holds.
 for primal problem:
$$ \min_{\mathbf{w},b}\ \frac{1}{2}\\mathbf{w}\^2 + C \sum_{i=1}^{n}\xi _i \\ \begin{align} s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b) &\geq 1\xi_i ,\ &i=1,\dots,n \\ \xi_i &\geq 0,\ &i=1,\dots,n \end{align} $$ When , to minimize the objective function into some finite value, it must hold . Therefore, , and the Slack SVM’s primal problem will be:$$ \min_{\mathbf{w},b}\ \frac{1}{2}\\mathbf{w}\^2 \\ \begin{align} s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b) &\geq 1,\ &i=1,\dots,n \\ \end{align} $$ This is exactly the Original SVM’s primal problem.
Therefore, the above question is equivalent to ask: What if we apply the Original SVM to the linearly nonseparable case?
The answer is that the separating results will be almost the same as the case in the figure Slack SVM over different penalty weight C. Why the geodesic margin is not further enlarged?
We showed that original SVM is equivalent to set in SlackSVM. However, from the aspect of the dual problem, the real value of is actually determined by the upbound of . For example, if we set , but the real upbound of the trained is 10000, then the real effective is actually 10000. Therefore, we will see by applying Original SVM to linearly nonseparable case, the final separating result is identical to the case.
10  100  10000  
10  62.2  62.2  62.2 
We can see that when reaches 100, the maximum of usually reaches around 60. Therefore, keeping increasing does not influence the separating results further. Note that as we continue training, the may further rise, but it can hardly reach the value of if is very large.

Previous
An Introduction to Support Vector Machines (SVM): Dual problem solution using GD 
Next
An Introduction to Support Vector Machines (SVM): kernel functions