Support Vector Machines (SVMs) classifier is a popular binary classifier based on maximum margin approach introduced by Vladimir Vapnik. It is one of the most popular and powerful "out-of-the-box" supervised classification techniques. The SVM classifier models the classification task by creating a feature space, which is a finite-dimensional vector space, and each dimension of which represents a "feature" of the sample.

Say, we have p training examples (x(i), y(i)), (1≤i≤p), where x(i) is a feature vector in d-dimensional feature space (x1, x2, .. xd). y(i) is the corresponding binary class labels {-1, 1}. The goal of SVM is to train a model that assigns new unseen objects into a particular class. It achieves this by creating a linear separation (also known as decision boundary, or hyperplane in higher dimensions) of the feature space that separates the two classes. Based on the features in the new unseen cases, it places a case "above" or "below" the hyperplane, leading to a binary classification. Contrasting to logistic regression classifiers, SVM is a non-probabilistic linear classifier because the features in the new objects fully determine their exact locations within the feature space and there is no stochastic element (such as random variable) involved.

The advantages of using SVM include its ability to handle extremely high dimensional data, memory efficiency due to the storage of only subset of data during training, and versatility by using techniques called kernel tricks to handle non-linearly separable data. However, SVMs may perform poorly when the number of features is greater than the number of training example (d > p). Other disadvantages include its lack of probabilistic interpretation and lack of interpretability on how classification decisions are made (black-box approach).

Hyperplane (or Decision Boundary)

Assuming the classes are linearly separable, SVMs would search for a hyperplane:

w1x1 + w2x2 + … + wdxd + b = 0

which can be rewritten as follows:


which can be expressed as the dot product (*) of vectors w and x:

w*x + b = wTx + b =0

x is the feature vector consisted of the training sample. w is the weight vector that is perpendicular to the hyperplane, thus it describes the spacial orientation of the hyperplane. The decision boundary will be a straight line if there are two features and a 2-D plane if there are three features, and so on. b specifies how far the hyperplane is away from the origin.


When data is linearly separable, there exists an infinite number of solutions that satisfies the equation w*x + b = 0. In other words, there are infinite hyperplanes existed that can classify the training data correctly. However, some hyperplanes handle random noise from the data better than the others, thus resulting in better generalization. In SVMs, the most preferred hyperplane is the one that maximizes the margin between the two classes. A margin is defined as the smallest perpendicular distance to a training observation from the hyperplane. Surrounding the hyperplane, there are two support hyperplanes (+ and -) that are parallel to the w*x + b = 0 classifier hyperplane, and the margin is the shortest (perpendicular) distance between these two support hyperplanes (+ and -). The + support hyperplane is shown (below) as the upper dotted line as w*x + b = 1, while the – support hyperplane is shown as the lower dotted line as w*x + b = -1. Data above the + support hyperplane is in the region w*x + b > 1 and is classified into one class, while data below the – support hyperplane is in the in the region w*x + b < -1 and is classified into the other class. The data points on top of the + and – support hyperplanes are called the support vectors.


Maximizing the Margin

The primary objective for SVM is to maximize the margin or distance (D). To find D, we arbitrarily pick a point (Point A) from one of the support hyperplane (for example a point from the – hyperplane):


The corresponding point (Point B) on the opposite support hyperplane (+) will be D distance away from Point A. This D distance is perpendicular to the classifier hyperplane, and in the same direction as the w vector.


combining the two equations from the two points gives:


It turns out that the maximization task of margin D is equivalent to the minimization task of:


which is more preferable since the latter can be solved for using relatively simple quadratic optimization.

The above function Φ(w) holds under the assumption that all training data is correctly classified. This assumption can be expressed as:


combining the two equations above gives:


The Primal Form

To recap, the objective for SVM is to find w and b such that:


is minimized; and for all {(x(i), y(i))} it is subject to:


In other words, it needs to optimize a quadratic function subject to linear constraints stating that each observation must be on the correct side of the hyperplane and at least a certain distance from it.

Lagrange Formulation

Lagrange multipliers are mathematical method used to solve constrained optimization problems of differentiable functions. For example, in order to find the maximum/minimum of the objective function:


which is subject to


The Lagrange formulation will form a new function with a Lagrange multiplier λ:


The next step is to take partial derivatives of L with respect to x, y, z, and λ and set them to equal to 0.


and then substitute the solutions back into f(x, y, z).

In terms of SVM, the objective function to be minimized is:


which is subject to


We can apply the Lagrange formulation and become


w.r.t. w and b, and maximize w.r.t. each λ(i)≥0.

Taking the gradient (∇) of L with respect to w and set it to 0, we get


and the partial derivative of L with respect to b is


The Dual Form

Once ∇wL and ∂L/∂b are obtained above, they can be substituted back to the original Lagrange formulation L(w,b,λ) to get rid of w and b.


which is to be maximized w.r.t. to λ subject to 1) λ(i)≥0 for i=1, 2,…, p and 2) Σλ(i)y(i) = 0.

Quadratic Programming

Quadratic programming is a typical approach to solve the maximization problem (subject to constraints) above. First, the maximization problem is converted into a minimization problem for more efficient computation:




which can be viewed as follows


where the [matrix] is read off from the training data which is passed to quadratic programming as a quadratic term, while the (-1T) is passed to quadratic programming as a linear term. This is subject to λ≥0 and yTλ=0. The above formula can be expressed more compactly as where Q represent the quadratic coefficient term:


Quadratic programming will compute the above equation using the training data and output solution: λ= λ(1), λ(2), …, λ(p). The λ can then be substituted back into


in order to solve for w.

Furthermore, the formula for any support vector is used


and substitute w obtained above to solve for b.

Non-linear Transform


To add a capability to SVM to handle non-linearly separable data, ξ(i) called a slack variable is introduced. This slack variable is the distance of misclassified samples from their correct point of the supposed support hyperplane. The optimal hyperplane with maximum margin can be obtained by solving the following quadratic programming:


is to be minimized and subject to 1) y(i)(w*x(i) + b) >= 1-ξ(i) for all i, and 2) ξ(i) >=0 for all i. C is a constant (can be tested) acting as a trade-off parameter which signals the SVM optimization how much misclassification is allowed for the margin. A large value of C results in optimization with a smaller margin hyperplane if such margin most correctly classifies the training data. A small value of C results in optimizer finding the larger-margin hyperplane, but at the expense of allowing more misclassified training data.


Source material

Please rate this