datarekha

Support Vector Machines

Find the separating line with the widest margin. The margin is 2/‖w‖, only the support vectors define it, and kernels bend it into curves — all GATE-frequent.

9 min read Advanced GATE DA Lesson 89 of 122

What you'll learn

  • SVM picks the max-margin hyperplane wᵀx + b = 0
  • The margin width equals 2/‖w‖, so maximising margin means minimising ‖w‖
  • Only the closest points — support vectors — determine the boundary
  • Soft margins handle overlap; kernels fit curved boundaries conceptually

Before you start

Many lines can separate two classes — logistic regression settles on one of them. A Support Vector Machine (SVM) is pickier: among all separating lines it chooses the one that sits as far as possible from both classes, leaving the widest margin. That single choice gives SVMs their robustness — one reason they were the go-to classifier for text and image tasks before deep learning, and still a strong baseline on small, high-dimensional datasets. It leads to two facts GATE leans on every couple of years: the margin equals 2/‖w‖, and only a handful of points actually matter.

The maximum-margin hyperplane

The decision boundary is a hyperplane wᵀx + b = 0. The margin is the empty band around it, running out to the nearest point of each class; its full width works out to

margin = 2 / ‖w‖

SVM maximises that width. Because the 2 is fixed, maximising the margin is the same as minimising ‖w‖ — a small weight norm means a wide margin.

wᵀx + b = 0margin = 2 / ‖w‖Circled points = support vectors (they touch the margin edges).
The solid line is the boundary; the dashed lines are the margin edges; only the circled support vectors set their position.

Support vectors — only the closest points matter

The points sitting exactly on the margin edges are the support vectors. They alone pin down the boundary. Every other point lies safely outside the band and contributes nothing: delete a non-support-vector and the boundary and margin do not move at all. This is why SVMs are described by “support” vectors — the solution depends on a small subset of the data, not all of it.

Soft margins and kernels

Two extensions, conceptual only at GATE level:

  • Soft margin. Real data overlaps, so a hard margin (zero violations) may be impossible. The soft-margin SVM adds slack that lets a few points sit inside the margin or on the wrong side, trading a little misclassification for a wider, more robust boundary.
  • Kernels. A linear SVM can only draw straight boundaries. A kernel (e.g. polynomial or RBF) implicitly maps the data into a higher-dimensional space where a straight boundary there corresponds to a curved boundary back in the original space — all without ever computing the mapping explicitly (the “kernel trick”). This is how SVMs fit non-linear class shapes.

How GATE asks this

Usually a conceptual MCQ/MSQ on the margin and support vectors, sometimes with a tiny hard-margin calculation. Two patterns recur: (1) “given ‖w‖, what is the margin?” — plug into 2/‖w‖; and (2) “what happens to the boundary if a non-support-vector is removed / a support vector is removed?” Both appeared in 2024 and 2025. No dual formulation or KKT conditions are required.

Worked example — support-vector invariance and the margin

SVM1 is trained on a dataset. SVM2 is trained on the same data but with one point — a non-support-vector — removed. Also, you are told the learned weight vector has ‖w‖ = 2. Compare the two margins, and give the margin width.

Reason it out in two steps:

  1. Does the margin change? The boundary and margin of an SVM are determined only by the support vectors. The removed point is not a support vector, so it never touched the margin and never constrained the solution. Removing it leaves the support vectors — and therefore w, b, the boundary, and the margin — unchanged. SVM2’s margin equals SVM1’s margin.

  2. What is the width? Plug the given norm into the formula:

    margin = 2 / ‖w‖ = 2 / 2 = 1.

So the margin is unchanged by deleting the non-support-vector, and its width is 1. (GATE DA tested support vectors directly in 2024 — “which set is a possible set of support vectors?” — and the margin again in 2025.) Sanity check: if instead ‖w‖ = 4, the margin would be 2/4 = 0.5 — a larger weight norm always means a narrower margin.

Quick check

Quick check

0/6
Q1A trained SVM has weight vector norm ‖w‖ = 2. What is the width of its margin?numerical answer — type a number
Q2Another SVM has ‖w‖ = 4. What is its margin width? (2 decimals)numerical answer — type a number
Q3SVM2 is trained on the same data as SVM1 but with one non-support-vector point removed. How do the two margins compare?
Q4Which statements about SVMs are correct? (select all that apply)select all that apply
Q5Which statements about kernels and soft margins are correct? (select all that apply)select all that apply
Q6What does 'support vector' refer to in an SVM?

Practice this in an interview

All questions
How does an SVM work, and what is the kernel trick?

An SVM finds the hyperplane that maximises the margin between the two nearest points of each class (the support vectors). When data is not linearly separable, the kernel trick implicitly maps inputs to a high-dimensional feature space — computing inner products there without ever materialising the transformation — enabling non-linear decision boundaries at the cost of linear-space computation.

What does the C parameter control in a Support Vector Machine?

C is the regularisation parameter that trades margin width against training error tolerance. A small C allows many margin violations (wide margin, simpler boundary, higher bias) while a large C penalises violations heavily, forcing a narrow margin that fits the training data more tightly but risks overfitting.

How do you handle skewed features in a machine learning dataset, and why does skew matter?

Right-skewed features (long tail on the right) concentrate most values near zero while a few extreme values pull the mean up, which distorts distance-based models and linear regression. Common fixes are log, square-root, or Box-Cox transformations that compress the tail and make the distribution closer to normal, improving model convergence and reducing the undue influence of large values.

How does Ordinary Least Squares derive the coefficient vector, and what is the closed-form solution?

OLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content