datarekha

Support vector machines

Find the boundary with the widest margin between classes — and bend it to any shape with the kernel trick. The max-margin idea, soft margins (C), and RBF kernels.

8 min read Intermediate Machine Learning Lesson 12 of 33

What you'll learn

  • The max-margin hyperplane and why only the support vectors matter
  • Soft margins and the C hyperparameter (regularization vs fit)
  • The kernel trick — separating non-linear data with RBF/polynomial kernels

Before you start

Many lines can separate two classes. A support vector machine (SVM) asks a sharper question: which boundary leaves the widest gap between the classes? That single idea — the maximum margin — gives SVMs strong generalization, and the kernel trick that comes with it is one of the most elegant moves in all of ML (and a perennial interview question).

Max margin: the widest street

Picture the boundary as a street between the two classes. The SVM makes that street as wide as possible. The points that touch the curb — the closest ones on each side — are the support vectors, and they’re the only points that define the boundary. Move a far-away point and nothing changes; move a support vector and the whole boundary shifts. That focus on the hardest cases is why SVMs generalize well.

Soft margins and C

Real data isn’t cleanly separable, so SVMs use a soft margin — they allow some points to sit inside the margin or on the wrong side, for a penalty. The C hyperparameter sets how much you punish those violations:

  • Low C → a wider, more tolerant margin (more regularization). Accepts some misclassifications for a smoother boundary. Higher bias, lower variance.
  • High C → a narrow margin that tries hard to classify every training point correctly. Lower bias, higher variance — risks overfitting.

C is the SVM’s bias–variance dial. Slide it in the widget and watch the margin breathe.

The kernel trick — bending the boundary

A straight line can’t separate concentric circles. The SVM’s superpower is the kernel trick: it implicitly maps the data into a higher-dimensional space where a straight boundary does exist, without ever computing those coordinates — it only needs dot products, which the kernel computes directly.

  • Linear kernel — a straight boundary. Fast; the right choice for high-dimensional data (text) that’s already roughly linearly separable.
  • RBF (Gaussian) kernel — the default for non-linear data; wraps curved, blobby boundaries. Tuned by gamma (how local each point’s influence is).
  • Polynomial — curved boundaries of a fixed degree.

Quick check

Quick check

0/3
Q1What are support vectors, and why do they matter?
Q2What does the C hyperparameter control?
Q3What is the kernel trick?

Next

You’ve now met the core supervised algorithms. Next, the simplest non-parametric one — k-nearest neighbors — and the probabilistic Naive Bayes, before the ensembles that usually win.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How does an SVM work, and what is the kernel trick?

An SVM finds the hyperplane that maximises the margin between the two nearest points of each class (the support vectors). When data is not linearly separable, the kernel trick implicitly maps inputs to a high-dimensional feature space — computing inner products there without ever materialising the transformation — enabling non-linear decision boundaries at the cost of linear-space computation.

What is the kernel trick in SVM, and why does it work?

The kernel trick lets an SVM find a nonlinear decision boundary by implicitly mapping data into a higher-dimensional space where it becomes linearly separable, without ever computing that mapping explicitly. It works because the SVM's dual formulation depends only on dot products between points, and a kernel function computes that dot product directly in the high-dimensional space. Common kernels are linear, polynomial, and RBF.

What do the C and gamma hyperparameters control in an SVM, and how do they relate to overfitting?

C controls the soft-margin tradeoff: large C penalizes misclassifications heavily, producing a narrow margin that can overfit, while small C allows more slack for better generalization. Gamma (for RBF kernels) sets how far one training point's influence reaches: high gamma makes a wiggly boundary that overfits, low gamma makes it smoother. You tune both jointly via cross-validation after scaling features.

What does the C parameter control in a Support Vector Machine?

C is the regularisation parameter that trades margin width against training error tolerance. A small C allows many margin violations (wide margin, simpler boundary, higher bias) while a large C penalises violations heavily, forcing a narrow margin that fits the training data more tightly but risks overfitting.

Related lessons

Explore further

Skip to content