Support vector machines
Find the boundary with the widest margin between classes — and bend it to any shape with the kernel trick. The max-margin idea, soft margins (C), and RBF kernels.
What you'll learn
- The max-margin hyperplane and why only the support vectors matter
- Soft margins and the C hyperparameter (regularization vs fit)
- The kernel trick — separating non-linear data with RBF/polynomial kernels
Before you start
Many lines can separate two classes. A support vector machine (SVM) asks a sharper question: which boundary leaves the widest gap between the classes? That single idea — the maximum margin — gives SVMs strong generalization, and the kernel trick that comes with it is one of the most elegant moves in all of ML (and a perennial interview question).
Max margin: the widest street
Picture the boundary as a street between the two classes. The SVM makes that street as wide as possible. The points that touch the curb — the closest ones on each side — are the support vectors, and they’re the only points that define the boundary. Move a far-away point and nothing changes; move a support vector and the whole boundary shifts. That focus on the hardest cases is why SVMs generalize well.
Soft margins and C
Real data isn’t cleanly separable, so SVMs use a soft margin — they allow some points to sit inside the margin or on the wrong side, for a penalty. The C hyperparameter sets how much you punish those violations:
- Low C → a wider, more tolerant margin (more regularization). Accepts some misclassifications for a smoother boundary. Higher bias, lower variance.
- High C → a narrow margin that tries hard to classify every training point correctly. Lower bias, higher variance — risks overfitting.
C is the SVM’s bias–variance dial. Slide it in the widget and watch the margin breathe.
The kernel trick — bending the boundary
A straight line can’t separate concentric circles. The SVM’s superpower is the kernel trick: it implicitly maps the data into a higher-dimensional space where a straight boundary does exist, without ever computing those coordinates — it only needs dot products, which the kernel computes directly.
- Linear kernel — a straight boundary. Fast; the right choice for high-dimensional data (text) that’s already roughly linearly separable.
- RBF (Gaussian) kernel — the default for non-linear data; wraps curved,
blobby boundaries. Tuned by
gamma(how local each point’s influence is). - Polynomial — curved boundaries of a fixed degree.
Quick check
Quick check
Next
You’ve now met the core supervised algorithms. Next, the simplest non-parametric one — k-nearest neighbors — and the probabilistic Naive Bayes, before the ensembles that usually win.
Practice this in an interview
All questionsAn SVM finds the hyperplane that maximises the margin between the two nearest points of each class (the support vectors). When data is not linearly separable, the kernel trick implicitly maps inputs to a high-dimensional feature space — computing inner products there without ever materialising the transformation — enabling non-linear decision boundaries at the cost of linear-space computation.
The kernel trick lets an SVM find a nonlinear decision boundary by implicitly mapping data into a higher-dimensional space where it becomes linearly separable, without ever computing that mapping explicitly. It works because the SVM's dual formulation depends only on dot products between points, and a kernel function computes that dot product directly in the high-dimensional space. Common kernels are linear, polynomial, and RBF.
C controls the soft-margin tradeoff: large C penalizes misclassifications heavily, producing a narrow margin that can overfit, while small C allows more slack for better generalization. Gamma (for RBF kernels) sets how far one training point's influence reaches: high gamma makes a wiggly boundary that overfits, low gamma makes it smoother. You tune both jointly via cross-validation after scaling features.
C is the regularisation parameter that trades margin width against training error tolerance. A small C allows many margin violations (wide margin, simpler boundary, higher bias) while a large C penalises violations heavily, forcing a narrow margin that fits the training data more tightly but risks overfitting.