How does an SVM work, and what is the kernel trick?
An SVM finds the hyperplane that maximises the margin between the two nearest points of each class (the support vectors). When data is not linearly separable, the kernel trick implicitly maps inputs to a high-dimensional feature space — computing inner products there without ever materialising the transformation — enabling non-linear decision boundaries at the cost of linear-space computation.
How to think about it
SVMs are a two-part idea: margin maximisation (hard or soft) and the kernel trick. Interviewers often ask for both together, so link them cleanly.
Margin maximisation
Given labelled points (x_i, y_i) with y ∈ {-1, +1}, SVM finds weights w and bias b such that:
- All points satisfy y_i(w · x_i + b) ≥ 1 (hard margin).
- The margin width is 2 / ‖w‖ — maximising it means minimising ‖w‖².
Points that lie exactly on the margin boundaries are support vectors; they are the only training examples that determine the decision boundary. This sparsity is a practical strength: SVM inference time depends on the number of support vectors, not n.
Soft-margin SVM (the standard, via parameter C) allows some misclassifications with a penalty, trading margin width for tolerance of outliers. Low C = wide margin, more violations; high C = narrow margin, fewer violations, risk of overfitting.
The kernel trick
For non-linearly separable data, you could map x → φ(x) into a higher-dimensional space where a linear separator exists. But computing φ(x) explicitly is expensive or infinite-dimensional.
The trick: the SVM dual objective depends only on dot products x_i · x_j. Replace every dot product with a kernel function K(x_i, x_j) = φ(x_i) · φ(x_j). You never compute φ — just evaluate K.
Common kernels:
| Kernel | Formula | Use case |
|---|---|---|
| Linear | x_i · x_j | High-dim, text |
| RBF (Gaussian) | exp(-γ ‖x_i - x_j‖²) | Most tabular problems |
| Polynomial | (x_i · x_j + r)^d | Image, NLP |
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("scale", StandardScaler()), # mandatory for SVM
("svm", SVC(kernel="rbf", C=1.0, gamma="scale")),
])
pipe.fit(X_train, y_train)
gamma="scale" sets γ = 1 / (n_features · X.var()), a sensible default.