Machine Learning Hard Asked at GoogleAsked at MicrosoftAsked at AmazonAsked at Bloomberg

How does an SVM work, and what is the kernel trick?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

An SVM finds the hyperplane that maximises the margin between the two nearest points of each class (the support vectors). When data is not linearly separable, the kernel trick implicitly maps inputs to a high-dimensional feature space — computing inner products there without ever materialising the transformation — enabling non-linear decision boundaries at the cost of linear-space computation.

How to think about it

SVMs are a two-part idea: margin maximisation (hard or soft) and the kernel trick. Interviewers often ask for both together, so link them cleanly.

Margin maximisation

Given labelled points (x_i, y_i) with y ∈ {-1, +1}, SVM finds weights w and bias b such that:

All points satisfy y_i(w · x_i + b) ≥ 1 (hard margin).
The margin width is 2 / ‖w‖ — maximising it means minimising ‖w‖².

Points that lie exactly on the margin boundaries are support vectors; they are the only training examples that determine the decision boundary. This sparsity is a practical strength: SVM inference time depends on the number of support vectors, not n.

Soft-margin SVM (the standard, via parameter C) allows some misclassifications with a penalty, trading margin width for tolerance of outliers. Low C = wide margin, more violations; high C = narrow margin, fewer violations, risk of overfitting.

The kernel trick

For non-linearly separable data, you could map x → φ(x) into a higher-dimensional space where a linear separator exists. But computing φ(x) explicitly is expensive or infinite-dimensional.

The trick: the SVM dual objective depends only on dot products x_i · x_j. Replace every dot product with a kernel function K(x_i, x_j) = φ(x_i) · φ(x_j). You never compute φ — just evaluate K.

Common kernels:

Kernel	Formula	Use case
Linear	x_i · x_j	High-dim, text
RBF (Gaussian)	exp(-γ ‖x_i - x_j‖²)	Most tabular problems
Polynomial	(x_i · x_j + r)^d	Image, NLP

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("scale", StandardScaler()),          # mandatory for SVM
    ("svm", SVC(kernel="rbf", C=1.0, gamma="scale")),
])
pipe.fit(X_train, y_train)

gamma="scale" sets γ = 1 / (n_features · X.var()), a sensible default.

How does an SVM work, and what is the kernel trick?

Margin maximisation

The kernel trick

Keep practising

Explore further