datarekha

What is the kernel trick in SVM, and why does it work?

The short answer

The kernel trick lets an SVM find a nonlinear decision boundary by implicitly mapping data into a higher-dimensional space where it becomes linearly separable, without ever computing that mapping explicitly. It works because the SVM's dual formulation depends only on dot products between points, and a kernel function computes that dot product directly in the high-dimensional space. Common kernels are linear, polynomial, and RBF.

How to think about it

The crisp answer

The kernel trick is a way to get a nonlinear SVM boundary by implicitly mapping inputs into a higher-dimensional feature space where the classes are linearly separable — without ever materializing that space. You replace dot products with a kernel function.

Why it works

The SVM’s dual optimization depends on the data only through dot products between pairs of points. As the kernel trick explanation by Suraj Yadav describes, a kernel K(x, z) computes what the dot product φ(x)·φ(z) would be in the mapped space, directly from the original inputs. So you get the benefit of the high-dimensional mapping at the cost of evaluating a simple function — avoiding the blow-up of computing φ explicitly (which for the RBF kernel is infinite-dimensional).

The key idea in words

Instead of “map then take dot product,” you “take the kernel,” which equals the dot product in feature space. For this to correspond to a valid inner product, the kernel must be symmetric and positive semi-definite (Mercer’s condition).

Common kernels

  • Linear: no mapping; best for high-dimensional sparse data like text.
  • Polynomial: captures feature interactions up to a degree.
  • RBF (Gaussian): maps to infinite dimensions; flexible default for nonlinear data, controlled by gamma.

The common trap

Reaching for RBF by default. On high-dimensional data a linear SVM is often the right starting point and sometimes the final model. And you must scale features, since both the margin and the kernel depend on distances. Expected follow-up: “What do C and gamma do?” — C trades off margin width vs misclassification; gamma sets how far a single training point’s influence reaches (high gamma = wiggly, overfit boundary).

Learn it properly Support vector machines

Keep practising

All Machine Learning questions

Explore further

Skip to content