datarekha
Deep Learning Easy Asked at NVIDIAAsked at GoogleAsked at MetaAsked at Amazon

Why are GPUs used for deep learning instead of CPUs?

The short answer

Neural network training is dominated by large matrix multiplications that are embarrassingly parallel. GPUs have thousands of small cores optimised for this exact operation, whereas CPUs have tens of powerful cores optimised for low-latency sequential logic. The throughput difference is 10–100x for typical DL workloads.

How to think about it

The core operation in a neural network layer is a matrix multiplication: Y = X @ W. Multiplying a batch of 256 examples (each with 1024 features) against a weight matrix of shape (1024, 2048) requires ~500 million floating-point multiply-accumulates. Every element of the output can be computed independently — this is embarrassingly parallel.

CPU vs GPU architecture

CPU (e.g. AMD EPYC)GPU (e.g. NVIDIA A100)
Cores64–192 powerful cores6912 CUDA cores + 432 tensor cores
Optimised forLow-latency sequential codeHigh-throughput parallel computation
Memory bandwidth~300 GB/s~2000 GB/s
FP16 throughput~1 TFLOPS~312 TFLOPS (tensor cores)
Cache hierarchyDeep, multi-level L1/L2/L3Shared memory per SM block

Tensor cores

Modern NVIDIA GPUs include tensor cores — specialised functional units that compute a 4x4 or 16x16 matrix multiply-accumulate in a single clock cycle, operating in float16 or bfloat16 while accumulating in float32. This is why mixed precision training delivers such a large speedup: it unlocks tensor core throughput.

Memory bandwidth is the bottleneck

On modern hardware, training is often bandwidth-bound, not compute-bound: the GPU can perform FLOPs faster than it can feed weights and activations from HBM memory. This is why:

  • Larger batch sizes improve utilisation (amortise memory loads over more compute).
  • flash attention and similar kernel-fused operations reduce the number of HBM round-trips.

When is a CPU better?

  • Inference on small models with tiny batch sizes (latency-sensitive edge deployments).
  • Gradient boosted trees and classical ML (irregular memory access patterns).
  • Data preprocessing pipelines that feed the GPU.
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = MyModel().to(device)

# Tensors must be on the same device as the model
x = x.to(device)

Keep practising

All Deep Learning questions

Explore further

Skip to content