Why is NumPy significantly faster than Python for-loops for numerical computation, and what is vectorization?
NumPy operations execute compiled C code over contiguous memory blocks in a single call, while a Python loop incurs interpreter overhead and dynamic type checks on every element. Vectorization means expressing an operation over an entire array at once so the hot path never re-enters the Python interpreter.
How to think about it
The mental model to lead with
Every Python object — even a plain integer — is a heap-allocated struct with a reference count, a type pointer, and then the actual value. A Python list is an array of pointers to those structs. A NumPy array is a raw block of C doubles (or ints) packed directly in memory, with no per-element Python overhead.
When NumPy does a + b, it dispatches once to a compiled C kernel that walks through raw memory with SIMD instructions. When a Python loop does the same thing, it executes that dispatch machinery on every single element.
The cost of a Python loop
Every iteration:
- Fetches the next boxed Python object (reference count check, type dispatch)
- Calls the operator through the interpreter bytecode
- Allocates a new Python object for the result
For 10 million numbers, that is 10 million type checks and 10 million heap allocations.
See the speedup yourself
Vectorization in practice
Instead of looping to apply an operation row by row, express it over the whole array:
# Avoid — Python loop defeats NumPy
result = []
for row in matrix:
result.append(row.sum())
# Use — stays in C
row_sums = matrix.sum(axis=1)
Broadcasting rules let NumPy extend scalar and lower-dimensional operations across higher-dimensional arrays without copying data, compounding the advantage.