datarekha

Vectorization vs Loops

Why two O(n) algorithms can differ by 50-100x in wall time — and how to push loops into C where NumPy can run them in a single tight pass over typed memory.

7 min read Intermediate Data Structures & Algorithms Lesson 30 of 32

What you'll learn

  • Why O(n) Python loops and O(n) NumPy calls have the same complexity but wildly different constants
  • What vectorization means and how NumPy pushes loops into compiled C over contiguous typed memory
  • How broadcasting eliminates explicit loops without changing the algorithm's logical structure
  • Why df.apply and iterrows are slow: every row is a Python-level function call

Before you start

Big-O tells you how work grows as data grows. It does not tell you how fast a single step is. Two algorithms can be identically O(n) and differ by a factor of 100 in real time — and this gap is the dominant performance story in numeric Python.

The gap Big-O cannot see

Consider summing a list of one million numbers.

# Option A — Python loop
total = 0
for x in data:
    total += x

# Option B — NumPy
import numpy as np
total = np.sum(data)

Both visit every element once. Both are O(n). But on a modern laptop, option B finishes in roughly 1 ms while option A takes 50–100 ms.

The complexity class is the same. The constant factor is not.

What is actually happening

A Python for loop does far more than arithmetic per step:

  1. The interpreter fetches the next object reference from the list.
  2. It checks the object’s type (could be an int, a float, a string — Python does not know statically).
  3. It unboxes the numeric value from inside the Python object.
  4. It calls the __add__ method through a virtual dispatch table.
  5. It allocates a new Python int object to hold the result.

That is five or more interpreter operations per element, each involving pointer indirection into heap memory scattered across many cache lines.

NumPy avoids every one of those steps. An ndarray stores values as raw typed bytes — a block of 64-bit floats packed consecutively in memory, no boxing, no headers per element. np.sum hands that block to a single C function that iterates with a tight inner loop the CPU can pipeline and, on most hardware, execute via SIMD (Single Instruction, Multiple Data) — one instruction that adds four or eight floats simultaneously.

The difference is not algorithmic. It is architectural: one loop lives in the Python interpreter; the other lives in a compiled C extension with full hardware access.

Contiguous memory and the cache

Modern CPUs are fast at reading sequential memory and slow at chasing pointers. A Python list is an array of pointers to objects scattered across the heap. Summing it means the CPU must fetch each object from a potentially cold cache line.

A NumPy array is a single contiguous allocation. The hardware prefetcher recognizes the sequential access pattern and loads the next cache lines before they are needed. By the time the loop reaches element k+1, it is already in L1 cache. This is called cache-friendly access, and it amplifies the SIMD benefit further.

Broadcasting: eliminating loops in the calling code

Vectorization — expressing an operation over an entire array at once so a compiled C loop handles the iteration, not Python — is not just about replacing sum. It is a mindset: express operations on whole arrays instead of writing element-by-element logic.

import numpy as np

temps_c = np.array([0.0, 20.0, 37.0, 100.0])

# Slow: explicit Python loop
temps_f_loop = []
for t in temps_c:
    temps_f_loop.append(t * 9/5 + 32)

# Fast: vectorized, no Python loop at all
temps_f_vec = temps_c * 9/5 + 32

The second form uses broadcasting: NumPy applies the scalar 9/5 and 32 across the entire array inside C, with no Python loop in sight. The expression reads like scalar math but executes like a compiled batch operation.

Broadcasting generalizes to arrays of compatible shapes, letting you write A + B, A * B, boolean masks, and slicing without loops — each compiles down to the same tight C iteration.

The pandas trap: apply and iterrows

The same logic explains why df.apply(func, axis=1) and df.iterrows() are slow even on DataFrames backed by NumPy.

iterrows wraps each row in a new pd.Series object — Python allocation and boxing per row. apply calls your Python function once per row through the interpreter. A DataFrame with 500,000 rows means 500,000 Python function calls, 500,000 object constructions, 500,000 interpreter round-trips.

The fix is to express the operation on whole columns, which pandas delegates to NumPy vectorized kernels:

# Slow — Python function called once per row
df["result"] = df.apply(lambda row: row["a"] * 2 + row["b"], axis=1)

# Fast — vectorized over columns, no Python loop
df["result"] = df["a"] * 2 + df["b"]

Same O(n). The constants differ by the same 50–100x factor.

See the difference live

Run this. It builds a large array, times a Python loop sum against np.sum, then times an element-wise transform loop against its vectorized form, and prints the speedups.

Watch the speedup column. Both benchmarks are O(n). The gap is constant factors: Python interpreter overhead vs one C loop over contiguous typed memory.

Tying it back to Big-O

Big-O complexity tells you how an algorithm scales. Constants tell you how fast it runs at any given size. The two questions are independent:

  • An O(n log n) algorithm with small constants can outperform an O(n) algorithm with large constants up to very large n.
  • As n grows toward infinity, complexity wins — but in data science, n rarely reaches “infinity.” You are working with gigabytes, not petabytes, and constants dominate.

The practical workflow is:

  1. Start with correct complexity (avoid the O(n²) trap from the Big-O lesson).
  2. Profile to find the hot path.
  3. Replace Python loops with vectorized NumPy or pandas column operations.

Step 3 routinely delivers 50–100x improvements with no change to algorithmic complexity.

Quick check

Quick check

0/3
Q1A Python for-loop sum and np.sum over the same array are both O(n). Why is np.sum 50-100x faster in practice?
Q2Which of the following is the fastest way to compute a new column result = a * 2 + b for every row in a large pandas DataFrame?
Q3You profile a data pipeline and find that replacing a list comprehension with a vectorized NumPy expression cuts runtime from 8 s to 0.1 s. What did you change?

Practice this in an interview

All questions
Why is NumPy significantly faster than Python for-loops for numerical computation, and what is vectorization?

NumPy operations execute compiled C code over contiguous memory blocks in a single call, while a Python loop incurs interpreter overhead and dynamic type checks on every element. Vectorization means expressing an operation over an entire array at once so the hot path never re-enters the Python interpreter.

Why is pandas slow, and what are the main strategies to speed it up?

pandas is slow primarily because Python loops bypass NumPy's vectorized C kernels, object-dtype columns prevent SIMD optimizations, and keeping entire datasets in memory limits scalability. The fixes are vectorization, categorical encoding, eval/query for large frames, chunking for out-of-core data, and switching to Polars or DuckDB for compute-heavy pipelines.

When would you use a Python list versus a NumPy array, and what are the performance trade-offs?

Python lists are heterogeneous, pointer-based, and general-purpose. NumPy arrays are homogeneous, stored as contiguous typed memory, and support vectorised operations that run at C speed. For numerical work on more than a few hundred elements, NumPy is almost always faster and more memory-efficient.

When should you use apply, map, or applymap versus vectorized pandas operations, and what are the performance implications?

Vectorized pandas and NumPy operations operate on entire arrays in compiled C/Fortran code and should always be your first choice. apply runs a Python function row- or column-wise in a Python loop, map transforms a single Series element-by-element, and applymap (DataFrame.map in pandas 2.1+) applies a function to every scalar — all three are orders of magnitude slower than vectorized equivalents.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content