datarekha
Patterns June 2, 2026

Why Python is slow — and the times it actually matters

A tight Python loop over a million numbers can be 100x slower than C, but that rarely matters — until it suddenly, catastrophically does.

9 min read · by datarekha · pythonperformancenumpyconcurrency

Somewhere around 2015, a benchmark showed that a tight Python loop computing the sum of a million integers took roughly 80 milliseconds. The equivalent C loop took under a millisecond. The slide went viral. A generation of programmers absorbed the lesson that Python is slow, and then went right back to using it for everything — because the benchmark was technically correct and practically irrelevant.

Here is the honest version of the story.

The three reasons Python is genuinely slower

Python’s speed problem is not one thing. It is three things stacked on top of each other, and understanding the distinction matters because the fixes are different.

The interpreter tax. CPython — the standard Python implementation — does not compile your source to native machine code. It compiles to bytecode (a set of higher-level virtual instructions) and then interprets that bytecode at runtime. Every time your loop increments a counter, the interpreter decodes an instruction, dispatches it, and manages a small stack of operations. That dispatch loop has overhead measured in tens of nanoseconds. Multiply by a hundred million iterations and the tax becomes visible.

Dynamic typing overhead. In C, the integer 42 is four bytes in a register. In Python, 42 is a PyObject — a heap-allocated struct carrying a reference count, a pointer to a type descriptor, and then the value. Arithmetic on two Python integers is not a single ADD instruction. It is: look up both objects, check their types, find the correct implementation of __add__, call it, allocate a new PyObject for the result, and return. Type dispatch on every operation.

The GIL. The Global Interpreter Lock (a mutex — a mutual exclusion lock — that allows only one thread to execute Python bytecode at a time) was designed to make CPython’s memory management safe without per-object locking. It works beautifully for that. It also means that spawning ten Python threads on a ten-core machine does not give you ten times the CPU throughput for CPU-bound work. You get one core’s worth of Python, with threading overhead on top.

Put those three together and a tight numeric loop in pure Python can be 50 to 100 times slower than the same loop in C.

Why it almost never matters

The 100x number is real but it applies to a narrow case: tight, purely CPU-bound loops over many values, written entirely in Python, with no calls into compiled code.

In practice, most Python programs spend the overwhelming majority of their wall-clock time doing one of two things.

The first is I/O. Reading from disk, waiting for a network response, querying a database. While your process is blocked waiting for bytes to arrive, the GIL is released and the interpreter overhead is irrelevant. A web server that handles five hundred requests per second is not bottlenecked on Python’s bytecode interpreter. It is bottlenecked on database queries, serialization, and network latency — and Python handles all of those just fine.

The second is C extensions doing the real work. When you call np.dot(a, b) on two arrays of a million elements, Python executes approximately five lines of interpreted bytecode: look up np, look up dot, fetch a and b, call. The actual dot product happens inside a compiled C (or Fortran) routine that holds the GIL released, runs SIMD vector instructions, and finishes in microseconds. Python here is nothing more than a thin scripting layer orchestrating compiled code. NumPy, Pandas, PyTorch, scikit-learn, Polars — they all follow this pattern. The Python you write is the glue. The work happens in C, C++, or Rust.

This is why the language that hosts the two most computationally demanding workloads of the 2020s — large-scale data processing and deep learning — is Python. Not despite its slowness, but in complete indifference to it.

Summing 1 million integers — pure Python vs NumPyTime (ms, log scale)020406080 ms80 msPure Python loopsum(range(1_000_000))0.8 msNumPy vectorizednp.sum(arr)≈100× faster
The NumPy bar is not missing — it is the 2 px sliver at the bottom right. Both operations on 1 million integers; NumPy calls into compiled C under the hood.

The four cases where it actually does matter

If you are not doing any of the four things below, close this tab and go build something.

1. Tight pure-Python loops over large data

You have a list of a million records. You wrote a for loop. The loop body involves conditionals, string operations, or custom class instances — none of which go through NumPy. This is the benchmark scenario made real, and it is where Python’s overhead compounds.

The tell-tale sign: your profiler shows time spent inside your own Python functions, not inside library calls. The fix is almost always vectorization — restructuring the computation so it operates on arrays through a C extension rather than on individual elements through the interpreter.

2. Hard real-time constraints

A trading system needs to react within 200 microseconds. A game physics engine updates at 120 Hz and cannot skip. A sensor fusion loop on an embedded device processes IMU (inertial measurement unit) data with a hard deadline.

In these environments, even one garbage collection pause — CPython’s reference counting is mostly fine, but the cyclic garbage collector runs periodically — can blow your latency budget. Python’s interpreter jitter is incompatible with hard real-time. The solution is not “optimize Python harder.” It is “don’t use Python for the latency-critical path.”

3. Massive CPU-bound concurrency

You want to spin up a hundred threads to process incoming work in parallel, and that work is CPU-bound (compression, JSON parsing, heavy computation). The GIL means these threads time-share a single core’s worth of Python execution. You get parallelism theater.

The fix is multiprocessing (separate processes, each with its own GIL) or offloading the CPU-bound work to a C extension that releases the GIL. NumPy, for instance, releases the GIL during array operations, so threading over NumPy-heavy code does get you real parallelism. Pure-Python threading over pure-Python work does not.

4. Cold-start-sensitive deployment

A serverless function that runs for 50 milliseconds of real work but spends 800 milliseconds importing pandas, scipy, and sklearn before it does anything. A CLI tool that feels sluggish because loading its own modules takes 300 milliseconds. Python’s import system — module discovery, bytecode compilation or loading, execution of top-level code — is linear in the volume of code you load.

This matters more than people admit in 2026. Serverless is billed by duration. Users notice slow CLIs. The fix is lazy imports, lighter dependency trees, or choosing a runtime with faster startup characteristics.

The hierarchy of fixes

When Python’s speed does bite you, the interventions in order of reach-for-first:

Vectorize before anything else. The gap between a Python loop and a NumPy/Polars array operation is so large that this should always be your first move. It often closes the problem entirely without any further work. Write the computation in terms of array operations; let the library’s C implementation handle the looping.

Reach for C extensions that already exist. Before writing anything custom, check whether a library already does what you need in compiled code. scipy.ndimage, Pillow, faiss, pyarrow, orjson — a huge fraction of common computations have compiled implementations that you can call from Python with a single line.

multiprocessing for CPU-bound parallelism. It has overhead (serialization across process boundaries via pickle, process startup cost), but it sidesteps the GIL entirely. For embarrassingly parallel workloads — process this batch, that batch, and that batch independently — it scales close to linearly with core count.

Cython or ctypes for hot paths you must keep in Python. Cython (a superset of Python that compiles to C) lets you annotate variables with static types and compile the result, often achieving within 2x of C for numeric code. ctypes lets you call into existing C libraries directly. Both require more engineering investment than the options above, so they live lower on the hierarchy.

PyPy as a full-process alternative. PyPy is an alternative Python interpreter with a JIT (just-in-time) compiler — one that compiles hot bytecode paths to native machine code at runtime. It can achieve 4 to 10x speedups over CPython on pure-Python workloads without any code changes. The catch: PyPy’s compatibility with the C extension ecosystem (NumPy, Pandas, the entire PyData stack) has historically been uneven. For programs that do not depend heavily on those extensions, PyPy is worth serious consideration.

The intuition you should carry

Python’s execution model was designed for clarity, not throughput. The abstraction it provides — dynamic typing, automatic memory management, a clean object model — is not free. It is paid for in cycles.

But the most important thing to understand is that Python’s slowness is compositional. When you write np.dot(a, b), you are composing Python’s expressive surface syntax with C’s execution speed. The Python part costs you essentially nothing — a handful of bytecode instructions to make a function call. The C part runs at the speed of native machine code. Almost all of modern scientific Python is built on this compositionality. The language is slow; the ecosystem it orchestrates is fast.

The question is never “is Python fast enough?” That is unanswerable without context. The question is: where is your program actually spending time? Profile first. If the time is inside library calls, you are already running C — optimize the algorithm or the data structure, not the language. If the time is inside your own Python functions, you have a vectorization or architecture problem, not a language problem.

Python’s reputation for slowness survives because the benchmark is easy to run and hard to contextualize. The tight loop over a million integers is real. It is also not what most programs do. Know the difference, and you will stop both dismissing the performance concern entirely and panicking about it inappropriately.

The language runs at the speed of the code it calls. Write code that calls the fast stuff.

Skip to content