Comprehensions, generators, and the art of not building the list

There is a file sitting on a data engineering server right now. It has 47 million rows of transaction records. Some analyst wrote a one-liner to load it:

amounts = [float(line.split(",")[3]) for line in open("transactions.csv")]
total = sum(amounts)

The job OOM-killed (ran out of memory and was terminated by the OS) at 3 AM. The fix took thirty seconds. The post-mortem took two hours.

The fix was this:

amounts = (float(line.split(",")[3]) for line in open("transactions.csv"))
total = sum(amounts)

One character changed: square bracket to parenthesis. The memory footprint dropped from roughly 380 MB to under a kilobyte. That is not an exaggeration. That is what lazy evaluation means in practice.

What a list comprehension actually does

A list comprehension is syntactic sugar for a loop that appends to a list. When Python evaluates [x * 2 for x in range(10_000_000)], it runs through all ten million iterations immediately, allocates a list object in memory, and fills it with ten million Python integer objects before returning control to the next line. The entire result lives in RAM, addressable by index, waiting.

This is not a design flaw. Sometimes that is exactly what you want. If you need to sort, slice, reverse, index by position, or pass the same sequence to three different functions, a list is the right structure. The problem is when you build a list and then immediately walk it once — sum(), max(), a for loop — and throw it away. You paid the full allocation cost for a benefit you never used.

A Python list of ten million plain floats occupies roughly 80 MB on a 64-bit system: 8 bytes per float value, plus the pointer in the list array (8 bytes), plus the Python object header overhead (28 bytes). The math lands around 44 bytes per item at the interpreter level, but measured resident memory for [float(i) for i in range(10_000_000)] comes in around 350–400 MB depending on platform. The generator that produces the same values holds a frame object with a handful of local variables — under 200 bytes total, regardless of how many items the sequence contains.

Lazy evaluation: the core idea

A generator does not compute its values upfront. It returns a generator object immediately — a tiny state machine that knows how to produce the next value when asked, and no more. Each call to next() on it resumes execution until it hits a yield (or a return), hands back one value, and then suspends again, freezing its local state in place.

def integers_from(n):
    while True:
        yield n
        n += 1

counter = integers_from(0)
print(next(counter))  # 0
print(next(counter))  # 1
print(next(counter))  # 2

This sequence is infinite. You cannot build it as a list. There is no list large enough. But as a generator it is perfectly representable, because at any moment you only hold the current value of n. The generator object is not a snapshot of all future values — it is a recipe for producing the next one.

Generator expressions apply the same principle to the comprehension syntax. (x * 2 for x in iterable) creates a lazy pipeline: Python will pull items from iterable one at a time, multiply by two, and hand each result to whoever is pulling from the generator. No intermediate list is ever built.

List comprehension versus generator expression: the list fills memory before your next line runs; the generator holds only enough state to produce the next item.

Where this actually matters

The memory difference is dramatic, but memory is not the only axis. There are three situations where generators are the obviously correct choice.

Large files and streams. Any time your data source is bigger than memory — or simply bigger than you want to pin in memory while other processes compete for RAM — you should be streaming. A generator over open(filename) reads one line at a time. The OS handles buffering. You never hold more than a handful of lines at once.

Pipelines. Generators compose. You can stack them:

lines = (line.strip() for line in open("log.txt"))
non_empty = (line for line in lines if line)
fields = (line.split("\t") for line in non_empty)
amounts = (float(row[3]) for row in fields if row[3] != "NULL")
total = sum(amounts)

None of these intermediate steps allocates a full collection. The whole pipeline processes one line at a time, front to back, with constant memory. Each for in a generator expression is a link in a chain that only moves when the final consumer (here, sum) pulls on it.

This is the same model as Unix pipes. cat access.log | grep 404 | awk '{print $7}' | sort | uniq -c does not materialize a list of all 404 lines before passing them to awk. Each stage is a cursor that feeds the next. Python generators give you the same composability inside a single process.

Infinite or unknown-length sequences. Paginators, event streams, sensor feeds, retry loops with early exit — anywhere you do not know when the data ends. A generator lets you express “keep producing until some condition” without pre-allocating anything.

The `itertools` toolkit

The standard library’s itertools module is the natural companion to generators. It provides lazy combinators — functions that take iterables and return iterables, never building intermediate lists.

itertools.islice(gen, n) takes the first n items from any iterable without consuming the rest. itertools.chain(a, b) concatenates two iterables lazily. itertools.takewhile(pred, gen) yields items until the predicate fails. itertools.groupby(gen, key) groups consecutive elements — though note it requires the input to already be sorted on the key, since it works in one pass.

import itertools

log = open("events.log")
errors = (line for line in log if "ERROR" in line)
first_ten = itertools.islice(errors, 10)

for line in first_ten:
    print(line, end="")

The file is opened, but only the first ten error lines are ever read. If the file has ten billion lines, that is fine.

The hidden cost: you can only walk it once

Generators are not free. They have a real and non-trivial limitation: a generator is exhausted after one pass. Once you have consumed all its items, it is empty. Calling list() on it again returns []. There is no rewind.

nums = (x ** 2 for x in range(5))
print(list(nums))   # [0, 1, 4, 9, 16]
print(list(nums))   # []  -- already exhausted

This trips people up. If you need to compute two different aggregates over the same large dataset, you have a few options: read the source twice (two generator passes over the file), compute both in a single pass yourself, or accept the memory cost of materializing the list once and reusing it.

You also lose indexing. gen[4] is a TypeError. Generators do not support random access. If you need the fifth element, you walk to it:

import itertools
fifth = next(itertools.islice(gen, 4, None))

And you lose len(). A generator object has no __len__. If you need the count, you either materialize it or count while consuming:

count = sum(1 for _ in gen)

These are real trade-offs. The decision is not “generators are always better.” It is: do you need multiple passes, indexing, or the length? Use a list. Do you process once, in order, and care about memory? Use a generator.

The readability question

There is a point where chained generator expressions become opaque. A pipeline of five nested generators with complex predicates is harder to read than five clearly named list comprehensions, even if the generator version is more efficient. Code is read far more often than it is run.

The practical convention at most data shops: use generators at the boundary of large I/O (file reads, database cursors, API paginators) and list comprehensions for small in-memory transformations where the sequence will be reused or passed to multiple things. The seam is usually obvious from context.

Named generator functions — def with yield — help when the logic is complex enough to deserve a name and docstring:

def valid_transactions(path):
    """Yield parsed transaction dicts, skipping malformed rows."""
    with open(path) as f:
        for i, line in enumerate(f):
            parts = line.strip().split(",")
            if len(parts) < 5:
                continue
            try:
                yield {"id": parts[0], "amount": float(parts[3])}
            except ValueError:
                continue

This is more readable than a generator expression that tries to embed all that logic inline. Named generators also let you unit-test the parsing logic in isolation.

How Python implements generators internally

When Python compiles a function containing yield, it produces a code object flagged as a generator. Calling the function does not execute any of its body — it allocates a frame object (the execution state: local variables, the instruction pointer, the value stack) and returns it wrapped as a generator object. The frame is suspended at the very first instruction.

Each call to next() resumes the frame from where it left off, runs until the next yield, stores the yielded value in the generator object, suspends the frame again, and returns control to the caller. The frame’s local variables persist across suspensions — that is how the loop counter in integers_from keeps incrementing.

When the function body returns (or falls off the end), Python raises StopIteration internally, which tells any for loop or sum() call to stop consuming. The frame is then garbage-collected.

The memory cost of a live generator is roughly the size of one frame: the code object (shared, one copy), the local variable dict, and the call stack depth within the frame. For simple generators this is a few hundred bytes. For generators that call deeply nested functions, it can be larger — but still bounded by the depth of the call stack in the generator body, not by the length of the sequence.

Each call to next() resumes the suspended frame, runs to the next yield, and returns one value. The frame cost is constant; the sequence length is irrelevant to memory.

The broader pattern: pull vs. push

Generators are a pull model. The consumer drives the pace. Nothing is computed until someone asks. This is the opposite of the push model — callbacks, event emitters, reactive streams — where the producer decides when to emit.

Pull composability is why generators feel so clean for data pipelines. When you write:

total = sum(
    float(row[3])
    for row in csv.reader(open("big.csv"))
    if row[3] != ""
)

Python’s sum() is the ultimate consumer. It calls next() on the outer generator. That generator calls next() on csv.reader. That reads one line from the file. The filter runs. If the field is non-empty, the float conversion runs, and sum accumulates it. Then sum calls next() again. One line at a time, from disk to result, no intermediate list anywhere in the chain.

This is not a micro-optimization. For a 2 GB CSV on a machine with 4 GB of RAM running six other processes, the list version crashes and the generator version finishes. That is a correctness difference, not a performance difference.

When you actually want the list

There is a reflex among Python programmers who have learned about generators to convert everything. Resist it. Lists are not the enemy.

You want a list when you need to pass the same sequence to multiple consumers. You want a list when you need len() for a progress bar or a percentage. You want a list when you need to index by position — results[-1], results[::2]. You want a list when the sequence is small enough that memory is not a concern and you value the clarity of “this is already computed.”

The rule of thumb: if you only iterate once, forward, and you do not need the length, a generator is appropriate. Any other access pattern tilts toward a list. The generator’s benefit is purely at the allocation boundary.

The industrial version

In production data engineering, this principle extends beyond plain Python. Spark’s RDD and DataFrame APIs are lazy — transformations like filter and select do not execute until you call an action like collect or write. Polars is lazy-by-default in its LazyFrame API. SQL query planners are fundamentally lazy — the optimizer decides what to materialize. DuckDB streams results from queries on files that are larger than RAM.

The pattern is universal because it solves a universal problem: you rarely need all the data at once. You need it processed. Processing one item at a time in a pipeline is almost always enough, and it keeps your memory footprint predictable regardless of dataset size.

Python’s generator protocol — the simple yield keyword and the iterator interface — is the ground-floor version of this idea. Understanding it intuitively makes the higher-level versions (lazy DataFrames, streaming SQL, async generators) click into place immediately, because they are all the same abstraction at different scales.

The 3 AM OOM kill was not a performance problem. It was a conceptual one: the code assumed data could be held all at once. One bracket changed that assumption. Thirty seconds of fix for two hours of post-mortem is a reasonable trade. The lesson is free.