What is the difference between a generator and a list, and when should you prefer a generator?
A list materialises all values in memory at once; a generator produces values one at a time on demand, using O(1) memory regardless of the sequence length. Prefer generators for large or infinite sequences, pipelines, and any situation where you do not need random access.
How to think about it
The core trade-off
A list is a finished container — all values computed and stored upfront. A generator is a recipe — it computes the next value only when asked. The memory difference is dramatic: a list of 10 million items takes ~85 MB; the equivalent generator takes ~104 bytes.
The trade-off is that generators are single-pass: once exhausted, they’re gone. No random access by index. If you need to iterate more than once, convert to a list first or recreate the generator.
Syntax comparison
A list comprehension builds everything immediately. Swap [] for () and you get a lazy generator expression instead:
import sys
big_list = [x * x for x in range(10_000_000)]
print(sys.getsizeof(big_list)) # ~85 MB
big_gen = (x * x for x in range(10_000_000))
print(sys.getsizeof(big_gen)) # ~104 bytes
Generator functions use yield to suspend and resume:
def read_chunks(filepath, size=4096):
with open(filepath, "rb") as f:
while chunk := f.read(size):
yield chunk
for chunk in read_chunks("dataset.bin"):
process(chunk) # only one chunk in memory at a time
Try it: see the memory difference and the exhaustion trap
Generators compose into pipelines
The real power is composability — each stage of a pipeline is a generator, the whole thing runs in constant memory, and sum (or any consumer) drives all stages:
lines = (line.strip() for line in open("log.txt"))
records = (line.split(",") for line in lines if line)
values = (float(r[2]) for r in records)
total = sum(values) # entire pipeline runs in constant memory
The key insight
Every for loop, sum, list, and join drives a generator by calling next() in a loop. A generator just decides how much work to do per call. That’s the entire model — the syntax (yield) is just Python’s way of letting a function pause and resume.