What problem does PagedAttention solve, and what is continuous batching?
PagedAttention stores the KV cache in non-contiguous fixed-size blocks like OS virtual-memory pages, eliminating the fragmentation and over-reservation of contiguous KV allocation and enabling sharing across sequences. Continuous batching dynamically adds and removes requests from a batch at the token level instead of waiting for the whole batch to finish, sharply improving GPU utilization and throughput.
How to think about it
PagedAttention stores the KV cache in non-contiguous fixed-size blocks like OS virtual-memory pages, eliminating the fragmentation and over-reservation of contiguous KV allocation and enabling sharing across sequences. Continuous batching dynamically adds and removes requests from a batch at the token level instead of waiting for the whole batch to finish, sharply improving GPU utilization and throughput.