LLM inference optimization

Most of the cost and latency people experience when using large language models has nothing to do with the model's intelligence. It comes from how inference is organized: how memory is managed, how requests are scheduled, how much redundant work the system does or avoids. These are the mechanics that determine whether a system feels responsive or sluggish, and they tend to get far less attention than model architecture even though they often matter more in practice.

This post walks through the main optimizations with concrete examples. Each one follows the same basic pattern: a request arrives, the model processes the full input (prefill) to build internal state, then generates tokens one at a time (decode) by reusing that state. The optimizations below target either the prefill cost, the per-token decode cost, or how the server juggles overlapping requests.

Example 1: Prefix KV cache reuse, when the prompt barely changes

Imagine a chat workflow where the system message and the retrieved evidence stay the same across requests but the user question changes each time. In a transformer, attention produces key and value tensors for each token position at each layer, and with prefix caching the server can hold onto those tensors from one request and reuse them for any subsequent request that shares the same prefix. The new request only needs to compute KV for the tokens that actually differ, which in a RAG setup is often just the final question.

Prefix KV cache reuse

Shared prefix (~2 000 tokens): "You are a helpful assistant. Use only the evidence provided." + 3 retrieved passages about global emissions data.
Request A question: "What was the global temperature increase in 2024?"
Request B question: "Which country had the highest CO₂ emissions?"
Request A
system + 3 passages
temperature query
120 ms
Request B
system + 3 passages
emissions query
120 ms

Both requests compute the full KV cache from scratch, including the identical prefix.

The user-facing effect is a shorter wait before the first token appears, because the server skipped the part of the computation that was identical to what it already did for the previous request.

The catch is that this only helps when your workload actually repeats prefixes. In many RAG systems the retrieved evidence varies substantially between queries, which reduces prefix overlap and limits the benefit. Whether prefix caching helps a lot or barely at all depends almost entirely on what the traffic looks like.

Example 2: Speculative decoding, when a draft autocomplete is usually right

Think of generation as autocomplete. A smaller, faster model proposes the next few tokens, and then the full model checks all of those proposals in a single forward pass rather than one at a time. That single-pass verification is the whole trick: checking k proposed tokens costs roughly the same as generating one token normally (the target model can process the proposed sequence in parallel, so verification is essentially free).

When the draft happens to match what the target would have produced anyway, the output advances through several tokens at once without paying the usual per-token cost for each of them. When the draft is wrong, the target corrects at the first mismatch, discards everything proposed after that point, and generation resumes from the corrected position.

Speculative decoding: draft, verify, accept or correct

Press "Next step" to see the draft/verify cycle.

Draft proposal Accepted Rejected Corrected

Whether this actually helps depends entirely on how often the draft is right. When acceptance is high you save real compute; when acceptance is low, the overhead of proposing and then throwing away bad guesses can make the system slower than if you had just run the target model by itself.

Acceptance rates depend on the domain, the sampling temperature, and how well the draft model's distribution aligns with the target's. In some regimes the match is excellent and the speedup is substantial. In others it quietly underperforms, and the only way to find out which regime you are in is to measure it on your actual workload.

Example 3: Continuous batching, when multiple requests overlap naturally

Consider two requests arriving at roughly the same time, one with a long prompt and one with a short one. In a static batching setup, both get grouped into a fixed batch that runs to completion, which means the short request finishes early and then sits idle, wasting its slot, until the long request is done.

Continuous batching avoids this by keeping requests alive independently and re-forming the batch at each scheduling step. When a short request finishes, its slot is immediately available for a new incoming request. The server is always running a dynamic mix of whatever work is ready, which keeps the GPU occupied instead of leaving capacity on the table.

Static batching vs continuous batching

Static batching

A (long)
prefill
decode
B (short)
prefill
decode

B finishes early but its slot stays empty until the batch completes. No new request can fill it.

0batch ends →
~55% GPU utilization (wasted slot after B finishes)

Continuous batching

A (long)
prefill
decode
B (short)
prefill
decode
C (new)
prefill
decode

When B finishes, its slot is immediately filled by C. The GPU never idles while A is still decoding.

0time →
~90% GPU utilization (freed slot reused by C)

In static batching, B's slot sits empty after it finishes. In continuous batching, that slot is immediately given to C. The key difference: the scheduler can swap requests in and out at every step, not just at batch boundaries.

The result is better GPU utilization and usually better responsiveness under load, though tail latency can still be unpredictable. Under some workloads the average wait time drops nicely while p95 barely moves or even gets slightly worse, so it is worth measuring under realistic concurrency rather than just testing with a single isolated request and calling it done.

Example 4: Reducing effective prompt work, when fewer chunks are actually enough

This one has nothing to do with the model or the serving infrastructure. It is entirely about what you put in the prompt.

Long-context RAG is the clearest illustration. A naive pipeline retrieves as many chunks as it can, concatenates them all into the prompt, and essentially hopes the model will locate the relevant part during attention. The prompt grows, prefill takes proportionally longer, and the time to first token climbs with it.

Prompt size vs. latency tradeoff

Chunks: 4
Prompt size~2400 tokens
TTFT (prefill time)~180 ms
Relevance coverage92%

At 4 chunks: good coverage with moderate latency. Adding more chunks increases TTFT with diminishing relevance gains.

A better pipeline retrieves fewer, more targeted chunks. If those chunks contain the evidence needed to answer the question, the model produces the same quality answer from a smaller input, and the first token arrives sooner because the prefill step had less to chew through.

What makes this worth thinking about carefully is that it interacts with almost every other optimization on this page. Shorter prompts reduce prefill work directly, but they also reduce KV cache memory pressure (freeing capacity for more concurrent requests), reduce the context that attention kernels must process, and reduce the prefix length that prefix caching would need to match across requests. It is one of those changes where the second-order effects stack up.

The tension, of course, is that retrieving fewer chunks risks missing the evidence the model actually needs, which degrades answer quality. The point is less about minimizing context than about matching context to what the question actually needs, which requires the retrieval stage to be genuinely good at selecting what matters. In practice that tradeoff between recall and latency is where most RAG systems end up spending the majority of their engineering effort, because getting retrieval right is harder than it initially looks.

Example 5: Paged KV cache, when memory needs to stay reusable

When many requests are active at once, each one maintains its own KV cache during decode, and those caches end up at different lengths because requests generate different numbers of tokens before they are done.

If KV memory is allocated as one big contiguous block per request, things start to fragment as requests finish at different times and leave holes in the memory layout. New requests that need contiguous space may have to wait for compaction or eviction, which introduces scheduling overhead that has nothing to do with the actual model computation.

Memory layout: contiguous vs paged allocation

Contiguous allocation

A
A
A
×
B
B
×
×
C
C
C
C
×

Freed blocks leave gaps. New requests may not fit contiguously.

Paged allocation

A
B
A
C
B
A
C
C
C

Blocks are reusable. No fragmentation. Free blocks return to the pool.

Paged KV cache sidesteps this by managing memory in fixed-size blocks that can be scattered across the address space. When a request needs more cache it grabs a free block; when it finishes, those blocks go back to the pool. No contiguous allocation, no fragmentation.

The practical effect is fewer scheduling disruptions under high concurrency. It does not change the attention computation at all, only how the system manages the memory around it. Like most infrastructure-level work, the benefit is most visible when the system is under real pressure and almost invisible in single-request benchmarks where fragmentation never had a chance to develop in the first place.

Example 6: Efficient attention kernels, when the math is the same but the hardware work changes

Even when the mathematical operation is identical, the actual wall-clock cost of computing attention varies a lot depending on how the computation is mapped to the hardware.

Standard attention materializes large intermediate matrices (the attention scores, the softmax output) and writes them to global memory before reading them back for the next step. On modern accelerators the arithmetic is fast but the memory round-trips are expensive, so the computation ends up bottlenecked by data movement rather than by the math itself.

Standard attention vs FlashAttention data flow

Standard attention

Q
K
V
↓ compute Q·Kᵀ
S = QKᵀ  (n×n)
stored in HBM
↓ softmax ↓ × V
P = softmax(S)
stored in HBM
Output O

Large intermediate matrices S and P are written to and read from slow global memory (HBM). Memory bandwidth becomes the bottleneck.

FlashAttention (fused)

Q tiles
K tiles
V tiles
↓ load tiles into fast memory
QKᵀ + softmax + ×V
fused in SRAM
↓ accumulate
Output O

No large intermediate matrix is materialized. The full computation stays in fast on-chip SRAM, tile by tile. Far fewer HBM round-trips.

Efficient attention kernels, FlashAttention being the most widely known, restructure the computation so that the intermediate results never leave fast on-chip SRAM. Instead of writing the full attention matrix to global memory and reading it back, the kernel processes the input in tiles, computing softmax and the value-weighted output within each tile and accumulating the result incrementally. The fused operation does the same math with far fewer memory round-trips.

The thing that makes this interesting as an optimization is that the output is bit-for-bit equivalent to standard attention. Nothing is being approximated or dropped. The speedup comes entirely from fitting the computation into the hardware's memory hierarchy more efficiently, and on long sequences (where the quadratic memory cost of naive attention really starts to bite) the difference in both speed and peak memory usage can be large.

One subtlety: kernel performance depends on the specific hardware. An attention kernel that shows big gains on one accelerator may look unremarkable on another because the memory hierarchy and compute-to-bandwidth ratios differ. Published numbers from a different GPU are directional, not definitive.

Example 7: Quantization, when the weights are smaller but behavior stays acceptable

Quantization changes the numerical precision of model weights and sometimes activations. The simplest version replaces 16-bit floating-point weights with 8-bit or 4-bit representations.

Memory and quality at different precisions

Memory
14 GB
Throughput
Quality
Baseline

FP16: full precision baseline. No quality tradeoff, but highest memory cost.

The immediate effect is a smaller memory footprint: a model that needs 14 GB in 16-bit precision might fit in 7 GB at 8-bit or under 4 GB at 4-bit. That matters because the model fits on cheaper hardware, and because memory bandwidth goes down when there is simply less data to move per matrix operation (bandwidth, not compute, is usually the bottleneck during decode).

The real question is whether the output quality holds up, and for most tasks it does, at least at 8-bit. The degradation from well-calibrated quantization is often smaller than people expect. At 4-bit the picture gets more complicated: some tasks tolerate it without issue, while others show measurable regressions, especially tasks that require precise numerical reasoning or that are already near the boundary of what the model can do.

Quantization is almost always worth trying as a first step when deploying under resource constraints, but "it works fine" needs to be verified on your specific task rather than assumed from aggregate benchmark scores. A model that quantizes beautifully for summarization can fall apart for structured extraction, and the only reliable way to know which side you are on is to actually test it.

Takeaway

These optimizations fall roughly into three groups: prefill-focused (prompt reduction, prefix caching), decode-focused (speculative decoding, efficient kernels, quantization), and server-focused (continuous batching, paged memory management). They target different parts of the pipeline and none of them work unconditionally. Prefix caching needs repeated prefixes. Speculative decoding needs a draft model that actually agrees with the target often enough. Quantization needs to not break whatever task you care about.

The common thread is that inference performance depends as much on the workload, the serving infrastructure, and the engineering decisions around the model as on the model itself. Two deployments of the same model can have wildly different latency profiles depending on how these pieces are configured, and the only way to know what helps in your specific case is to measure it there.