LLM Inference Became A Systems Problem • Rutvik Acharya

Training gets the glamour. Inference gets the bill.

Once LLM applications moved beyond prototypes, serving became a serious engineering problem. Every request consumed tokens, memory bandwidth, GPU time, and latency budget. The model was no longer just an API call. It was a production system with throughput constraints.

Inference optimization became one of the highest-leverage areas in LLM engineering.

Why Inference Is Expensive#

LLM inference has two main phases:

Prefill: process the input prompt
Decode: generate output tokens one at a time

Prefill is parallelizable across the prompt. Decode is harder because each new token depends on previous tokens.

That autoregressive loop is why long outputs are expensive. Generating 1,000 tokens is not one operation. It is 1,000 sequential decisions.

The system has to manage:

GPU memory for model weights
KV cache memory for active sequences
Batch scheduling
Variable prompt lengths
Variable output lengths
Streaming latency

This is why serving LLMs feels different from serving ordinary classifiers.

Prefill And Decode Have Different Bottlenecks#

Prefill and decode stress the system differently.

Prefill is dominated by processing the prompt. It benefits from parallel computation and becomes expensive when prompts are long. Decode is dominated by generating one token after another. It is sensitive to batch scheduling, KV cache memory, and output length.

This distinction matters because optimizations target different phases:

Problem	Likely lever
Huge prompts	retrieval, prompt trimming, prompt caching
Slow streaming	batching, model size, decode optimization
Low throughput	continuous batching, quantization
Memory pressure	KV cache management, shorter context, smaller model

If you do not separate prefill and decode metrics, you may optimize the wrong part of the system.

Batching Improves Throughput#

GPUs like work. If you send one request at a time, you often waste capacity.

Batching groups requests so the GPU can process them together. This improves throughput, but it creates a latency tradeoff: wait too long to build a batch, and individual users wait.

Dynamic batching tries to balance this:

collect requests for a few milliseconds
batch compatible sequences
run model step
stream tokens back to each request

text

The best batching strategy depends on traffic shape. A chat app with interactive users has different constraints than an offline document processing job.

Continuous Batching Fits Generation Better#

Classic batching assumes every request starts and ends together. LLM generation is messier. One user may request a 20-token classification, while another asks for a 1,000-token report.

Continuous batching lets new requests join as others finish. Instead of waiting for the slowest sequence in a static batch, the server keeps the GPU busy by scheduling active decoding work dynamically.

This improves utilization, but it makes scheduling more complex. The server has to manage fairness, streaming, memory pressure, and requests with different maximum output lengths.

For high-throughput serving, the scheduler becomes as important as the model.

The KV Cache Is The Hidden Memory Cost#

During generation, the model stores attention keys and values for previous tokens. This KV cache avoids recomputing the whole sequence every time.

The cache is essential for speed, but it consumes memory proportional to:

Number of active requests
Context length
Model size
Number of generated tokens

Long-context applications can run out of memory even when the model weights fit comfortably.

KV cache pressure is also why one pathological request can hurt everyone else. A single long-context conversation with a long generation can occupy enough memory to reduce batch size for other users.

Serving systems often need admission control:

Maximum prompt tokens
Maximum output tokens
Per-user concurrency limits
Separate queues for long jobs
Fallback models for cheap requests

These are product constraints as much as infrastructure constraints. Unlimited context and unlimited output are not realistic defaults.

Quantization Reduces The Footprint#

Quantization stores weights with fewer bits. This reduces memory use and can improve throughput, especially when memory bandwidth is the bottleneck.

Common tradeoff:

higher precision: better quality, more memory
lower precision: cheaper serving, possible quality loss

text

Quantization is workload-dependent. A model quantized to 4-bit may be fine for classification and extraction, but weaker for complex reasoning or long-form generation.

The correct test is not “Does the benchmark look okay?” It is “Does the quantized model pass our evals at the latency and cost we need?”

Prompt Caching Changes The Economics#

Many applications reuse a large prefix:

System prompt
Tool descriptions
Policy text
Few-shot examples
Static documentation

Prompt caching avoids reprocessing the same prefix repeatedly. If the first 5,000 tokens are stable across requests, caching can significantly reduce latency and cost.

This changes prompt design. Stable prefixes become valuable. Randomly reordering tool descriptions or injecting dynamic content into the top of the prompt can reduce cache hits.

Good pattern:

stable system instructions
stable tool schemas
stable examples
dynamic user/context block

text

Caching rewards discipline.

Routing Reduces Average Cost#

Not every request deserves the same model.

A practical inference stack can route by task:

Small model for classification
Quantized model for extraction
Frontier model for hard reasoning
Long-context model only when retrieval finds broad evidence

This is different from user-facing model selection. The user asks a question; the system chooses the cheapest path that passes quality requirements.

Routing needs evals. If the router is too aggressive, quality drops. If it is too conservative, cost savings disappear.

Speculative Decoding#

Speculative decoding uses a smaller draft model to propose tokens, then a larger model verifies them. If the draft is right, the system accepts multiple tokens faster than the large model would have generated them one by one.

The intuition:

small model guesses: "The refund policy allows"
large model verifies those tokens
accepted tokens are emitted faster

text

This works best when the draft model’s predictions often match the larger model. It is a serving optimization, not a quality improvement.

The tradeoff is complexity. You now manage two models, verification logic, and workload-specific speedups.

Output Length Is A Product Decision#

One of the simplest inference optimizations is asking the model to write less.

Long answers cost more, take longer, and can be harder to read. If the product only needs a classification, do not ask for an essay. If the UI shows three bullets, constrain the output to three bullets.

This is not just prompt thrift. It is product design.

Bad: Explain your reasoning in detail.
Better: Return the category and one sentence justification.
Best: Return JSON with category, confidence, and evidence_id.

text

Shorter outputs are easier to validate and cheaper to serve.

Retries Are Hidden Inference Cost#

A model call that fails validation and retries is not one call. It is two or three.

This matters for structured outputs, tool calls, and extraction systems. A cheaper model with a high retry rate may be more expensive than a stronger model that succeeds the first time.

Track cost per successful task, not just cost per token.

model A: cheap tokens, 18% retry rate
model B: expensive tokens, 2% retry rate
winner: depends on end-to-end cost and latency

text

This is where product evals and serving metrics meet. Quality failures become infrastructure cost.

Measuring The Right Metrics#

Average latency is not enough.

Track:

Time to first token
Tokens per second
End-to-end latency
Prompt tokens per request
Completion tokens per request
Cache hit rate
GPU utilization
Queue time
Error and timeout rates
Cost per successful task

For user-facing chat, time to first token may matter more than total generation time. For offline extraction, throughput and cost may matter more than streaming latency.

Capacity Planning Needs Real Traffic#

Synthetic benchmarks are useful, but real traffic has ugly distributions.

Users do not send uniform prompts. Some requests are tiny. Some paste entire documents. Some ask for one label. Others ask for a long report. The average request is rarely the request that breaks capacity.

Plan from distributions:

p50, p95, and p99 prompt length
p50, p95, and p99 output length
concurrent active requests
burst size
cacheable prefix percentage
retry rate

The p95 case often determines whether the experience feels reliable. The p99 case often determines whether you need guardrails.

This is why logging token counts is not optional. Token distributions are infrastructure requirements hiding inside product behavior.

A Practical Serving Playbook#

A reasonable optimization path:

Measure prompt and output token distributions
Shorten prompts and outputs where product allows
Add structured outputs to reduce retries
Use prompt caching for stable prefixes
Batch requests based on workload
Evaluate quantized models
Add routing for easy versus hard tasks
Consider speculative decoding when traffic justifies complexity

Start with measurement. Without traces, it is easy to optimize the wrong thing.

The first dashboard should be simple:

p50 and p95 latency
time to first token
input and output tokens
validation failure rate
retry rate
cache hit rate
cost per completed workflow

Once those are visible, optimization becomes less mystical. You can see whether the problem is long prompts, long outputs, queueing, cache misses, or bad model behavior.

The Takeaway#

LLM inference became a systems problem because model quality was only one part of the product.

Latency, throughput, memory, caching, output length, and validation all shape whether an LLM feature is affordable and pleasant to use.

The best teams treat inference as part of application design. They do not just ask “Which model is smartest?” They ask “Which system delivers the right answer at the right cost and latency?”