Rutvik Acharya

Back

Training gets the glamour. Inference gets the bill.

Once LLM applications moved beyond prototypes, serving became a serious engineering problem. Every request consumed tokens, memory bandwidth, GPU time, and latency budget. The model was no longer just an API call. It was a production system with throughput constraints.

Inference optimization became one of the highest-leverage areas in LLM engineering.

Why Inference Is Expensive#

LLM inference has two main phases:

  • Prefill: process the input prompt
  • Decode: generate output tokens one at a time

Prefill is parallelizable across the prompt. Decode is harder because each new token depends on previous tokens.

That autoregressive loop is why long outputs are expensive. Generating 1,000 tokens is not one operation. It is 1,000 sequential decisions.

The system has to manage:

  • GPU memory for model weights
  • KV cache memory for active sequences
  • Batch scheduling
  • Variable prompt lengths
  • Variable output lengths
  • Streaming latency

This is why serving LLMs feels different from serving ordinary classifiers.

Prefill And Decode Have Different Bottlenecks#

Prefill and decode stress the system differently.

Prefill is dominated by processing the prompt. It benefits from parallel computation and becomes expensive when prompts are long. Decode is dominated by generating one token after another. It is sensitive to batch scheduling, KV cache memory, and output length.

This distinction matters because optimizations target different phases:

ProblemLikely lever
Huge promptsretrieval, prompt trimming, prompt caching
Slow streamingbatching, model size, decode optimization
Low throughputcontinuous batching, quantization
Memory pressureKV cache management, shorter context, smaller model

If you do not separate prefill and decode metrics, you may optimize the wrong part of the system.

Batching Improves Throughput#

GPUs like work. If you send one request at a time, you often waste capacity.

Batching groups requests so the GPU can process them together. This improves throughput, but it creates a latency tradeoff: wait too long to build a batch, and individual users wait.

Dynamic batching tries to balance this:

collect requests for a few milliseconds
batch compatible sequences
run model step
stream tokens back to each request
text

The best batching strategy depends on traffic shape. A chat app with interactive users has different constraints than an offline document processing job.

Continuous Batching Fits Generation Better#

Classic batching assumes every request starts and ends together. LLM generation is messier. One user may request a 20-token classification, while another asks for a 1,000-token report.

Continuous batching lets new requests join as others finish. Instead of waiting for the slowest sequence in a static batch, the server keeps the GPU busy by scheduling active decoding work dynamically.

This improves utilization, but it makes scheduling more complex. The server has to manage fairness, streaming, memory pressure, and requests with different maximum output lengths.

For high-throughput serving, the scheduler becomes as important as the model.

The KV Cache Is The Hidden Memory Cost#

During generation, the model stores attention keys and values for previous tokens. This KV cache avoids recomputing the whole sequence every time.

The cache is essential for speed, but it consumes memory proportional to:

  • Number of active requests
  • Context length
  • Model size
  • Number of generated tokens

Long-context applications can run out of memory even when the model weights fit comfortably.

KV cache pressure is also why one pathological request can hurt everyone else. A single long-context conversation with a long generation can occupy enough memory to reduce batch size for other users.

Serving systems often need admission control:

  • Maximum prompt tokens
  • Maximum output tokens
  • Per-user concurrency limits
  • Separate queues for long jobs
  • Fallback models for cheap requests

These are product constraints as much as infrastructure constraints. Unlimited context and unlimited output are not realistic defaults.

Quantization Reduces The Footprint#

Quantization stores weights with fewer bits. This reduces memory use and can improve throughput, especially when memory bandwidth is the bottleneck.

Common tradeoff:

higher precision: better quality, more memory
lower precision: cheaper serving, possible quality loss
text

Quantization is workload-dependent. A model quantized to 4-bit may be fine for classification and extraction, but weaker for complex reasoning or long-form generation.

The correct test is not “Does the benchmark look okay?” It is “Does the quantized model pass our evals at the latency and cost we need?”

Prompt Caching Changes The Economics#

Many applications reuse a large prefix:

  • System prompt
  • Tool descriptions
  • Policy text
  • Few-shot examples
  • Static documentation

Prompt caching avoids reprocessing the same prefix repeatedly. If the first 5,000 tokens are stable across requests, caching can significantly reduce latency and cost.

This changes prompt design. Stable prefixes become valuable. Randomly reordering tool descriptions or injecting dynamic content into the top of the prompt can reduce cache hits.

Good pattern:

stable system instructions
stable tool schemas
stable examples
dynamic user/context block
text

Caching rewards discipline.

Routing Reduces Average Cost#

Not every request deserves the same model.

A practical inference stack can route by task:

  • Small model for classification
  • Quantized model for extraction
  • Frontier model for hard reasoning
  • Long-context model only when retrieval finds broad evidence

This is different from user-facing model selection. The user asks a question; the system chooses the cheapest path that passes quality requirements.

Routing needs evals. If the router is too aggressive, quality drops. If it is too conservative, cost savings disappear.

Speculative Decoding#

Speculative decoding uses a smaller draft model to propose tokens, then a larger model verifies them. If the draft is right, the system accepts multiple tokens faster than the large model would have generated them one by one.

The intuition:

small model guesses: "The refund policy allows"
large model verifies those tokens
accepted tokens are emitted faster
text

This works best when the draft model’s predictions often match the larger model. It is a serving optimization, not a quality improvement.

The tradeoff is complexity. You now manage two models, verification logic, and workload-specific speedups.

Output Length Is A Product Decision#

One of the simplest inference optimizations is asking the model to write less.

Long answers cost more, take longer, and can be harder to read. If the product only needs a classification, do not ask for an essay. If the UI shows three bullets, constrain the output to three bullets.

This is not just prompt thrift. It is product design.

Bad: Explain your reasoning in detail.
Better: Return the category and one sentence justification.
Best: Return JSON with category, confidence, and evidence_id.
text

Shorter outputs are easier to validate and cheaper to serve.

Retries Are Hidden Inference Cost#

A model call that fails validation and retries is not one call. It is two or three.

This matters for structured outputs, tool calls, and extraction systems. A cheaper model with a high retry rate may be more expensive than a stronger model that succeeds the first time.

Track cost per successful task, not just cost per token.

model A: cheap tokens, 18% retry rate
model B: expensive tokens, 2% retry rate
winner: depends on end-to-end cost and latency
text

This is where product evals and serving metrics meet. Quality failures become infrastructure cost.

Measuring The Right Metrics#

Average latency is not enough.

Track:

  • Time to first token
  • Tokens per second
  • End-to-end latency
  • Prompt tokens per request
  • Completion tokens per request
  • Cache hit rate
  • GPU utilization
  • Queue time
  • Error and timeout rates
  • Cost per successful task

For user-facing chat, time to first token may matter more than total generation time. For offline extraction, throughput and cost may matter more than streaming latency.

Capacity Planning Needs Real Traffic#

Synthetic benchmarks are useful, but real traffic has ugly distributions.

Users do not send uniform prompts. Some requests are tiny. Some paste entire documents. Some ask for one label. Others ask for a long report. The average request is rarely the request that breaks capacity.

Plan from distributions:

  • p50, p95, and p99 prompt length
  • p50, p95, and p99 output length
  • concurrent active requests
  • burst size
  • cacheable prefix percentage
  • retry rate

The p95 case often determines whether the experience feels reliable. The p99 case often determines whether you need guardrails.

This is why logging token counts is not optional. Token distributions are infrastructure requirements hiding inside product behavior.

A Practical Serving Playbook#

A reasonable optimization path:

  1. Measure prompt and output token distributions
  2. Shorten prompts and outputs where product allows
  3. Add structured outputs to reduce retries
  4. Use prompt caching for stable prefixes
  5. Batch requests based on workload
  6. Evaluate quantized models
  7. Add routing for easy versus hard tasks
  8. Consider speculative decoding when traffic justifies complexity

Start with measurement. Without traces, it is easy to optimize the wrong thing.

The first dashboard should be simple:

  • p50 and p95 latency
  • time to first token
  • input and output tokens
  • validation failure rate
  • retry rate
  • cache hit rate
  • cost per completed workflow

Once those are visible, optimization becomes less mystical. You can see whether the problem is long prompts, long outputs, queueing, cache misses, or bad model behavior.

The Takeaway#

LLM inference became a systems problem because model quality was only one part of the product.

Latency, throughput, memory, caching, output length, and validation all shape whether an LLM feature is affordable and pleasant to use.

The best teams treat inference as part of application design. They do not just ask “Which model is smartest?” They ask “Which system delivers the right answer at the right cost and latency?”