Rutvik Acharya

Back

Long context windows are seductive. If a model can accept hundreds of thousands of tokens, why build a retrieval system at all? Just paste the whole manual, contract set, codebase, or chat history into the prompt and ask the question.

Sometimes that works. It is also one of the easiest ways to build an expensive system that feels reliable in demos and brittle in production.

Long context is useful. It is not a replacement for retrieval.

More Context Means More Noise#

The model can only answer from what it attends to. A larger context window increases the amount of information available, but it also increases the amount of irrelevant information competing for attention.

Imagine asking:

What is the termination notice period for enterprise customers?

If the prompt includes every contract template, support article, archived policy, and renewal email, the correct clause may be present. But it is surrounded by near matches:

  • Consumer cancellation policy
  • Legacy enterprise template
  • Draft contract language
  • Regional exception
  • Sales FAQ

The model now has to retrieve internally from a noisy prompt. You have moved retrieval from your system into the model’s attention mechanism, where you have less visibility and control.

Lost In The Middle#

Long-context models can still struggle with information buried in the middle of a prompt. They may overweight the beginning, the end, or the most semantically obvious passage rather than the exact passage that answers the question.

This creates a strange debugging experience. The answer is “in the context,” but the model behaves as if it did not see it.

That is not always hallucination. Sometimes it is context layout failure.

Retrieval Gives You Control#

RAG is not only about fitting documents into a small context window. It is about deciding which evidence deserves to be there.

A retrieval system gives you levers:

  • Filter by permissions
  • Prefer current documents over archived ones
  • Rank exact keyword matches
  • Re-rank semantically similar chunks
  • Exclude drafts
  • Track which evidence was used
  • Evaluate recall independently

Long context gives you capacity. Retrieval gives you selection.

The two are complementary. A good system retrieves the most relevant material, then uses a larger context window to include enough surrounding context for synthesis.

The Cost Problem#

Long prompts are not free. They cost money, latency, and sometimes quality.

If you send 100,000 tokens for every question, you pay for those tokens every time. You also increase time to first token and make caching harder unless the shared prefix is stable.

For internal tools, this may be acceptable. For high-volume user-facing systems, it becomes painful quickly.

There is also an engineering cost: logs get larger, traces become harder to inspect, and prompt diffs become noisy. When something fails, you have to inspect a huge input to understand why.

Context Budgets Should Be Designed#

A useful way to think about long context is as a budget, not a bucket.

Even if the model can accept a very large prompt, you still need to decide how much space each type of information deserves:

  • System instructions
  • Tool descriptions
  • Conversation history
  • Retrieved evidence
  • User-provided files
  • Intermediate notes
  • Output instructions

If conversation history grows without pruning, it can crowd out retrieved evidence. If tool descriptions are overly verbose, they can crowd out the document passages that actually answer the question.

A practical context budget might look like:

10% system and tool instructions
15% recent conversation
60% retrieved evidence
10% metadata and citations
5% output format instructions
text

The numbers are not universal. The point is to make the tradeoff explicit. Context allocation is product design.

Context Design Beats Context Dumping#

The better question is not “How much can I fit?” It is “What does the model need to answer this specific question?”

Good context design includes:

  • The user question
  • The top evidence passages
  • Source titles and dates
  • Relevant surrounding sections
  • Definitions for local acronyms
  • Tool results or structured facts
  • Clear instructions for handling missing evidence

Bad context design includes everything because it might be useful.

Bad: entire policy folder
Better: 6 retrieved passages + source metadata + current-policy filter
Best: 6 retrieved passages + parent sections + explicit conflict handling
text

Conflict Handling Is A First-Class Feature#

Long context increases the chance of conflicts. Two documents may disagree because one is outdated, regional, draft, or scoped to a different customer segment.

Your prompt should tell the model how to handle this:

If sources conflict, prefer the newest non-draft policy.
If the conflict cannot be resolved from metadata, say the documents conflict and cite both.
text

Your retrieval layer should also help by attaching metadata:

{
  "source": "enterprise-policy.md",
  "updated_at": "recent",
  "status": "current",
  "audience": "enterprise"
}
json

Without metadata, the model may choose the answer that sounds most plausible rather than the answer your business rules require.

Summarization Is Not A Free Compression Layer#

Another tempting shortcut is to summarize everything first, then ask questions over the summaries. This can work, but it changes the task.

A summary is lossy. It preserves what the summarizer thought was important at the time. Later, a user may ask about a detail the summary dropped.

This is especially risky for:

  • Contract exceptions
  • Numeric thresholds
  • Dates and deadlines
  • Security procedures
  • Edge-case policy language

A better pattern is hierarchical retrieval. Store summaries for broad navigation, but keep the original passages available for final grounding.

step 1: retrieve document or section summary
step 2: retrieve exact passages inside that section
step 3: answer from exact passages, not only the summary
text

Summaries are maps. They are not the territory.

When Long Context Helps#

Long context is genuinely valuable when the task requires synthesis across many pieces of evidence.

Good uses:

  • Comparing multiple contract versions
  • Summarizing a long meeting transcript
  • Reviewing a full design document
  • Answering questions over a small set of known documents
  • Maintaining conversation state across a complex workflow

Even then, structure helps. Use section headings, document boundaries, timestamps, and clear source labels. The model should not have to infer where one document ends and another begins.

A Practical Architecture#

For most applications, I would use a hybrid approach:

  1. Use retrieval to select candidate evidence
  2. Re-rank candidates for the specific query
  3. Expand selected chunks to parent sections
  4. Include metadata and citation IDs
  5. Put the highest-confidence evidence near the question
  6. Ask the model to cite sources and identify missing evidence

This uses long context as a workspace, not a landfill.

Conversation History Needs Compression Rules#

Chat history is one of the easiest ways to waste a long context window.

Most old turns are not equally important. A user’s account ID, constraints, and unresolved decisions may matter. Polite greetings, abandoned branches, and earlier wrong answers usually do not.

Instead of sending the full transcript forever, keep a structured state object:

{
  "user_goal": "compare two vendor contracts",
  "confirmed_constraints": ["focus on termination", "ignore pricing"],
  "open_questions": ["which jurisdiction applies?"],
  "decisions": ["use vendor B template as baseline"]
}
json

Then include the recent conversation plus this state summary. This gives the model continuity without making it reread the entire session.

The danger is stale summaries. If the user changes their mind, the state must update. Otherwise the model will follow old constraints with confidence.

Testing Long-Context Behavior#

Long-context systems need their own evals. A normal RAG eval may only ask whether the right chunk was retrieved. A long-context eval should also test whether the model uses the right passage when many plausible passages are present.

Good tests include:

  • Put the correct answer near the beginning, middle, and end
  • Include a stale policy that conflicts with the current one
  • Include two similar customer segments with different rules
  • Ask for exact numbers that appear in multiple places
  • Add irrelevant but semantically similar sections

Then inspect citations. If the model gives the right answer but cites the wrong source, the system is not trustworthy yet.

The best long-context evaluation is adversarial in a boring way: duplicate the kinds of clutter your real documents contain.

The Takeaway#

Long context windows are a powerful capability, but they do not remove the need for information architecture.

The model still needs the right evidence, in the right order, with the right metadata, under the right instructions. Retrieval is how you control that. Long context is how you give the model enough room to use it.

The winning pattern is not context dumping. It is retrieval plus thoughtful context design.