Rutvik Acharya

Back

If you’ve built a Retrieval-Augmented Generation (RAG) system in the last year, you’ve probably used cosine similarity. It’s the default setting in almost every vector database. You take your query, embed it, run a cosine similarity search against your document chunks, and return the top K results.

It feels intuitive. It’s cheap to compute. But honestly? Relying completely on vanilla cosine similarity out-of-the-box is one of the most common reasons RAG pipelines quietly fail in production.

Let’s look at why cosine similarity isn’t exactly the silver bullet we treat it as, and explore the deeper nuances of how high-dimensional geometry actively works against your retrieval accuracy.

The problem with just measuring angles#

To understand why cosine similarity breaks down, we have to look at what it actually measures.

Cosine similarity looks at the angle between two vectors in a high-dimensional space. If they point in the exact same direction, the score is 1. If they point in opposite directions, it’s -1.

Notice what’s missing? Magnitude.

Cosine similarity normalizes everything. It forces every single vector onto the surface of a hypersphere, treating the distance from the origin (the length of the vector) as completely irrelevant. It assumes that only the direction encodes semantic meaning. While that’s mostly true for some embedding models, it’s a massive oversimplification for complex, real-world text.

Here is where that assumption starts to fall apart.

1. The “Negation” Blindspot#

Vector embeddings are generally great at capturing the “topic” of a sentence, but they often struggle with polarity or exact logic. This is because “topic” and “truth value” live on different geometric planes.

Imagine you’re building a legal RAG system, and the user asks: “Is the defendant permitted to contact the plaintiff?”

Now, look at these two document chunks in your database and their likely similarity scores:

Document ChunkCosine Similarity Score
”The defendant is expressly permitted to contact the plaintiff.”0.965
”The defendant is expressly NOT permitted to contact the plaintiff.”0.958

To a human, these mean the exact opposite thing. To a basic embedding model, these sentences are nearly identical. They share 90% of the same tokens and exist in the exact same semantic neighborhood (legal contact permissions).

Because cosine similarity only looks at the general direction of these topics, the 0.007 difference in score is essentially noise. Your RAG system retrieves the negation, feeds it to the LLM, and suddenly your chatbot gives the user disastrous legal advice because it was “the most relevant” document.

2. The Anisotropy Problem (The “Everything is 0.8” issue)#

One of the most documented but least understood issues with LLM embeddings is Anisotropy.

In an ideal world, embeddings would be spread out evenly across the entire vector space. In reality, they almost always cluster into a narrow “cone.” If you visualize 1536-dimensional space, all your documents are huddled together in a tiny corner of the room.

This results in “score squashing.” Instead of seeing a healthy range of scores from 0.0 to 1.0, you see this:

Result 1: 0.8241
Result 2: 0.8238
Result 3: 0.8235
...
Result 100: 0.8190
text

This makes it virtually impossible to set a meaningful threshold for filtering out “bad” results. If you set your filter at 0.8, you might miss a great chunk that’s just slightly off-center. If you set it at 0.75, you’ll drown your LLM in noise because the model literally cannot push irrelevant documents far enough away in this crowded cone.

3. The Curse of Hubness#

In very high-dimensional spaces - like the 1536 dimensions used by OpenAI’s standard embeddings - distance metrics start acting weird. One of the most annoying phenomena is called “hubness.”

Basically, a small subset of your vectors will naturally settle into positions where they are suspiciously close to a lot of other vectors. These “hub” documents end up being retrieved for a disproportionately wide variety of queries, even when they aren’t actually relevant.

If you’ve ever had a RAG system that keeps inexplicably returning the same generic introductory paragraph or boilerplate disclaimer no matter what you search for, you’ve experienced the hubness problem. Cosine similarity suffers heavily from this because it ignores the structural density of the space.

4. Throwing away the “Confidence” of Magnitude#

Some newer embedding models actually encode meaning into the length of the vector.

A very short, vague sentence like “The system failed” might point in a specific direction but have a small magnitude. A dense, highly specific paragraph like “The payment processing system failed violently at 3 AM due to a Redis connection timeout” might point in a similar direction but have a much larger magnitude because it contains more concentrated information.

When you use cosine similarity, you divide by the magnitude: similarity = (A · B) / (||A|| * ||B||)

You are explicitly taking that rich, dense paragraph and artificially scaling it down to be perfectly equal to the vague sentence. You throw away the model’s signal about how “dense” or useful the document actually is.


Moving Beyond Basic Retrieval#

If basic cosine similarity is the floor, where do we go next? You don’t have to rip out your vector database, but you do need to layer in some better logic.

1. Switch to Dot Product (The “Magnitude” Fix)#

If you are using an embedding model that doesn’t strictly normalize its outputs to length 1 (some modern open-source models like BGE or Nomic don’t), try switching your index to Dot Product (Inner Product).

Dot product measures both the angle and the magnitude. It naturally gives a “boost” to documents that are more information-dense.

2. Learn to Love the Cross-Encoder#

This is the gold standard fix for the “Negation” problem. Instead of relying solely on your vector database’s cosine similarity, use a two-stage retrieval pipeline.

Look how the scores change when we re-rank the “Not” problem examples from earlier using a Cross-Encoder:

MethodPairScore
Bi-Encoder (Cosine)“Permitted” vs “Not Permitted”0.958 (Ambiguous)
Cross-Encoder”Permitted” vs “Not Permitted”0.012 (Clear Rejection)

A Cross-Encoder reads the query and the document together in a single pass. It doesn’t treat them as independent vectors; it uses attention across both texts simultaneously. It’s slower, but by only running it on your top 50 results (fetched via cosine), the latency is negligible.

from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/ms-marco-Minilm-L-6-v2')
scores = model.predict([
    ('Is contact permitted?', 'The defendant is permitted...'),
    ('Is contact permitted?', 'The defendant is NOT permitted...')
])
# Result: [0.98, 0.02] -> The ambiguity is gone.
python

3. Hybrid Search: The Old School filter#

Never underestimate the power of exact keyword matching. Semantic search is notoriously bad at exact part numbers (XJ-9000), specific names, or strict acronyms.

By combining your vector search with traditional keyword search (BM25) and blending the scores using Reciprocal Rank Fusion (RRF), you get the best of both worlds.

The formula for RRF is simple: score = sum(1 / (k + rank_i)) (where k is a constant like 60, and rank_i is the position in the i-th search result list).

This ensures that if a document is a perfect keyword match (Rank 1 in BM25) but only a “decent” semantic match (Rank 20 in Cosine), it still surfaces to the top.

4. Late Interaction (ColBERT)#

If you have a particularly difficult dataset where nuances are getting lost, look into ColBERT.

Instead of turning a whole paragraph into a single vector (compressing it into 1536 numbers), ColBERT turns every word into its own vector. When you search, it performs a “MaxSim” operation - it finds the best matching token in the document for every token in your query.

This preserves the fine-grained structure. If your query mentions “not,” that specific “not” vector will literally search for its counterpart in the document, making it much harder for negations to hide in the “average” of a single vector.

The Takeaway: It’s just a starting point#

Cosine similarity isn’t evil - it’s just a tool with very specific limitations. It assumes your vector space is isotropic (it isn’t), that magnitude doesn’t matter (it does), and that “topics” are enough for truth (they aren’t).

As your RAG system matures, stop trusting the raw scores your vector DB gives you. Start layering in re-rankers, hybrid search, and multi-vector logic. Your users - and your LLM - will thank you.