Rutvik Acharya

Back

Retrieval Augmented Generation (RAG) is the standard architecture for ensuring LLMs have access to private, up-to-date data. However, a common failure mode in production RAG pipelines is poor retrieval accuracy. You construct a perfect prompt, but the system retrieves irrelevant documents.

This often happens because off-the-shelf embedding models (like text-embedding-3 or bge-m3) are trained on general internet corpora. They understand general semantics but lack the nuance required for specialized domains like legal contracts, medical records, or proprietary codebases.

In this post, we’ll explore why generic embeddings fail and how to fine-tune them for your specific data distribution.

The Semantic Gap#

Standard embedding models map text to a high-dimensional vector space where “semantically similar” texts are grouped together. For a general model, “Apple” might be strongly correlated with “Fruit” and “iPhone”.

However, in a corporate finance context, “Apple” should correlate strictly with “AAPL Quarterly Earnings”. The general model’s training objective conflicts with your retrieval objective.

Fine-tuning effectively warps this vector space. It pushes unrelated concepts (like “Fruit”) away and pulls domain-relevant concepts (like “Earnings Report”) closer to your query anchor.

Methodology#

There are several approaches to adapting embedding models. We will focus on the two most high-leverage techniques suitable for modern applications.

1. Contrastive Learning#

Contrastive learning is the foundation of most embedding training. It works on triplets:

  • Anchor: The query (e.g., “login failure error 500”).
  • Positive: The correct document (e.g., “System authentication timeout protocols”).
  • Negative: A hard negative document (e.g., “User password reset guide”).

The loss function minimizes the distance between the Anchor and Positive while maximizing the distance between the Anchor and Negative.

Why “Hard Negatives” matter: Using random documents as negatives is easy but inefficient. The model learns more by distinguishing between similar but wrong documents (hard negatives) than distinctively wrong ones.

2. Matryoshka Representation Learning (MRL)#

Matryoshka Representation Learning is a newer technique that structures the information within the embedding vector itself. It forces important semantic information into the earlier dimensions of the vector.

In a standard 1536-dimensional vector, the information is diffused. In an MRL-trained vector, the first 64 or 128 dimensions capture the vast majority of the semantic meaning.

  • Benefit: You can truncate vectors during retrieval (e.g., using only the first 256 dimensions), reducing storage and latency by 5x-10x with less than 2% degradation in retrieval quality.

Implementation Guide#

We will use the sentence-transformers library for this. Here is the complete workflow:

1. Initialize a Base Model#

We start with a pre-trained “Generalist” model. We choose Nomic because it supports dynamic resizing (Matryoshka) out of the box.

model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)
python

2. Prepare Training Data#

Create pairs of positive and negative examples.

train_examples = [
    InputExample(
        texts=['deployment failed with error 503', 'Check load balancer health status'], 
        label=1.0 # Positive match
    ),
    InputExample(
        texts=['deployment failed with error 503', 'Update frontend CSS styles'], 
        label=0.0 # Negative match
    )
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
python

3. Define Matryoshka Loss#

This is the magic step. We tell the model to optimize specifically for the first 64, 128, etc., dimensions.

train_loss = losses.MatryoshkaLoss(
    model=model,
    loss=losses.CosineSimilarityLoss(model),
    matryoshka_dims=[768, 512, 256, 128, 64]
)
python

4. Run Training#

Fit the model. This usually takes minutes to hours depending on dataset size.

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=10,
    warmup_steps=100
)
python

5. Save & Deploy#

Save your new specialist model. It is now ready to replace your generic provider.

model.save('financial-embeddings-v1')
python

Conclusion#

Fine-tuning your embedding model is often a higher-ROI activity than prompt engineering or switching LLMs. By explicitly teaching the model the “jargon” and relationships of your dataset, you ensure that the Retrieval step of RAG provides high-quality context to the Generation step.

For most use cases, starting with a strong base model like Nomic or BGE and applying Contrastive Learning on your specific data pairs will yield significant improvements.

MRL for Embeddings
https://rutvikacharya.com/blog/fine-tuning-embeddings
Author Rutvik Acharya
Published at March 12, 2025