Fine-Tuning Multimodal Embedding Models with Sentence Transformers: A Practical Walkthrough

I’ve been using Sentence Transformers for a while now, mostly for text embeddings. But the multimodal stuff? That’s where it gets interesting. The library now handles images, audio, and video alongside text, and more importantly, you can fine-tune these models on your own data.

Let’s cut to the chase: I’m going to walk through fine-tuning a model for Visual Document Retrieval (VDR). The task is simple in concept: given a text query like “What was the company’s Q3 revenue?”, find the right document page image from a pile of thousands. Charts, tables, layouts — the model needs to understand all of it.

I used Qwen/Qwen3-VL-Embedding-2B as the base model. After fine-tuning on a custom VDR dataset, the NDCG@10 jumped from 0.888 to 0.947. That’s not just an incremental improvement — it beat every other multimodal model I tested, including ones four times larger.

Why Bother Fine-Tuning?

General-purpose multimodal models are trained on everything: cat photos, street signs, product shots, document scans. They’re decent at everything but rarely great at one thing. VDR is a perfect example — the model needs to parse structured documents, not just match visual features. Fine-tuning on domain-specific data teaches it the patterns that matter for your use case.

In my test, the fine-tuned model outperformed every recent VDR model I could find. That’s the kind of gain that makes fine-tuning worth the effort.

What You Need for Training

The training pipeline is the same as text-only Sentence Transformers, just with multimodal data. Here’s the stack:

Model: Either an existing multimodal embedding model or a raw VLM checkpoint
Dataset: Paired text and images (or other modalities)
Loss function: Guides the optimization
Training arguments: Learning rate, batch size, etc.
Evaluator: Optional but highly recommended
Trainer: The SentenceTransformerTrainer ties it all together

The trainer handles everything the same way as text-only training. The difference is your dataset contains images, and the model’s processor handles preprocessing automatically.

Setting Up the Model

You can start from an existing multimodal embedding model or a fresh VLM checkpoint. Sentence Transformers auto-detects supported modalities from the processor.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
    processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
)

You can also start from a raw VLM checkpoint:

model = SentenceTransformer("Qwen/Qwen3-VL-2B")

The library inspects the processor, figures out what modalities are available, and adds pooling if needed. You can check what’s supported:

print(model.modalities)
print(model.supports("image"))

The Dataset

For VDR, I built a dataset where each example pairs a text query with a document page image. The dataset format is straightforward: a list of dictionaries, each containing a query and one or more relevant images.

from datasets import Dataset

dataset = Dataset.from_list([
    {
        "query": "What was Q3 revenue?",
        "images": ["page_42.png"],
    },
    # ... more examples
])

The key is that the images are document screenshots — not just text, but full layouts, charts, and tables. The model needs to learn to match queries to the right visual context.

Loss Function Choices

For embedding models, I used CachedMultipleNegativesRankingLoss with MatryoshkaLoss. The cached variant is efficient because it reuses cached embeddings for negative sampling. MatryoshkaLoss lets you train embeddings that work well at multiple dimensions — useful if you want to trade off between speed and accuracy at inference time.

For reranker models, the setup is similar but uses CachedMultipleNegativesRankingLoss without MatryoshkaLoss — rerankers don’t need the dimension flexibility.

Training Configuration

I kept it simple: 10 epochs, batch size of 8, learning rate of 2e-5. The model was trained on a single A100 GPU. The training took about 6 hours for the VDR dataset.

training_args = SentenceTransformerTrainingArguments(
    output_dir="vdr-model",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
)

Evaluating Performance

I used NDCG@10 as the primary metric — standard for retrieval tasks. The base model scored 0.888. After fine-tuning, it hit 0.947. That’s a 6.6% absolute improvement, which is significant for retrieval.

I also compared against other models:

Larger models (up to 8B parameters) scored lower than my fine-tuned 2B model
Matryoshka dimensions showed consistent performance across all truncation levels

Training Multimodal Reranker Models

Rerankers follow the same pattern but with a different loss function. You use CachedMultipleNegativesRankingLoss without MatryoshkaLoss. The model architecture is a cross-encoder that takes both query and document as input.

from sentence_transformers import CrossEncoder

model = CrossEncoder("Qwen/Qwen3-VL-2B")

Training is analogous: prepare a dataset with query-document pairs, define the loss, and use the trainer.

What I Learned

Fine-tuning multimodal models isn’t much harder than text-only. The library handles the modality detection and preprocessing. The real work is in curating a good dataset — clean, representative pairs that cover the range of queries and documents your model will see.

The results speak for themselves: a 2B model fine-tuned on domain data beats 8B general-purpose models. If you have a specialized use case, fine-tuning is worth the effort.