Sentence Transformers has been my go-to library for embedding and reranker models for years. It’s simple, well-maintained, and just works. The v5.4 update adds something I’ve been waiting for: multimodal support. You can now encode and compare texts, images, audio, and videos with the same API. No more cobbling together separate pipelines for each modality.
What’s New
Multimodal embedding models map inputs from different modalities into a shared embedding space. That means you can take a text query, compare it against image documents, and get meaningful similarity scores. Multimodal reranker models do the same for relevance scoring, handling mixed-modality pairs like text+image or image+audio.
This opens up obvious use cases: visual document retrieval, cross-modal search, multimodal RAG pipelines. But the real win is that you don’t need to learn a new API. If you’ve used Sentence Transformers before, you already know how to use this.
Installation
You’ll need extra dependencies depending on which modalities you want:
pip install -U "sentence-transformers[image]"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers[image,video,train]"
A word of caution: VLM-based models like Qwen3-VL-2B need a GPU with at least 8 GB of VRAM. For the 8B variants, expect around 20 GB. On CPU, these models are painfully slow. If you don’t have a local GPU, use a cloud GPU service or Google Colab. For CPU inference, stick with text-only or CLIP models.
Using Multimodal Embedding Models
Loading a model is identical to before:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
The model auto-detects its supported modalities. No extra config needed unless you want to tweak image resolution or model precision.
Encoding Images
model.encode() now accepts images as URLs, local file paths, or PIL Image objects:
img_embeddings = model.encode([
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
Cross-Modal Similarity
Since the model maps everything into the same space, you can compare text and image embeddings directly:
text_embeddings = model.encode([
"A green car parked in front of a yellow building",
"A red car driving on a highway",
"A bee on a pink flower",
"A wasp on a wooden table",
])
similarities = model.similarity(text_embeddings, img_embeddings)
Results behave as expected: “A green car…” matches the car image best (0.51), “A bee…” matches the bee image (0.67). Hard negatives get lower scores. But note those scores aren’t close to 1.0. That’s the modality gap—embeddings from different modalities cluster in separate regions. Cross-modal similarities are lower than within-modal ones, but relative ordering holds, so retrieval still works.
Encoding Queries and Documents
For retrieval, use encode_query() and encode_document(). Many models prepend instruction prompts based on whether input is a query or document, so this matters:
query_emb = model.encode_query("Find images of cars")
doc_emb = model.encode_document(["image_url_1", "image_url_2"])
Multimodal Reranker Models
Rerankers score relevance between pairs. Now they work across modalities:
from sentence_transformers import CrossEncoder
model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B")
scores = model.predict([
("A green car", "https://example.com/car.jpg"),
("A bee on a flower", "https://example.com/bee.jpg"),
])
This is useful for reranking initial retrieval results where documents might be images or mixed-modality.
Supported Models
The update supports a range of models. The standout is Qwen3-VL for both embedding and reranking. CLIP models also work for image-text tasks. Audio and video support is newer, so expect more models to land over time.
What I Like and What Gives Me Pause
I like that Sentence Transformers stays simple. Multimodal could have been a separate library or a complex API, but it’s not. That’s good engineering.
What gives me pause is the modality gap. If your use case requires high absolute similarity scores for cross-modal comparisons, you’ll need to calibrate thresholds carefully. Also, GPU requirements for VLM models are steep. This isn’t a “run on your laptop” situation unless you have a decent GPU.
Final Thoughts
This update makes multimodal search and RAG much more accessible. If you’ve been waiting for a clean way to handle mixed-modality retrieval, this is it. Just be realistic about hardware requirements and the modality gap. Start with CLIP-based models if you’re on limited hardware, then scale up to VLM models when you need better performance.
Comments (0)
Login Log in to comment.
Be the first to comment!