Google's Model Lineup Expands: Introducing Gemini Embedding 2

March 16, 2026by FakeJack· 3 min read

Google's Model Lineup Expands: Introducing Gemini Embedding 2

Google introduces Gemini Embedding 2, its first natively multimodal embedding model. Discover how processing text, audio, images, and video in a single shared vector space drastically reduces costs and redefines semantic retrieval accuracy by eliminating the "loss in translation" of traditional pipelines.

What's the News

Last week, Google released a preview of its new embedding model, Gemini Embedding 2.

This is Google's first natively multimodal embedding model. This means that textual information, audio files, images, and videos will share the same embedding space, bypassing intermediate transformation processes in the ingestion pipeline during vectorization.

This is undoubtedly a major breakthrough, not only for improved data accuracy and quality but also for cost optimization and greater precision in semantic information retrieval. The interleaved input modality also allows for pairing semantic information with different types of media files.

You can already test the new model via the Gemini API and Vertex AI within the Google ecosystem; it is also compatible with several existing technologies, including LangChain, LlamaIndex, Weaviate, Qdrant, and ChromaDB.

TL;DR

The updates introduced by Google's new embedding model bring several improvements, not only to the quality of the processed data but also to the ingestion system itself, in addition to bringing native support for different media types.

The Ingestion Pipeline

Ingestion is the process by which embedding models translate (vectorize) a data type—from a simple text string to a more complex file—into a vector; a data representation that LLMs can "understand" and therefore use in search queries (semantic retrieval).

Thanks to its extended native support for multiple media file types, Gemini Embedding 2 optimizes the ingestion process by directing information into a single embedding space: the benefits lie not just in performance, but also in semantic precision.

Shared Embedding Space

The real game-changer of this shared space is the ability to pair specific contextual information with different media files. For example, by associating the word "dog" with an image of a specific breed, both the image representation of the dog and the word "dog" will have closely aligned coordinates within the same shared vector space.

Furthermore, extended native support improves vector precision throughout the entire ingestion pipeline. Normally, for complex media files, it was necessary to go through intermediate steps (like OCR for images or STT for audio files) to translate the original data into a format the vectorization process could understand before sequencing it into a database. This introduced inaccuracies, such as "loss in translation." Bypassing this intermediate "textual translation" and moving everything directly to a vector state eliminates this semantic dispersion, reduces latency, and drastically lowers costs.

In addition to text, audio, images, and video, the model has been explicitly trained to map complex documents (like PDFs) within the same vector space, drastically improving support and accuracy for this file type.

Multilingual Semantic Training

The embedding space doesn't just align different modalities (audio and video); it can also capture semantic intent across more than 100 languages.

For instance, if we vectorize a document in Italian and a search query is performed in Japanese, the model will still find the vector match.

"Matryoshka" (MRL - Matryoshka Representation Learning)

By default, Gemini Embedding 2 generates a very long, 3072-dimensional vector. Thanks to the MRL technique, the most important information is "concentrated" in the vector's initial dimensions (just like smaller Matryoshka dolls are nested inside larger ones). This allows developers to "truncate" the vector (e.g., down to 768 dimensions) without having to recalculate it from scratch, keeping the semantic meaning almost completely intact.

Limits and Pricing

Gemini Embedding 2 has precise ingestion limits per API call:

Text: Up to a maximum context of 8,192 tokens.
Documents: 1 document per prompt, up to a maximum of 6 pages.
Images: Up to 6 images processed in a single request.
Video: Supports video files up to 120 seconds long if they don't have an audio track, or up to 80 seconds if they include audio.

Below are the API usage costs (in USD):

Input Type	Price	Unit	Source
Text	$0.25	per 1M tokens	cloud.google
Image	$0.25	per 1M tokens	cloud.google
Video	$0.25	per 1M tokens	cloud.google
Audio	$0.50	per 1M tokens	cloud.google