RAG Time Journey 4: Advanced Multimodal Indexing

gia_mondragon

Microsoft

Mar 26, 2025

Introduction

Welcome to RAG Time Journey 4, the next step in our deep dive into Retrieval-Augmented Generation (RAG). If you’ve been following along, you might remember that in Journey 2 , we explored data ingestion, hybrid search, and semantic reranking—key concepts that laid the groundwork for effective search and retrieval. Now, we’re moving beyond text and into the multimodal world, where text, images, audio, and video coexist in search environments that demand more sophisticated retrieval capabilities.

Modern AI-powered applications require more than just keyword matching. They need to understand relationships across multiple data types, extract meaning from diverse formats, and provide accurate, context-rich results. That’s where multimodal indexing in Azure AI Search comes into play. This journey will explore how Azure AI Search, its AI enrichment and advanced query capabilities, and its interaction with other Azure services enable seamless multimodal search, ensuring every data type contributes to a more robust and intelligent retrieval experience.

The Evolution of Multimodal Indexing

Why Does Multimodal Data Matter?

In traditional search, keyword-based indexing worked well for text-heavy documents. But today’s AI-powered applications need more than that—they must retrieve insights from text, images, and even video content. Multimodal indexing bridges this gap by allowing search engines to interpret, relate, and retrieve diverse data formats within a single framework.

The Challenges of Indexing Multimodal Data

Handling multiple data types introduces unique complexities. How do you effectively process an image alongside text? How can a search system correlate image with written content? The answer lies in:

Unified data ingestion pipelines that can handle multiple formats.
Structure-preserving chunking to ensure meaningful search results.
Vector embeddings that capture semantic concepts of multimodal data.

Data Ingestion and Preprocessing for Multimodality: Laying the Groundwork

A successful RAG framework begins with efficient data ingestion, ensuring that all types of content—documents, images, and audio—are processed and indexed correctly for retrieval. Azure provides multiple pathways to achieve this:

For Azure AI Search natively supported data sources, such as Azure Blob Storage, Azure Data Lake Storage Gen2, Azure SQL, Azure Cosmos DB for NoSQL and several others listed under data source gallery, Azure AI Search Indexers automate data ingestion, apply AI Enrichment, and streamline indexing workflows.
For other data sources, Azure Logic Apps allows seamless integration, enabling AI-powered transformations before pushing data into Azure AI Search.

By leveraging these ingestion methods, organizations can build a scalable and flexible multimodal retrieval system, ensuring all content is structured, enriched, and searchable.

AI Enrichment: Transforming Data with AI Skills

Azure AI Search doesn’t just index raw data—it enriches it through AI-driven transformations. AI enrichment allows you to:

Call Azure AI Services directly within the indexing pipeline to extract deeper insights through built-in AI skills for text, image, and language processing.
Develop custom skills with your own logic and calling any required services that transform, classify, or annotate data before it reaches the index.

For example, an AI enrichment pipeline could extract metadata from images, summarize key points in lengthy documents, or transcribe audio files for searchability. These transformations enhance retrieval accuracy and ensure multimodal content is indexed in a structured, searchable way.

Pro tip: After your data is indexed, you can access and optimize its searchability via the Azure portal, Azure AI Foundry or via SDKs/REST APIs, where query optimization help refine your search strategy.

Integrated Vectorization: Making Data Searchable Across Modalities

The Role of Chunking

As discussed in Journey 2, long-form content—whether a text-heavy document or a video transcript—must be broken down into manageable chunks to ensure efficient indexing and retrieval. Chunking allows search engines to retrieve precise sections of documents rather than entire files, helping to improve relevance for multiple use cases.

However, not all chunking methods are the same. Azure AI Search supports multiple strategies to preserve document structure while enabling efficient retrieval:

Fixed-size chunking via the split skill, which allows character- or token-based segmentation for uniform text division.
Structure-aware chunking through the document layout skill, leveraging Azure Document Intelligence layout model to recognize headers, sections, and layout elements. This approach ensures that document structure is preserved, keeping related content together and maintaining readability.

For multimodal search, chunking is particularly critical. Since text, images, and video transcripts need to be indexed in a way that retains their interrelationships, chunking ensures that:

Image captions remain linked to their corresponding visuals, enhancing retrieval accuracy for visual searches.
Video subtitles and transcripts are properly segmented, allowing users to search for specific spoken content within a long recording.
Structured tables are kept as in the original document, making data extraction and retrieval more meaningful when combined with textual content.

By using these chunking methods, multimodal search systems ensure that extracted content remains contextually relevant and logically organized, improving retrieval accuracy and providing a more holistic search experience across different data types.

Embedding Models: The Heart of Multimodal Search

Once chunked, data is transformed into vector embeddings for retrieval. Azure AI Search natively integrates:

Text embeddings (e.g., Azure OpenAI’s text-embedding-3-large and text-embedding-3-small) for natural language understanding.
Multi-modal embeddings, where different modalities vector representations for various types of data, such as images and text are captured and allowed to be used interchangeably in the same semantic space, ensuring richer retrieval. AI Vision multimodal embedding skill, which leverages the AI Vision multimodal model allows multimodal representations.
AI Foundry-supported embedding models (e.g., Cohere-embed-v3-english, Cohere-embed-v3-multilingual, Facebook-DinoV2-Image-Embeddings-ViT-Base, Facebook-DinoV2-Image-Embeddings-ViT-Giant) where you can choose models available through the AI Foundry catalog.

With AI enrichment, images, text, and even spoken words from audio files can be converted into embeddings, making multimodal search truly seamless.

Integrated vectorization that incorporates multimodal elements with all functionalities above is easily available through the Azure portal via the Import and vectorize data wizard and you can access a code sample available here.

Enhancing Retrieval with Azure AI Content Understanding

Using AI-Powered Content Understanding

Azure AI Content Understanding extends search beyond basic text recognition by enabling AI-driven interpretation of diagrams, flowcharts, and other complex visual elements in addition to text. This capability enhances multimodal search by ensuring contextually relevant indexing and retrieval of structured and unstructured content.

Through custom skills in Azure AI Search, Content Understanding can be leveraged to:

Extract entities and relationships from complex visual elements, such as annotated diagrams, flowcharts, and schematics, improving retrieval accuracy.
Automatically categorize content based on its format, structure, and visual cues, enhancing search across different data modalities.
Generate metadata and confidence scores to ensure reliable search results and assist in quality control.

For example, a technical document containing process flowcharts can be indexed in a way that allows users to retrieve specific steps or components within a visual structure rather than just the textual description. By integrating Azure AI Content Understanding through a custom skill, complex documents become fully searchable across modalities, significantly improving discoverability. You may find examples of how to integrate other services in the Azure AI Search indexing pipeline through custom skills in the Power Skills repo.

With these tools, multimodal indexing is no longer just about storing data—it’s about understanding, structuring, and retrieving it intelligently for an optimized search experience.

Optimizing Query Performance for Multimodal Search

Tuning Search with Vector Weighting

Multimodal search results should prioritize relevant modalities based on context. With vector weighting, different data types can be assigned relevance scores—boosting text-based results for documentation-heavy queries, or prioritizing image-based results for visually driven searches.

Advanced Search Techniques: Multi-Vector & Cross-Field Search

Azure AI Search supports advanced retrieval strategies:

Multi-vector search: Assigns separate embeddings for different data aspects, enabling precise matching across multiple dimensions.
Cross-field search: Merges vectors from text, images, and metadata to surface the most contextually relevant results.

These approaches make search results more intuitive, accurate, and adaptable to real-world needs.

The Future of Multimodal Indexing in RAG

Where We Are and Where We’re Headed

With multimodal indexing, RAG is evolving beyond text retrieval, enabling AI applications to understand and process diverse information formats. This shift opens the door to agent-driven frameworks, where AI assistants can retrieve, correlate, and synthesize insights from text, images, and audio in real time.

Key Takeaways

Multimodal indexing unifies text, images, and audio into a single search framework.
Azure AI Search, Azure AI Services and AI Content Understanding enable seamless multimodal retrieval.
Advanced search techniques like vector weighting and multi-vector search improve result precision.