From diagrams to dialogue: Introducing new multimodal functionality in Azure AI Search

gia_mondragon

Microsoft

May 19, 2025

Spend time on user value, not pipelines—multimodal indexing now comes standard via the Azure portal.

Introduction

We're thrilled to introduce a new suite of multimodal capabilities in Azure AI Search.

This set of features includes both new additions and incremental improvements that enable Azure AI Search to extract text from pages and inline images, generate image descriptions (verbalization), and create vision/text embeddings. It also facilitates storing cropped images in the knowledge store and returning text and image annotations in RAG (Retrieval Augmented Generation) applications to end users.

These features can be configured in our new Azure portal wizard with multimodal support or via the REST API 2025-05-01-preview version.

In addition, we're providing a new GitHub repo with sample code for a RAG app. This resource shows how you can use the index created by Azure AI Search to leverage the power of a multimodal search-ready index.

Why multimodality matters for GenAI Apps

In AI, 'multimodality' refers to a system's capacity to seamlessly manage and interpret diverse types of data, such as text and images. With this ability, the system can process and comprehend information from both textual descriptions and complex visual data.

Multimodality enables GenAI (Generative AI) apps to understand and extract information not just from text, but also from images like diagrams, charts, and infographics. This matters when critical answers—like how to request access to an HR system—are embedded in a visual workflow, not plain text.

Multimodal search unlocks that content, helping copilots and agents provide more complete, grounded answers—even when the information lives inside an image.

How it’s different from OCR

Traditional OCR (Optical Character Recognition) converts images to plain text, but it doesn’t understand structure or context—especially in complex visuals like flowcharts. Multimodal search goes further: it interprets relationships, context, and meaning within both text and images, enabling much richer and more relevant answers.

What about embeddings?

Image and multimodal embeddings help match visuals to queries based on content or meaning. But embeddings alone may miss structural logic—like the sequence in a diagram.

That’s where verbalization comes in: it extracts the underlying structure and relationships from images, complementing embeddings to deepen understanding. Used together, they power more accurate and helpful GenAI responses.

Why it matters for developers

Until now, setting up multimodal search meant building and maintaining separate pipelines for images and text. It was complex and time-consuming.

The new multimodal wizard in Azure AI Search changes that. It simplifies setup with built-in support for:

Image extraction and normalization

Image verbalization (auto-captioning)

Multimodal embeddings

Cropped image storage in the knowledge store

Indexing for search and RAG scenarios

There’s also a GitHub repo to help you get started with a full RAG app. It’s all about removing the plumbing—so you can focus on building great user experiences, faster.

"We’re exploring Azure AI Search’s multimodal capabilities to enhance MiM, our AI-powered knowledge assistant. We are enabling MiM to interpret and retrieve information from complex technical documentation and proprietary knowledge sources—including diagrams, structured tables, and embedded visuals. Early access and close collaboration with Azure AI Search have enabled us to test emerging capabilities in real-world scenarios. While it’s still early days, we’re encouraged by the progress and excited to continue this journey together. "

- Subhas Patel, Group Head of Technology, Spirax Group

End-to-end flow: from ingestion to answers

Here is an end-to-end high-level diagram for multimodality functionality:

Inside the multimodal functionality set

Step	Functionality	Status	Description	Where it lights up
1.1	Enhanced Document Intelligence layout skill	Updated	Now extracts images and text page numbers, bounding boxes, and parallel plain-text—plus page-level slices from multiple doc types.	Portal wizard, REST (2025-05-01-preview), GitHub repo
OR 1.2	Document extraction skill	Existing	Extracts images and text. Suited for RAG apps that don’t require: text page number location, text polygon extraction, and other capabilities provided by layout skill. This extracts page number only if PDF.	Portal wizard, REST (2025-05-01-preview)
2	Enhanced split skill	Updated	Now extracts offsets, lengths, ordinal position per chunk	Portal wizard, REST (2025-05-01-preview), GitHub repo
3	GenAI prompt skill	New	Call any chat completions (inference) model deployed in Azure AI Foundry for extracted image verbalization, summarization, classification and more – use in combination with AOAI embedding model afterwards.	Portal wizard, REST (2025-05-01-preview), GitHub repo
4.	Embedding model support	Existing	Vision-text embeddings created at document ingest and for every user query. You can use Azure OpenAI embedding skill for verbalized images and text. Also, you can use supported multimodal embedding models through AML skill and AI Vision Embeddings skill, and their respective vectorizers.	Portal wizard, AOAI embedding skill/vectorizer, AML skill / AML vectorizer, AI Vision multimodal embedding skill / vectorizer, REST (2025-05-01-preview), GitHub repo
5	Native image storage in existing knowledge store	New	Extracted images automatically persisted in Knowledge store for direct reference from RAG app.	Portal wizard, GitHub repo
*1-5*	Out-of-box Azure portal wizard	New	Few clicks integration tool to create a RAG-ready multimodal index —no JSON editing	No-code
6	Sample GitHub app code	New	Plug-and-play RAG app that enables multimodal search with the wizard-created index and fetches text + image annotations.	High-code

How to get started

Go to the How-to Multimodal Search wizard documentation for a step-by-step process to create a RAG multimodal-ready index.
Go to the Multimodal sample app code repo in GitHub and get a code-ready app to plug-in your index created in step #1. You can also use code-only method and use the sample app code only to build your end-to-end multimodal RAG app.

This GitHub app sample supports text and image citations, train of thought functionality, adjusting the number of top chunks to be retrieved for data grounding and enabling/disabling semantic ranker.

Developer tips

Try the sample app code for code-ready end-to-end functionality.
Document extraction skill provides text and image extraction capabilities, but it lacks support for text location metadata support such as polygon boxes and page number extraction. This skill has multimodal capabilities available for PDF files only. For more advanced image and text extraction capabilities, use the Document Intelligence content layout skill.
Adapt the code for customized/extra functionality with existing skills (including custom skill) and index projections.