Spend time on user value, not pipelines—multimodal indexing now comes standard via the Azure portal.
Introduction
We're thrilled to introduce a new suite of multimodal capabilities in Azure AI Search.
This set of features includes both new additions and incremental improvements that enable Azure AI Search to extract text from pages and inline images, generate image descriptions (verbalization), and create vision/text embeddings. It also facilitates storing cropped images in the knowledge store and returning text and image annotations in RAG (Retrieval Augmented Generation) applications to end users.
These features can be configured in our new Azure portal wizard with multimodal support or via the REST API 2025-05-01-preview version.
In addition, we're providing a new GitHub repo with sample code for a RAG app. This resource shows how you can use the index created by Azure AI Search to leverage the power of a multimodal search-ready index.
Why multimodality matters for GenAI Apps
In AI, 'multimodality' refers to a system's capacity to seamlessly manage and interpret diverse types of data, such as text and images. With this ability, the system can process and comprehend information from both textual descriptions and complex visual data.
Multimodality enables GenAI (Generative AI) apps to understand and extract information not just from text, but also from images like diagrams, charts, and infographics. This matters when critical answers—like how to request access to an HR system—are embedded in a visual workflow, not plain text.
Multimodal search unlocks that content, helping copilots and agents provide more complete, grounded answers—even when the information lives inside an image.
How it’s different from OCR
Traditional OCR (Optical Character Recognition) converts images to plain text, but it doesn’t understand structure or context—especially in complex visuals like flowcharts. Multimodal search goes further: it interprets relationships, context, and meaning within both text and images, enabling much richer and more relevant answers.
What about embeddings?
Image and multimodal embeddings help match visuals to queries based on content or meaning. But embeddings alone may miss structural logic—like the sequence in a diagram.
That’s where verbalization comes in: it extracts the underlying structure and relationships from images, complementing embeddings to deepen understanding. Used together, they power more accurate and helpful GenAI responses.
Why it matters for developers
Until now, setting up multimodal search meant building and maintaining separate pipelines for images and text. It was complex and time-consuming.
The new multimodal wizard in Azure AI Search changes that. It simplifies setup with built-in support for:
- Image extraction and normalization
- Image verbalization (auto-captioning)
- Multimodal embeddings
- Cropped image storage in the knowledge store
- Indexing for search and RAG scenarios
There’s also a GitHub repo to help you get started with a full RAG app. It’s all about removing the plumbing—so you can focus on building great user experiences, faster.
"We’re exploring Azure AI Search’s multimodal capabilities to enhance MiM, our AI-powered knowledge assistant. We are enabling MiM to interpret and retrieve information from complex technical documentation and proprietary knowledge sources—including diagrams, structured tables, and embedded visuals. Early access and close collaboration with Azure AI Search have enabled us to test emerging capabilities in real-world scenarios. While it’s still early days, we’re encouraged by the progress and excited to continue this journey together. "
- Subhas Patel, Group Head of Technology, Spirax Group
End-to-end flow: from ingestion to answers
Here is an end-to-end high-level diagram for multimodality functionality:
Inside the multimodal functionality set
Step |
Functionality |
Status |
Description |
Where it lights up |
1.1 |
Updated |
Now extracts images and text page numbers, bounding boxes, and parallel plain-text—plus page-level slices from multiple doc types. | ||
OR 1.2 |
Existing |
Extracts images and text. Suited for RAG apps that don’t require: text page number location, text polygon extraction, and other capabilities provided by layout skill. This extracts page number only if PDF. | ||
2 |
Updated |
Now extracts offsets, lengths, ordinal position per chunk | ||
3
|
New |
Call any chat completions (inference) model deployed in Azure AI Foundry for extracted image verbalization, summarization, classification and more – use in combination with AOAI embedding model afterwards. | ||
4. |
Embedding model support |
Existing |
Vision-text embeddings created at document ingest and for every user query. You can use Azure OpenAI embedding skill for verbalized images and text. Also, you can use supported multimodal embedding models through AML skill and AI Vision Embeddings skill, and their respective vectorizers. |
Portal wizard, AOAI embedding skill/vectorizer, AML skill / AML vectorizer, AI Vision multimodal embedding skill / vectorizer, REST (2025-05-01-preview), GitHub repo |
5 |
New |
Extracted images automatically persisted in Knowledge store for direct reference from RAG app. | ||
1-5 |
New |
Few clicks integration tool to create a RAG-ready multimodal index —no JSON editing |
No-code | |
6 |
New |
Plug-and-play RAG app that enables multimodal search with the wizard-created index and fetches text + image annotations. |
High-code |
How to get started
- Go to the How-to Multimodal Search wizard documentation for a step-by-step process to create a RAG multimodal-ready index.
- Go to the Multimodal sample app code repo in GitHub and get a code-ready app to plug-in your index created in step #1. You can also use code-only method and use the sample app code only to build your end-to-end multimodal RAG app.
This GitHub app sample supports text and image citations, train of thought functionality, adjusting the number of top chunks to be retrieved for data grounding and enabling/disabling semantic ranker.
Developer tips
- Try the sample app code for code-ready end-to-end functionality.
- Document extraction skill provides text and image extraction capabilities, but it lacks support for text location metadata support such as polygon boxes and page number extraction. This skill has multimodal capabilities available for PDF files only. For more advanced image and text extraction capabilities, use the Document Intelligence content layout skill.
- Adapt the code for customized/extra functionality with existing skills (including custom skill) and index projections.
What’s next?
- What’s new in Azure AI Search.
- Public preview documentation of multimodal wizard and GitHub multimodal sample app repo. Let users “see” the answers they’ve been missing with full multimodal annotations!
- Multimodal functionality documentation.
- Feedback: Share thoughts in the comment section of this post.
Updated May 20, 2025
Version 3.0gia_mondragon
Microsoft
Joined January 24, 2019
AI - Azure AI services Blog
Follow this blog board to get notified when there's new activity