AI - Azure AI services Blog

10 MIN READ

Prep your Data for RAG with Azure AI Search: Content Layout, Markdown Parsing & Improved Security

gia_mondragon

Microsoft

Nov 19, 2024

Unlock new chunking strategies, parsing modes and security in RAG applications with Azure AI Search’s latest preview features for the indexing pipeline.

Introduction

We’re excited to announce new preview features in Azure AI Search, specifically designed to enhance data preparation, enrichment, and indexing processes for Retrieval-Augmented Generation (RAG) applications. These updates include the document layout skill—a high-level parser powered by Azure AI Document Intelligence that adapts to scenarios requiring rich content extraction and indexing—and enhanced security with managed identities and private endpoints. Together, these features, along with markdown as a parsing mode, provide organizations with fine-grained control over data enrichment, security, and indexing, enabling smarter, more efficient, and more secure RAG solutions.

These capabilities are now available through the REST API version 2024-11-01-preview and can also be accessed via the newest beta SDKs in .NET, Python, Java, and JavaScript.

These new features are enabled by Azure AI Search’s built-in indexers, which allow users to automate data ingestion and transformation from various data sources. Built-in indexers can apply AI enrichment via skillsets—configurable pipelines that leverage AI skills like OCR, entity recognition, and the new document layout skill—to enrich and enhance data before indexing. These skillsets help customers extract meaningful information from their data, making it easier to search and retrieve relevant content. The new markdown parsing mode and document layout skill work within these built-in indexers to support advanced data preparation and indexing workflows.

With these functionalities, Azure AI Search now supports both fixed-size and structure-aware chunking natively during data enrichment. Fixed-size chunking with overlap—where documents are divided into equal-sized sections with slight overlaps—is effective for simple, uniform documents like articles. For more complex content, structure-aware chunking divides documents into sections aligned with their natural structure, improving retrieval accuracy and relevance.

Why Structure-Aware Chunking Matters for RAG

Azure AI Search’s new native structure-aware chunking capabilities adapt to specific document's unique structure, making RAG solutions more effective. Here’s why structure-aware chunking is key for high-quality retrieval:

1. Enhanced Contextual Understanding

- By breaking documents into context-rich sections based on logical markers (headers, subsections), RAG applications can retrieve content within its intended context. This allows Large Language Models (LLMs) to answer queries with more clarity, as each chunk retains its full meaning.

2. Enhanced Retrieval Focus

- Chunking documents based on their structure may assist in refining search results to more relevant sections, which could be particularly beneficial for complex texts like technical documentation or industry-specific guidelines. This approach helps direct users to specific segments of information, potentially minimizing the noise from unrelated content and enhancing the overall user experience.

3. Optimized LLM Performance

- LLMs perform best with targeted data chunks. Structure-aware chunking helps so only relevant sections are processed, enhancing response accuracy.

New Data Ingestion and Enrichment Preview Features in Azure AI Search

Note: The new features mentioned in this article can be enabled through the Import and vectorize data" wizard in the Azure portal.

Document Layout Skill: Parsing for Complex Documents

The document layout skill, part of Azure AI Search’s AI enrichment capabilities, is configured through skillsets within built-in indexers. It functions as a high-level parser, designed to adapt to scenarios requiring in-depth content extraction and indexing of richly structured documents. Powered by Azure AI Document Intelligence layout model, the document layout skill enables advanced parsing for nuanced document layouts, making it ideal for complex data preparation.

Structure-Aware Chunking: As part of the AI enrichment process, the document layout skill organizes content into coherent markdown sections based on document structure, such as headers and subsections. This structured chunking enhances RAG applications by allowing each section to retain its contextual meaning.
Advanced Parsing for Key Document Elements: Beyond chunking, the document layout skill can extract, and index structured elements like tables and lists, ensuring that critical content is available for precise retrieval. This is especially valuable for documents where specific data points must be indexed separately.
Hierarchical Relationship Mapping: The skill maintains relationships between content sections, preserving the document’s logical structure. For instance, technical manuals or regulatory documents can be indexed in a way that retains content hierarchies.
Adaptive Scenarios: As a high-level parser, the document layout skill is versatile, suitable for documents where rich content parsing is essential.
Benefits for RAG: By transforming complex documents into structured, layout-based chunks with clear hierarchical relationships, the document layout skill enables RAG applications to respond with greater accuracy and context. This feature is ideal when content structure is paramount.

Picture 1 - High-level document layout skill flow in AI Search built-in indexers pipeline. This doesn't show interaction with other Azure Services during the skillset execution.

Fixed-Chunking Limitations and AI Document Intelligence Solution

Why and When Fixed-Size Chunking Falls Short:

Fixed-size chunking can struggle with complex document layouts, as it divides content uniformly without respecting document structure. This may lead to disjointed sections and a loss of meaningful context in RAG applications.

Improved Results with AI Document Intelligence Layout Model:

The example below showcases a small sample document that has been divided into passages without considering the original document structure. This approach highlights the potential limitations of fixed-size chunking when extracting relevant data from documents with rich and complex layouts.

To make the concept visually clear and easy to grasp, this example uses approximate rather than exact fixed-size chunks and overlap, as the focus here is on illustrating the challenges rather than achieving technical precision. This simplified demonstration is designed to help readers quickly understand the impact of fixed-size chunking in contrast to structure-aware approaches.

Imagine this small text is divided into three distinct passages or chunks, with some overlapping content. The blue rectangle represents the first chunk (chunk #1), the green rectangle the second chunk (chunk #2), and the red rectangle the third chunk (chunk #3). This visualization helps illustrate how fixed-size chunking with overlap processes work.

Picture 2 - Text extracted from Introduction to Azure AI Search - Azure AI Search | Microsoft Learn with 3 chunks with overlap showing how fixed-size chunking works.

If you were to ask a question about the steps for an end-to-end exploration of core search features in the Azure portal, the answer would span two separate chunks: blue (chunk #1) and green (chunk #2). Since neither chunk contains the full context to answer the question with certainty, this illustrates a key limitation of fixed-size chunking in such scenarios.

Now, let’s examine two structure-aware chunks created from the same content.

These chunks not only preserve the headers but also replicate Header 1 (H1) in both chunks, as it is relevant to both sections that follow. Additionally, the content under Header 2 (H2) is kept intact, including the points necessary to fully answer the question posed earlier. With a single structure-aware chunk (chunk #1 below), the question about the four steps to explore core search features in the portal can now be answered with complete and accurate context.

Picture 3 - Text extracted from Introduction to Azure AI Search - Azure AI Search | Microsoft Learn with a chunk that would preserve the structure of the document in the same chunk.Picture 4 - Text extracted from Introduction to Azure AI Search - Azure AI Search | Microsoft Learn with a chunk that would preserve the structure of the document in the same chunk.

Prerequisites for Starting this Integration:

Azure AI Search Service: Ensure you have an active Azure AI Search service set up for the integration.
AI Services Multi-Service Resource: Required for billing associated with the AI Document Intelligence Layout skill, this multi-service AI account covers costs specifically for document layout analysis services. Note that these charges are separate from Azure AI Search billing. Please review Azure AI Document Intelligence layout model pricing carefully to understand potential costs before using this integration. If you do not specify multi-service resource when configuring the document layout skill, your search service will default to using the free AI enrichments available for your indexer on a daily basis. However, this limits execution to 20 transactions per indexer invocation, after which the process will halt, and a 'Time Out' message will appear in the indexer's execution history. To process additional documents and ensure uninterrupted functionality, you will need to attach an AI Service multi-service resource to the skillset.
Your Azure AI Search service and your AI Services multi-service resource must be in any of the following regions: East US, West US2, West Europe or North Central US.
Azure OpenAI Embedding Model Deployment: Needed if you’re using integrated vectorization, which is highly recommended for RAG applications. Integrated vectorization allows for automatic vector creation from extracted content and at query time, optimizing retrieval quality and relevance in RAG implementations.

JSON Configuration Example:

Here’s a sample JSON configuration for setting up the document layout skill in a skillset.

{
  "skills": [
    {
      "description": "Analyze a document",
      "@odata.type": "#Microsoft.Skills.Util.DocumentLayoutAnalysisSkill",
      "context": "/document",
      "outputMode": "oneToMany", 
      "markdownHeaderDepth": "h3", 
      "inputs": [
        {
          "name": "file_data",
          "source": "/document/file_data"
        }
      ],
      "outputs": [
        {
          "name": "markdown_document", 
          "targetName": "markdown_document" 
        }
      ]
    }
  ]
}

Keep in mind that even after parsing a document using the AI Document Intelligence layout model with markdown output, very long sections may still require additional fixed-sized chunking. This is necessary because such sections might exceed the RAG-optimal input size for Large Language Models (like GPT-4o), which can impact the relevance in many scenarios. For detailed guidance on determining the optimal chunk size based on multiple use cases, refer to the article Azure AI Search: Outperforming vector search with hybrid retrieval and ranking capabilities | Microsoft Community Hub. If you're using the Import and vectorize data wizard in the Azure portal, this secondary fixed-size chunking option is automatically handled by default.

Markdown Parsing Mode for Structured RAG Retrieval

Azure AI Search indexers offer multiple parsing modes: text, JSON, CSV and now, markdown parsing. Markdown parsing, available as a new parsing mode in Azure AI Search’s built-in indexers, provides a structured way to process and index markdown files. By organizing content based on headers, markdown parsing enables more direct retrieval, making it ideal for content where each section can be accessed independently.

How It Works: Markdown parsing splits content by headers, creating searchable sections that make each content block accessible to RAG applications.
Value for RAG: Markdown parsing is useful for structured documents like FAQs or instructional content, where each question or topic can be indexed as an individual section for quick retrieval.

Configuration JSON Example:

Here’s a sample JSON configuration for enabling markdown parsing in Azure AI Search’s built-in indexers. For a detailed explanation of how to configure this in the end-to-end pipeline, review markdown parsing documentation.

POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
Content-Type: application/json
api-key: [admin key]

{
  "name": "my-markdown-indexer",
  "dataSourceName": "my-blob-datasource",
  "targetIndexName": "my-target-index",
  "parameters": {
    "configuration": {
      "parsingMode": "markdown",
      "markdownParsingSubmode": "oneToMany",
      "markdownHeaderDepth": "h6"
    }
  },
}

This capability is available in a few steps in the Azure portal through the Import and vectorize data wizard.

Enhanced Security with Managed Identity and Private Endpoints for AI Services Integration

This preview also introduces new security features for the existing integration with AI Services for native AI skills that bring greater flexibility and control to Azure AI Search’s indexing pipeline:

Managed Identity for Keyless Connections

Newly supported managed identities allow Azure AI Search to connect securely to AI Services without relying on API keys. This approach simplifies security and supports cross-region connections for billing purposes, enabling seamless integration.

Private Endpoints for Multi-Service Accounts

Azure AI Search implements private endpoints through shared private links. Shared private links, now supporting AI Services, allow Azure AI Search to securely connect built-in indexers to an AI Services multi-service resource, ensuring that all billing-related calls remain private. For AI Services-dependent skills, Azure AI Search processes data through its own Azure AI Services resources, keeping data connections internal via the Azure backbone. However, some enterprises with strict security policies require that even billing calls are routed through a private endpoint, and this capability is now supported.

Impact on Secure RAG Applications

These security enhancements provide organizations with more robust data privacy controls, essential for building secure, scalable RAG deployments in regulated industries.

Prerequisites for Starting this Integration

Configure Azure AI Search to use a managed identity.
On your AI Services Multi-Service resource, assign the identity to the Cognitive Services User role.
Using the Azure portal, or the Skillset 2024-11-01-preview REST API, or an Azure SDK beta package that provides the syntax, configure a skillset to use an identity:
- The managed identity used on the connection belongs to the search service.
- The identity can be system-managed or user-assigned.
- The identity must have Cognitive Services User permissions on the Azure AI resource.
- @odata.type is always #Microsoft.Azure.Search.AIServicesByIdentity.
- subdomainUrl is the endpoint of your Azure AI multi-service resource. It can be in any region that's jointly supported by Azure AI Search and Azure AI services.

As with keys, the details you provide about the Azure AI Services resource are used for billing, not connections. All API requests made by Azure AI Search to Azure AI services for built-in skills processing continue to be internal and managed by Microsoft.

Configuration JSON Example:

System-managed identity

Below is an example JSON for configuring a system-managed identity with a skillset.

Identity is set to null.

POST https://[service-name].search.windows.net/skillsets/[skillset-name]?api-version=2024-11-01-Preview  

{  
    "name": "my skillset name",  
    "skills":   
    [  
      // skills definition goes here 
    ],  
    "cognitiveServices": {  
        "@odata.type": "#Microsoft.Azure.Search.AIServicesByIdentity",  
        "description": "",  
        "subdomainUrl": “https://[subdomain-name].cognitiveservices.azure.com",  
        "identity": null 
    }  
}

User-assigned managed identity

Below is an example JSON for configuring a user-assigned managed identity with a skillset.

Identity is set to the resource ID of the user-assigned managed identity. To find an existing user-assigned managed identity, see Manage user-assigned managed identities.

For a user-assigned managed identity, set the @odata.type and the userAssignedIdentity properties.

POST https://[service-name].search.windows.net/skillsets/[skillset-name]?api-version=2024-11-01-Preview  

{  
    "name": "my skillset name",  
    "skills":   
    [  
      // skills definition goes here 
    ],  
    "cognitiveServices": {  
        "@odata.type": "#Microsoft.Azure.Search.AIServicesByIdentity",  
        "description": "",  
        "subdomainUrl": “https://[subdomain-name].cognitiveservices.azure.com",  
        "identity": {   
            "@odata.type":  "#Microsoft.Azure.Search.DataUserAssignedIdentity",   
            "userAssignedIdentity": ""/subscriptions/{subscription-ID}/resourceGroups/{resource-group-name}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{user-assigned-managed-identity-name}"" 
        }
    } 
}

Exploring Additional Preview Features in Azure AI Search

Azure AI Search’s latest updates extend its support for RAG applications even further. Read our recently released blogs, discussing also additional features: