Microsoft Foundry Blog

7 MIN READ

Automate RAG Indexing: Azure Logic Apps & AI Search for Source Document Processing

allisonsparrow

Microsoft

Oct 10, 2024

Credits: Technical content by Azure AI Search Senior PM Gia Mondragon

Introduction

When working with RAG (Retrieval-Augmented Generation) applications, the retriever, such as Azure AI Search, plays a crucial role in obtaining the most relevant results for the language model to deliver a response to end users. It is essential to store data representations that are semantically similar to specific user searches, such as vectors, which are a key component in vector and hybrid search. The task of parsing, chunking, vectorizing data and storing it in an index is handled by an Azure AI Search feature known as integrated vectorization. For supported data sources, this functionality also enables automated data ingestion, enrichment and processing.

However, there are numerous data sources not directly integrated with AI Search but accessible through a variety of connectors available in Azure Logic Apps. Azure Logic Apps has introduced new functionality that facilitates every step required to process documents from their connectors for unstructured data. Now data extraction, pulling files, parsing data, chunking, vectorizing and indexing your data into Azure AI Search is all streamlined into one integrated flow. Additionally, Azure Logic Apps now offers templates for high-demand connectors, with predefined indexing workflows for RAG-ready AI Search indexes, simplifying the creation of these workflows. Some of these templates include indexing pipelines for files located in SharePoint Online, Azure Files, SFTP, among others.

How to get started

Prerequisites:

A data source with unstructured data supported by Azure Logic Apps connectors
Azure Logic App (Workflow Service Plan)
Azure AI Search service
Azure OpenAI service with a deployed text embedding model
Azure Logic App built-in template. This is so you don’t have to create your own workflow. Note that you can create your own as well. However, this is not covered as part of this blog post. This tutorial shows how to index files that you add after the workflow creation in an Azure Files share.

Azure AI Search index creation

This integration at this time needs an index created in Azure AI Search with the following schema (as a minimum). Later in this article we will explain how you can update the workflow to map more fields to each document chunk accordingly.

Azure AI Search index: Minimum schema needed for this integration

Note: The sample index definitions below include a vector field with 3072 dimensions, corresponding to the Azure OpenAI text-embedding-3-large model. If you use a different Azure OpenAI embedding model or a different dimensionality, you must adjust the index definition accordingly before index creation.

{ 

  "name": "chunked-index", 

  "fields": [ 

    { 

      "name": "id", 

      "type": "Edm.String", 

      "searchable": true, 

      "retrievable": true, 

      "key": true 

  }, 

    { 

      "name": "documentName", 

      "type": "Edm.String", 

      "searchable": true, 

      "retrievable": true 

    }, 

    { 

      "name": "content", 

      "type": "Edm.String", 

      "searchable": true, 

      "retrievable": true 

    }, 

    { 

      "name": "embeddings", 

      "type": "Collection(Edm.Single)", 

      "searchable": true, 

      "filterable": false, 

      "retrievable": true, 

      "dimensions": 3072, 

      "vectorSearchProfile": "vector-profile" 

    } 

  ], 

  "vectorSearch": { 

    "algorithms": [ 

      { 

        "name": "vector-config", 

        "kind": "hnsw", 

        "hnswParameters": { 

          "metric": "cosine", 

          "m": 4, 

          "efConstruction": 400, 

          "efSearch": 500 

        }, 

        "exhaustiveKnnParameters": null 

      } 

    ], 

    "profiles": [ 

      { 

        "name": "vector-profile", 

        "algorithm": "vector-config" 

      } 

    ] 

  } 

}

Azure AI Search index: Vectorization at query time

If you need Azure AI Search to also vectorize your data at query time, instead of performing this operation from the orchestrator end from your RAG application, you can use the following JSON definition for your index. You need to make sure to change the Azure OpenAI endpoint and change for yours. Also, create a service-managed identity for your AI Search service, and follow the instructions to assign the Cognitive Services OpenAI User role in your Azure OpenAI service.

{ 

  "name": "chunked-index", 

  "fields": [ 

    { 

      "name": "id", 

      "type": "Edm.String", 

      "searchable": true, 

      "retrievable": true, 

      "key": true 

  }, 

    { 

      "name": "documentName", 

      "type": "Edm.String", 

      "searchable": true, 

      "retrievable": true 

    }, 

    { 

      "name": "content", 

      "type": "Edm.String", 

      "searchable": true, 

      "retrievable": true 

       

    }, 

    { 

      "name": "embeddings", 

      "type": "Collection(Edm.Single)", 

      "searchable": true, 

      "filterable": false, 

      "retrievable": true, 

      "dimensions": 3072, 

      "vectorSearchProfile": "vector-profile" 

    } 

  ], 

  "vectorSearch": { 

    "algorithms": [ 

      { 

        "name": "vector-config", 

        "kind": "hnsw", 

        "hnswParameters": { 

          "metric": "cosine", 

          "m": 4, 

          "efConstruction": 400, 

          "efSearch": 500 

        }, 

        "exhaustiveKnnParameters": null 

      } 

    ], 

    "profiles": [ 

      { 

        "name": "vector-profile", 

        "algorithm": "vector-config", 

        "vectorizer": "azureOpenAI-vectorizer" 

 

      } 

    ], 

        "vectorizers": [ 

      { 

        "name": "azureOpenAI-vectorizer", 

        "kind": "azureOpenAI", 

        "azureOpenAIParameters": { 

          "resourceUri": "https://<yourAOAIendpoint>.openai.azure.com", 

          "deploymentId": "text-embedding-3-large", 

          "modelName": "text-embedding-3-large" 

        } 

    ], 

 

  } 

}

Create index from JSON in Azure portal

This is how you can create the index from the Azure portal using the JSON template above:

Go to your AI Search service, select Search Management -> Indexes and click on Add index and select Add index (JSON) from the dropdown menu.

Figure 1 - Create Azure AI Search index from JSON using Azure portal

Delete the JSON structure that appears at the right, copy the JSON template above according to your needs and paste in the canvas at the right. Click on Save.

Figure 2 - Copy the JSON template provided in this tutorial for index creation

The index that is created with the template is called chunked-index and we’ll use it as the target index in this example.

Using Azure Logic App workflow templates to import data from your unstructured data source

Go to your Logic App resource, click on Workflows > Workflows and click on +Add > Add from Template

Figure 3 - Add workflow from Template in Azure Logic App

Look for azure ai search in the search box and choose the template that aligns with your data source. Note that you should be able to use any Azure Logic App supported connector of unstructured data so you can use it to import data to AI Search with this same chunking and embedding pattern, but you will need to modify the workflow according to your needs.
In this case we will choose the “Azure Files: Ingest and index documents at a schedule using Azure OpenAI and Azure AI Search - RAG pattern”

Figure 4 - Choose Azure Files RAG template

Select Use this template

Figure 5 - Review the workflow and select it

Choose a workflow name and type it. Click on Next.

Figure 6 - Name your workflow

Click on Connect for each connection in the template configuration and add your existing endpoints which correspond to your data source (in this case Azure Files, your Azure AI Search service and your AOAI service)

Figure 7 - Configure connector connections

Examples of how each connection configuration looks like are here. Make sure that you have a minimum role of Contributor access over the resources to establish the connections.

For Azure Files connection: Your Azure Storage account URI is under your Azure Storage account Settings > Endpoints > File Service and the domain is .file.core.windows.net. You can find the connection string under the Storage Account Security + Network > Keys > Connection String.

Copy the URI and add to the Storage Account URI configuration and the connection string in their respective fields.

Figure 8 - Azure Files connection configuration

For Azure AI Search connection: The Azure AI Search endpoint URL is under your AI Search service Overview > Essentials > URL and the domain is .search.windows.net.

In case your setup is with admin key, you’ll find it under AI Search service Settings > Keys > Primary Admin Key.

Figure 9 - AI Search connection

For Azure OpenAI connection: The Azure OpenAI endpoint URL is under your Azure OpenAI service Resource Management > Keys and Endpoint > Endpoint and the suffix domain is .openai.azure.com. For key setup copy Key1 and copy in the Authentication Key configuration.

Figure 10 - Azure OpenAI connection

After configuring all services connections, click Next.

Figure 11 - Connections configuration complete

Fill out the following indexing configuration details. It assumes that:

You already created the index with one of the templates above.
You have an Azure OpenAI embedding model called “text-embeddings-3-large”in your AOAI deployed instance.

Indexing Workflow configuration details:

AISearch index name: This is the name of the index that we’ve created as part of this tutorial.
OpenAI text embedding deployment identifier = text-embedding-3-large. This is the name of the Azure OpenAI embedding model deployment: This is the embedding model deployment name (not the model name – in this case is the same though).

Figure 12 - Azure OpenAI embedding model name

Azure Files storage Folder Name: This is the name of your Azure Files file share, where your files are located.

Figure 13 - Azure Files share name

Click Next and then Create.

Figure 14 - Review workflow details and create

Click on Go to My Workflow and wait until the initial run is completed. This is scheduled to be triggered to check for any new files added to your Azure Files share. After you add new files to your configured file share, you must see them reflected in your AI Search index.
Right after you have initial vectorized data in your index, you can use the index in this tutorial to chat with your data from your preferred RAG orchestrator such as Azure AI Studio.
To use your Azure AI Search index in Azure AI Studio go to Project Playground > Chat > Add your data > Add a new data source and follow the instructions to set up your index.
Note: If you created an index with the minimal JSON configuration in this tutorial, you must follow the instructions in the Azure AI Studio documentation here as is. However, if you used the option of adding an index vectorizer, you must remove the vector option from the AI Studio configuration since the index will contact your embedding model directly.

Figure 15 - Azure AI Studio Chat Playground "Add your data"

Additional considerations

For optimal AI search relevance, consider using a hybrid approach that combines vector and keyword search along with a semantic ranker. This method is generally more effective for many use cases. For more information, please visit: Azure AI Search Outperforming Vector Search with Hybrid.

This case focuses specifically on fixed-chunking and text-only scenarios.

Get started building your RAG application with low code using Azure AI Search and Azure Logic Apps.

Azure AI Search and its latest features.   
Azure Logic Apps and Azure Logic App connector for AI Search index
Logic App connectors.
Create a search service in the Azure Portal, Azure CLI, the Management REST API, ARM template, or a Bicep file.   
Retrieval Augmented Generation in Azure AI Search.  
Pricing:

Updated Oct 10, 2024

Version 1.0

azure ai search