Blog Post

AI - Azure AI services Blog
4 MIN READ

Azure AI Search integration with Fabric OneLake, now in preview

gia_mondragon's avatar
gia_mondragon
Icon for Microsoft rankMicrosoft
May 21, 2024

Introduction 

In today's data-driven business landscape, enterprises have widely adopted lakehouses as an essential component to manage unstructured data efficiently. Recognizing this need, we are thrilled to introduce the OneLake files indexer in Azure AI Search (preview)This connector makes it easy to directly index data from a OneLake lakehouse provisioned in Microsoft Fabric. It can index file content and metadata, and automatically tracks changes so your index will always be up to date. Simplifying data integration and accelerating indexing for retrieval augmented generation (RAG) applications, the OneLake files indexer aims to enhance your interaction with Azure AI Search for a more streamlined and effective operation. 

 

Figure 1 – RAG Architecture with OneLake files indexer

 

Indexing Files from OneLake and using Shortcuts 

The OneLake files indexer is designed to connect to Microsoft Fabric tenant's OneLake and index its files. It can index files both directly from a OneLake lakehouse and by using OneLake shortcuts. This indexer is specifically geared towards unstructured and semi-structured content, like PDFs, Microsoft Office documents, images, JSON or CSV files.   On the other hand, this indexer does not apply to highly structured table content, such as indexing parquet files. The pull-indexer is also compatible with features like AI enrichment and integrated vectorization, which allow for data transformation and refinement, and designed to optimally prepare your data before incorporating it into your Azure AI Search index. 

This indexer is a valuable addition to the existing data sources supported by Azure AI Search, such as Azure Blob Storage, Azure SQL and Azure Cosmos DB. 

 

Indexing Lakehouse Files in OneLake 

The OneLake files indexer allows for direct indexing of lakehouse files in OneLake. It can index files and associated metadata from one or more data paths within a lakehouse, supports incremental indexing for identifying and indexing new and updated files and metadata. 

 

OneLake Shortcuts for data in Amazon S3 and Google Cloud Storage 

The OneLake files indexer also supports indexing files from OneLake shortcuts like Azure Data Lake Storage Gen2 (ADLS Gen2), Amazon S3, and Google Cloud Storage (GCS).  Navigate these platforms, access documents, and integrate their metadata and content into your Azure AI search index. This feature broadens the range of supported data sources for Azure AI Search and taps into the growing ecosystem of shortcuts supported by OneLake. 

 

REST API and SDK Support 

The OneLake files indexer is part of the REST API 2024-05-01 Preview version. This feature is also available through the Azure portal and our latest preview SDKs for Python, .NET, Java, and JavaScript, offering diverse integration methods to cater to varying user needs. 

 

Configuring OneLake Files Indexer 

Use the "Import and vectorize data" wizard

To try this out, go to your AI Search service in the Azure portal and use the Import and vectorize data wizard, it'll take you from zero to a working search index for use in RAG apps in just a few steps.  

 

  1. Follow the guidelines in OneLake files indexer documentation to enable a managed identity in your AI Search service and provide the permissions required for the connector to access the data in your lakehouse.
  2. Go to your OneLake lakehouse and write down the URL and the subfolder or shortcut name that contains the data you'd like to index.Figure 2 - OneLake lakehouse and subfolder/shortcut details

     

  3. Go to the Azure portal and in your AI Search service, click on Import and vectorize data wizard.

     

    Figure 3 - Import and vectorize data wizard

     

  4. Choose OneLake files as your datasource and copy the URL and subfolder or shortcut name from the lakehouse details you've wrote down in step a) and click on Next.

         Figure 4 - OneLake datasource configuration

 

      e. In the following two steps of the wizard, as part of integrated vectorization:

- Under Vectorize text tab, choose your text embeddings configuration.

- Under Vectorize/enrich your images tab, optionally, choose your image processing settings (multimodal embeddings and/or enrichments).

f. In Advanced settings, it is recommended that you enable semantic ranker for improved results and put the indexer on a schedule.

Figure 5 - Configure Semantic ranker and indexer schedule

 

 

g. Review and click on Create.

 

Use other options 

There are other ways to use the OneLake files indexer depending on your scenario: 

  1. Via the Azure portal: 
  • With the Import Data wizard for conventional search applications (like powering the search box for private data in a custom application). 
  • Build individual AI Search components of the indexing pipeline independently for added customization. This includes the dataset, index, optional skillset, and finally the indexer. 

Figure 6. Individual object creation for customized indexing pipeline

 

2. Through the supported SDKs: A Python Jupyter notebook is available that demonstrates how to configure your end-to-end indexing pipeline with several other recently released features.  

3. Direct REST API Usage: Our OneLake files indexer documentation provides a comprehensive guide on how to use the REST API directly.  

 

What’s next? 

Stay tuned for more updates on Azure AI Search’s new features and how they’re making integration easier for RAG applications. We’re excited to see what you’ll build with them! 

 

More news 

We have announced more features and improvements in AI Search. Learn more in our blog posts: 

 

Getting started with Azure AI Search 

 

 

Updated May 21, 2024
Version 2.0
  • Hi Gia, 

    Thank you for the great article. I followed your steps for importing and vectorizing data flow, selecting "system-assigned" as the managed identity type. However, I encountered the following error at the end of the creation process.

  • Hi PeterTHLee, based on the error, it looks like you're missing one of the pre-requisites to assign the managed identity permissions first: 

    1. Follow the guidelines in OneLake files indexer documentation to enable a managed identity in your AI Search service and provide the permissions required for the connector to access the data in your lakehouse.
  • willhawkins2295's avatar
    willhawkins2295
    Copper Contributor

    URGENT!!

    Thank you for the great article gia_mondragon !

     

    Hey PeterTHLee , did you ever resolve the error you received?

     

    I am getting the same error, I followed the steps of the documentation Gia recommended you but I'm not able to add the name of my Azure AI Search service to the Microsoft Fabric workspace. It doesn't recognize the system or user identities I add in here:


    Thanks in advance for any support anyone can provide:)

     

  • willhawkins2295 apologies for the late reply but we don't monitor these blogs for urgent requests. I hope that you were able to solve this. For future references, please raise any Q&A questions in the forum: Azure Community Support | Microsoft Azure. For completeness, you must have your system managed identity enabled in AI Search first, so it shows up. You must give it up to 15 minutes to replicate.