Introduction
In today's data-driven business landscape, enterprises have widely adopted lakehouses as an essential component to manage unstructured data efficiently. Recognizing this need, we are thrilled to introduce the OneLake files indexer in Azure AI Search (preview). This connector makes it easy to directly index data from a OneLake lakehouse provisioned in Microsoft Fabric. It can index file content and metadata, and automatically tracks changes so your index will always be up to date. Simplifying data integration and accelerating indexing for retrieval augmented generation (RAG) applications, the OneLake files indexer aims to enhance your interaction with Azure AI Search for a more streamlined and effective operation.
Indexing Files from OneLake and using Shortcuts
The OneLake files indexer is designed to connect to Microsoft Fabric tenant's OneLake and index its files. It can index files both directly from a OneLake lakehouse and by using OneLake shortcuts. This indexer is specifically geared towards unstructured and semi-structured content, like PDFs, Microsoft Office documents, images, JSON or CSV files. On the other hand, this indexer does not apply to highly structured table content, such as indexing parquet files. The pull-indexer is also compatible with features like AI enrichment and integrated vectorization, which allow for data transformation and refinement, and designed to optimally prepare your data before incorporating it into your Azure AI Search index.
This indexer is a valuable addition to the existing data sources supported by Azure AI Search, such as Azure Blob Storage, Azure SQL and Azure Cosmos DB.
Indexing Lakehouse Files in OneLake
The OneLake files indexer allows for direct indexing of lakehouse files in OneLake. It can index files and associated metadata from one or more data paths within a lakehouse, supports incremental indexing for identifying and indexing new and updated files and metadata.
OneLake Shortcuts for data in Amazon S3 and Google Cloud Storage
The OneLake files indexer also supports indexing files from OneLake shortcuts like Azure Data Lake Storage Gen2 (ADLS Gen2), Amazon S3, and Google Cloud Storage (GCS). Navigate these platforms, access documents, and integrate their metadata and content into your Azure AI search index. This feature broadens the range of supported data sources for Azure AI Search and taps into the growing ecosystem of shortcuts supported by OneLake.
REST API and SDK Support
The OneLake files indexer is part of the REST API 2024-05-01 Preview version. This feature is also available through the Azure portal and our latest preview SDKs for Python, .NET, Java, and JavaScript, offering diverse integration methods to cater to varying user needs.
Configuring OneLake Files Indexer
Use the "Import and vectorize data" wizard
To try this out, go to your AI Search service in the Azure portal and use the Import and vectorize data wizard, it'll take you from zero to a working search index for use in RAG apps in just a few steps.
- Follow the guidelines in OneLake files indexer documentation to enable a managed identity in your AI Search service and provide the permissions required for the connector to access the data in your lakehouse.
- Go to your OneLake lakehouse and write down the URL and the subfolder or shortcut name that contains the data you'd like to index.
- Go to the Azure portal and in your AI Search service, click on Import and vectorize data wizard.
- Choose OneLake files as your datasource and copy the URL and subfolder or shortcut name from the lakehouse details you've wrote down in step a) and click on Next.
e. In the following two steps of the wizard, as part of integrated vectorization:
- Under Vectorize text tab, choose your text embeddings configuration.
- Under Vectorize/enrich your images tab, optionally, choose your image processing settings (multimodal embeddings and/or enrichments).
f. In Advanced settings, it is recommended that you enable semantic ranker for improved results and put the indexer on a schedule.
g. Review and click on Create.
Use other options
There are other ways to use the OneLake files indexer depending on your scenario:
- Via the Azure portal:
- With the Import Data wizard for conventional search applications (like powering the search box for private data in a custom application).
- Build individual AI Search components of the indexing pipeline independently for added customization. This includes the dataset, index, optional skillset, and finally the indexer.
2. Through the supported SDKs: A Python Jupyter notebook is available that demonstrates how to configure your end-to-end indexing pipeline with several other recently released features.
3. Direct REST API Usage: Our OneLake files indexer documentation provides a comprehensive guide on how to use the REST API directly.
What’s next?
Stay tuned for more updates on Azure AI Search’s new features and how they’re making integration easier for RAG applications. We’re excited to see what you’ll build with them!
More news
We have announced more features and improvements in AI Search. Learn more in our blog posts:
- Multimodal search and AI Studio model catalog embeddings support in Azure AI Search, now in preview.
- Announcing cost-effective RAG at scale with Azure AI Search: more storage, vector capacity and performance in new Azure AI S and L tiers.
- Azure AI Search’s new hybrid and vector search updates.
Getting started with Azure AI Search
- Learn more about Azure AI Search.
- Start creating a search service in the Azure Portal, Azure CLI, the Management REST API, ARM template, or a Bicep file.
- Review Azure AI Search pull-indexers and push data approaches.
- Learn about Retrieval Augmented Generation in Azure AI Search.
- Learn more about integrated vectorization and why chunking and vectors are important in your RAG solutions.