Educator Developer Blog

6 MIN READ

[pt2] Choosing the right Data Storage Source (Under Preview) for Azure AI Search

Iron Contributor

Feb 07, 2025

This guide introduces preview data sources available for integrating with Azure AI Search, specifically focusing on new features currently in preview. In this article, we break down the available preview connectors and categorize them into key use cases:

When integrating Azure AI Search into your applications, choosing the right preview data source is essential for optimizing indexing efficiency, query performance, and scalability. Azure AI Search allows you to pull data from various storage sources using indexers that automate both ingestion and enrichment, allowing for powerful search experiences.

This guide explores the preview data sources currently available in Azure AI Search, their best use cases, key benefits, and change detection mechanisms, to help you determine which is right for your application.

Generally Available Data Sources by Azure AI Search: These connectors are already fully supported and optimized for production workloads, and this article focuses on preview features that are still being fine-tuned for broader adoption.
Preview Data Sources by Azure AI Search: The preview data sources featured here offer cutting-edge functionality that extends Azure AI Search’s capabilities. These are in the testing and feedback phase, with the potential for future enhancements based on customer feedback.
Data Sources from Our Partners: Some preview connectors are powered by Microsoft’s trusted partners, who provide specialized solutions for enterprise storage, graph databases, document management, and more.

Preview Data Sources by Azure AI Search

When integrating Azure AI Search into your applications, selecting the right preview data source is essential for optimizing indexing efficiency, query performance, and scalability. Azure AI Search enables you to pull data from multiple storage sources using indexers, which automate the ingestion and enrichment of data.

This guide explores the preview data sources currently available in Azure AI Search, their best use cases, key benefits, and change detection mechanisms.

Choosing the Right Data Source for Your Use Case

The following table summarizes the available preview data sources, their best applications, benefits, change detection methods, and supported content.

Data Source	Best For	Key Benefits	Change Detection	Supported Content
Fabric OneLake Files	Data lakes with structured & unstructured data	Supports AI enrichment, metadata extraction, and indexing from hierarchical directories	Auto-detects changes via metadata	CSV, JSON, Office files, PDFs, XML, ZIP, Markdown
Azure Cosmos DB for Apache Gremlin	Graph-based NoSQL applications	Supports indexing of graph data (vertices, edges), queryable as JSON	_ts-based high-water mark tracking	JSON documents (serialized graph data)
Azure Cosmos DB for MongoDB	Document-based NoSQL applications	Real-time search, supports JSON documents with soft delete tracking	_ts-based high-water mark tracking	JSON, structured NoSQL data
SharePoint	Enterprise document management	Supports AI enrichment, indexing Office files, PDFs, HTML, etc.	Auto-detects file changes and deletions	PDFs, Office files, HTML, CSV, JSON, XML
Azure Files	SMB file shares with structured content	AI enrichment, metadata extraction, and incremental indexing	Auto-detects changes based on metadata	Office files, PDFs, JSON, XML, ZIP, CSV
Azure MySQL	Relational data in MySQL databases	Supports indexing data from MySQL tables and views, including change tracking and soft deletes	High-water mark change detection; soft delete detection	Structured relational data from MySQL tables/views

1. Fabric OneLake Files – For Large-Scale Data Lakes

When to Use Fabric OneLake Files

Your data resides in OneLake Lakehouse.
You need AI enrichment for text extraction from images, CVS, EML, JSON, ZIP, PDFs, or structured data.
Your data is organized in hierarchical directories or nested subdirectories.
You require incremental indexing with auto-detection of file changes.

Configuration Tips

Use skillsets for AI-powered text extraction, metadata enrichment, and vectorization.
Leverage JSON parsing modes to split JSON documents into separate search entries.
Include/exclude files based on format or metadata to optimize indexing.

Change Detection

Auto-detects changes based on metadata_storage_last_modified.
Soft delete detection requires adding metadata properties.

2. Azure Cosmos DB for Apache Gremlin – For Graph Data

When to Use Azure Cosmos DB for Gremlin

You have graph-based NoSQL data that needs full-text search.
Your data consists of vertices and edges, serialized as JSON documents.
You require incremental indexing based on changes in the graph database.
Your use case involves real-time search for relationships in graph data.

Configuration Tips

Use custom queries to extract vertices (g.V()) or edges (g.E()) separately.
Enable soft delete tracking via a Boolean flag (e.g., isDeleted).
Use field mappings to adapt Gremlin properties to search index schema.

Change Detection

Uses _ts (timestamp) field for high-water mark tracking.
Soft delete detection via a dedicated property (e.g., isDeleted = true).

3. Azure Cosmos DB for MongoDB – For NoSQL Document Data

When to Use Azure Cosmos DB for MongoDB

Your application stores NoSQL documents in MongoDB-compatible Cosmos DB.
You need low-latency, real-time search for high-velocity JSON data.
You want incremental indexing based on _ts timestamps.
Your data structure requires hierarchical or nested JSON queries.

Configuration Tips

Use custom queries to structure data before indexing.
Implement soft delete detection using a Boolean flag (isDeleted).
Enable vector indexing for RAG (Retrieval-Augmented Generation) use cases.

Change Detection

Auto-tracks changes using _ts field (timestamp-based).
Soft delete detection using an explicit flag (isDeleted = true).

4. SharePoint – For Document Library Indexing

When to Use SharePoint

Your content is stored in SharePoint Online document libraries.
You need to index structured and unstructured documents (Office files, PDFs, JSON, HTML, JSON, ZIP, XML, ODS).
Your use case requires full-text search for enterprise documents.
You want AI-powered skillsets for OCR, entity recognition, and translation.

Configuration Tips

Index specific document libraries instead of the entire SharePoint site.
Add AI enrichment skillsets for OCR, language translation, and metadata extraction.
Be aware of limitations (e.g., OneNote files, SharePoint Lists not supported).

Change Detection

Auto-detects changes in documents and updates index accordingly.
Deletion detection is built-in (deleted SharePoint files are removed from the index).

5. Azure Files – For SMB File Shares

When to Use Azure Files

Your data resides in Azure Files (SMB file shares).
You need structured content indexing with metadata extraction.
Your organization requires secure, managed file storage with search capabilities.
You want incremental indexing with auto-detection of file changes.

Configuration Tips

Use metadata storage properties to optimize search and filtering.
Implement JSON parsing to index structured data files.
Configure AI enrichment skillsets for text extraction and vectorization.

Change Detection

Auto-detects changes via metadata_storage_last_modified.
Supports soft delete detection using metadata flags.

6. Azure MySQL for Azure AI Search:

When to Use Azure MySQL

You have structured relational data stored in Azure Database for MySQL Flexible Server.
You need to index content from MySQL tables or views and make them searchable through Azure AI Search.
You want to track incremental changes and deletions in your MySQL data, including full and soft deletes.

Configuration Tips

Use the high-water mark change detection policy for incremental indexing based on a timestamp column.
Soft deletes can be handled with a soft delete detection policy, ensuring that deleted rows in MySQL are also removed from the search index.
Although MySQL indexer creation is not supported through the Azure portal (currently), you can configure it using REST APIs or the Azure SDK for .NET.

Change Detection

High Water Mark: The MySQL indexer uses a timestamp-based high-water mark column to track changes and only index new or modified rows.
Soft Delete Detection: The MySQL connector also supports soft delete detection, where a special marker in the database can indicate deleted records, which will be reflected in the search index.

Key Takeaways

For large-scale hierarchical data → Fabric OneLake Files
For NoSQL graph data → Azure Cosmos DB for Gremlin
For NoSQL document databases → Azure Cosmos DB for MongoDB
For enterprise document search → SharePoint
For SMB file shares → Azure Files
For MySQL databases → Azure MySQL (Preview)

Final Thoughts

Choosing the right preview data source in Azure AI Search is crucial for optimizing search performance, handling different data structures, and ensuring scalability. These preview features open up more possibilities for your search solutions, especially for applications requiring indexing of MySQL databases, NoSQL graph data, or enterprise document management systems.

By selecting the right data source for your use case, you can enhance search functionality, reduce operational complexity, and drive better insights from your data.

Need Help Deciding?

Need to search hierarchical data lakes? → Fabric OneLake Files
Need to search NoSQL graph data? → Azure Cosmos DB for Gremlin
Need real-time search on MongoDB documents? → Azure Cosmos DB for MongoDB
Need to index SharePoint enterprise content? → SharePoint
Need to index SMB file shares? → Azure Files
Need to index MySQL relational data? → Azure MySQL (Preview)

By selecting the right data source, you can enhance search performance, reduce costs, and maximize search accuracy in your AI-powered applications.

What’s Next?

Explore Azure AI Search Indexers
Try out the Import & Vectorize Data Wizard in the Azure portal
Set up AI skillsets to enhance your search solution with OCR, NLP, and embeddings.

Updated Jan 29, 2025

Version 1.0

Iron Contributor

Joined November 07, 2022

View Profile

Educator Developer Blog