Blog Post

Educator Developer Blog
6 MIN READ

[pt2] Choosing the right Data Storage Source (Under Preview) for Azure AI Search

kevin_comba's avatar
kevin_comba
Iron Contributor
Feb 07, 2025

This guide introduces preview data sources available for integrating with Azure AI Search, specifically focusing on new features currently in preview. In this article, we break down the available preview connectors and categorize them into key use cases:

  1. Generally Available Data Sources by Azure AI Search
  2. Preview Data Sources by Azure AI Search
  3. Data Sources from Our Partners

In This Article:

When integrating Azure AI Search into your applications, choosing the right preview data source is essential for optimizing indexing efficiency, query performance, and scalability. Azure AI Search allows you to pull data from various storage sources using indexers that automate both ingestion and enrichment, allowing for powerful search experiences.

This guide explores the preview data sources currently available in Azure AI Search, their best use cases, key benefits, and change detection mechanisms, to help you determine which is right for your application.

  • Generally Available Data Sources by Azure AI Search: These connectors are already fully supported and optimized for production workloads, and this article focuses on preview features that are still being fine-tuned for broader adoption.
  • Preview Data Sources by Azure AI Search: The preview data sources featured here offer cutting-edge functionality that extends Azure AI Search’s capabilities. These are in the testing and feedback phase, with the potential for future enhancements based on customer feedback.
  • Data Sources from Our Partners: Some preview connectors are powered by Microsoft’s trusted partners, who provide specialized solutions for enterprise storage, graph databases, document management, and more.

Preview Data Sources by Azure AI Search

When integrating Azure AI Search into your applications, selecting the right preview data source is essential for optimizing indexing efficiency, query performance, and scalability. Azure AI Search enables you to pull data from multiple storage sources using indexers, which automate the ingestion and enrichment of data.

This guide explores the preview data sources currently available in Azure AI Search, their best use cases, key benefits, and change detection mechanisms.

Choosing the Right Data Source for Your Use Case

The following table summarizes the available preview data sources, their best applications, benefits, change detection methods, and supported content.

Data Source

Best For

Key Benefits

Change Detection

Supported Content

Fabric OneLake Files

Data lakes with structured & unstructured data

Supports AI enrichment, metadata extraction, and indexing from hierarchical directories

Auto-detects changes via metadata

CSV, JSON, Office files, PDFs, XML, ZIP, Markdown

Azure Cosmos DB for Apache Gremlin

Graph-based NoSQL applications

Supports indexing of graph data (vertices, edges), queryable as JSON

_ts-based high-water mark tracking

JSON documents (serialized graph data)

Azure Cosmos DB for MongoDB

Document-based NoSQL applications

Real-time search, supports JSON documents with soft delete tracking

_ts-based high-water mark tracking

JSON, structured NoSQL data

SharePoint

Enterprise document management

Supports AI enrichment, indexing Office files, PDFs, HTML, etc.

Auto-detects file changes and deletions

PDFs, Office files, HTML, CSV, JSON, XML

Azure Files

SMB file shares with structured content

AI enrichment, metadata extraction, and incremental indexing

Auto-detects changes based on metadata

Office files, PDFs, JSON, XML, ZIP, CSV

 

Azure MySQL

Relational data in MySQL databases

Supports indexing data from MySQL tables and views, including change tracking and soft deletes

High-water mark change detection; soft delete detection

Structured relational data from MySQL tables/views

 

1. Fabric OneLake Files – For Large-Scale Data Lakes

When to Use Fabric OneLake Files

Configuration Tips

  • Use skillsets for AI-powered text extraction, metadata enrichment, and vectorization.
  • Leverage JSON parsing modes to split JSON documents into separate search entries.
  • Include/exclude files based on format or metadata to optimize indexing.

Change Detection

 

2. Azure Cosmos DB for Apache Gremlin – For Graph Data

When to Use Azure Cosmos DB for Gremlin

  • You have graph-based NoSQL data that needs full-text search.
  • Your data consists of vertices and edges, serialized as JSON documents.
  • You require incremental indexing based on changes in the graph database.
  • Your use case involves real-time search for relationships in graph data.

Configuration Tips

  • Use custom queries to extract vertices (g.V()) or edges (g.E()) separately.
  • Enable soft delete tracking via a Boolean flag (e.g., isDeleted).
  • Use field mappings to adapt Gremlin properties to search index schema.

Change Detection

  • Uses _ts (timestamp) field for high-water mark tracking.
  • Soft delete detection via a dedicated property (e.g., isDeleted = true).

 

3. Azure Cosmos DB for MongoDB – For NoSQL Document Data

When to Use Azure Cosmos DB for MongoDB

  • Your application stores NoSQL documents in MongoDB-compatible Cosmos DB.
  • You need low-latency, real-time search for high-velocity JSON data.
  • You want incremental indexing based on _ts timestamps.
  • Your data structure requires hierarchical or nested JSON queries.

Configuration Tips

  • Use custom queries to structure data before indexing.
  • Implement soft delete detection using a Boolean flag (isDeleted).
  • Enable vector indexing for RAG (Retrieval-Augmented Generation) use cases.

Change Detection

  • Auto-tracks changes using _ts field (timestamp-based).
  • Soft delete detection using an explicit flag (isDeleted = true).

 

4. SharePoint – For Document Library Indexing

When to Use SharePoint

  • Your content is stored in SharePoint Online document libraries.
  • You need to index structured and unstructured documents (Office files, PDFs, JSON, HTML, JSON, ZIP, XML, ODS).
  • Your use case requires full-text search for enterprise documents.
  • You want AI-powered skillsets for OCR, entity recognition, and translation.

Configuration Tips

  • Index specific document libraries instead of the entire SharePoint site.
  • Add AI enrichment skillsets for OCR, language translation, and metadata extraction.
  • Be aware of limitations (e.g., OneNote files, SharePoint Lists not supported).

Change Detection

  • Auto-detects changes in documents and updates index accordingly.
  • Deletion detection is built-in (deleted SharePoint files are removed from the index).

 

5. Azure Files – For SMB File Shares

When to Use Azure Files

  • Your data resides in Azure Files (SMB file shares).
  • You need structured content indexing with metadata extraction.
  • Your organization requires secure, managed file storage with search capabilities.
  • You want incremental indexing with auto-detection of file changes.

Configuration Tips

  • Use metadata storage properties to optimize search and filtering.
  • Implement JSON parsing to index structured data files.
  • Configure AI enrichment skillsets for text extraction and vectorization.

Change Detection

  • Auto-detects changes via metadata_storage_last_modified.
  • Supports soft delete detection using metadata flags.

6. Azure MySQL for Azure AI Search:

When to Use Azure MySQL

  • You have structured relational data stored in Azure Database for MySQL Flexible Server.
  • You need to index content from MySQL tables or views and make them searchable through Azure AI Search.
  • You want to track incremental changes and deletions in your MySQL data, including full and soft deletes.

Configuration Tips

Change Detection

  • High Water Mark: The MySQL indexer uses a timestamp-based high-water mark column to track changes and only index new or modified rows.
  • Soft Delete Detection: The MySQL connector also supports soft delete detection, where a special marker in the database can indicate deleted records, which will be reflected in the search index.

Key Takeaways

  1. For large-scale hierarchical data → Fabric OneLake Files
  2. For NoSQL graph data → Azure Cosmos DB for Gremlin
  3. For NoSQL document databases → Azure Cosmos DB for MongoDB
  4. For enterprise document search → SharePoint
  5. For SMB file shares → Azure Files
  6. For MySQL databasesAzure MySQL (Preview)

 

Final Thoughts

Choosing the right preview data source in Azure AI Search is crucial for optimizing search performance, handling different data structures, and ensuring scalability. These preview features open up more possibilities for your search solutions, especially for applications requiring indexing of MySQL databases, NoSQL graph data, or enterprise document management systems.

By selecting the right data source for your use case, you can enhance search functionality, reduce operational complexity, and drive better insights from your data.

Need Help Deciding?

  • Need to search hierarchical data lakes? → Fabric OneLake Files
  • Need to search NoSQL graph data? → Azure Cosmos DB for Gremlin
  • Need real-time search on MongoDB documents? → Azure Cosmos DB for MongoDB
  • Need to index SharePoint enterprise content? → SharePoint
  • Need to index SMB file shares? → Azure Files
  • Need to index MySQL relational data? → Azure MySQL (Preview)

By selecting the right data source, you can enhance search performance, reduce costs, and maximize search accuracy in your AI-powered applications.

 

What’s Next?

Updated Jan 29, 2025
Version 1.0
No CommentsBe the first to comment