Best Practices for Using Azure AI Search for Natural Language to SQL Generation with Generative AI

Microsoft

Nov 21, 2024

Introduction

Using Generative AI to convert natural language (NL) into SQL queries can simplify user interactions with complex databases. This technology can democratize data access by allowing non-technical business partners to obtain insights without needing to write SQL queries. It can streamline analysts' workflows by enabling them to focus on data interpretation and strategy rather than query formulation. Additionally, it enhances productivity by reducing the time and effort required to retrieve data and ensures more consistent and accurate query results through automated translation of natural language into SQL.
However, schema complexity, schema storage and retrieval, and contextual understanding are often the challenges. Azure AI Search, paired with generative AI models like GPT, can tackle these issues by enabling efficient indexing, storage, and retrieval mechanisms, while providing the right context to the AI model for accurate SQL generation. This combination ensures that users can easily query complex databases and get precise answers, enhancing both usability and reliability.

Understanding the Challenges

When building NL to SQL solutions, here are the key issues to address:

Schema Complexity: Databases have intricate schemas that can make NL to SQL translation difficult.
Schema Storage & Planning: Efficiently storing schema details for quick access by the AI model.
Contextual Retrieval: The AI model requires an understanding of schema relationships to generate accurate queries.
Ranking and Optimization: Retrieving the most relevant schema details and prioritizing them for accuracy.
Natural Language Ambiguity: Human language is inherently ambiguous and context-dependent. Disambiguating user queries and understanding the intended meaning is necessary to generate accurate SQL statements.
Dynamic Schemas: Adapting to evolving database schemas without much challenge is crucial.

Best Practices for AI Search Indexing and Storing

1. Plan the Index Structure Based on Schema Elements - What to Index:

Table Names: Index all table names in the schema.
Column Names: Include column names with metadata (e.g., primary key, foreign key).
Data Types: Store column data types to help frame conditions.
Relationships: Capture foreign key relationships to support joins.
Sample Values: Store sample values or data patterns to provide context.

Code Example:

{ "index": "database_schema", "fields": [ { "name": "table_name", "type": "Edm.String", "searchable": true }, { "name": "column_name", "type": "Edm.String", "searchable": true }, { "name": "data_type", "type": "Edm.String", "searchable": false }, { "name": "column_description", "type": "Edm.String", "searchable": true }, { "name": "table_relationships", "type": "Collection(Edm.String)" } ] }

2. Use Semantic Search to Enhance Query Understanding

Feature: Semantic Search

Best Practice: Enable semantic search to allow the AI model to understand the meaning behind user queries, even if terminology doesn’t match the schema. For example, "total sales" can match "Sales Amount" or "Revenue."

Code Example in python:

search_results = search_client.search( search_text="list total sales", semantic_configuration_name="default" ) for result in search_results: print(result["table_name"], result["column_name"]) Use Vector Indexing for Schema Embeddings

3.Feature: Vector Search

Best Practice:

Convert schema descriptions and relationships into vector embeddings and store them in Azure AI Search. This allows semantic matching for terms that don’t directly align with schema elements.

Code Example:

from azure.search.documents import SearchClient query_vector = generate_embedding("list all clients from New York") search_results = search_client.search( search_text=None, vectors={"vector_embedding": query_vector}, top=5 ) for result in search_results: print(f"Table: {result['table_name']}, Column: {result['column_name']}")

4. Enrich Index with Metadata and Descriptions

What to Store:

Column Descriptions: Describe each column's purpose.

Relationships Metadata: Include primary and foreign key relationships.

AI-Generated Metadata: Use AI enrichment to auto-generate metadata, enhancing SQL generation accuracy. Storing metadata helps the AI model understand schema relationships and context.

Code Example:

{ "name": "database_schema_index", "fields": [ {"name": "table_name", "type": "Edm.String", "searchable": true}, {"name": "column_name", "type": "Edm.String", "searchable": true}, {"name": "description", "type": "Edm.String", "searchable": true}, {"name": "vector_embedding", "type": "Collection(Edm.Single)", "vectorSearch": true} ] }

5.Prioritize Key Schema Elements Using Custom Scoring Profiles

Feature: Custom Scoring Profiles

Best Practice:

Create custom scoring profiles to prioritize schema elements based on usage frequency or role. This allows AI models to focus on important details.

Code Example:

{ "scoringProfiles": [ { "name": "importanceScoring", "text": { "weights": { "column_name": 1.5, "table_relationships": 2.0 } } } ] }

6. Use Filters and Facets for Contextual Retrieval

Feature: Filters and Facets

Best Practice:

Define filters to narrow schema retrieval based on context. For example, when a query is related to "sales," limit results to sales tables.

Use facets to categorize and narrow schema components.

Code Example:

search_results = search_client.search( search_text="sales by region", filter="table_name eq 'SalesData' or column_name eq 'Region'", facets=["table_name"] ) for result in search_results: print(result["table_name"], result["column_name"])

7. Store Synonyms and Related Terms to Enhance Retrieval

Feature: Synonym Maps

Best Practice:

Use synonym maps to link alternative terms (e.g., "revenue" and "sales") for more accurate matching with schema components.

Code Example:

from azure.search.documents.indexes.models import SynonymMap synonym_map = SynonymMap( name="synonymMap", synonyms=["revenue, sales", "client, customer", "product, item"] ) search_client.create_synonym_map(synonym_map)

Conclusion

Using Azure AI Search with Generative AI for NL2SQL solutions streamlines the translation from natural language to SQL by managing schema details and prioritizing relevant context. Leveraging features like vector indexing, semantic search, and custom scoring helps in providing accurate and efficient SQL query generation.