Microsoft Foundry Blog

7 MIN READ

Best Practices for Using Azure AI Search for Natural Language to SQL Generation with Generative AI

Microsoft

Nov 21, 2024

Introduction

Using Generative AI to convert natural language (NL) into SQL queries can simplify user interactions with complex databases. This technology can democratize data access by allowing non-technical business partners to obtain insights without needing to write SQL queries. It can streamline analysts' workflows by enabling them to focus on data interpretation and strategy rather than query formulation. Additionally, it enhances productivity by reducing the time and effort required to retrieve data and ensures more consistent and accurate query results through automated translation of natural language into SQL.
However, schema complexity, schema storage and retrieval, and contextual understanding are often the challenges. Azure AI Search, paired with generative AI models like GPT, can tackle these issues by enabling efficient indexing, storage, and retrieval mechanisms, while providing the right context to the AI model for accurate SQL generation. This combination ensures that users can easily query complex databases and get precise answers, enhancing both usability and reliability.

Understanding the Challenges

When building NL to SQL solutions, here are the key issues to address:

Schema Complexity: Databases have intricate schemas that can make NL to SQL translation difficult.
Schema Storage & Planning: Efficiently storing schema details for quick access by the AI model.
Contextual Retrieval: The AI model requires an understanding of schema relationships to generate accurate queries.
Ranking and Optimization: Retrieving the most relevant schema details and prioritizing them for accuracy.
Natural Language Ambiguity: Human language is inherently ambiguous and context-dependent. Disambiguating user queries and understanding the intended meaning is necessary to generate accurate SQL statements.
Dynamic Schemas: Adapting to evolving database schemas without much challenge is crucial.

Best Practices for AI Search Indexing and Storing

1. Plan the Index Structure Based on Schema Elements - What to Index:

Table Names: Index all table names in the schema.
Column Names: Include column names with metadata (e.g., primary key, foreign key).
Data Types: Store column data types to help frame conditions.
Relationships: Capture foreign key relationships to support joins.
Sample Values: Store sample values or data patterns to provide context.

Code Example:

{ "index": "database_schema", "fields": [ { "name": "table_name", "type": "Edm.String", "searchable": true }, { "name": "column_name", "type": "Edm.String", "searchable": true }, { "name": "data_type", "type": "Edm.String", "searchable": false }, { "name": "column_description", "type": "Edm.String", "searchable": true }, { "name": "table_relationships", "type": "Collection(Edm.String)" } ] }

2. Use Semantic Search to Enhance Query Understanding

Feature: Semantic Search

Best Practice: Enable semantic search to allow the AI model to understand the meaning behind user queries, even if terminology doesn’t match the schema. For example, "total sales" can match "Sales Amount" or "Revenue."

Code Example in python

search_results = search_client.search( search_text="list total sales", semantic_configuration_name="default" ) for result in search_results: print(result["table_name"], result["column_name"]) Use Vector Indexing for Schema Embeddings

Feature: Vector Search

Best Practice:

Convert schema descriptions and relationships into vector embeddings and store them in Azure AI Search. This allows semantic matching for terms that don’t directly align with schema elements.

Code Example:

from azure.search.documents import SearchClient query_vector = generate_embedding("list all clients from New York") search_results = search_client.search( search_text=None, vectors={"vector_embedding": query_vector}, top=5 ) for result in search_results: print(f"Table: {result['table_name']}, Column: {result['column_name']}")

4. Enrich Index with Metadata and Descriptions

What to Store:

Column Descriptions: Describe each column's purpose.

Relationships Metadata: Include primary and foreign key relationships.

AI-Generated Metadata: Use AI enrichment to auto-generate metadata, enhancing SQL generation accuracy.

Why It Helps:

Storing metadata helps the AI model understand schema relationships and context.

Code Example:

{ "name": "database_schema_index", "fields": [ {"name": "table_name", "type": "Edm.String", "searchable": true}, {"name": "column_name", "type": "Edm.String", "searchable": true}, {"name": "description", "type": "Edm.String", "searchable": true}, {"name": "vector_embedding", "type": "Collection(Edm.Single)", "vectorSearch": true} ] }

5.Prioritize Key Schema Elements Using Custom Scoring Profiles

Feature: Custom Scoring Profiles

Best Practice:

Create custom scoring profiles to prioritize schema elements based on usage frequency or role. This allows AI models to focus on important details.

Code Example:

{ "scoringProfiles": [ { "name": "importanceScoring", "text": { "weights": { "column_name": 1.5, "table_relationships": 2.0 } } } ] }

6. Use Filters and Facets for Contextual Retrieval

Feature: Filters and Facets

Best Practice:

Define filters to narrow schema retrieval based on context. For example, when a query is related to "sales," limit results to sales tables.

Use facets to categorize and narrow schema components.

Code Example:

search_results = search_client.search( search_text="sales by region", filter="table_name eq 'SalesData' or column_name eq 'Region'", facets=["table_name"] ) for result in search_results: print(result["table_name"], result["column_name"])

7. Store Synonyms and Related Terms to Enhance Retrieval

Feature: Synonym Maps

Best Practice:

Use synonym maps to link alternative terms (e.g., "revenue" and "sales") for more accurate matching with schema components.

Code Example:

from azure.search.documents.indexes.models import SynonymMap synonym_map = SynonymMap( name="synonymMap", synonyms=["revenue, sales", "client, customer", "product, item"] ) search_client.create_synonym_map(synonym_map)

Integrating with Generative AI Models

With retrieved schema details, integrate the context with generative AI for SQL generation.

Prompt Engineering:

Provide schema details and example queries to give context to the AI model.

Use structured prompts to define schema relationships.

Code Example:

prompt = f""" Database Schema: {retrieved_schema_details} Query: {user_natural_language_query} Generate an optimized SQL query based on the schema details. """

Incorporate Few-Shot Prompting

In this approach, the LLM is provided with a few examples (prompts), and it learns to generalize from these to unseen tasks.

The idea behind few-shot learning is to mimic human cognitive abilities. Just like humans, who often learn a new task from a few examples, AI models should be able to do the same. In the context of NL2SQL solutions, we can provide the AI model with a few examples of natural language queries and their corresponding SQL statements.

Here's how few-shot learning can be implemented in the prompt engineering phase:

Code Example :


prompt = f""" Database Schema: {retrieved_schema_details} Example 1: Query: 'Show the total sales by region for last year.' SQL: 'SELECT Region, SUM(Sales) FROM SalesData WHERE Year=2020 GROUP BY Region' Example 2: Query: 'List all products sold in New York in 2020.' SQL: 'SELECT DISTINCT Product FROM SalesData WHERE Location='New York' AND Year=2020'

Introduce a multi-step self-correction loop

The multi-step component allows the LLM to correct the generated SQL query for accuracy. In this approach, the previous prompt and generated code are given as input with the next prompt to generate the next code sequence. This way the generated SQL is checked for syntax errors, and this feedback is further used to enrich our prompt for the LLM for more accurate and effective corrections in the generated SQL. This enables the generative AI model to self-correct its output, ensuring that the resulting SQL query is both syntactically and contextually accurate.

For example, suppose we input a natural language query like 'Show the total sales by region for last year.' The AI model generates an SQL query, which is then checked for syntax errors. If an error is found, the model uses this feedback to enrich the prompt and correct the SQL output. This process can continue across multiple steps until the outputted SQL query is correct, reflecting the power and efficiency of a multi-step self-correction loop.

Example 1: Suppose your natural language query is "Show me the total sales by region for 2020". Let's assume the initial SQL generated by the AI is incorrect. Initial SQL: "SELECT Region, SUM(Sales) FROM SalesData WHERE Year='2020'" After checking the syntax, the SQL is found to be incorrect. We can then re-prompt the AI model: Prompt: "Based on the database schema, the correct SQL query should be structured differently. Correct the SQL query for 'Show me the total sales by region for 2020'." Corrected SQL: "SELECT Region, SUM(Sales) FROM SalesData WHERE Year=2020 GROUP BY Region" Example 2: For the query "List all products sold in New York in 2020", the initial SQL generated is: Initial SQL: "SELECT Product FROM SalesData WHERE Location='New York' AND Year='2020'" After syntax checking, we find that the SQL is incorrect. So, we re-prompt the AI model: Prompt: "Based on the database schema, the correct SQL query should be structured differently. Correct the SQL query for 'List all products sold in New York in 2020'." Corrected SQL: "SELECT DISTINCT Product FROM SalesData WHERE Location='New York' AND Year=2020"

Conclusion

Using Azure AI Search with Generative AI for NL2SQL solutions streamlines the translation from natural language to SQL by managing schema details and prioritizing relevant context. Leveraging features like vector indexing, semantic search, and custom scoring helps in providing accurate and efficient SQL query generation.

A holistic strategy for NL2SQL solutions involves understanding and indexing the database schema using Azure AI Search, enhancing query comprehension via semantic search and vector indexing, and prioritizing schema elements using custom scoring profiles. Leveraging filters and facets for contextual retrieval, storing synonyms for accurate matching, and utilizing few-shot prompting for training AI models with limited examples can significantly enhance the solution. Integrating context with generative AI models and implementing a multi-step self-correction loop ensures the accuracy of the generated SQL queries. Continual monitoring and updating of the solution ensures adaptability to evolving schemas and query patterns. This strategy provides a robust, efficient, and accurate NL2SQL solution.

Key Takeaways:

Effectively organize and index schema data for quick retrieval and accurate SQL generation.
Store detailed schema metadata and synonyms for better AI context and query understanding.
Utilize semantic search and filters to retrieve relevant schema details.
Apply ranking algorithms to improve query generation accuracy.
Use few-shot prompting while prompt engineering.
Integrate context with generative AI models for precise SQL generation.
Implement a self-correction loop for contextual and syntactic accuracy.
Regularly monitor and update the solution to adapt to evolving schemas and queries.

Updated Dec 19, 2024

Version 2.0

azure ai search

Kranthi-Manchikanti

Microsoft

Joined February 08, 2024

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity