The recent advances in LLMs have created opportunities for businesses to gain new insight into their unstructured text data. A classic example of this has been using the popular Retrieval Augmented Generation pattern (e.g. https://github.com/Azure/GPT-RAG) in which enterprise documents are used as context to answer user questions. These implementations can reduce time spent sifting through lengthy documentation and promote awareness throughout an organization.
But what about using LLMs to extract insights from structured relational data such as SQL databases? Many enterprises possess enormous amounts of SQL data that currently requires technical expertise to query or build business dashboards to make this data accessible to end users. However, it is costly to build these dashboards and they are limited in their scope.
With LLMs, there is a new potential to enable non-technical users to extract information from SQL databases by using an LLM to dynamically generate SQL queries, execute those SQL queries, and use the results as context to an LLM to provide a response to the end user. This use case has potential to enhance understanding of business data and reduce time and money spent on extracting data from SQL databases.
This guide will show multiple approaches to building a chat application that leverages SQL data as context. We will start with the simplest architecture and work progressively toward more sophisticated methods to address limitations.
Note: Truncated prompts will be provided in the examples below, but more complete and comprehensive prompts will be included in the final section.
In this first implementation of an NL to SQL chat app, the LLM builds a SQL query based on the user’s question and a fixed SQL schema that is included in the system prompt.
Figure 1 – Few Shot SQL Generation
“You are an expert at providing facts from a SQL Database. Given the user question, produce a PostgresSQL query which, when executed provides the correct answer to the question. Only use the tables and schema given below to answer the question. SQL Schema: {Schema}
User Question: {Question}”
“Given the original question, the corresponding PostgreSQL query, and the execution results, produce a truthful and accurate natural language response to the original question.
Question: {question}”
Benefits / Limitations: This architecture is quite simple and benefits in terms of latency by having only two LLM calls. However, this setup is limited because the LLM does not have any examples of valid SQL queries to reference.
This implementation is similar to the previous architecture, but rather than having a fixed set of few-shot examples, we create a data preparation step in which example queries are embedded into a vector database and then retrieved at run time.
Figure 2 – Few Shot SQL Generation with RAG
Below, we discuss the new steps that differentiate Architecture 2 from Architecture 1:
{
"question": "Which product line had the highest sales in FY2022?"
"sql_query": "SELECT value FROM table_1..."
}
Benefits / Limitations: This architecture is quite simple and benefits in terms of latency by having only two LLM calls. However, this setup is limited because the LLM does not have any examples of valid SQL queries to reference.
One of the central challenges in building an NL to SQL based chat application is handling latency. One method to reduce latency is by opting to fine tune a smaller model, such as GPT 3.5T for the SQL generation step.
Figure 3 – Fine Tuning SQL Generation with GPT 3.5T
This architecture follows roughly the same pattern as architecture 1, but with a preparatory fine tuning step.
Note: If sufficient NL / SQL samples are available, fine tuning could be used with RAG to include few shot examples in addition to the fine tuning. Users should experiment with the exact number of samples to include to get best results and ensure to test with unseen samples.
In order to evaluate an NL to SQL project you will need to evaluate the accuracy of the generated SQL query, the final response, and latency for each step. Below, we outline some of the key metrics to capture during the evaluation stage.
Manual Metrics:
The simplest way of evaluating the application is to have a human evaluator look at the final response and determine if it answer’s the user’s original question correctly. An example of this is End-to-End Accuracy.
One of the limitations with this method is that it requires a human to review each response, which can be labor-intensive. Additionally, if the answer is marked incorrect, it may not be clear which part of the process failed. For this reason, we also introduce a set of automated metrics to evaluate each step of the process.
Automated Metrics:
Below we introduce four automated metrics, which can be executed immediately without human reviewers in a scalable fashion.
Figure 4 – NL to SQL Evaluation Metrics
LLMs introduce new risks to a web application. In this case, the LLM is generating SQL code that will be executed. This introduces a attack surface, and the appropriate boundaries should be introduced prior to releasing to public.
NL to SQL systems require various safety precautions to ensure accurate, secure, and unbiased results. Here are some key aspects to consider.
Risks:
Mitigations:
Example 1: NL to SQL System Prompt
"You are an expert in providing facts from an SQL database. Given the user question, produce a PostgreSQL query which, when executed, provides the correct answer to the question. Only use the tables and schema given below to answer the questions. Learn query patterns from the similar question-answer pairs provided as examples below. Directly start your response with the executable query. Do not start with '''sql. Do not make up new table and column names. Only use the ones available in the schema below.
SQL Schema:
{schema}
Examples:
{examples}"
Example 2: SQL to NL System Prompt
"Given the original question, the corresponding PostgreSQL query, and the execution results, produce a truthful and accurate natural language response to the original question, that is only based on the execution results. Your answer should be detailed and accurate, for numerical responses make sure you infer the appropriate units. If the question is complex, you can provide a multi-sentence response. Make sure that you provide a response that is easy to understand."
Example 3: NL to SQL User Prompt
"Your response should be an executable query, do not start with '''sql Provide detailed reasoning behind the PostgreSQL query, please provide it as a comment using -- at the beginning of the line. Formatting example:
SELECT some_name, val FROM some_table WHERE a = number AND b = another_number -- This query selects some_name and val from some_table where a = number and b = another_number. Queries should return a minimal amount of rows, and should not return more rows than necessary.
User question: {query}
PostgreSQL query:"
Example 4: SQL to NL User Prompt
"Question: {question} SQL Query: {sql_query} SQL Query Results: {sql_query_results}"
Example 5: SQL Critic Score Prompt
"Your job is to evaluate the nl-to-sql engine.
The app goes through several steps:
Your job is to determine if the generated SQL correctly translates the original question.
User question: {question}
Generated PostgreSQL: {generated_query}
Executed PostgreSQL result: {query_execution_result}
Given the information above, give a numeric score of 0 to the Generated SQL if it doesn't correctly handle the User question, and give a numeric score of 1 if the Generated SQL query correctly handles the User question.
If the SQL query yields an error, give a numeric score of 0.
If the SQL query doesn't error out, but doesn't correctly handle the User question, give a numeric score of 0.
If the SQL execution results are empty, most likely it didn't handle the User's question. Think hard if you want to give it a 1 or a 0.
Score: "
Example 6: End to End Relevance
"You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer to a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following rating:
10: The ANSWER follows logically from the information contained in the CONTEXT.
0: The ANSWER is logically false from the information contained in the CONTEXT.
an integer score between 0 and 10 and if such integer score does not exists, use 0: It is not possible to determine whether the ANSWER is true or false without further information.
Read the passage of information thoroughly and select the correct answer from the three answer labels. Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails.
Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not be a negative factor in the evaluation.
Independent Examples:
Example Task #1 Input:
"CONTEXT":
Query: {Sample SQL Query}
Query results: {Sample SQL Query Result}
"ANSWER": "{Valid Answer}”
Example Task #1 Output:
10
Example Task #2 Input:
"CONTEXT":
Query: "{SQL Query}”
Query results: {SQL Query Results}
"ANSWER": "{NL Response}"
Example Task #2 Output:
10
"""
Example 7: End to End groundedness
“Is the answer relevant and related to the question?
If the answer is yes, output 1.
Respond 0 otherwise.
Question: {question}
Answer: {answer}
Score:”
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.