It took less than a week for OpenAI’s ChatGPT to reach a million users, and it crossed the 100 million user mark in under two months. The interest and excitement around this technology has been remarkable. Users around the world are seeing potential for applying these large language models to a broad range of scenarios.
In the context of enterprise applications, the question we hear most often is “how do I build something like ChatGPT that uses my own data as the basis for its responses?”
The combination of Azure Cognitive Search and Azure OpenAI Service yields an effective solution for this scenario. It integrates the enterprise-grade characteristics of Azure, the ability of Cognitive Search to index, understand and retrieve the right pieces of your own data across large knowledge bases, and ChatGPT’s impressive capability for interacting in natural language to answer questions or take turns in a conversation.
In this blog post we’ll describe the above solution pattern, from the internals of orchestrating conversation and knowledge bases to the considerations in user experience necessary to help end users judge responses and their supporting facts appropriately. Our goal is to give you the tools necessary to build ChatGPT-powered applications starting today, using the "gpt-35-turbo" model that's now in preview. We’re also releasing a GitHub repo with examples, including UX, orchestration, prompts, etc., that you can use to learn more or as a starting point for your own application.
Chatting with your own data
The way you interact with large language models like ChatGPT is using natural language, giving the model a “prompt” and requesting it to complete it. This could be a question, a conversation turn, a pattern to extend, etc. When used this way, the responses you get are based on what the model has learned during training, which can be useful for general knowledge questions or an informal chat, but not what you want if you’re building an application where users should see responses based on your own data. You can see in this example (asking about employee healthcare plans) using the base ChatGPT model, the response (in green) is not useful since it doesn’t account for the company’s own benefits policy:
<|im_start|>system Assistant helps Contoso Inc employees with their internal healthcare plan questions. <|im_end|> <|im_start|>user Does my plan cover annual eye exams? <|im_end|> <|im_start|>assistant I'm sorry, I'm not able to access your specific healthcare plan information. However, many healthcare plans do cover annual eye exams. I recommend checking your plan's summary of benefits or contacting your plan administrator for more information.
One approach to have ChatGPT generate responses based on your own data is simple: inject this information into the prompt. ChatGPT can read the information along with any instructions, context or questions, and respond accordingly. This approach doesn’t need retraining or fine-tuning of the model, and the responses can reflect any changes in the underlying data immediately.
This presents a new challenge though: these models have a limit on the “context length” they support (the current ChatGPT model can take up to 4000 tokens in a prompt), and even if they didn’t have those limits, it wouldn’t be practical to inject GBs worth of data into a text prompt in each interaction. The alternative is to keep all the data in an external knowledge base that can retrieve pieces quickly and with good relevance, exactly what Cognitive Search was designed for. Here’s what a starting point prompt and response could look like:
<|im_start|>system Assistant helps Contoso Inc employees with their internal healthcare plan questions. Only answer questions based on the fact listed below. If the facts below don't answer the question, say you don't know. Each fact sentence is numbered. Always include a reference to the fact number in the form [number] when you answer. 1. Employees have a choice of two plans, PlanA and PlanB, and they have different benefits. 2. PlanA includes coverage for annual eye and hearing exams. 3. Those with PlanB have a co-payment for routine eye exams and vision-related purchases such as glasses. 4. Both PlanA and PlanB cover preventive care visits. <|im_end|> <|im_start|>user Does my plan cover annual eye exams? <|im_end|> <|im_start|>assistant PlanA includes coverage for annual eye and hearing exams. [2]
In the example the blue parts were retrieved from Cognitive Search based on the user’s question and any accumulated context that might be applicable, the green part is the response generated by the model, and the rest is the prompt template we used to give the model instructions.
This retrieval-augmented generation approach opens the door for starting simple and getting more sophisticated as needed. There are many options for how to construct prompts, how to formulate queries for effective retrieval from the knowledge base, and how to orchestrate back-and-forth interaction between ChatGPT and the knowledge base. Before we dig into those, let’s talk about one more requirement: helping users validate that responses are trustworthy.
Generating trustworthy responses
We assume these large language models, prompts, and orchestration systems aren’t perfect, and see the responses generated by them as a candidate response that should include the right information for an end user to validate. As part of exploring this topic we implemented 3 simple experiences as starting points. That’s not to say these are the only ones; we welcome ideas and feedback on the best way to give users better tools to validate that results from the system are factually correct.
As you can see in the picture below, when we produce a response in our examples, we also offer the user 3 “drill down” tools:
- Citations: Each statement in the response includes a citation with a link to the source content. You can see the citations in context (the superscript numbers) as well as the links at the bottom. When you click on one, we display the original content so the user can inspect it.
- Supporting content: Each response or chat bubble generated by ChatGPT has an option (notebook icon) for displaying all the original content that was fed into the prompt as facts.
- Orchestration process: Also present in each response or chat bubble, we include an option (lightbulb icon) to see the entire interaction process, including intermediate results and generated prompts.
Each of these options may or may not be useful for users depending on the audience. There are other options to offer transparency and validation tools for users to have confidence in responses. In particular, in this blog post and initial version of the example code we don’t tackle the critical topic of methods that can be implemented within the application to evaluate quality of responses and possibly reject or retry cases that don’t meet certain criteria. We encourage application developers to explicitly explore this topic in the context of each application experience.
Emerging interaction patterns
Approaches for more effective prompt design, retrieval query construction, and interaction models between components are emerging quickly. This is a nascent space where we expect to see lots of rapid progress. Here’s a small sampling of starting points for prompt and query generation, with references to literature for those interested in more detail:
- Retrieve-then-read: a simple starting point for single-shot Q&A scenarios, where the user question contains enough information to retrieve candidates from the search index. This approach simply uses the question to retrieve from the index, take the top few candidates, and inline them in a prompt along with instructions and the question itself.
- Read content and context before retrieving: in many cases a user question alone is not good enough for retrieval. For example, in conversational settings, the last user turn may be just a few words representing a follow-up point or question and cannot be used to retrieve related knowledge effectively. Even in single-shot interactions, context needs to be accounted for. In these cases, an interesting approach is to use ChatGPT for search query generation, by asking the tool to create a summary of the conversation for retrieval purposes and accounting for any context you want to inject.
- Actions, tools, and more: often a single interaction between the user input, the prompt instructions, and the knowledge base is not enough. For example, it’s been shown that asking large language models to decompose responses into small steps increases the quality of responses and avoids certain error patterns. Once a question is decomposed, smaller and more pointed questions can be asked to external sources, either as unstructured searches as we’ve done so far, or as factoid question-answering steps (e.g. as supported by Semantic Answers in Cognitive Search), or as lookups in external systems (e.g. an employee table in an internal application, or an incident table in a customer support application). This is a broad space for exploration, and lots of early experimental work is being . Some interesting reads include the introduction of CoT (chain of thought) prompting and subsequent work, the ReAct approach to combine CoT with tools, and the Toolformer approach to teach models how to use multiple tools to produce a response.
The samples that accompany this blog post implement some of these, either directly or through open-source libraries such as Langchain. Just to cherry pick a particular example, the user chat turn for “I have the plus plan” in the screenshot below wouldn’t yield a good answer using a naïve retrieve-then-read approach, but works well with a slightly more sophisticated implementation that carries the context of the conversations:
Improving knowledge base retrieval
Since responses will ultimately be based on what we’re able to retrieve from the knowledge base, quality of retrieval becomes a significant aspect of these solutions. Here are a few considerations:
- Semantic ranking: by default, Cognitive Search will use keyword search combined with a simple probabilistic model for scoring. You can choose to enable Semantic Ranking, which will use a sophisticated deep learning secondary ranking layer for improved precision.
- Document chunking: when indexing content in Cognitive Search for the specific purpose of powering ChatGPT scenarios, you want content of the right length. If each document is too short, it will lack context. If it’s too long, it’s hard to locate the right parts for ChatGPT to “read.” We recommend targeting a few sentences (e.g. ~1/4 to 1/3 of a page) with a sliding window of text as starting point if your data allows. In some cases, such as parts catalogs, it’s reasonable not to chunk up the data and have each document contain the full description of a single part.
- Summarization: even after chunking, sometimes you’ll want to fit more candidates in a prompt, by making each candidate shorter. You can achieve this by using a summarization step. A few options for this include using Semantic Captions (a query-contextualized summarization step supported directly in Cognitive Search), using hit highlighting (a more lexical, instead of semantic, mechanism to extract snippets), or post-processing the search results with an external summarization model.
The accompanying sample code includes functionality to easily experiment with some of the options above (click settings icon on the window top-right).
More scenarios
In this blog post we focused on conversation and question answering scenarios that combine ChatGPT from Azure OpenAI with Azure Cognitive Search as a knowledge base and retrieval system. There are other ways in which Azure OpenAI Service and Cognitive Search can be combined to improve existing scenarios or enable new ones. Examples include using natural language for query formulation, powering catalog browsing experiences, and using Azure OpenAI at indexing time to enrich data. We plan on continuing to publish guidance and examples to illustrate how to accomplish many of these.
Try this out today, on your own data or ours
We posted a few examples, including the complete UX shown in this blog post, in this GitHub repo. We plan on continuously expanding that repo with a focus on covering more scenarios.
You can clone this repo and either use the included sample data or adapt it to use your own. We encourage you to take an iterative approach. Data preparation will take a few tries. Start by uploading what you have and try out the experience.
We’re excited about the prospect of improved and brand-new scenarios powered by the availability of large language models combined with information retrieval technology. We look forward to seeing what you will build with Azure OpenAI and Azure Cognitive Search.