A deep dive into how answer synthesis works and performs across real-world datasets
By Alec Berntson, Alina Beck, Amaia Salvador Aguilera, Arnau Quindós Sánchez, Lihang Li, Tushar Khan
Introduction
Answer synthesis lets your applications receive grounded, cited answers directly from your retrieval layer, no additional orchestration required. Available in Foundry IQ and Azure AI Search, it's built into the agentic retrieval engine in knowledge bases, streamlining how you build end-to-end RAG solutions.
In this post, we'll explain how answer synthesis works, then share how it impacted answer quality across multiple experiments, datasets, and performance metrics.
How It Works
With answer synthesis enabled, the most relevant content is returned and generates a natural language answer complete with inline citations referencing the supporting documents. Answers are returned in a format that’s easy to render in applications, including metadata to reference the original sources. This simplifies adding cited answers to applications like internal copilots, customer support bots and knowledge management tools.
Answers deliver the following capabilities:
- Inline citations: We annotate the answer with citations, thus linking the factual statements in the answer to their supporting sources and making it easy for users to verify information.
- Steerability through natural language instructions: We change the answer’s format, style, and language according to instructions provided by developers or final users in the knowledge base or the query.
- Partial answers: We generate answers even if we could not retrieve enough relevant content for some aspects of the query, thus providing the user with information that can lead them to a successful search experience.
Answer synthesis is part of Foundry IQ's agentic retrieval engine in knowledge bases. When a query is received, the system retrieves the most relevant content. If the answer synthesis parameter is enabled, we synthesize a response using an LLM, inserting citations inline (e.g., [ref_id:4]), then we return both the answer and a references array. Answers are generated even if the retrieved content does not address all the aspects of the query, but only a subset of them. We believe that this is an important feature of Answers because it provides the final user with useful data. Even if this is only partial information, as some aspects of the query did not get a satisfactory answer, it is more useful and can lead to better follow-ups than a simple, non-informative answer of the type “No relevant content was found for your query.”. We only return such simple answers when we could not retrieve any relevant data for any aspects of the query.
When generating the answer, we prompt the LLM to follow specific formatting guidelines (e.g. concise, helpful and professional tone). However, we also allow developers and final users to provide their own instructions and thus steer the answer shape (e.g. make the answer shorter / longer, add bullet points, respond in a given language etc.). If contradictory instructions are provided to the LLM, the instructions written by final users in their queries are given priority, followed by those written by developers in the knowledge base, and finally by those defined by us.
How We Measure Quality
Metrics
We strive to generate answers that are informative, grounded in the retrieved content (instead of hallucinated by the LLM or extracted from its internal knowledge), steerable by provided instructions and annotated with citations. We developed metrics to measure all these aspects:
- Percentage of answered queries: shows the percentage of queries for which the generated answer is not “No relevant content was found for your query.”
- Answer relevance: measure how relevant the generated answer is to the user’s query.
- Groundedness: measures the extent to which the generated answer is supported by the retrieved content.
- Number of citations per answer: shows how many citations are included per answer on average.
- Citations quality: measures the extent to which the citations-delimited sections of the answer are supported by the cited documents.
- Steering instructions compliance: determines if the generated answer follows the provided instructions. In the case of contradictory instructions, it determines if the priority of instructions is followed.
We use LLMs as judges to label data, which allows us to compute the different metrics.
To measure the percentage of answered queries, we use a judge LLM to check whether each generated response attempts to address the query (even partially) or simply returns the rejection message: “No relevant content was found for your query.”
This lets us quantify the percentage of queries that are actually answered and compute the remaining metrics accurately. Because those metrics are only meaningful when the system produces an answer (not the rejection message), we exclude unanswered queries from their calculations.
For answer relevance, we provide the judge LLM with the user query and the generated answer and we ask it to determine how relevant the answer is, on a scale from 0 to 4 (which we rescale to 0 to 100, with higher values meaning more relevant).
For groundedness and citations quality, we take inspiration from an evaluations report for the TREC 2024 RAG Track and compute “nuggets of information”. Roughly, nuggets are atomic, factual claims, which allows us to determine precisely what parts of the generated answer are supported by the retrieved content and which are not. We use two LLM calls:
- The first LLM call extracts all nuggets from the generated answer / citations-delimited sections of the answer.
- The second LLM call determines if the extracted nuggets are supported by the provided content / citated documents.
We aggregate the different nuggets scores and we rescale the final scores to 0 to 100, with higher values meaning better quality.
To measure steering instructions compliance, we provide the answer generation LLM with one or more instructions, written in natural language (both formal and colloquial, with and without misspellings). Example of such instructions are: “Use bullet points to organize the answer”, “Respond in Spanish”, “Include at least three examples.” etc. After the answer is generated, we ask the judge LLM to determine if the generated answer follows indeed the provided instructions. Again, we aggregate scores and compute a final score, that we rescale from 0 to 100.
Datasets
To evaluate the generated answers, we use several datasets designed to reflect real-world production scenarios (see our previous blogpost [3] for a more detailed description):
- Customer: Contains multiple document sets shared by Azure customers with permission.
- Support: Is drawn from hundreds of thousands of publicly available support and knowledge base articles in various languages, out of which we used content in 8 languages.
- Multi-industry, Multi-language (MIML): Comprises 60 indexes built from publicly accessible documents. These represent typical materials from 10 customer segments and 6 languages.
We use more than 10,000 queries, both from real users and generated using multiple strategies, as described in a previous blogpost. We use queries independently and also pair each query with one or more steering instructions out of a pool of 70 such instructions. To generate answers and compute metrics, we retrieve content for each query using the agentic retrieval API version 2025-08-01-preview.
One of our goals is to produce answers that are as grounded as possible in the retrieved content—providing useful responses when context is partial, while avoiding ungrounded outputs when context is irrelevant or missing. To evaluate this, we create three additional benchmarks based on the MIML dataset. For each query, we pair it with different contexts to simulate multiple levels of available information:
- Full information: we pair queries with manually selected documents that contain the necessary context to fully answer the queries.
- Partial information: we pair queries with documents that provide only partial context.
- No information: we pair queries with unrelated documents which provide no useful information.
This structure allows us to measure whether answers remain grounded in the retrieved content, producing complete answers when full context is available, partial answers when only some information is present, and avoiding ungrounded answers when context is irrelevant or missing.
Results
We used GPT-4.1-mini to generate the answers for all tables except for Table 8. We set the maximum length of the context string to 5000 tokens everywhere.
Overall performance metrics
Table 1 presents the overall performance metrics on different datasets, showing high scores across metrics and datasets.
|
Dataset |
% answered queries |
Answer relevance |
Groundedness |
#citations per answer |
Citation quality |
|
MIML |
95.9% |
93.9 |
87.4 |
5.0 |
81.6 |
|
Support |
97.8% |
94.5 |
92.0 |
5.2 |
88.7 |
|
Customer |
93.4% |
89.5 |
95.6 |
3.7 |
94.9 |
Table 1: Metrics across multiple datasets
Table 2 and 3 show the metrics breakdown by language and content segment. Metrics are stable across these dimensions, although there is a regression in groundedness and citations quality for Japanese and Chinese. This could be explained by lower performance of the GPT-4.1-mini model on these languages, a hypothesis that we plan to test in future work.
|
Dataset |
Language |
% answered queries |
answer relevance |
Groundedness |
#citations per answer |
Citation quality |
|
MIML |
German |
96.2% |
93.6 |
88.4 |
5.0 |
87.8 |
|
English |
94.1% |
89.7 |
87.9 |
4.6 |
87.5 | |
|
Spanish |
95.1% |
91.8 |
87.4 |
4.3 |
87.1 | |
|
French |
95.2% |
92.6 |
89.0 |
4.5 |
86.0 | |
|
Japanese |
96.9% |
95.9 |
83.0 |
4.4 |
67.3 | |
|
Chinese |
96.9% |
95.8 |
82.6 |
6.3 |
68.4 |
Table 2: Metrics across different languages
|
Dataset |
Content segment |
% answered queries |
Answer relevance |
Groundedness |
#citations per answer |
Citation quality |
|
MIML |
Accounting & Tax Services |
94.1% |
92.3 |
87.0 |
4.5 |
79.9 |
|
Banking |
97.7% |
94.3 |
85.6 |
4.5 |
79.9 | |
|
Government Administration |
95.8% |
93.5 |
86.2 |
5.4 |
81.7 | |
|
Healthcare Administration |
96.1% |
94.7 |
82.8 |
5.3 |
77.3 | |
|
Healthcare Research |
95.0% |
91.9 |
85.8 |
5.0 |
81.0 | |
|
Human Resources |
97.8% |
95.2 |
88.4 |
4.8 |
79.4 | |
|
Industrial and Manufacturing |
96.0% |
93.7 |
86.8 |
4.7 |
82.5 | |
|
Insurance |
94.9% |
92.7 |
87.2 |
5.3 |
83.2 | |
|
Legal |
93.9% |
90.7 |
87.4 |
4.6 |
80.0 | |
|
Product Manuals |
96.2% |
94.0 |
86.5 |
4.8 |
81.7 |
Table 3: Metrics across different content segments
Metrics for different levels of information
We paid particular attention to answer generation when different levels of information are available. Table 4 shows the results for 3 different cases. As expected, answer relevance is highest when the generating model is presented with a lot of relevant information. When only some aspects of the query are addressed in the retrieved content, a high-quality answer is generated in most cases, as shown by the metrics. We believe that a partial answer is much more useful to the user than no answer (e.g. “No relevant content was found for your query.”), as they can use the answer to formulate new queries. When no relevant information is present in the retrieved content, an answer is generated only very rarely, thus preventing ungrounded answers.
|
Dataset |
Available context |
% answered queries |
Answer relevance |
Groundedness |
#citations per answer |
Citation quality |
|
MIML |
full information |
100.0% |
99.4 |
90.6 |
4.3 |
86.3 |
|
partial information |
95.7% |
86.1 |
84.2 |
2.9 |
81.0 | |
|
no information |
1.3% |
- |
- |
- |
- |
Table 4: Metrics on query-context pairs at three different levels of information
Metrics for steering instructions compliance
Table 5 focuses on compliance with a selected set of steering instructions. These instructions were designed to influence the shape of the response—including language, tone, and style—without altering the factual content. Examples of instruction are “respond in Spanish” or “answer under 100 words”. We pair each query with 1 to 4 instructions. The results show high compliance even when multiple constraints are applied.
|
Dataset |
#steering instructions |
Steering instructions compliance |
|
MIML |
1 instruction |
97.6 |
|
2 instructions |
96.2 | |
|
3 instructions |
95.3 | |
|
4 instructions |
89.6 |
Table 5: Steering instructions compliance for different number of instructions
Table 6 shows that the instructions written by final users in their queries are given priority, over those written by developers in case of conflicts, causing only a small drop in instruction compliance.
|
Dataset |
Conflicts |
Steering instructions |
Steering instructions compliance |
|
MIML |
no conflict (baseline) |
user-provided only |
98.2 |
|
conflicts |
user-provided + conflicting dev-provided |
91.8 |
Table 6: Steering instructions compliance for conflicting and non-conflicting instructions
Since steering instructions are added to the prompt, we check if their presence impacts metrics. Table 7 compares the overall metrics with and without steering instructions. The results confirm that introducing steering strings does not negatively impact the core performance metrics.
|
Dataset |
Steering |
% answered queries |
Answer relevance |
Groundedness |
#citations per answer |
Citation quality |
|
MIML |
without |
95.9% |
93.9 |
87.4 |
5.0 |
81.6 |
|
with |
97.9% |
95.9 |
86.2 |
5.5 |
79.6 | |
|
Support |
without |
97.8% |
94.5 |
92.0 |
5.2 |
88.7 |
|
with |
98.0% |
93.2 |
90.9 |
5.6 |
87.1 | |
|
Customer |
without |
93.4% |
89.5 |
95.6 |
3.7 |
94.9 |
|
with |
95.6% |
91.6 |
94.2 |
4.4 |
92.6 |
Table 7: Metrics with/without steering
Metrics for different GPT models
Table 8 shows the overall performance metrics when different GPT models are used for answer generation. While high quality answers are generated with all models, some of the less powerful models (gpt-4o-mini, gpt-4.1-nano) lead to significant drops in performance compared to the more powerful ones.
|
Model |
% answered queries |
Answer relevance |
Groundedness |
#citations per answer |
Citation quality | |
|
gpt-4o |
gpt-4o |
91.7% |
89.1 |
86.9 |
4.0 |
83.9 |
|
gpt-4o-mini |
79.9% |
77.8 |
80.9 |
2.3 |
69.5 | |
|
gpt-4.1 |
gpt-4.1 |
96.7% |
92.6 |
86.4 |
4.1 |
84.0 |
|
gpt-4.1-mini |
96.3% |
93.9 |
87.1 |
4.9 |
81.3 | |
|
gpt-4.1-nano |
73.3% |
73.4 |
78.6 |
2.3 |
59.5 | |
|
gpt-5 |
gpt-5 |
96.2% |
88.9 |
89.3 |
7.2 |
88.3 |
|
gpt-5-mini |
96.4% |
90.0 |
89.3 |
8.0 |
84.4 | |
|
gpt-5-nano |
90.2% |
83.2 |
88.3 |
4.8 |
75.3 |
Get started
To use Answers, set the generateAnswer parameter in your agentic retrieval API call. The response includes the synthesized answer, inline citations, and a references array. The output is designed to be straightforward to render in chatbots, web apps, or other user interfaces.
Here are more resources to explore what's possible: