Our latest evaluations show knowledge bases achieve better answer scores than traditional RAG
By Alec Berntson, Alina Beck, Amaia Salvador Aguilera, Arnau Quindós Sánchez, Gunter Loch, Jieyu Cheng, Lihang Li, Mike Kim, Thibault Gisselbrecht, Tushar Khan
Foundry IQ by Azure AI Search is a unified knowledge layer for agents, designed to improve response performance, automate RAG workflows and enable enterprise-ready grounding. Foundry IQ and Azure AI Search are part of Microsoft Foundry.
Foundry IQ uses the agentic retrieval engine in Azure AI Search knowledge bases (KB) to tackle several key challenges customers face when building agents. A single entry point to search across all your content, access to federated sources such as web grounding to supplement private data, self-reflective search, latency-quality trade-off controls, high quality answers output and steerability. Developers can configure and register a single knowledge base as a ‘super tool’ to provide agents access to many different knowledge sources. This greatly reduces agent development complexity and separates concerns between retrieving knowledge (knowledge bases) and using it (your agent).
In this post we detail the operations that take place when a knowledge base is called and showcase its capabilities through results from numerous experiments carried out with representative customer data.
The results demonstrate that knowledge bases:
- Integrate with your agents with a range of latency/quality setpoints as a single tool for up to 10 different knowledge sources, including Azure Blob Storage, Microsoft OneLake, SharePoint in Microsoft M365 and the web.
- Provide an average of +20 points (36%) improvement in the quality of end-to-end RAG answer scores when using knowledge bases as opposed to brute force searching all sources at once.
- Can leverage source selection to bring together different sources of knowledge and fill gaps using web grounding.
- Utilize reflective search to review retrieved results and issue follow up queries to improve relevance for hard queries and provide a low latency exit path for easier ones using semantic classifier, a newly trained small language model (SLM).
- Provide the option to pick between using the best retrieved content raw or as a generated answer with strong grounding and steering.
Defining agentic retrieval in knowledge bases
The agentic retrieval process in knowledge bases automates and brings together many steps that enable your agent to effectively search across a wide range of sources. In this section we look in detail at the approach we follow for retrieval.
Since the parameter space for agentic retrieval is too large to surface as individual inputs, we introduce a single higher level retrieval reasoning effort parameter that gives developers control over the latency-quality trade-off without having to tune a large number of options.
Here are the steps for requests with retrieval reasoning effort set to medium. The differences for minimal and low are described later.
Figure 1: Overview of search execution flow in knowledge bases
Step 1: Configuration. Knowledge bases are configured with the target knowledge sources, with an Azure OpenAI deployment to use for internal reasoning, retrieval instructions, retrieval reasoning effort and other parameters.
Step 2: Source selection and query planning. When a request is made to knowledge base, it is decomposed into one or more subqueries, then one or more knowledge sources are selected for retrieval. Retrieval instructions are followed at this time to steer the query execution (e.g. transform queries or promote certain sources)
Step 3: Federation. The subqueries are executed against the selected knowledge sources, which can include local searches against internally built vector and text indexes from Azure Blob Storage and Microsoft OneLake, and/or remote searches against targets such as SharePoint in Microsoft 365, web via Bing, and MCP servers. Up to the top 50 results for each (knowledge source, sub-query) pair are fetched.
Step 4: Ranking and filtering. The top results from all sources (including non-Azure AI Search) are scored and ranked using semantic ranker. Semantic ranker provides normalized and calibrated scores across content types, which places all the retrieved content on a consistent scale. It also produces extractive captions which can serve as query-dependent summaries of each content chunk. The documents are grouped from across knowledge sources by the subquery (e.g. all the docs for the same subquery are sorted by their semantic ranker score). The top 10 documents from each group are then scored again using our new semantic classifier which provides a stronger score optimized for downstream RAG tasks. In the diagram we refer to semantic ranker as L2 (2nd layer of processing) and semantic classifier as L3 (3rd layer).
Step 5: Reflective search: In another call, the captions from the top 10 documents are concatenated and inspected by semantic classifier to determine if another round of search is necessary. If it looks like the information need from the subquery is satisfied by what was returned, then the retrieval phase is complete. If semantic classifier decides the query is either complex or the information need is not satisfied, then the collection of captions is sent to another LLM prompt. This prompt ‘reflects’ on whether the sub-query is satisfied. If it is, the retrieval phase is complete. If not, then the best documents are pinned to keep, and a new set of knowledge sources and queries are generated. The retrieval phase will be complete after this second retrieval. This feature is called iterative search.
Step 6: Results merging: The results are merged across iterations, knowledge sources and queries using a multi-layer round robin approach. A response is constructed from the best documents based on the token budget set by the retrieval reasoning effort (or customer override).
Step 7 (optional): Answer generation: Knowledge bases provide a fully featured answer generation stage. Answers are constructed from the content response with very high grounding (low hallucination) rate and support several key features like inline citations, partial answers, tone and response steering (for example, ‘in bullets’, or ‘translate to French’), and source document organization (e.g. ‘according to web, according to abc knowledge sources’).
Customer can control costs and latency by picking a retrieval reasoning effort level. Three levels are supported with the following options.
|
Capability |
Minimal |
Low |
Medium |
|
Source selection and query planning |
Not supported |
Yes |
Yes |
|
Federated search (multiple knowledge sources searched in parallel) |
All at once |
Based on source selection |
Based on source selection |
|
Web grounding knowledge source |
Not supported |
Yes (with answer generation) |
Yes (with answer generation) |
|
Semantic ranking (SR) |
Yes |
Yes |
Yes |
|
Semantic classifier (SC) |
No |
No |
Yes |
|
Reflective search |
No |
No |
Up to 1 additional retrieval step |
|
Answers generation |
No |
Yes |
Yes |
Performance Benchmarks
We tested knowledge bases at the three retrieval efforts on several datasets and queries, spanning many different types and use cases. The retrieval reasoning effort that is most appropriate for your use case may vary, but in general:
- We observed improvements ranging from 16% to 60%, with an average of 36% when comparing medium to minimal retrieval reasoning effort.
- Most improvements are seen for queries that are particularly difficult and require gathering information from across knowledge sources.
- Performance on easier queries plateaus on lower RE; we did not observe any regressions when comparing medium to low or minimal and the iterative exit reduces some latency on easy queries.
Metrics
We focused on the RAG triad (content relevance, answer relevance, groundedness) of metrics from our previous work [1]. We also used the same scoring methodology. For this post, we only report the answer relevance score which ranges from 0 to 100. In all cases the grounding score was very high (min 80%, averaging >95% over all experiments).
Evaluation Sets
In this test we focused heavily on the interplay of private and public web content. We used the following datasets:
|
Dataset |
Description |
|
MIML-hardest-en |
Multi-industry, Multi-language – "MIML". A collection of 10 indexes that we created from publicly available documents. They represent typical documents from 10 customer segments. Each index has approximately 1000 documents, on the order of 10M tokens. We used the hard query generation pipeline from our previous post and used sampling to produce 'hard' and 'hardest' variants. |
|
Sec-hardest |
We used the same set of documents from FinanceBench (GitHub - patronus-ai/financebench) but organized the files into 9 KS’s based on their Global Industry Classification Standard (GICS) sector and generated hard queries using the MIML hard query generation pipeline. As with MIML, we created several different difficulty levels. |
|
Sec-open |
The original documents and queries from FinanceBench. |
|
NQ |
A fairly straightforward fact search largely based on Wikipedia. We used all the original queries. GitHub - google-research-datasets/natural-questions |
|
Hotpot |
A Wikipedia based dataset that requires multiple connected pieces of information (multi-hop QA) to construct a complete answer. We subsampled 1000 queries. HotpotQA Homepage |
|
Frames |
A challenging RAG evaluation dataset from google/frames-benchmark · Datasets at Hugging Face |
Results
Knowledge bases with query planning and reflective search give a 36% gain over directly searching all knowledge sources
When we set up a knowledge base with multiple (up to 10) knowledge sources and run challenging queries at each reasoning effort, the performance increases with each level.
- Minimal: runs the exact caller query on all knowledge sources in parallel. We ran the content output with a 5k token budget and added the same answer generation step as the over retrieval reasoning efforts. Note the minimal doesn’t currently support answer generation.
- Low: runs the input query through query planning and source selection, federating out to multiple knowledge sources. Low uses a 5000 token answer generation budget.
- Medium: runs the input query through query planning and source selection, federating out to multiple knowledge sources and allowing up to 1 reflective follow up retrieval. Medium uses a 10,000-answer generation budget.
Adding query planning from minimal to low significantly helps, and reflective search further improves performance from low to minimal. These gains are evident across all datasets with an average gain of 36%.
Chart 1: Answer score for 5 different datasets across retrieval reasoning effort levels. Thick green line is the average performance.
|
Knowledge base set up |
Answer score |
Minimal to medium | ||||
|
Dataset |
KS type |
minimal |
low |
medium |
absolute |
relative |
|
sec-hardest |
AZS |
49 |
50 |
62 |
13 |
26% |
|
miml-hardest-en |
AZS |
47 |
69 |
72 |
26 |
55% |
|
sec-open |
Web |
67 |
84 |
90 |
22 |
33% |
|
hotpot |
Web |
79 |
88 |
91 |
13 |
16% |
|
frames |
Web |
54 |
80 |
86 |
32 |
60% |
|
Average |
59 |
74 |
80 |
21 |
36% | |
Table 1: Answer score for 5 different datasets across retrieval reasoning effort levels with absolute and relative gains measured between minimal medium. Azure AI Search knowledge source types were Azure AI Search backed indexes, the web knowledge sources uses the Bing Grounding API.
We also measured two of these datasets using the recently published BenchmarkQED library from Microsoft Research (MSR). We can see that the answers produced using the retrieved contents from the medium reasoning effort are preferred across all four metrics. As would be expected from more robust retrieval, comprehensiveness is the most improved quality dimension.
| Dataset |
Comprehensiveness |
Diversity |
Empowerment |
Relevance |
|
sec-hardest |
69% |
66% |
67% |
65% |
|
miml-hardest |
79% |
77% |
78% |
75% |
Table 2: Pairwise preference scores based on the BenchmarkQED library for two datasets comparing minimal to medium retrieval reasoning effort. A 79% would mean that the answer created by the medium level was preferred to the minimal one’s 79% of the time.
Web search in knowledge bases is effective as a standalone knowledge
The web knowledge source can federate queries to the Bing for grounding API. This is extremely useful for bringing in fresh data (e.g. following the model cutoff or frequently changing content). We ran four common open-source benchmarks to evaluate the web knowledge source. On an easy test like the NQ dataset, the knowledge base was able to saturate the benchmark and increasing the retrieval reasoning effort improved quality substantially for more challenging ones.
|
Knowledge base set up |
Answer score |
Minimal to medium | ||||
|
Dataset |
KS Type |
minimal |
low |
medium |
absolute |
relative |
|
nq |
Web |
97 |
98 |
98 |
1 |
1% |
|
hotpot |
Web |
79 |
88 |
91 |
13 |
16% |
|
sec-open |
Web |
67 |
84 |
90 |
22 |
33% |
|
frames |
Web |
54 |
80 |
86 |
32 |
60% |
Table 3: Answer score metrics for four web-based datasets in increasing difficulty level with one knowledge source.
We can observe that knowledge bases issues more queries at higher reasoning levels and that easier datasets (NQ) require fewer queries on average than more challenging ones (Frames).
|
Dataset |
minimal |
low |
medium |
|
nq |
1.0 |
2.1 |
3.1 |
|
hotpot |
1.0 |
2.7 |
4.0 |
|
sec-open |
1.0 |
3.1 |
4.6 |
|
frames |
1.0 |
3.6 |
5.6 |
Table 4: Number of sub-queries issued at each retrieval reasoning effort across 4 web-based datasets.
Knowledge bases with a web knowledge source can fill knowledge gaps in private data
We used the sec-hardest dataset, which is organize into nine knowledge sources based on their Global Industry Classification Standard (GICS) sector. For this experiment, we remove three of the sectors, which were required to answer 50% (75/150) queries in the dataset.
|
Sector |
Redaction experiment | |
|
Consumer Discretionary |
Removed | |
|
Consumer Staples |
Removed | |
|
Information Technology |
Removed | |
|
Industrials |
| |
|
Health Care |
| |
|
Financials |
| |
|
Communication Services |
| |
|
Materials |
| |
Table 5: The FinanceBench dataset was broken up into nine GICS sectors and for the redaction experiment, we removed the top three.
We collected a baseline at low and medium retrieval reasoning effort with the complete set of nine knowledge sources. This performed very well on both settings with an 89/90pt answer score, highlighting the knowledge base's ability to piece together and route queries effectively across multiple knowledge sources.
We then ran again with the three GICS sectors removed. As expected, the score dropped to 50%. We then added the web knowledge sources to fill in the missing pieces; the score recovered significantly (+24pts answer score) on low and approached the original complete dataset (+34pts answer score) on medium. This demonstrated that the knowledge base was effective at using web to fill in the gaps.
Chart 2: Illustrating the gap-filling capabilities of the web knowledge sources at low and medium retrieval reasoning effort.
|
RE |
Set up |
Answer Score |
|
Low |
9 KS |
89 |
|
6 KS (3 removed) |
53 | |
|
6 KS + Web |
77 | |
|
Medium |
9 KS |
90 |
|
6 KS (3 removed) |
50 | |
|
6 KS + Web |
84 |
Table 5: The answer score for each step in the redaction experiment Illustrating the gap-filling capabilities of the web knowledge source at low and medium retrieval reasoning effort.
Knowledge bases can bring content from multiple knowledge sources together
We also tested the ability of knowledge bases to piece together content spread across knowledge sources by segmenting SEC-hardest dataset queries based on how many GICS sectors were required to produce a complete answer. You can see that medium reasoning effort’s higher budget for queries and parallel knowledge sources resulted in much better results on the queries that required multiple sources. In fact, it more than doubled the quality for 3+ source queries (+105% [46->95]).
|
GICS required for answer (Answer Score) | |||
|
Reasoning Effort |
1 |
2 |
3+ |
|
low |
55 |
46 |
46 |
|
medium |
89 |
83 |
95 |
Table 6: The answer score at low and medium retrieval reasoning effort split by the number of GICS that are required to fully answer the question.
Reflective search in knowledge bases is effective at avoiding second searches for easier queries
Reflective search in knowledge bases uses semantic classifier, a newly trained SLM powered early exit model to reduce LLM calls and latency on easy questions without sacrificing quality. For this experiment, we use the answer score on minimal reasoning effort as a proxy for the overall hardness of the queries. We also created different splits of the sec-hard query set with increasing baseline difficulties (SEC-hard < SEC-harder < SEC-hardest) and included that with the other open-source datasets.
When sorted by semantic classifier (SC) exit rate, you can see that there are easier datasets that semantic classifier and the LLM tend to agree on exiting, on more challenging dataset semantic classifier tends to be more conservative and is often overruled by the LLM and then on the hardest dataset, it and the LLM agree that another iteration would be valuable.
|
Dataset |
KS Type |
SC Exit % |
SC+LLM Exit % |
‘Minimal’ Answer Score |
Assessment |
|
nq |
Web |
89 |
89 |
97 |
SC and LLM agree |
|
hotpot |
Web |
26 |
26 |
79 | |
|
sec-open |
AZS |
21 |
79 |
81 |
SC defers to LLM |
|
sec-hard |
AZS |
13 |
77 |
83 | |
|
sec-harder |
AZS |
4 |
77 |
64 | |
|
Frames |
Web |
4 |
4 |
54 |
SC and LLM agree |
|
sec-hardest |
AZS |
3 |
3 |
59 |
Table 7: Demonstrates across datasets of varying difficulty how often semantic classifier (SC) alone vs when in tandem with an LLM is likely to trigger an exit from reflective search.
Knowledge bases use a powerful new classifier to consistently filter documents from across knowledge sources
Knowledge base’s medium reasoning effort uses semantic classifier, an L3 ranker to make finer grained decisions about the quality of documents retrieved from various knowledge sources. This model gives a normalized score that can be reliably used across content types and sub-queries for filtration of low-quality content and final ranking for output to the caller or generation of an answer.
It’s important that this model is well calibrated. This chart shows the semantic classifier score distribution for each ground truth label. A perfect model would produce 5 point-like spikes where all the samples of a given ground truth score receive the correct score. The model, while not perfect, does a very effective job separating good content from bad. For example, purple (ground truth label 0) is the lowest quality content; semantic classifier effectively predicts <2 score for label 0 content for the vast majority of labels. The converse is true of the best (blue label 4) labels.
Chart 3: The SC score distribution for each ground truth label for a large pool of test data based on MIML.
The semantic classifier model is very fast at high quality ranking and exit decision making
We fine-tuned the semantic classifier model for two tasks: L3 ranker and ‘exit decision’ from iterative search (reflective search). The semantic classifier model runs on the top 10 documents from each subquery and once more on the concatenated captions from each of the 10 documents for the early exit decision for 11 total inference per sub query. We run the semantic classifier on modern H100 GPUs in production and can achieve a very fast latency of <150ms for a batch of the largest supported chunk size (total size 22k tokens). This is critical for building agentic experiences where many steps contribute to the latency of each turn.
|
Document Length (tokens) |
Latency (ms for 11 inferences) |
|
2048 |
148 |
|
1024 |
81 |
|
512 |
49 |
|
256 |
42 |
Table 8: Semantic classifier latency in milliseconds for a batch of 11 chunks at various token sizes.
Knowledge bases are effective in multi-language knowledge source scenarios
Many customers have a many-to-many content:query language set up. We simulated this by setting up three knowledge bases from MIML - the insurance industry indexes from 3 languages (EN, FR, ZH) and issuing queries in their original language.
|
KB Set up |
Answer Score |
Minimal -> Medium | ||||
|
Dataset |
KS Type |
minimal |
low |
medium |
absolute |
relative |
|
miml-3-insurance |
AZS |
82.6 |
88.1 |
91.4 |
9 |
11% |
Table 9: Answer score improvements across retrieval reasoning efforts for a set of knowledge bases with 3 knowledge sources, each in different languages.
Titles and descriptions work best when they are concise and clear
We conducted inner loop experiments to establish best practices for knowledge base name and description. We used the MIML-hard-en dataset and varied these fields for each KS using different strategies. We first ran studies to establish a strong name strategy and then tested various description types against the reference name and a random string of characters. For each name and description strategy, we measured the change in the knowledge base’s ability to select the ground truth knowledge source for each query (e.g. pick the knowledge source where the necessary documents were found).
We found that:
- Being concise is important for both fields. Adding excessive detail can cause the source selection stage over-index on details and potentially miss the desired knowledge source. This is an area of future research.
- You need either a strong name or description – a gap in the quality of one can be compensated by the other.
- Use structured, descriptive naming strategies.
- Do not use ultra-short or ambiguous names (e.g. avoid 3-letter acronyms)
- Be careful about blurring the lines between knowledge sources in the descriptions.
- Knowledge source name is most important - a good name with an adversarial description still scores well
|
Description strategies |
Concise name |
Scrambled name |
|
Grounded content summaries |
0% |
-5% |
|
Plain/Natural summaries |
-5% |
-9% |
|
Technical/Formal |
-4% |
-9% |
|
Structural/Lists/Taxonomy |
-2% |
-7% |
|
Search/Intent phrasing |
-8% |
-19% |
|
Label/Tag/Legacy/Project |
-3% |
-13% |
|
Creative/Metaphor/Analogy |
-4% |
-13% |
|
Regional/Temporal |
-4% |
-13% |
|
Semantic overlap/Cross-domain |
-8% |
-43% |
|
Adversarial |
-4% |
-25% |
|
Minimalist |
-4% |
-7% |
Table 10: Change in recall from the best configuration (concise name, grounded summarized description) for the source selection stage in inner loop experiments. The MIML-hard-en dataset was used.
As examples, here are the knowledge source names we used for “concise name”.
'accounting_tax_services', 'banking', 'government_administration', 'healthcare_administration', 'healthcare_research', 'human_resources', 'industrial_and_manufacturing', 'insurance', 'legal', 'product_manuals'
Getting started with knowledge bases
Read the Foundry IQ announcement: https://aka.ms/FoundryIQ
Try Foundry IQ in the Foundry portal today: https://ai.azure.com/nextgen
Learn more in the Azure AI Search documentation: https://learn.microsoft.com/en-us/azure/search/
Resources
[1] Up to 40% better relevance for complex queries with new agentic retrieval engine