Blog Post

Microsoft Foundry Blog
14 MIN READ

Foundry IQ: boost response relevance by 36% with agentic retrieval

alecberntson's avatar
alecberntson
Icon for Microsoft rankMicrosoft
Nov 18, 2025

Our latest evaluations show knowledge bases achieve better answer scores than traditional RAG

By Alec Berntson, Alina Beck, Amaia Salvador Aguilera, Arnau Quindós Sánchez, Gunter Loch, Jieyu Cheng, Lihang Li, Mike Kim, Thibault Gisselbrecht, Tushar Khan

Foundry IQ by Azure AI Search is a unified knowledge layer for agents, designed to improve response performance, automate RAG workflows and enable enterprise-ready grounding. Foundry IQ and Azure AI Search are part of Microsoft Foundry.

Foundry IQ uses the agentic retrieval engine in Azure AI Search knowledge bases (KB) to tackle several key challenges customers face when building agents. A single entry point to search across all your content, access to federated sources such as web grounding to supplement private data, self-reflective search, latency-quality trade-off controls, high quality answers output and steerability. Developers can configure and register a single knowledge base as a ‘super tool’ to provide agents access to many different knowledge sources. This greatly reduces agent development complexity and separates concerns between retrieving knowledge (knowledge bases) and using it (your agent).

In this post we detail the operations that take place when a knowledge base is called and showcase its capabilities through results from numerous experiments carried out with representative customer data.

The results demonstrate that knowledge bases:

  • Integrate with your agents with a range of latency/quality setpoints as a single tool for up to 10 different knowledge sources, including Azure Blob Storage, Microsoft OneLake, SharePoint in Microsoft M365 and the web.
  • Provide an average of +20 points (36%) improvement in the quality of end-to-end RAG answer scores when using knowledge bases as opposed to brute force searching all sources at once.
  • Can leverage source selection to bring together different sources of knowledge and fill gaps using web grounding.
  • Utilize reflective search to review retrieved results and issue follow up queries to improve relevance for hard queries and provide a low latency exit path for easier ones using semantic classifier, a newly trained small language model (SLM).
  • Provide the option to pick between using the best retrieved content raw or as a generated answer with strong grounding and steering.

Defining agentic retrieval in knowledge bases

The agentic retrieval process in knowledge bases automates and brings together many steps that enable your agent to effectively search across a wide range of sources. In this section we look in detail at the approach we follow for retrieval.

Since the parameter space for agentic retrieval is too large to surface as individual inputs, we introduce a single higher level retrieval reasoning effort parameter that gives developers control over the latency-quality trade-off without having to tune a large number of options.

Here are the steps for requests with retrieval reasoning effort set to medium. The differences for minimal and low are described later.

Figure 1: Overview of search execution flow in knowledge bases

Step 1: Configuration. Knowledge bases are configured with the target knowledge sources, with an Azure OpenAI deployment to use for internal reasoning, retrieval instructions, retrieval reasoning effort and other parameters.

Step 2: Source selection and query planning. When a request is made to knowledge base, it is decomposed into one or more subqueries, then one or more knowledge sources are selected for retrieval. Retrieval instructions are followed at this time to steer the query execution (e.g. transform queries or promote certain sources)

Step 3: Federation. The subqueries are executed against the selected knowledge sources, which can include local searches against internally built vector and text indexes from Azure Blob Storage and Microsoft OneLake, and/or remote searches against targets such as SharePoint in Microsoft 365, web via Bing, and MCP servers. Up to the top 50 results for each (knowledge source, sub-query) pair are fetched.

Step 4: Ranking and filtering. The top results from all sources (including non-Azure AI Search) are scored and ranked using semantic ranker. Semantic ranker provides normalized and calibrated scores across content types, which places all the retrieved content on a consistent scale. It also produces extractive captions which can serve as query-dependent summaries of each content chunk. The documents are grouped from across knowledge sources by the subquery (e.g. all the docs for the same subquery are sorted by their semantic ranker score). The top 10 documents from each group are then scored again using our new semantic classifier which provides a stronger score optimized for downstream RAG tasks. In the diagram we refer to semantic ranker as L2 (2nd layer of processing) and semantic classifier as L3 (3rd layer).

Step 5: Reflective search: In another call, the captions from the top 10 documents are concatenated and inspected by semantic classifier to determine if another round of search is necessary. If it looks like the information need from the subquery is satisfied by what was returned, then the retrieval phase is complete. If semantic classifier decides the query is either complex or the information need is not satisfied, then the collection of captions is sent to another LLM prompt. This prompt ‘reflects’ on whether the sub-query is satisfied. If it is, the retrieval phase is complete. If not, then the best documents are pinned to keep, and a new set of knowledge sources and queries are generated. The retrieval phase will be complete after this second retrieval. This feature is called iterative search.

Step 6: Results merging: The results are merged across iterations, knowledge sources and queries using a multi-layer round robin approach. A response is constructed from the best documents based on the token budget set by the retrieval reasoning effort (or customer override).

Step 7 (optional): Answer generation: Knowledge bases provide a fully featured answer generation stage. Answers are constructed from the content response with very high grounding (low hallucination) rate and support several key features like inline citations, partial answers, tone and response steering (for example, ‘in bullets’, or ‘translate to French’), and source document organization (e.g. ‘according to web, according to abc knowledge sources’).

Customer can control costs and latency by picking a retrieval reasoning effort level. Three levels are supported with the following options.

Capability

Minimal

Low

Medium

Source selection and query planning

Not supported

Yes

Yes

Federated search (multiple knowledge sources searched in parallel)

All at once

Based on source selection

Based on source selection

Web grounding knowledge source

Not supported

Yes (with answer generation)

Yes (with answer generation)

Semantic ranking (SR)

Yes

Yes

Yes

Semantic classifier (SC)

No

No

Yes

Reflective search

No

No

Up to 1 additional retrieval step

Answers generation

No

Yes

Yes

Performance Benchmarks

We tested knowledge bases at the three retrieval efforts on several datasets and queries, spanning many different types and use cases. The retrieval reasoning effort that is most appropriate for your use case may vary, but in general:

  • We observed improvements ranging from 16% to 60%, with an average of 36% when comparing medium to minimal retrieval reasoning effort.
  • Most improvements are seen for queries that are particularly difficult and require gathering information from across knowledge sources.
  • Performance on easier queries plateaus on lower RE; we did not observe any regressions when comparing medium to low or minimal and the iterative exit reduces some latency on easy queries.

Metrics

We focused on the RAG triad (content relevance, answer relevance, groundedness) of metrics from our previous work [1]. We also used the same scoring methodology. For this post, we only report the answer relevance score which ranges from 0 to 100. In all cases the grounding score was very high (min 80%, averaging >95% over all experiments).

Evaluation Sets

In this test we focused heavily on the interplay of private and public web content. We used the following datasets:

Dataset

Description

MIML-hardest-en

Multi-industry, Multi-language – "MIML". A collection of 10 indexes that we created from publicly available documents. They represent typical documents from 10 customer segments. Each index has approximately 1000 documents, on the order of 10M tokens. We used the hard query generation pipeline from our previous post and used sampling to produce 'hard' and 'hardest' variants.

Sec-hardest

We used the same set of documents from FinanceBench (GitHub - patronus-ai/financebench) but organized the files into 9 KS’s based on their Global Industry Classification Standard (GICS) sector and generated hard queries using the MIML hard query generation pipeline. As with MIML, we created several different difficulty levels. 

Sec-open

The original documents and queries from FinanceBench.

NQ

A fairly straightforward fact search largely based on Wikipedia. We used all the original queries. GitHub - google-research-datasets/natural-questions

Hotpot

A Wikipedia based dataset that requires multiple connected pieces of information (multi-hop QA) to construct a complete answer. We subsampled 1000 queries. HotpotQA Homepage

Frames

A challenging RAG evaluation dataset from google/frames-benchmark · Datasets at Hugging Face

Results

Knowledge bases with query planning and reflective search give a 36% gain over directly searching all knowledge sources

When we set up a knowledge base with multiple (up to 10) knowledge sources and run challenging queries at each reasoning effort, the performance increases with each level.

  • Minimal: runs the exact caller query on all knowledge sources in parallel. We ran the content output with a 5k token budget and added the same answer generation step as the over retrieval reasoning efforts. Note the minimal doesn’t currently support answer generation.
  • Low: runs the input query through query planning and source selection, federating out to multiple knowledge sources. Low uses a 5000 token answer generation budget.
  • Medium: runs the input query through query planning and source selection, federating out to multiple knowledge sources and allowing up to 1 reflective follow up retrieval. Medium uses a 10,000-answer generation budget.

Adding query planning from minimal to low significantly helps, and reflective search further improves performance from low to minimal. These gains are evident across all datasets with an average gain of 36%.

 

Chart 1: Answer score for 5 different datasets across retrieval reasoning effort levels. Thick green line is the average performance.

Knowledge base set up

Answer score

Minimal to medium

Dataset

KS type

minimal

low

medium

absolute

relative

sec-hardest

AZS

49

50

62

13

26%

miml-hardest-en

AZS

47

69

72

26

55%

sec-open

Web

67

84

90

22

33%

hotpot

Web

79

88

91

13

16%

frames

Web

54

80

86

32

60%

Average

 

59

74

80

21

36%

 Table 1: Answer score for 5 different datasets across retrieval reasoning effort levels with absolute and relative gains measured between minimal medium. Azure AI Search knowledge source types were Azure AI Search backed indexes, the web knowledge sources uses the Bing Grounding API.

 

We also measured two of these datasets using the recently published BenchmarkQED library from Microsoft Research (MSR). We can see that the answers produced using the retrieved contents from the medium reasoning effort are preferred across all four metrics. As would be expected from more robust retrieval, comprehensiveness is the most improved quality dimension.

 Dataset

Comprehensiveness

Diversity

Empowerment

Relevance

sec-hardest

69%

66%

67%

65%

miml-hardest

79%

77%

78%

75%

Table 2: Pairwise preference scores based on the BenchmarkQED library for two datasets comparing minimal to medium retrieval reasoning effort. A 79% would mean that the answer created by the medium level was preferred to the minimal one’s 79% of the time.

Web search in knowledge bases is effective as a standalone knowledge

The web knowledge source can federate queries to the Bing for grounding API. This is extremely useful for bringing in fresh data (e.g. following the model cutoff or frequently changing content). We ran four common open-source benchmarks to evaluate the web knowledge source. On an easy test like the NQ dataset, the knowledge base was able to saturate the benchmark and increasing the retrieval reasoning effort improved quality substantially for more challenging ones.

Knowledge base set up

Answer score

Minimal to medium

Dataset

KS Type

minimal

low

medium

absolute

relative

nq

Web

97

98

98

1

1%

hotpot

Web

79

88

91

13

16%

sec-open

Web

67

84

90

22

33%

frames

Web

54

80

86

32

60%

Table 3: Answer score metrics for four web-based datasets in increasing difficulty level with one knowledge source.

We can observe that knowledge bases issues more queries at higher reasoning levels and that easier datasets (NQ) require fewer queries on average than more challenging ones (Frames).

Dataset

minimal

low

medium

nq

1.0

2.1

3.1

hotpot

1.0

2.7

4.0

sec-open

1.0

3.1

4.6

frames

1.0

3.6

5.6

Table 4: Number of sub-queries issued at each retrieval reasoning effort across 4 web-based datasets.

Knowledge bases with a web knowledge source can fill knowledge gaps in private data

We used the sec-hardest dataset, which is organize into nine knowledge sources based on their Global Industry Classification Standard (GICS) sector. For this experiment, we remove three of the sectors, which were required to answer 50% (75/150) queries in the dataset.

Sector

Redaction experiment

Consumer Discretionary

Removed

Consumer Staples

Removed

Information Technology

Removed

Industrials

 

Health Care

 

Financials

 

Communication Services

 

Materials

 

Table 5: The FinanceBench dataset was broken up into nine GICS sectors and for the redaction experiment, we removed the top three.

We collected a baseline at low and medium retrieval reasoning effort with the complete set of nine knowledge sources. This performed very well on both settings with an 89/90pt answer score, highlighting the knowledge base's ability to piece together and route queries effectively across multiple knowledge sources.

We then ran again with the three GICS sectors removed. As expected, the score dropped to 50%. We then added the web knowledge sources to fill in the missing pieces; the score recovered significantly (+24pts answer score) on low and approached the original complete dataset (+34pts answer score) on medium. This demonstrated that the knowledge base was effective at using web to fill in the gaps.

Chart 2: Illustrating the gap-filling capabilities of the web knowledge sources at low and medium retrieval reasoning effort.

RE

 Set up

Answer Score

 Low

 9 KS

89

 6 KS (3 removed)

53

 6 KS + Web

77

 Medium

 9 KS

90

 6 KS (3 removed)

50

 6 KS + Web

84

Table 5: The answer score for each step in the redaction experiment Illustrating the gap-filling capabilities of the web knowledge source at low and medium retrieval reasoning effort.

Knowledge bases can bring content from multiple knowledge sources together

We also tested the ability of knowledge bases to piece together content spread across knowledge sources by segmenting SEC-hardest dataset queries based on how many GICS sectors were required to produce a complete answer. You can see that medium reasoning effort’s higher budget for queries and parallel knowledge sources resulted in much better results on the queries that required multiple sources. In fact, it more than doubled the quality for 3+ source queries (+105% [46->95]).

 

GICS required for answer

 (Answer Score)

Reasoning Effort

1

2

3+

low

55

46

46

medium

89

83

95

Table 6: The answer score at low and medium retrieval reasoning effort split by the number of GICS that are required to fully answer the question.

Reflective search in knowledge bases is effective at avoiding second searches for easier queries

Reflective search in knowledge bases uses semantic classifier, a newly trained SLM powered early exit model to reduce LLM calls and latency on easy questions without sacrificing quality.  For this experiment, we use the answer score on minimal reasoning effort as a proxy for the overall hardness of the queries. We also created different splits of the sec-hard query set with increasing baseline difficulties (SEC-hard < SEC-harder < SEC-hardest) and included that with the other open-source datasets.

When sorted by semantic classifier (SC) exit rate, you can see that there are easier datasets that semantic classifier and the LLM tend to agree on exiting, on more challenging dataset semantic classifier tends to be more conservative and is often overruled by the LLM and then on the hardest dataset, it and the LLM agree that another iteration would be valuable.

Dataset

KS Type

 SC Exit %

 SC+LLM Exit %

‘Minimal’ Answer Score

Assessment

 nq

Web

89

89

97

SC and LLM agree

 hotpot

Web

26

26

79

 sec-open

AZS

21

79

81

SC defers to LLM

 sec-hard

AZS

13

77

83

 sec-harder

AZS

4

77

64

 Frames

Web

4

4

54

SC and LLM agree

 sec-hardest

AZS

3

3

59

Table 7: Demonstrates across datasets of varying difficulty how often semantic classifier (SC) alone vs when in tandem with an LLM is likely to trigger an exit from reflective search.

Knowledge bases use a powerful new classifier to consistently filter documents from across knowledge sources

Knowledge base’s medium reasoning effort uses semantic classifier, an L3 ranker to make finer grained decisions about the quality of documents retrieved from various knowledge sources. This model gives a normalized score that can be reliably used across content types and sub-queries for filtration of low-quality content and final ranking for output to the caller or generation of an answer. 

It’s important that this model is well calibrated. This chart shows the semantic classifier score distribution for each ground truth label. A perfect model would produce 5 point-like spikes where all the samples of a given ground truth score receive the correct score. The model, while not perfect, does a very effective job separating good content from bad. For example, purple (ground truth label 0) is the lowest quality content; semantic classifier effectively predicts <2 score for label 0 content for the vast majority of labels. The converse is true of the best (blue label 4) labels.

Chart 3: The SC score distribution for each ground truth label for a large pool of test data based on MIML.

The semantic classifier model is very fast at high quality ranking and exit decision making 

We fine-tuned the semantic classifier model for two tasks: L3 ranker and ‘exit decision’ from iterative search (reflective search). The semantic classifier model runs on the top 10 documents from each subquery and once more on the concatenated captions from each of the 10 documents for the early exit decision for 11 total inference per sub query. We run the semantic classifier on modern H100 GPUs in production and can achieve a very fast latency of <150ms for a batch of the largest supported chunk size (total size 22k tokens). This is critical for building agentic experiences where many steps contribute to the latency of each turn.

Document Length (tokens)

Latency (ms for 11 inferences)

2048

148

1024

81

512

49

256

42

Table 8: Semantic classifier latency in milliseconds for a batch of 11 chunks at various token sizes.

Knowledge bases are effective in multi-language knowledge source scenarios

Many customers have a many-to-many content:query language set up. We simulated this by setting up three knowledge bases from MIML - the insurance industry indexes from 3 languages (EN, FR, ZH) and issuing queries in their original language.   

KB Set up

Answer Score

Minimal -> Medium

Dataset

KS Type

minimal

low

medium

absolute

relative

miml-3-insurance

AZS

82.6

88.1

91.4

9

11%

Table 9: Answer score improvements across retrieval reasoning efforts for a set of knowledge bases with 3 knowledge sources, each in different languages.

Titles and descriptions work best when they are concise and clear

We conducted inner loop experiments to establish best practices for knowledge base name and description. We used the MIML-hard-en dataset and varied these fields for each KS using different strategies. We first ran studies to establish a strong name strategy and then tested various description types against the reference name and a random string of characters. For each name and description strategy, we measured the change in the knowledge base’s ability to select the ground truth knowledge source for each query (e.g. pick the knowledge source where the necessary documents were found).

We found that:

  • Being concise is important for both fields. Adding excessive detail can cause the source selection stage over-index on details and potentially miss the desired knowledge source. This is an area of future research.
  • You need either a strong name or description – a gap in the quality of one can be compensated by the other.
    • Use structured, descriptive naming strategies.
    • Do not use ultra-short or ambiguous names (e.g. avoid 3-letter acronyms)
    • Be careful about blurring the lines between knowledge sources in the descriptions.
  • Knowledge source name is most important - a good name with an adversarial description still scores well

Description strategies

Concise name

Scrambled name

Grounded content summaries

0%

-5%

Plain/Natural summaries

-5%

-9%

Technical/Formal

-4%

-9%

Structural/Lists/Taxonomy

-2%

-7%

Search/Intent phrasing

-8%

-19%

Label/Tag/Legacy/Project

-3%

-13%

Creative/Metaphor/Analogy

-4%

-13%

Regional/Temporal

-4%

-13%

Semantic overlap/Cross-domain

-8%

-43%

Adversarial

-4%

-25%

Minimalist

-4%

-7%

Table 10: Change in recall from the best configuration (concise name, grounded summarized description) for the source selection stage in inner loop experiments. The MIML-hard-en dataset was used.

As examples, here are the knowledge source names we used for “concise name”.

'accounting_tax_services', 'banking', 'government_administration', 'healthcare_administration', 'healthcare_research', 'human_resources', 'industrial_and_manufacturing', 'insurance', 'legal', 'product_manuals'

Getting started with knowledge bases

Read the Foundry IQ announcement: https://aka.ms/FoundryIQ

Try Foundry IQ in the Foundry portal today: https://ai.azure.com/nextgen

Learn more in the Azure AI Search documentation: https://learn.microsoft.com/en-us/azure/search/

Resources

[1] Up to 40% better relevance for complex queries with new agentic retrieval engine

[2] BenchmarkQED: Automated benchmarking of RAG systems

Updated Nov 18, 2025
Version 2.0
No CommentsBe the first to comment