The Future of AI: Evaluating and optimizing custom RAG agents using Azure AI Foundry

Microsoft

Sep 19, 2025

The Future of AI blog series is an evolving collection of posts from the AI Futures team in collaboration with subject matter experts across Microsoft. In this series, we explore tools and technologies that will drive the next generation of AI. Explore more at: Collections | Microsoft Learn.

AI agents can be powerful productivity enablers. They can understand business context, plan, make decisions, execute actions, and interact with human stakeholders or other agents to create complex workflows for business needs.

For example, AI agents can perform retrieval-augmented generation (RAG) based on enterprise documents to ground their responses for relevance. However, the “black-box” nature of agents presents significant challenges for developers who need to create and manage them effectively. Developers require tools to assess the quality aspects of an agent’s workflow. To enable observability into the RAG quality for developers is important towards building trustworthy AI agents.

At a high-level, a RAG system aims to generate the most relevant answer consistent with grounding documents in response to a user's query. When a query is submitted, the system retrieves relevant content from a corpus and uses that context to generate an informed response.

To support RAG quality evaluation, it’s important to evaluate the following aspects using RAG triad metrics:

A typical RAG pattern in which we can evaluate the quality with 3 distinct metrics: Retrieval, Groundedness, Relevance.

Retrieval: Is the search output relevant and useful for resolving the user's query? Strong retrieval is critical for providing accurate context.
Groundedness: Is the generated response supported by the retrieved documents (e.g., output of a search tool)? The consistency of the response generated with respect to the grounding sources is important.
Relevance: After agentic retrieval and generation, does the response fully address the user’s query? This is key to delivering a satisfying experience for the end user.

Through this blog, you will learn two key best practices for evaluating and optimizing the quality of your custom Retrieval-Augmented Generation (RAG) agent before deployment. The TLDR:

Evaluate and optimize the end-to-end response of your RAG agent using reference-free RAG triad evaluators, focusing on the Groundedness and Relevance evaluators.
Optimize search parameters for advanced scenarios that require ground-truth data and precise retrieval quality by applying golden metrics such as XDCG and max relevance with the Document Retrieval evaluator.

Best Practice 1: Evaluate your RAG App

Complex queries are a common scenario for RAG-powered agents. In both principle and practice, agentic RAG is an advanced RAG pattern compared to traditional RAG patterns in agentic scenarios. By using the Agentic Retrieval API (Public Preview) in Azure AI Search in Azure AI Foundry, we observe up to 40% better relevance for complex queriesthan our baselines. This video walks through the first best practice to use agentic retrieval, evaluate and optimize the end-to-end quality of the retrieval parameters using Groundedness and Relevance evaluators:

What Can Agentic Retrieval Do?

Agentic retrieval engines, like Azure AI Search in Azure AI Foundry, are designed to extract grounding information from your knowledge sources. Using conversation history and retrieval parameters, the agent performs the following steps:

Analyzes the entire conversation to understand the user’s information need.
Breaks down complex queries into smaller, focused subqueries.
Executes subqueries concurrently across the configured knowledge sources.
Applies semantic ranking to re-rank and filter retrieved results.
Merges and synthesizes top results into a unified output.

Synthesizing results supports more than search engine results. It also supports end-to-end question answering. Configured as an “answer synthesis” knowledge agent, the retrieval pipeline can handle complex, contextual queries within a conversation.

How to Evaluate and Optimize Agentic Retrieval?

After we have onboarded to the agentic retrieval pipeline in your agentic workflow, we want to measure its end-to-end response quality and finetune the parameters as a best practice. This practice is also known as "Parameter Sweep". To fine-tune these parameters, AI development teams should evaluate the quality of parameters of interest using end-to-end RAG evaluators for groundedness and relevance.

Common parameters to finetune include re-ranker thresholds, target index and knowledge source parameters. These parameters influence how aggressive the agent is in re-ranking and which sources it queries. Teams should inspect activity and references to validate grounding and build traceable citations.

Teams can use Azure AI Foundry portal to visualize batch evaluation results and assess the answer quality of their knowledge agents. These evaluation results provide clear pass/fail indicators along with supported reasoning for each response. After evaluating one set of parameters for knowledge agent, we simply repeat the exercise with another set of parameters of interest—such as adjusting the re-ranker threshold. By performing A/B testing across different parameter sets, development teams can optimize the knowledge agent with enterprise data.

For a complete walkthrough, check out this end-to-end example notebook: https://aka.ms/knowledge-agent-eval-sample.

Best Practice 2: Optimize your RAG Search Parameters

Document retrieval quality is a common bottleneck in RAG workflows. To address this, one best practice is to optimize your RAG search parameters according to your enterprise data. For advanced scenarios where you can curate ground-truth relevance labels for document retrieval results (commonly called qrels), it’s a best practice to "sweep" and optimize the parameters by evaluating the document retrieval quality using golden metrics such as XDCG and max relevance. This video covers another best practice of curating ground truths for retrieval quality measurements and optimizing them in advanced scenarios:

What are the Document Retrieval Metrics?

They include the following golden metrics in information retrieval to specifically target retrieval quality measurement in a RAG system:

Metric	Higher is better	Description
Fidelity	Yes	Measures how well the top n retrieved chunks reflect the content for a given query; calculated as the number of good documents returned out of total known good documents in a dataset.
NDCG	Yes	Evaluates how close the ranking is to an ideal order where all relevant items appear at the top of the list.
XDCG	Yes	Assess the quality of results withing the top-k documents, regardless of scoring of other index documents.
Max Relevance N	Yes	Captures the maximum relevance score in the top-k chunks.
Holes	No	Counts the number of documents missing query relevance judgments (ground truth).

Using Golden Metrics for Parameter Optimization

With these golden metrics in Document Retrieval evaluator, you can enable more precise measurement and turbo-charge the parameter sweep scenario for any search engine that returns relevance scores.

For illustration purposes, we will use Azure AI Search API as the search engine, but the same approach applies to other solutions—from agentic retrieval in Azure AI Search and LlamaIndex.

Prepare test queries and generate retrieval results with relevance scores using a retrieval engine that returns relevance scores such as Azure AI Search API, agentic retrieval in Azure AI Search and LlamaIndex.
Label relevance for each result as ground truths:

Human judgment: Typically performed by a subject matter expert.
LLM-based judgment: Use an AI-assisted evaluator as an alternative. For example, you can reuse the Relevance evaluator mentioned earlier to score each text chunk.

By combining curated ground truth with automated evaluation, you can systematically sweep parameters (e.g., top-k values in search algorithms, chunk size, overlap size as you create the index) and identify the configuration that delivers the best retrieval quality for your enterprise data. In general, as mentioned, there will be multiple parameter settings by other search engines (as long as they return relevance scores). Your team should fine-tune them according to your enterprise dataset.

After you have curated the ground truths, you can submit multiple evaluation runs on Azure AI Foundry, each based on a different search parameter setting. For illustration purposes, we evaluated 4 different search algorithms (text search, vector search, semantic search, and semantic hybrid search) for Azure AI Search API on a demo dataset using Document Retrieval evaluator. Then you can leverage the visualization for document retrieval metrics using Azure AI Foundry Observability to find the optimal search parameter as follows:

Select the evaluation runs corresponding to multiple retrieval parameters for comparison in Azure AI Foundry and select "Compare":
View the tabular results for all evaluation runs:
Find the best parameter in the charts for each metric: as an illustration, the`xdcg@3` chart below suggest semantic hybrid search is the best RAG parameter in a particular dataset:

Applying the Optimal Parameters

Once you’ve identified your optimal parameter for your retrieval engine, you can confidently integrate them into your RAG agent, fine-tuned to your enterprise data. Keep in mind this evaluator works with any search engine that returns relevance ranking scores, including agentic retrieval in Azure AI Search and LlamaIndex.

For a complete end-to-end example with Azure AI Search, check out this notebook: https://aka.ms/doc-retrieval-sample.