Blog Post

Microsoft Developer Community Blog
4 MIN READ

Building an Enterprise HR Chatbot with Multi-Strategy RAG and Live Agent Handoff on Azure

risthakur1213's avatar
risthakur1213
Icon for Microsoft rankMicrosoft
Apr 19, 2026

HR teams deal with thousands of employee questions every day — policy lookups, leave balances, case escalations, and sensitive topics like harassment or misconduct. AI chatbots can handle the routine stuff and free up HR advisors for the hard cases. But most chatbot projects get stuck at basic Q&A. They can't handle multi-country policies, employee slang, or smooth handoffs to a real person.

This post covers how we built Eva, a production HR chatbot using Microsoft Bot Framework and Semantic Kernel on Azure. I'll focus on three problems and how we solved them:

  1. Getting accurate answers when employees and policy documents use different words
  2. Handing off to a live human advisor in real time
  3. Catching answer quality regressions automatically

Why basic RAG isn't enough for HR

Retrieval-Augmented Generation (RAG) — fetching relevant documents and feeding them to an LLM — is the standard approach. But plain RAG breaks down in HR for a few reasons:

  • Vocabulary mismatch. An employee asks "How does misconduct affect my ACB?" but the policy document says "Annual Cash Bonus eligibility criteria." The search doesn't connect the two.
  • Multi-country ambiguity. The same question can have different answers depending on the employee's country, grade, or role.
  • Sensitive topics. Questions about harassment, disability, or whistleblowing should go to a human, not get an AI-generated answer.
  • Ranking noise. Search results often include globally relevant but locally irrelevant documents.

Eva handles these with a layered pipeline: query augmentation → multi-index search → LLM reranking → answer generation → citation handling.

Architecture at a glance

LayerTechnology
Bot frameworkMicrosoft Agents SDK (aiohttp)
LLM orchestrationSemantic Kernel
Primary LLMAzure OpenAI Service (GPT-4.1 / GPT-4o)
Knowledge searchAzure AI Search (hybrid + vector)
Live agent chatSalesforce MIAW via server-sent events
EvaluationAzure AI Evaluation SDK + custom LLM judge
ConfigPydantic-settings + Azure App Configuration + Key Vault

Four retrieval strategies, controlled by feature flags

Instead of one search approach, Eva supports four — toggled by feature flags so we can A/B test per country without code changes. They run in a priority cascade:

 

  1. HyDE (Hypothetical Document Embeddings)
    Instead of searching with the employee's question, the LLM first generates a hypothetical policy document thatwouldanswer it. We embed that synthetic document and use it as the search query. Since a hypothetical answer is closer in embedding space to the real answer than the original question is, this bridges vocabulary gaps effectively.
  2. Step-back prompting
    The LLM broadens the question. "How does misconduct affect my ACB?" becomes "What is the Annual Cash Bonus policy and what factors affect eligibility?" This works well when answers live in broader policy sections.
  3. Query rewrite
    The LLM expands abbreviations and adds HR domain context, then runs a hybrid (text + vector) search.
  4. Standard search (fallback)
    Basic intent classification with hybrid search. No augmentation.

All four strategies return the same Pydantic model, so the rest of the pipeline doesn't care which one ran. The team can enable HyDE globally, roll out step-back to specific countries, or revert instantly if something underperforms.

 

 

 

 

LLM reranking

After pulling results from both a country-specific index and a global index, Eva optionally reranks them using a RankGPT-style approach — the LLM scores document relevance with a bias toward local content. If reranking fails for any reason, it falls back to the original ordering so the pipeline keeps moving.

Answer generation with local vs. global context

The answer stage separates retrieved documents into local context (country-specific) and global context (company-wide), injected as distinct prompt sections. The LLM returns a structured response with reasoning, the actual answer, citations, and a coverage classification (full, partial, or denial).

Prompts are stored as version-controlled .txt files with per-model variants (e.g., gpt-4o.txt, gpt-4.1.txt), resolved at runtime. This makes prompts reviewable in PRs and deployable without code changes.

 

Live agent handoff with Salesforce

When Eva determines a question needs a human — sensitive topic, complex case, or the employee simply asks — it hands off to a Salesforce advisor in real time.

  • SSE streaming. Eva keeps a persistent HTTP connection to Salesforce for real-time messages, typing indicators, and session end signals.
  • Session resilience. Session state persists across three layers — in-memory cache, Azure Cosmos DB, and Bot Framework turn state — to survive restarts and failovers.
  • Message delivery workers. Each session has a dedicated async worker with exponential backoff retry. Overflow messages go to a failed messages list rather than being silently dropped.
  • Queue position updates. While employees wait, Eva queries Salesforce for queue position and sends rate-limited updates.
  • Context handoff. On session start, Eva sends the full conversation transcript so advisors don't ask employees to repeat themselves.



 

Automated evaluation

Eva includes an evaluation framework that runs as a separate process, testing against ground-truth Q&A pairs from CSV files.

Factual questions are scored using Azure AI's SimilarityEvaluator on a 1–5 scale, with optional relevance and groundedness checks.

Sensitive questions (harassment, disability, whistleblowing) use a custom LLM judge that checks whether the response acknowledges sensitivity and directs the employee to create a case or speak with an advisor.

A deviation detector flags score drops between runs. SQLite stores results for trending, and Application Insights powers dashboards. Long evaluation runs support resume — the framework skips already-completed test cases on restart.

 

 



 

Key takeaways

  • Make retrieval strategies swappable. Feature flags let you A/B test without redeploying.
  • Separate local and global knowledge explicitly. Don't rely on the LLM to figure out which country's policy applies.
  • Invest in evaluation early. Ground-truth datasets with factual and behavioral scoring catch regressions that manual testing misses.
  • Build resilience into live agent handoff. Multi-tier session recovery and retry logic prevent dropped conversations.
  • Treat prompts as code. File-based, model-variant-aware prompts are easier to maintain than inline strings.
  • Use Pydantic for structured LLM outputs. Typed models catch bad output at the validation boundary instead of letting it propagate.

Get started

Updated Apr 01, 2026
Version 1.0
No CommentsBe the first to comment