HR teams deal with thousands of employee questions every day — policy lookups, leave balances, case escalations, and sensitive topics like harassment or misconduct. AI chatbots can handle the routine stuff and free up HR advisors for the hard cases. But most chatbot projects get stuck at basic Q&A. They can't handle multi-country policies, employee slang, or smooth handoffs to a real person.
This post covers how we built Eva, a production HR chatbot using Microsoft Bot Framework and Semantic Kernel on Azure. I'll focus on three problems and how we solved them:
- Getting accurate answers when employees and policy documents use different words
- Handing off to a live human advisor in real time
- Catching answer quality regressions automatically
Why basic RAG isn't enough for HR
Retrieval-Augmented Generation (RAG) — fetching relevant documents and feeding them to an LLM — is the standard approach. But plain RAG breaks down in HR for a few reasons:
- Vocabulary mismatch. An employee asks "How does misconduct affect my ACB?" but the policy document says "Annual Cash Bonus eligibility criteria." The search doesn't connect the two.
- Multi-country ambiguity. The same question can have different answers depending on the employee's country, grade, or role.
- Sensitive topics. Questions about harassment, disability, or whistleblowing should go to a human, not get an AI-generated answer.
- Ranking noise. Search results often include globally relevant but locally irrelevant documents.
Eva handles these with a layered pipeline: query augmentation → multi-index search → LLM reranking → answer generation → citation handling.
Architecture at a glance
| Layer | Technology |
|---|---|
| Bot framework | Microsoft Agents SDK (aiohttp) |
| LLM orchestration | Semantic Kernel |
| Primary LLM | Azure OpenAI Service (GPT-4.1 / GPT-4o) |
| Knowledge search | Azure AI Search (hybrid + vector) |
| Live agent chat | Salesforce MIAW via server-sent events |
| Evaluation | Azure AI Evaluation SDK + custom LLM judge |
| Config | Pydantic-settings + Azure App Configuration + Key Vault |
Four retrieval strategies, controlled by feature flags
Instead of one search approach, Eva supports four — toggled by feature flags so we can A/B test per country without code changes. They run in a priority cascade:
- HyDE (Hypothetical Document Embeddings)
Instead of searching with the employee's question, the LLM first generates a hypothetical policy document thatwouldanswer it. We embed that synthetic document and use it as the search query. Since a hypothetical answer is closer in embedding space to the real answer than the original question is, this bridges vocabulary gaps effectively. - Step-back prompting
The LLM broadens the question. "How does misconduct affect my ACB?" becomes "What is the Annual Cash Bonus policy and what factors affect eligibility?" This works well when answers live in broader policy sections. - Query rewrite
The LLM expands abbreviations and adds HR domain context, then runs a hybrid (text + vector) search. - Standard search (fallback)
Basic intent classification with hybrid search. No augmentation.
All four strategies return the same Pydantic model, so the rest of the pipeline doesn't care which one ran. The team can enable HyDE globally, roll out step-back to specific countries, or revert instantly if something underperforms.
LLM reranking
After pulling results from both a country-specific index and a global index, Eva optionally reranks them using a RankGPT-style approach — the LLM scores document relevance with a bias toward local content. If reranking fails for any reason, it falls back to the original ordering so the pipeline keeps moving.
Answer generation with local vs. global context
The answer stage separates retrieved documents into local context (country-specific) and global context (company-wide), injected as distinct prompt sections. The LLM returns a structured response with reasoning, the actual answer, citations, and a coverage classification (full, partial, or denial).
Prompts are stored as version-controlled .txt files with per-model variants (e.g., gpt-4o.txt, gpt-4.1.txt), resolved at runtime. This makes prompts reviewable in PRs and deployable without code changes.
Live agent handoff with Salesforce
When Eva determines a question needs a human — sensitive topic, complex case, or the employee simply asks — it hands off to a Salesforce advisor in real time.
- SSE streaming. Eva keeps a persistent HTTP connection to Salesforce for real-time messages, typing indicators, and session end signals.
- Session resilience. Session state persists across three layers — in-memory cache, Azure Cosmos DB, and Bot Framework turn state — to survive restarts and failovers.
- Message delivery workers. Each session has a dedicated async worker with exponential backoff retry. Overflow messages go to a failed messages list rather than being silently dropped.
- Queue position updates. While employees wait, Eva queries Salesforce for queue position and sends rate-limited updates.
- Context handoff. On session start, Eva sends the full conversation transcript so advisors don't ask employees to repeat themselves.
Automated evaluation
Eva includes an evaluation framework that runs as a separate process, testing against ground-truth Q&A pairs from CSV files.
Factual questions are scored using Azure AI's SimilarityEvaluator on a 1–5 scale, with optional relevance and groundedness checks.
Sensitive questions (harassment, disability, whistleblowing) use a custom LLM judge that checks whether the response acknowledges sensitivity and directs the employee to create a case or speak with an advisor.
A deviation detector flags score drops between runs. SQLite stores results for trending, and Application Insights powers dashboards. Long evaluation runs support resume — the framework skips already-completed test cases on restart.
Key takeaways
- Make retrieval strategies swappable. Feature flags let you A/B test without redeploying.
- Separate local and global knowledge explicitly. Don't rely on the LLM to figure out which country's policy applies.
- Invest in evaluation early. Ground-truth datasets with factual and behavioral scoring catch regressions that manual testing misses.
- Build resilience into live agent handoff. Multi-tier session recovery and retry logic prevent dropped conversations.
- Treat prompts as code. File-based, model-variant-aware prompts are easier to maintain than inline strings.
- Use Pydantic for structured LLM outputs. Typed models catch bad output at the validation boundary instead of letting it propagate.
Get started
- Semantic Kernel documentation — LLM orchestration with plugins and structured outputs
- Azure OpenAI Service quickstart — Deploy GPT-4o or GPT-4.1
- Azure AI Search vector search tutorial — Hybrid and vector search indices
- Microsoft Bot Framework SDK — Build bots for Teams and web
- Azure AI Evaluation SDK — Score for similarity, relevance, and groundedness