python
303 TopicsBuilding a Scalable Contract Data Extraction Pipeline with Microsoft Foundry and Python
Architecture Overview Alt text: Architecture diagram showing Blob Storage triggering Azure Function, calling Document Intelligence, transforming data, and storing in Cosmos DB Flow: Upload contract files (PDF or ZIP) to Azure Blob Storage Azure Function triggers automatically on file upload Azure AI Document Intelligence extracts layout and tables A transformation layer converts output into a canonical JSON format Data is stored in Azure Cosmos DB Step 1: Trigger Processing with Azure Functions An Azure Function with a Blob trigger enables automatic processing when a file is uploaded. import logging import azure.functions as func import zipfile import io def main(myblob: func.InputStream): logging.info(f"Processing blob: {myblob.name}") if myblob.name.endswith(".zip"): with zipfile.ZipFile(io.BytesIO(myblob.read())) as z: for file_name in z.namelist(): logging.info(f"Extracting {file_name}") file_data = z.read(file_name) # Pass file_data to extraction step Best Practices Keep functions stateless and idempotent Handle retries for transient failures Store configuration in environment variables Step 2: Extract Layout Using Document Intelligence The prebuilt layout model helps extract tables, text, and structure from documents. from azure.ai.documentintelligence import DocumentIntelligenceClient from azure.core.credentials import AzureKeyCredential client = DocumentIntelligenceClient( endpoint="<your-endpoint>", credential=AzureKeyCredential("<your-key>") ) poller = client.begin_analyze_document( "prebuilt-layout", document=file_data ) result = poller.result() Output Includes Structured tables Paragraphs and text blocks Bounding regions for layout context Step 3: Handle Multi-Page Table Continuity Contract documents often contain tables split across multiple pages. These need to be merged to preserve data integrity. def merge_tables(tables): merged = [] current = None for table in tables: headers = [cell.content for cell in table.cells if cell.row_index == 0] if current and headers == current["headers"]: current["rows"].extend(extract_rows(table)) else: if current: merged.append(current) current = { "headers": headers, "rows": extract_rows(table) } if current: merged.append(current) return merged Key Considerations Match headers to detect continuation Preserve row order Avoid duplicate headers Step 4: Transform to a Canonical JSON Schema A consistent schema ensures compatibility across downstream systems. { "id": "contract_123", "documentType": "contract", "vendorName": "ABC Corp", "invoiceDate": "2023-05-05", "tables": [ { "name": "Line Items", "headers": ["Item", "Qty", "Price"], "rows": [ ["Service A", "2", "100"] ] } ], "metadata": { "sourceFile": "contract.pdf", "processedAt": "2026-04-22T10:00:00Z" } } Design Tips Keep schema flexible and extensible Include metadata for traceability Avoid excessive nesting Step 5: Persist Data in Cosmos DB Store the transformed data in a scalable NoSQL database. from azure.cosmos import CosmosClient client = CosmosClient("<cosmos-uri>", "<key>") database = client.get_database_client("contracts-db") container = database.get_container_client("documents") container.upsert_item(canonical_json) Best Practices Choose an appropriate partition key (for example, documentType or vendorName) Optimize indexing policies Monitor request units (RU) usage Observability and Monitoring To ensure reliability: Enable logging with Application Insights Track processing time and failures Monitor document extraction accuracy Security Considerations Store secrets securely using Azure Key Vault Use Managed Identity for service authentication Apply role-based access control (RBAC) to storage resources Conclusion This approach provides a scalable and maintainable solution for contract data extraction: Event-driven processing with Azure Functions Accurate extraction using Document Intelligence Clean transformation into a reusable schema Efficient storage with Cosmos DB This foundation can be extended with validation layers, review workflows, or analytics dashboards depending on your business requirements. Resources Contract data extraction – Document Intelligence: Foundry Tools | Microsoft Learn microsoft/content-processing-solution-accelerator: Programmatically extract data and apply schemas to unstructured documents across text-based and multi-modal content using Azure AI Foundry, Azure OpenAI, Azure AI Content Understanding, and Cosmos DB.Fixing Broken Markdown in AI Translation: Hardening a Production Pipeline
By Minseok Song and Hiroshi Yoshioka (Microsoft MVPs) TL;DR Recent community feedback, especially from Japanese translations, revealed that many translation failures were not semantic, but structural. Through detailed issue reports and discussions, we identified recurring patterns such as broken links, malformed code fences, inconsistent list structures, and CJK-specific formatting issues. In response, Co-op Translator has undergone a series of structural improvements across multiple releases, culminating in v0.18.1 with enhancements such as parser-based code fence handling, list-aware chunking, language-specific Markdown templates, safer CJK emphasis normalization, more robust image migration, and improved internal anchor consistency. These changes were directly informed by real-world community feedback. We would like to especially thank Hiroshi Yoshioka (Microsoft MVP), whose many detailed reports not only uncovered several of these systemic issues but also made this community report possible. The result is not just improved Japanese translations, but a more reliable and resilient translation pipeline for any repository that depends on Markdown fidelity. Introduction Most translation bugs are not actually translation bugs. They are structural failures. They show up as broken links, missing bold markers, unclosed code fences, skipped content, or images that quietly point to the wrong place. To a learner reading translated technical documentation, those issues can make a page feel untrustworthy. To a maintainer localizing documentation at scale, they reveal something deeper: the translation pipeline is not preserving structure as carefully as it preserves meaning. That insight became much clearer over the past several months through community feedback on Co-op Translator. Co-op Translator helps maintain educational GitHub content across many languages while keeping Markdown, images, and notebooks synchronized as the source evolves. As Hiroshi Yoshioka reported a series of Japanese translation issues across real Microsoft learning repositories, each issue looked narrow on the surface: a broken link here, a skipped line there, bold markers not surviving around linked text, HTML image tags not being rewritten, or code fences breaking after chunking. Example of a real community-reported issue where a code block was broken during translation, causing structural corruption in the output. But taken together, those reports exposed a broader pattern: The hardest problem was not “translate this sentence.” The hardest problem was “translate this document without damaging its structure.” This post is a community report on the hardening work that followed, especially in the recent run-up to v0.18.1, and what we learned from those real-world cases. Why these reports mattered One of the most useful things about community feedback is that it reveals failure modes that synthetic tests often miss. These were not edge cases found in toy Markdown samples. The reports came from real translated content in active educational repositories. That meant we were dealing with the kinds of files maintainers actually have to ship: nested lists fenced code blocks inline HTML relative links translated headings migrated image assets CJK punctuation and emphasis edge cases In other words, we were seeing the kinds of Markdown that break when a translation system is only mostly correct. 1) We stopped treating code fences like a regex problem Code fences are not a regex problem—they are a structural one. Left: Regex-based handling breaks code fences and list structure across chunks. Right: Parser-based processing preserves code blocks and their surrounding context as atomic units. One of the earliest recurring themes was code fence integrity. A report on incorrectly handled triple backticks highlighted a classic failure mode: if fenced blocks are detected or split incorrectly, placeholders can fall out of sync, chunk boundaries can be corrupted, and the translated file can come back structurally damaged. A later report showed a closely related issue: list items and indented code placeholders could be split into separate chunks, which then caused broken fences downstream. The right fix was not another regex patch. Instead, Co-op Translator moved to a parser-based approach using markdown-it-py for fenced code block detection. This made code block handling spec-aware and more resilient to cases like unmatched fences, variable fence lengths, and info strings. More importantly, it ensured code sections were treated as atomic units during chunking and placeholder restoration. This same principle was extended to list-aware chunking. Rather than splitting Markdown line by line and hoping the model would preserve structure, the pipeline now groups list items together with their continuation lines and indented placeholders such as @@CODE_BLOCK_X@@. This prevents bullets and their associated code content from being separated into different translation chunks. This was not just a better heuristic. It changed the unit of chunking itself. In practice, this required modifying the chunking pipeline to detect and preserve list-item blocks before token-based splitting. Instead of treating each line independently, we introduced a grouping step that keeps the entire list context intact, including nested indentation and code placeholders. The change was implemented directly in the chunking logic: lines = _group_lines_preserving_list_items(part_text) This helper ensures that list items and their associated code blocks are processed as a single unit, preventing structural corruption during translation. Why this mattered Technical documentation frequently embeds code examples directly under list items or step-by-step instructions. When these relationships are broken during translation, the issue is not just cosmetic. It results in structurally invalid Markdown and misplaced code blocks that can confuse readers and make examples unusable. These were not edge cases. They appeared in real production documentation where: Fenced code blocks became malformed after chunking List items and their associated code placeholders were separated into different segments Placeholder ordering drifted, breaking reconstruction of the original structure In practice, this meant that even when the translated text was correct, the document itself could no longer be trusted as a working technical resource. What changed in practice Before: Code samples could leak out of their list context List items and code blocks were split across chunks Placeholder ordering could drift, breaking reconstruction After: Code blocks are preserved as atomic units during chunking List-bound code samples remain intact Placeholder ordering is stable across the pipeline 2) We restored internal link consistency across translation chunks Even when each chunk appears locally correct, internal links can break at the document level. Left: Anchor links drift out of sync because headings and links are translated independently across chunks. Right: After document-level normalization, links correctly resolve to their corresponding translated headings. Another cluster of issues surfaced when translating longer Markdown documents: internal links would silently break once the content was processed in chunks. Co-op Translator splits large documents into multiple chunks to fit within model constraints. While this works well for translation itself, it introduces a structural problem. Internal links such as [Go to section](#section-name) depend on heading-derived anchor slugs, and those slugs can change during translation. When each chunk is translated independently, links and headings can drift out of sync. In practice, this meant that even when translated headings and links looked correct locally within a chunk, they no longer matched at the document level. Tables of contents, section jump links, and cross-references inside the same file could silently break. The right fix was not to rely on chunk-level correctness. Instead, Co-op Translator introduced a document-level normalization step for internal anchor links. The pipeline now parses both the source and translated Markdown using markdown-it, extracts headings, generates GitHub-style slugs from the translated headings, and then realigns internal anchor links so they correctly point to their corresponding translated sections. Rather than trusting fragment identifiers produced during chunk-level translation, links are reconciled against the final translated document structure. This was not just a small post-processing tweak. It changed where consistency is enforced. In practice, this required introducing a normalization step that runs after all chunks are merged back into a single document. Instead of assuming each chunk is self-consistent, the system now treats the entire document as the source of truth and rebinds internal links accordingly. The change was implemented as a dedicated normalization pass: normalize_internal_anchor_links(source_markdown, translated_markdown) This function aligns fragment identifiers with translated heading slugs, ensuring that internal navigation remains valid even when content has been translated in multiple independent chunks. Why this mattered Technical documentation relies heavily on internal navigation such as tables of contents, section links, and cross-references within the same file. When anchor links drift out of sync with translated headings, the document becomes difficult to navigate even if the translation itself is accurate. Readers may click on links that lead to incorrect sections or nowhere at all, which significantly reduces trust in the content. These issues surfaced in real-world usage where: Internal links no longer matched translated heading slugs Tables of contents pointed to incorrect or missing sections Cross-references silently broke across chunk boundaries This highlighted that correctness at the chunk level was not enough. Consistency had to be enforced at the document level. What changed in practice Before: Internal links could drift out of sync with translated headings Tables of contents pointed to incorrect or missing sections Cross-references silently broke across chunk boundaries Long documents behaved like fragmented outputs rather than a single unit After: Internal links are realigned with translated heading slugs at the document level Tables of contents correctly resolve to translated sections Cross-references remain consistent across the entire document Long Markdown documents behave as a single coherent unit 3) We fixed CJK emphasis the safe way Bold and italic rendering around CJK text was a recurring and subtle failure point. Issues like “Markdown bold not handled correctly” may look minor, but they reveal a deeper compatibility problem: many Markdown renderers do not consistently apply emphasis when markers sit directly next to CJK characters. To address this, we introduced a dedicated normalization step for emphasis markers. Instead of relying on each renderer to interpret `*`, `**`, and `***` correctly in CJK-adjacent cases, Co-op Translator converts them into equivalent HTML tags such as `<em>` and `<strong>` when the target language is Japanese, Korean, or Chinese. This shifts emphasis rendering from renderer-dependent behavior to deterministic output. What mattered was not just fixing it, but fixing it safely. The normalization is strictly scoped to CJK languages and carefully designed to avoid overmatching. It does not mutate inline code spans or unrelated fragments. This is critical, because overly aggressive formatting fixes can easily break code, identifiers, or underscore-heavy technical text. Unlike whitespace-delimited languages, Japanese, Korean, and Chinese often place characters directly adjacent to emphasis markers without clear boundaries. For example, a phrase like: example is ... may be translated into Japanese as: 例は ... Here, the particle は is attached directly to the emphasized word. In some Markdown renderers, this breaks the expected boundary around ..., causing the emphasis to render incorrectly or not at all. This pattern is not limited to Japanese. Similar boundary issues can appear across CJK languages due to the absence of whitespace between words. Why this mattered Formatting bugs around emphasis may look cosmetic, but they affect readability, hierarchy, and trust especially in instructional documentation where emphasis often signals warnings, key concepts, or required steps. What changed in practice Before: Emphasis markers could render inconsistently when adjacent to CJK characters Bold and italic formatting could break depending on the Markdown renderer Fixes risked overmatching and corrupting code or inline technical content After: Emphasis rendering is deterministic across CJK languages using HTML tags Bold and italic formatting remains consistent regardless of renderer behavior Normalization is safely scoped, avoiding unintended mutations in code and inline content Next steps With the recent release, Co-op Translator now exposes a programmatic API that allows the translation pipeline to be executed directly from Python, not only through the CLI. This is an important step, but it is not the end state. The immediate focus is improving adoption. Documentation and usage patterns are being developed so that the API can be reliably integrated across different environments and workflows. More fundamentally, the direction is shifting. Co-op Translator is evolving from a repository-specific tool into a reusable translation engine that can operate as part of larger content pipelines. This enables broader use cases, including: Long-form content such as eBooks and technical blogs Developer documentation and static site projects (for example, Docusaurus or Astro) Continuous documentation pipelines that track and update translations as source content evolves Multilingual SDK, API documentation, and knowledge base systems The long-term goal is to treat translation as infrastructure rather than a one-time task. Instead of generating static outputs, the system is being designed to support continuous updates, structural guarantees, and seamless integration into real-world documentation workflows. Why community feedback mattered so much here One of the most encouraging parts of this work is that the most useful reports were not always long reports. Sometimes a single repository link, a screenshot, and one concrete example of broken output were enough to reveal a structural weakness in the translation engine. That feedback created a valuable loop between people reading translated docs and people maintaining the translation tooling. Hiroshi's reports did not just identify isolated defects. They helped surface recurring categories of failure: code fence integrity chunk boundary safety link preservation CJK emphasis compatibility image path migration anchor normalization Once those patterns became visible, the fixes could be implemented in the core and covered with tests so that the broader ecosystem not just one file or one repo would benefit. Why this matters for learners worldwide Co-op Translator is used in educational repositories where translated documentation can lower the barrier to learning for people around the world. That raises the quality bar. A learner should not have to wonder whether a missing bold marker changed the meaning of a sentence. A learner should not hit a broken anchor halfway through a tutorial. A learner should not lose trust in a translated page because a code block or image path was corrupted during processing. Improving those details is not cosmetic. It is part of making global technical education more reliable. Closing thoughts This community report comes down to a simple truth: Translation quality depends on structural quality. Community feedback helped Co-op Translator get better at preserving the things technical documents depend on most: code fences, lists, links, emphasis, images, and anchors. The result is a more dependable foundation for multilingual documentation not only for Japanese, but for any repository that needs translated content to behave like a maintained technical artifact rather than a plain text dump. To everyone who has opened an issue, shared a screenshot, submitted a PR, or stress-tested translated docs in the real world: thank you. That feedback is helping Co-op Translator become a stronger tool for maintainers and a more trustworthy experience for learners. If you are maintaining multilingual Markdown content, I hope these lessons are useful beyond this project too: use parsers where you can, make structure a first-class concern, and treat community bug reports as design input not just support tickets. If you are working on multilingual documentation, you can explore Co-op Translator here: https://github.com/Azure/co-op-translator Selected GitHub references Repository: https://github.com/Azure/co-op-translator Issue #221: https://github.com/Azure/co-op-translator/issues/221 PR #226: https://github.com/Azure/co-op-translator/pull/226 Issue #234: https://github.com/Azure/co-op-translator/issues/234 PR #237: https://github.com/Azure/co-op-translator/pull/237 Issue #235: https://github.com/Azure/co-op-translator/issues/235 Issue #239: https://github.com/Azure/co-op-translator/issues/239 Issue #357: https://github.com/Azure/co-op-translator/issues/357 Issue #362: https://github.com/Azure/co-op-translator/issues/362 Issue #363: https://github.com/Azure/co-op-translator/issues/363 PR #370: https://github.com/Azure/co-op-translator/pull/370 PR #372: https://github.com/Azure/co-op-translator/pull/372 PR #377: https://github.com/Azure/co-op-translator/pull/377 PR #378: https://github.com/Azure/co-op-translator/pull/378 PR #379: https://github.com/Azure/co-op-translator/pull/379 PR #364: https://github.com/Azure/co-op-translator/pull/364 About the authors Minseok Song (Microsoft MVP) is an OSS maintainer of Co-op Translator focusing on GitHub-native multilingual automation. Hiroshi Yoshioka (Microsoft MVP) is a community contributor who has played a key role in improving translation quality through detailed real-world feedback.Securing Your AI Agents Before They Ship: Red Teaming with Microsoft PyRIT
Securing Your AI Agents Before They Ship: Red Teaming with Microsoft PyRIT You wouldn't ship a web app without running OWASP ZAP or Snyk. So why are AI agents going to production without a single security scan? Prompt injection, data leakage, system prompt theft — the OWASP Top 10 for LLM Applications reads like a checklist of things most teams haven't tested for. PyRIT is Microsoft's open-source answer: an automation framework battle-tested on 100+ products including Copilot. But here's the catch — PyRIT is a research library. To make it work in a real engineering workflow, you need to wrap it. This post shows you how. In this post: Why AI red teaming is fundamentally different from traditional security testing What PyRIT gives you out of the box How to build a thin wrapper that turns PyRIT into a config-driven, pipeline-ready scanner When and how to plug it into your CI/CD workflow Customizing every step for your threat model 🛡️ Why AI Red Teaming Is Different If you're building agentic AI — systems that reason, call tools, and take actions — you already know that traditional security testing doesn't cut it. Microsoft's AI Red Team learned this the hard way after red-teaming 100+ generative AI products. Three things make AI red teaming unique: You're testing two risk surfaces at once — security vulnerabilities (prompt injection, data exfiltration) *and* responsible AI harms (bias, toxicity, manipulation). Traditional pen testers focus on one. Outputs are probabilistic — the same prompt can produce different responses across runs. You can't just assert on a fixed output. You need automated scoring at scale. Every architecture is different — standalone chatbots, RAG pipelines, multi-agent workflows, tool-calling agents. A single test harness has to flex across all of them. The OWASP LLM Top 10 (2025) gives us the taxonomy — prompt injection, sensitive information disclosure, excessive agency, system prompt leakage, data poisoning, supply chain risks, improper output handling, embedding weaknesses, misinformation, and unbounded consumption. Every AI agent you deploy is exposed to all ten. The question is whether *you* discover the gaps or your users do. 🔧 What PyRIT Gives You PyRIT (Python Risk Identification Tool) started as internal scripts at Microsoft in 2022. Today it's a 3,800-star, MIT-licensed framework with 129 contributors and a published paper. "We were able to pick a harm category, generate several thousand malicious prompts, and use PyRIT's scoring engine to evaluate the output from the Copilot system — all in the matter of hours instead of weeks." — Microsoft Security Blog The building blocks: 53+ datasets — AIRT, HarmBench, AdvBench, XSTest, and more. Curated adversarial prompts covering content harms, jailbreaks, data exfiltration, and social bias. 70+ prompt converters — Base64, ROT13, Leetspeak, Unicode confusables, LLM-powered rephrasing, translation, multimodal injection. They stack — a prompt can be translated, then Base64-encoded, then embedded in an image. 6 attack strategies — from simple `PromptSendingAttack` (single-turn) to `CrescendoAttack` (gradual escalation), `TreeOfAttacksWithPruning` (TAP), and multi-turn dialogue attacks. 20+ scorers — LLM-as-judge, Azure AI Content Safety, true/false classifiers, Likert scales. 10+ targets — OpenAI, Azure, HuggingFace, HTTP endpoints, Playwright, WebSockets. This is powerful — PyRIT gives you the components — datasets, converters, attack strategies, scorers — but not the glue. You still need something that loads a config, wires the right components together, runs attacks, scores the results, and tells your pipeline pass or fail. That's what a wrapper does. 🏗️ Building an Enterprise Wrapper The idea is simple: take PyRIT's primitives and compose them into an opinionated, config-driven pipeline that any developer can run with a single command. Below is given the idea on how we can create the wrapper around PyRIT to make it useful for agentic ai security testing, but this is not limited. The Flow Everything starts with a YAML config and ends with a pass/fail exit code: The key insight: every step in this pipeline is configurable through YAML, not code. Switching attack strategies, adding converters, adjusting thresholds — it's all one config edit away. Project Structure At POC level, the wrapper is lean — an orchestrator that stitches PyRIT's components together: The heart of it is `runner.py` — a single orchestrator that: Reads the YAML config Dynamically imports your agent and wraps it as a PyRIT target Loads datasets (built-in + your custom prompts) Runs attacks in escalating phases — plain → encoded → semantic → multi-turn Scores every response using LLM-as-judge Maps findings to OWASP LLM Top 10 categories Generates reports (JSON, Markdown) Evaluates the release gate — pass or fail That's it. One file doing the orchestration, a handful of helpers, and a YAML config. You can start here and grow the structure as your needs evolve. The Configuration A single YAML file drives everything: scan: name: "my-agent-security-scan" # Point to your agent — any Python callable target: module: "my_agent.main" function: "handle_prompt" # Attack strategies to run strategies: - name: "prompt_sending" # Single-turn baseline - name: "crescendo" # Gradual escalation max_turns: 5 # Prompt transformations — they stack automatically converters: - name: "Base64Converter" - name: "LeetspeakConverter" - name: "TranslationConverter" # Adversarial prompt sets datasets: - "airt_illegal" # Built-in: Microsoft AI Red Team - "harmbench" # Built-in: harm evaluation benchmark - path: "datasets/custom_prompts.yaml" # Release gate thresholds gate: fail_on: critical_findings: 0 # Zero tolerance for critical high_findings: 3 max_score: 0.3 # Defense rate must exceed 70% Point the target at your agent, pick your strategies and datasets, set your thresholds — run. Teams can start scanning their agents in an afternoon, not weeks. 🔄 Plugging Into Your Pipeline Since the wrapper is a pip-installable package(we can use setup tools or poetry to build and make it pip installable), integrating it into any CI/CD system is straightforward — `pip install`, then call the CLI. No custom actions or marketplace extensions needed. The key decision is when to run scans. Not every merge needs a full red team pass. Here's what works in practice: The idea is that developers can optionally run quick scans locally as a fast feedback loop, while full scans are manually triggered or approval-gated — the tech lead or architect decides when it's worth running a comprehensive assessment based on the nature of the changes. Since it's just a CLI, integration is the same everywhere — GitHub Actions, Azure DevOps, Jenkins, or a shell script. Install the package, call `pyrit-scan run`, check the exit code. ⚙️ Customization Without Forking The whole point of a wrapper is that teams customize behavior through configuration — not by modifying framework code. What to Customize How Example Which agent to test Point target.module + target.function in YAML to any Python callable Your chatbot, RAG pipeline, or multi-agent workflow Attack strategies Add/remove entries under strategies in YAML Start with prompt_sending , add crescendo when ready Prompt transformations List converters in YAML — they stack automatically Base64 → Leetspeak → Translation = multi-phase evasion Datasets Use built-in (53+) or add custom YAML prompt files HIPAA prompts, financial compliance scenarios Scoring thresholds Set per-OWASP-category thresholds in gate.fail_on Zero tolerance for data leakage (LLM02), relaxed for misinformation (LLM09) Report formats List formats in reporting.formats JSON for automation, PDF for compliance, JUnit for dashboards New attack classes Register via custom_attacks in YAML — module + class name No framework code change, no PR needed 🎯 Start Red Teaming Today AI red teaming isn't a nice-to-have anymore. If you're shipping agentic AI — systems that call tools, access data, and take actions on behalf of users — you need automated security testing in your pipeline. PyRIT gives you the primitives. A thin wrapper gives you the automation. Together, they turn AI security from a one-off exercise into a continuous, measurable practice. The pattern: YAML config → wrap your agent → run attacks → score → map to OWASP → gate the release. Build it once. Run it on every release. Sleep better. Resources PyRIT on GitHub — source code, docs, and community PyRIT Documentation — getting started guides and API reference OWASP LLM Top 10 (2025) — the industry standard risk taxonomy Microsoft AI Red Team Hub — threat models, bug bars, and best practices 3 Takeaways from Red Teaming 100 Products — lessons learned at scale PyRIT Launch Blog — origin story and key design decisions PyRIT Paper (arXiv) — the academic paper566Views0likes0CommentsMaking Sense of Azure AI Foundry IQ
As enterprise teams build AI agents, the hardest design decisions often have nothing to do with models. Instead, they revolve around a more fundamental question: How should an agent access organizational knowledge in a way that is accurate, secure, and sustainable over time? Azure AI Foundry IQ is designed to address a specific version of that problem. It is not a general‑purpose data access layer, and it is not a replacement for every retrieval pattern. Understanding where it fits and where it does not is key to using it effectively. This post explores those boundaries and grounds them in concrete, enterprise‑relevant scenarios, before showing how Foundry IQ can be implemented directly via Azure AI Search APIs and SDKs. What Azure AI Foundry IQ Is (and Is Not): Azure AI Foundry IQ is a managed knowledge layer built on Azure AI Search. It allows you to define a knowledge base that spans multiple content sources such as SharePoint, Azure Blob Storage, OneLake, existing Azure AI Search indexes, and selected external sources and expose them through a single, permission‑aware endpoint. When an agent queries a knowledge base, Foundry IQ: Plans how the query should be executed Selects relevant knowledge sources Runs retrieval (optionally in multiple steps) Enforces user permissions Returns grounded results with citations A single knowledge base can be reused across multiple agents or applications, avoiding duplicated indexing and inconsistent retrieval logic. What Foundry IQ is not: It does not execute SQL queries, perform aggregations, or provide real‑time numeric accuracy. Foundry IQ retrieves unstructured text, not transactional or analytical data. Where Foundry IQ Is a Good Fit 1. Multi‑Source, Distributed Knowledge Foundry IQ is most valuable when relevant knowledge is spread across multiple systems. It removes the need for each agent to manage source‑specific routing and retrieval logic. This benefit increases as the number of sources grows; with a single source, the overhead is rarely justified. 2. Complex or Multi‑Part Questions Foundry IQ’s agentic retrieval model is designed for questions that require: Decomposition into sub‑questions Retrieval from multiple documents Synthesis across sources Its multi‑step retrieval approach is especially effective when a single document cannot answer the question on its own. 3. Reduced Custom Retrieval Engineering Foundry IQ automates indexing, chunking, vectorization, and orchestration across sources. This makes it a strong choice for teams that want to focus on agent behavior rather than building and maintaining custom RAG pipelines. 4. Enterprise Security and Governance Foundry IQ integrates with Microsoft Entra ID and supports document‑level permissions and Purview sensitivity labels where the underlying source allows it. This makes it suitable for internal or regulated scenarios where permission trimming is a hard requirement. 5. Shared Knowledge Across Multiple Agents A single knowledge base can serve multiple agents or applications, reducing operational overhead and ensuring consistent retrieval behavior across experiences. 6. High Emphasis on Answer Quality and Trust For scenarios where correctness, grounding, and citations matter more than latency or cost, Foundry IQ’s multi‑step retrieval consistently outperforms basic RAG approaches. Example Scenarios Where Foundry IQ Works Well Scenario A: Internal Policy and Operations Assistant An enterprise builds an internal assistant for store managers. Relevant information lives in: • HR policies in SharePoint • Safety procedures in Blob Storage • Operations manuals in OneLake Questions often span multiple documents. A single Foundry IQ knowledge base unifies these sources and enforces permissions automatically. Scenario B: Compliance or Regulatory Knowledge Assistant A compliance team needs answers strictly grounded in approved documents, with citations and access control. Foundry IQ ensures only authorized content is retrieved, reducing the risk of accidental data exposure. Scenario C: Shared Knowledge Layer for Multiple Internal Agents Multiple internal agents like chat assistants, workflow helpers, embedded copilots rely on the same procedural content. A shared knowledge base avoids duplicate indexing and centralizes governance. Where Foundry IQ Is Not a Good Fit 1. Simple or Single‑Source Q&A For a single, well‑defined source, Foundry IQ’s orchestration adds complexity without proportional benefit. 2. Structured or Analytical Data Queries Foundry IQ does not execute live queries or calculations. It retrieves text, not metrics. 3. Ultra‑Low Latency or High‑Throughput Requirements Agentic retrieval introduces LLM‑in‑the‑loop latency and token costs. For sub‑second responses at scale, simpler retrieval pipelines are more appropriate. 4. Highly Customized Retrieval Logic Foundry IQ abstracts the retrieval pipeline. If you require fine‑grained control over scoring or transformations, a fully custom search pipeline may be preferable. Example Scenarios Where Foundry IQ Is the Wrong Tool Scenario D: Sales and Inventory Analytics Agent Questions like “What were Q4 sales by region?” require live data queries. Indexing reports leads to stale answers. A direct SQL or analytics tool is the correct solution. Scenario E: High‑Volume, Low‑Latency Assistant Voice‑based assistants requiring sub‑second responses cannot tolerate the latency of agentic retrieval. A Common Architecture Pattern Most successful implementations combine: Foundry IQ for unstructured documents and policies Structured data tools for analytics and live queries An application or agent layer that routes questions based on intent This avoids forcing a single tool to solve every problem. Querying Foundry IQ Knowledge Bases Directly via Azure AI Search SDK You can query Azure AI Foundry IQ knowledge bases directly using the azure-search-documents Python SDK without using Foundry Agent Service. Your App → Azure AI Search SDK → Foundry IQ Knowledge Base → Grounded Results Ideal when you want full orchestration control while still benefiting from managed, agentic retrieval. How this works Note:It is a reference implementation Install pip install --pre azure-search-documents azure-identity Setup (High Level) Provision Azure AI Search (Basic or higher) Enable Azure AD and API key authentication Enable a system‑assigned managed identity Ingest Content via Knowledge Sources Blob Storage, SharePoint, or OneLake Index, indexer, data source, and skillset are created automatically Knowledge sources and KBs are created via REST API (2025‑11‑01‑preview) Create a Knowledge Base minimal reasoning → semantic retrieval only (no LLM) low / medium reasoning → requires Azure OpenAI model Search service MI needs Cognitive Services User Querying the Knowledge Base (Python) Initialize the Client from azure.identity import DefaultAzureCredential from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient client = KnowledgeBaseRetrievalClient( endpoint="https://<search-service>.search.windows.net", knowledge_base_name="<kb-name>", credential=DefaultAzureCredential(), ) Minimal Reasoning (Fast, No LLM) from azure.search.documents.knowledgebases.models import ( KnowledgeBaseRetrievalRequest, KnowledgeRetrievalSemanticIntent, KnowledgeRetrievalMinimalReasoningEffort, KnowledgeRetrievalOutputMode, ) request = KnowledgeBaseRetrievalRequest( intents=[KnowledgeRetrievalSemanticIntent(search="your question here")], retrieval_reasoning_effort=KnowledgeRetrievalMinimalReasoningEffort(), output_mode=KnowledgeRetrievalOutputMode.EXTRACTIVE_DATA, ) response = client.retrieve(retrieval_request=request) Conversational Reasoning (LLM‑Backed) from azure.search.documents.knowledgebases.models import ( KnowledgeBaseRetrievalRequest, KnowledgeBaseMessage, KnowledgeBaseMessageTextContent, KnowledgeRetrievalLowReasoningEffort, KnowledgeRetrievalOutputMode, ) request = KnowledgeBaseRetrievalRequest( messages=[ KnowledgeBaseMessage( role="user", content=[KnowledgeBaseMessageTextContent(text="<first user question>")] ), KnowledgeBaseMessage( role="assistant", content=[KnowledgeBaseMessageTextContent(text="<assistant response>")] ), KnowledgeBaseMessage( role="user", content=[KnowledgeBaseMessageTextContent(text="<follow-up question>")] ), ], retrieval_reasoning_effort=KnowledgeRetrievalLowReasoningEffort(), output_mode=KnowledgeRetrievalOutputMode.EXTRACTIVE_DATA, ) response = client.retrieve(retrieval_request=request) Keep in mind: intents → minimal reasoning only messages → low / medium reasoning only They are not interchangeable. Processing the Response # Extracted content for msg in (response.response or []): for item in (msg.content or []): print(item.text) # Citations (handles blob, SharePoint, OneLake, and search index references) for ref in (response.references or []): ref_id = getattr(ref, "id", None) url = getattr(ref, "blob_url", None) or getattr(ref, "url", None) print(f"[{ref_id}] {url}") # Retrieval diagnostics for record in (response.activity or []): elapsed = getattr(record, "elapsed_ms", None) or "" print(f"{record.type}: {elapsed}ms") Output Modes Mode When to Use extractiveData Feed grounded chunks into your own LLM answerSynthesis Return a ready‑made answer with citations (LLM required) Security & Permissions RBAC: Search Index Data Reader with DefaultAzureCredential Permission trimming Must be enabled at ingestion (ingestionPermissionOptions) Enforced at query time by passing the user’s bearer token response = client.retrieve( retrieval_request=request, x_ms_query_source_authorization="Bearer <user-token>" ) Foundry IQ won't solve every retrieval problem. But when your agents need grounded, permission-aware answers from content scattered across SharePoint, Blob Storage, and OneLake, it handles the hard parts — so you can focus on what your agent actually does.The Future of Agentic AI: Inside Microsoft Agent Framework 1.0
Agentic AI is rapidly moving beyond demos and chatbots toward long‑running, autonomous systems that reason, call tools, collaborate with other agents, and operate reliably in production. On April 3, 2026, Microsoft marked a major milestone with the General Availability (GA) release of Microsoft Agent Framework 1.0, a production‑ready, open‑source framework for building agents and multi‑agent workflows in.NET and Python. [techcommun...rosoft.com] In this post, we’ll deep‑dive into: What Microsoft Agent Framework actually is Its core architecture and design principles What’s new in version 1.0 How it differs from other agent frameworks When and how to use it—with real code examples What Is Microsoft Agent Framework? According to the official announcement, Microsoft Agent Framework is an open‑source SDK and runtime for building AI agents and multi‑agent workflows with strong enterprise foundations. Agent Framework provides two primary capability categories: 1. Agents Agents are long‑lived runtime components that: Use LLMs to interpret inputs Call tools and MCP servers Maintain session state Generate responses They are not just prompt wrappers, but stateful execution units. 2. Workflows Workflows are graph‑based orchestration engines that: Connect agents and functions Enforce execution order Support checkpointing and human‑in‑the‑loop scenarios This leads to a clean separation of responsibilities: Concern Handled By Reasoning & interpretation Agent Execution policy & control flow Workflow This separation is a foundational design decision. High‑Level Architecture From the official overview, Agent Framework is composed of several core building blocks: Model clients (chat completions & responses) Agent sessions (state & conversation management) Context providers (memory and retrieval) Middleware pipeline (interception, filtering, telemetry) MCP clients (tool discovery and invocation) Workflow engine (graph‑based orchestration) Conceptual Flow 🌟 What’s New in Version 1.0 Version 1.0 marks the transition from "Release Candidate" to "General Availability" (GA). Production-Ready Stability: Unlike the earlier experimental packages, 1.0 offers stable APIs, versioned releases, and a commitment to long-term support (LTS). A2A Protocol (Agent-to-Agent): A new structured messaging protocol that allows agents to communicate across different runtimes. For example, an agent built in Python can seamlessly coordinate with an agent running in a .NET environment. MCP (Model Context Protocol) Support: Full integration with the Model Context Protocol, enabling agents to dynamically discover and invoke external tools and data sources without manual integration code. Multi-Agent Orchestration Patterns: Stable implementations of complex patterns, including: Sequential: Linear handoffs between specialized agents. Group Chat: Collaborative reasoning where agents discuss and solve problems. Magentic-One: A sophisticated pattern for task-oriented reasoning and planning. Middleware Pipeline: The new middleware architecture lets you inject logic into the agent's execution loop without modifying the core prompts. This is essential for Responsible AI (RAI), allowing you to add content safety filters, logging, and compliance checks globally. DevUI Debugger: A browser-based local debugger that provides a real-time visual representation of agent message flows, tool calls, and state changes. Code Examples Creating a Simple Agent (C#) From Microsoft Learn : using Azure.AI.Projects; using Azure.Identity; using Microsoft.Agents.AI; AIAgent agent = new AIProjectClient( new Uri("https://your-foundry-service.services.ai.azure.com/api/projects/your-project"), new AzureCliCredential()) .AsAIAgent( model: "gpt-5.4-mini", instructions: "You are a friendly assistant. Keep your answers brief."); Console.WriteLine(await agent.RunAsync("What is the largest city in France?")); This shows: Provider‑agnostic model access Session‑aware agent execution Minimal setup for production agents Creating a Simple Agent (Python) from agent_framework.foundry import FoundryChatClient from azure.identity import AzureCliCredential client = FoundryChatClient( project_endpoint="https://your-foundry-service.services.ai.azure.com/api/projects/your-project", model="gpt-5.4-mini", credential=AzureCliCredential(), ) agent = client.as_agent( name="HelloAgent", instructions="You are a friendly assistant. Keep your answers brief.", ) result = await agent.run("What is the largest city in France?") print(result) The same agent abstraction applies across languages. When to Use Agents vs Workflows Microsoft provides clear guidance: Use an Agent when… Use a Workflow when… Task is open‑ended Steps are well‑defined Autonomous tool use is needed Execution order matters Single decision point Multiple agents/functions collaborate Key principle: If you can solve the task with deterministic code, do that instead of using an AI agent. 🔄 How It Differs from Other Frameworks Microsoft Agent Framework 1.0 distinguishes itself by focusing on "Enterprise Readiness" and "Interoperability." Feature Microsoft Agent Framework 1.0 Semantic Kernel / AutoGen LangChain / CrewAI Philosophy Unified, production-ready SDK. Research-focused or tool-specific. High-level, developer-friendly abstractions. Integration Deeply integrated with Microsoft Foundry and Azure. Varied; often requires more glue code. Generally cloud-agnostic. Interoperability Native A2A and MCP for cross-framework tasks. Limited to internal ecosystem. Uses proprietary connectors. Runtime Identical API parity for .NET and Python. Primarily Python-first (SK has C#). Primarily Python. Control Graph-based deterministic workflows. More non-deterministic/experimental. Mixture of role-based and agentic. 🛠️ Key Technical Components Agent Harness: The execution layer that provides agents with controlled access to the shell, file system, and messaging loops. Agent Skills: A portable, file-based or code-defined format for packaging domain expertise. Implementation Tip: If you are coming from Semantic Kernel, Microsoft provides migration assistants that analyze your existing code and generate step-by-step plans to upgrade to the new Agent Framework 1.0 standards. Microsoft Agent Framework Version 1.0 | Microsoft Agent Framework Agent Framework documentation 🎯 Summary Microsoft Agent Framework 1.0 is the "grown-up" version of AI orchestration. By standardizing the way agents talk to each other (A2A), discover tools (MCP), and process information (Middleware), Microsoft has provided a clear path for taking AI experiments into production. For more detailed guides, check out the official Microsoft Agent Framework DocumentationMicrosoft Agent Framework - .NET AI Community StandupStop Experimenting, Start Building: AI Apps & Agents Dev Days Has You Covered
The AI landscape has shifted. The question is no longer “Can we build AI applications?” it’s “Can we build AI applications that actually work in production?” Demos are easy. Reliable, scalable, resilient AI systems that handle real-world complexity? That’s where most teams struggle. If you’re an AI developer, software engineer, or solution architect who’s ready to move beyond prototypes and into production-grade AI, there’s a series built specifically for you. What Is AI Apps & Agents Dev Days? AI Apps & Agents Dev Days is a monthly technical series from Microsoft Reactor, delivered in partnership with Microsoft and NVIDIA. You can explore the full series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ This isn’t a slide deck marathon. The series tagline says it best: “It’s not about slides, it’s about building.” Each session tackles real-world challenges, shares patterns that actually work, and digs into what’s next in AI-driven app and agent design. You bring your curiosity, your code, and your questions. You leave with something you can ship. The sessions are led by experienced engineers and advocates from both Microsoft and NVIDIA, people like Pamela Fox, Bruno Capuano, Anthony Shaw, Gwyneth Peña-Siguenza, and solutions architects from NVIDIA’s Cloud AI team. These aren’t theorists; they’re practitioners who build and ship the tools you use every day. What You’ll Learn The series covers the full spectrum of building AI applications and agent-based systems. Here are the key themes: Building AI Applications with Azure, GitHub, and Modern Tooling Sessions walk through how to wire up AI capabilities using Azure services, GitHub workflows, and the latest SDKs. The focus is always on code-first learning, you’ll see real implementations, not abstract architecture diagrams. Designing and Orchestrating AI Agents Agent development is one of the series’ strongest threads. Sessions cover how to build agents that orchestrate long-running workflows, persist state automatically, recover from failures, and pause for human-in-the-loop input, without losing progress. For example, the session “AI Agents That Don’t Break Under Pressure” demonstrates building durable, production-ready AI agents using the Microsoft Agent Framework, running on Azure Container Apps with NVIDIA serverless GPUs. Scaling LLM Inference and Deploying to Production Moving from a working prototype to a production deployment means grappling with inference performance, GPU infrastructure, and cost management. The series covers how to leverage NVIDIA GPU infrastructure alongside Azure services to scale inference effectively, including patterns for serverless GPU compute. Real-World Architecture Patterns Expect sessions on container-based deployments, distributed agent systems, and enterprise-grade architectures. You’ll learn how to use services like Azure Container Apps to host resilient AI workloads, how Foundry IQ fits into agent architectures as a trusted knowledge source, and how to make architectural decisions that balance performance, cost, and scalability. Why This Matters for Your Day Job There’s a critical gap between what most AI tutorials teach and what production systems actually require. This series bridges that gap: Production-ready patterns, not demos. Every session focuses on code and architecture you can take directly into your projects. You’ll learn patterns for state persistence, failure recovery, and durable execution — the things that break at 2 AM. Enterprise applicability. The scenarios covered — travel planning agents, multi-step workflows, GPU-accelerated inference — map directly to enterprise use cases. Whether you’re building internal tooling or customer-facing AI features, the patterns transfer. Honest trade-off discussions. The speakers don’t shy away from the hard questions: When do you need serverless GPUs versus dedicated compute? How do you handle agent failures gracefully? What does it actually cost to run these systems at scale? Watch On-Demand, Build at Your Own Pace Every session is available on-demand. You can watch, pause, and build along at your own pace, no need to rearrange your schedule. The full playlist is available at This is particularly valuable for technical content. Pause a session while you replicate the architecture in your own environment. Rewind when you need to catch a configuration detail. Build alongside the presenters rather than just watching passively. What You’ll Walk Away Wit After working through the series, you’ll have: Practical agent development skills — how to design, orchestrate, and deploy AI agents that handle real-world complexity, including state management, failure recovery, and human-in-the-loop patterns Production architecture patterns — battle-tested approaches for deploying AI workloads on Azure Container Apps, leveraging NVIDIA GPU infrastructure, and building resilient distributed systems Infrastructure decision-making confidence — a clearer understanding of when to use serverless GPUs, how to optimise inference costs, and how to choose the right compute strategy for your workload Working code and reference implementations — the sessions are built around live coding and sample applications (like the Travel Planner agent demo), giving you starting points you can adapt immediately A framework for continuous learning — with new sessions each month, you’ll stay current as the AI platform evolves and new capabilities emerge Start Building The AI applications that will matter most aren’t the ones with the flashiest demos — they’re the ones that work reliably, scale gracefully, and solve real problems. That’s exactly what this series helps you build. Whether you’re designing your first AI agent system or hardening an existing one for production, the AI Apps & Agents Dev Days sessions give you the patterns, tools, and practical knowledge to move forward with confidence. Explore the series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ and start watching the on-demand sessions at the link above. The best time to level up your AI engineering skills was yesterday. The second-best time is right now and these sessions make it easy to start.Join our free livestream series on hosting agents in Microsoft Foundry
Join us for a new 3‑part livestream series where we deploy AI agents on Microsoft Foundry using Microsoft Agent Framework and LangChain/LangGraph, then level them up with tools, observability, and evals. You'll learn how to: Deploy Python agents to Foundry Hosted agents using the Azure Developer CLI Build hosted agents with Microsoft Agent Framework, including Foundry IQ integration Build hosted agents with LangChain + LangGraph, including built-in tools like Web Search Run quality and safety evaluations: continuous evals, scheduled evals, guardrails, and red-teaming Throughout the series, we’ll use Python for all examples and share full code so you can run everything yourself in your own Foundry projects. 👉 Register for the full series. Spanish speaker? ¡Tendremos una serie para hispanohablantes! Regístrese aquí In addition to the live streams, you can also join Join the Microsoft Foundry Discord to ask follow-up questions after each stream. If you are new to generative AI with Python, start with our 9-part Python + AI series, which covers topics such as LLMs, embeddings, RAG, tool calling, MCP, and agents. If you are new to Microsoft Agent Framework, watch our 6-part Python + Agent series which dives deep into agents and workflows. To learn more about each live stream or register for individual sessions, scroll down: Host your agents on Foundry: Microsoft Agent Framework 27 April, 2026 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our first session, we'll deploy agents built with Microsoft Agent Framework (the successor of Autogen and Semantic Kernel). Starting with a simple agent, we'll add Foundry tools like Code Interpreter, ground the agent in enterprise data with Foundry IQ, and finally deploy multi-agent workflows. Along the way, we'll use the Foundry UI to interact with the hosted agent, testing it out in the playground and observing the traces from the reasoning and tool calls. Host your agents on Foundry: LangChain + LangGraph 29 April, 2026 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our second session, we'll deploy agents built with the popular open-source libraries LangChain and LangGraph. Starting with a simple agent, we'll add Foundry tools like Bing Web Search, ground the agent in Foundry IQ, then deploy more complex agents using the LangGraph orchestration framework. Along the way, we'll use the Foundry UI to interact with the hosted agent, testing it out in the playground and observing the traces from the reasoning and tool calls. Host your agents on Foundry: Quality & safety evaluations 30 April, 2026 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our third session, we'll ensure that our AI agents are producing high-quality outputs and operating safely and responsibly. First we'll explore what it means for agent outputs to be high quality, using built-in evaluators to check overall task adherence and then building custom evaluators for domain-specific checks. With Foundry hosted agents, we can run bulk evaluations on demand, set up scheduled evaluations, and even enable continuous evaluation on a subset of live agent traces. Next we'll discuss safety systems that can be layered on top of agents and audit agents for potential safety risks. To improve compliance with an organization's goals, we can configure custom policies and guardrails that can be shared across agents. Finally, we can ensure that adversarial inputs can't produce unsafe outputs by running automated red-teaming scans on agents, and even schedule those to run regularly as well. With all of these evaluation and compliance features available in Foundry, you can have more confidence hosting your agents in production.Microsoft 365 multi-agent workflow with Microsoft Agent Framework
Learn how to design and run a multi‑agent workflow with Microsoft Agent Framework: from building a coordinated set of specialized agents and tools, to hosting and deploying them with Azure AI Foundry, and finally exposing the same workflow to users in Microsoft 365 (Teams or Copilot). This walkthrough demonstrates a practical end‑to‑end pattern for orchestrating agents, adding tools, and packaging the solution for real‑world applications.362Views0likes0CommentsClaim your IQ Series: Foundry IQ badge
The IQ Series kicked off with three Foundry IQ episodes, each paired with a hands-on cookbook. If you've worked through all three or you're planning to, there's now a digital badge waiting for you to claim! What the badge represents The IQ Series: Foundry IQ badge recognizes developers who've completed the full Foundry IQ curriculum end-to-end: not just watched the episodes, but deployed the Azure resources, run every notebook, and built working knowledge bases against live data. Earners have: Deployed AI Search, Azure OpenAI, a Foundry project, and Azure Blob Storage with seeded sample data Connected structured and unstructured sources into Foundry IQ Built and queried multi-source AI knowledge bases Grounded agent responses in permission-aware enterprise knowledge Badges are issued by the Global AI Community, so you'll want an account there before you submit. What the three episodes cover Episode 1 — Unlocking Knowledge for Your Agents. Introduces Foundry IQ and the core ideas behind it. The episode explains how AI agents work with knowledge and walks through the main components of Foundry IQ that support knowledge-driven applications. Episode 2 — Building the Data Pipeline with Knowledge Sources. Focuses on Knowledge Sources and how different types of content flow into Foundry IQ across SharePoint, Fabric, OneLake, Azure Blob Storage, Azure AI Search, and the web. Episode 3 — Querying the Multi-Source AI Knowledge Bases. Dives into Knowledge Bases and how multiple knowledge sources can be organized behind a single endpoint. The episode demonstrates how AI systems query across these sources and synthesize information to answer complex questions. Each episode is paired with a cookbook for you to learn hands-on and each of them reuses the same Azure deployment, so you set up once and build across all three. How to claim the badge Four steps, in order: Fork the IQ Series repo and work through all three episode cookbooks in your fork. Commit your notebooks with cell outputs saved! That's the proof of completion. Capture a final output screenshot for each episode. Your GitHub username or Azure resource name needs to be visible in the screenshot. Submit a badge request issue. The template walks you through fork URLs, screenshots, and one brief technical takeaway per episode. Complete the badge form. This step is required. Without the form, we can't issue the badge. Why this badge is worth your time The IQ Series recognizes your hands-on learning with real infrastructure, real indexed data, real agents and queries. If you're working on enterprise AI (grounding, retrieval, knowledge-aware agents), this is a concrete artifact that says: I've built this, end to end, on the actual platform. Work IQ and Fabric IQ are coming next, and each phase will have its own badge. Foundry IQ is your head start on the full IQ Series. 👉 Start with Episodes or jump straight to the cookbooks if you prefer to learn by doing. Questions along the way? Create and issue in the repo or drop into our Discord. The Foundry IQ team and community are there to help.Building an Enterprise HR Chatbot with Multi-Strategy RAG and Live Agent Handoff on Azure
HR teams deal with thousands of employee questions every day — policy lookups, leave balances, case escalations, and sensitive topics like harassment or misconduct. AI chatbots can handle the routine stuff and free up HR advisors for the hard cases. But most chatbot projects get stuck at basic Q&A. They can't handle multi-country policies, employee slang, or smooth handoffs to a real person. This post covers how we built Eva, a production HR chatbot using Microsoft Bot Framework and Semantic Kernel on Azure. I'll focus on three problems and how we solved them: Getting accurate answers when employees and policy documents use different words Handing off to a live human advisor in real time Catching answer quality regressions automatically Why basic RAG isn't enough for HR Retrieval-Augmented Generation (RAG) — fetching relevant documents and feeding them to an LLM — is the standard approach. But plain RAG breaks down in HR for a few reasons: Vocabulary mismatch. An employee asks "How does misconduct affect my ACB?" but the policy document says "Annual Cash Bonus eligibility criteria." The search doesn't connect the two. Multi-country ambiguity. The same question can have different answers depending on the employee's country, grade, or role. Sensitive topics. Questions about harassment, disability, or whistleblowing should go to a human, not get an AI-generated answer. Ranking noise. Search results often include globally relevant but locally irrelevant documents. Eva handles these with a layered pipeline: query augmentation → multi-index search → LLM reranking → answer generation → citation handling. Architecture at a glance Layer Technology Bot framework Microsoft Agents SDK (aiohttp) LLM orchestration Semantic Kernel Primary LLM Azure OpenAI Service (GPT-4.1 / GPT-4o) Knowledge search Azure AI Search (hybrid + vector) Live agent chat Salesforce MIAW via server-sent events Evaluation Azure AI Evaluation SDK + custom LLM judge Config Pydantic-settings + Azure App Configuration + Key Vault Four retrieval strategies, controlled by feature flags Instead of one search approach, Eva supports four — toggled by feature flags so we can A/B test per country without code changes. They run in a priority cascade: HyDE (Hypothetical Document Embeddings) Instead of searching with the employee's question, the LLM first generates a hypothetical policy document thatwouldanswer it. We embed that synthetic document and use it as the search query. Since a hypothetical answer is closer in embedding space to the real answer than the original question is, this bridges vocabulary gaps effectively. Step-back prompting The LLM broadens the question. "How does misconduct affect my ACB?" becomes "What is the Annual Cash Bonus policy and what factors affect eligibility?" This works well when answers live in broader policy sections. Query rewrite The LLM expands abbreviations and adds HR domain context, then runs a hybrid (text + vector) search. Standard search (fallback) Basic intent classification with hybrid search. No augmentation. All four strategies return the same Pydantic model, so the rest of the pipeline doesn't care which one ran. The team can enable HyDE globally, roll out step-back to specific countries, or revert instantly if something underperforms. LLM reranking After pulling results from both a country-specific index and a global index, Eva optionally reranks them using a RankGPT-style approach — the LLM scores document relevance with a bias toward local content. If reranking fails for any reason, it falls back to the original ordering so the pipeline keeps moving. Answer generation with local vs. global context The answer stage separates retrieved documents into local context (country-specific) and global context (company-wide), injected as distinct prompt sections. The LLM returns a structured response with reasoning, the actual answer, citations, and a coverage classification (full, partial, or denial). Prompts are stored as version-controlled .txt files with per-model variants (e.g., gpt-4o.txt, gpt-4.1.txt), resolved at runtime. This makes prompts reviewable in PRs and deployable without code changes. Live agent handoff with Salesforce When Eva determines a question needs a human — sensitive topic, complex case, or the employee simply asks — it hands off to a Salesforce advisor in real time. SSE streaming. Eva keeps a persistent HTTP connection to Salesforce for real-time messages, typing indicators, and session end signals. Session resilience. Session state persists across three layers — in-memory cache, Azure Cosmos DB, and Bot Framework turn state — to survive restarts and failovers. Message delivery workers. Each session has a dedicated async worker with exponential backoff retry. Overflow messages go to a failed messages list rather than being silently dropped. Queue position updates. While employees wait, Eva queries Salesforce for queue position and sends rate-limited updates. Context handoff. On session start, Eva sends the full conversation transcript so advisors don't ask employees to repeat themselves. Automated evaluation Eva includes an evaluation framework that runs as a separate process, testing against ground-truth Q&A pairs from CSV files. Factual questions are scored using Azure AI's SimilarityEvaluator on a 1–5 scale, with optional relevance and groundedness checks. Sensitive questions (harassment, disability, whistleblowing) use a custom LLM judge that checks whether the response acknowledges sensitivity and directs the employee to create a case or speak with an advisor. A deviation detector flags score drops between runs. SQLite stores results for trending, and Application Insights powers dashboards. Long evaluation runs support resume — the framework skips already-completed test cases on restart. Key takeaways Make retrieval strategies swappable. Feature flags let you A/B test without redeploying. Separate local and global knowledge explicitly. Don't rely on the LLM to figure out which country's policy applies. Invest in evaluation early. Ground-truth datasets with factual and behavioral scoring catch regressions that manual testing misses. Build resilience into live agent handoff. Multi-tier session recovery and retry logic prevent dropped conversations. Treat prompts as code. File-based, model-variant-aware prompts are easier to maintain than inline strings. Use Pydantic for structured LLM outputs. Typed models catch bad output at the validation boundary instead of letting it propagate. Get started Semantic Kernel documentation — LLM orchestration with plugins and structured outputs Azure OpenAI Service quickstart — Deploy GPT-4o or GPT-4.1 Azure AI Search vector search tutorial — Hybrid and vector search indices Microsoft Bot Framework SDK — Build bots for Teams and web Azure AI Evaluation SDK — Score for similarity, relevance, and groundedness