python
306 TopicsPlatform Improvements for Python AI Apps on Azure App Service
Overview Azure App Service (Linux) is a fully managed PaaS offering that supports a broad range of languages, including Python, Node.js, .NET, PHP, and Java. Developers can push source code or deploy a pre-built artifact; the platform handles the rest, including dependency installation, application containerization, and running the application at cloud scale. More customers are building intelligent applications using Azure AI Foundry and other AI services, and Python has become a language of choice for these workloads. The performance and reliability of the Python deployment pipeline directly shape the developer's experience on the platform, so we looked across the deployment path for opportunities to reduce latency and improve reliability. The first set of changes has reduced Python deployment latency on Azure App Service Linux by approximately 30%. This is the first step in a broader effort to make the platform better suited for AI application development, but the gains resulting from this effort will benefit all apps on the platform. Let's look at the details. Where Deployment Time Was Going Python web application deployments on Azure App Service Linux rely on Oryx, the platform's open-source build system, to produce runnable artifacts during remote builds. Platform telemetry showed that around 70% of Python app deployments use remote builds, and the majority of those resolve dependencies via requirements.txt using pip install. To understand where time was going, we profiled a stress workload: a 7.5 GB PyTorch application. Most production builds are smaller, but stress-testing a dependency-heavy application made the pipeline bottlenecks clear. When a Python app is deployed via remote build, the build container in Kudu (the App Service deployment service) runs Oryx to: Extract the uploaded source code. Create a Python virtual environment. Install dependencies via pip install; 4.35 min (~34% of build time). Copy files to a staging directory; 0.98 min (~8%). Compress via tar + gzip into an archive; 7.53 min (~58%). Write the archive to /home (Azure Storage SMB mount). The app container then extracts this archive to the local disk on every cold start. Why the Archive-Based Approach? The /home directory is backed by an Azure Storage SMB mount, where small-file I/O is comparatively expensive. Python dependencies are file-heavy: virtual environments commonly contain tens of thousands of files, and dependency-heavy ML applications can exceed 200,000 files. Writing those files individually over SMB would be prohibitively slow. Instead, the pipeline builds on the container's local filesystem, writes a single compressed archive over SMB, and the app container extracts it locally on startup for efficient module loading. Key insight: Compression was the single largest phase at 58% of build time, longer than installing the packages themselves. What We Changed Zstandard Compression (Replacing gzip) Standard gzip compression is single-threaded. In our benchmark, compression accounted for 58% of total build time, making it the dominant bottleneck. Because the archive is also decompressed during container startup, decompression time affects runtime startup latency as well. We evaluated three compression algorithms: gzip, LZ4, and Zstandard (zstd). The following results are averaged across multiple deployments of a 7.5 GB Python application with PyTorch and additional ML packages: Metric gzip LZ4 zstd Compression time 7.53 min 1.20 min 1.18 min Decompression time 2.80 min 1.18 min 1.07 min Archive size 4.0 GB 5.0 GB 4.8 GB Both zstd and LZ4 were more than 6× faster than gzip for compression and more than 2× faster for decompression. We selected zstd for the following reasons: Comparable speed to LZ4, with smaller archive sizes (4.8 GB vs. 5.0 GB). Mature ecosystem: zstd is based on RFC 8878 published in 2021 and ships with many common Linux distributions. Native tar support: tar –I zstd works out of the box; no extra packages required. Result: Compression time dropped from 7.53 min → 1.18 min (6.4× faster). Decompression improved from 2.80 min → 1.07 min (2.6× faster), directly reducing cold-start latency. Faster Package Installation with uv pip is implemented in Python and has historically optimized compatibility over maximum parallelism. In dependency-heavy workloads, package download, resolution, and installation can become a major part of deployment time. In our 7.5 GB PyTorch benchmark, package installation accounted for ~34% of total build time (4.35 min out of 12.86 min). We introduced uv, a Python package manager written in Rust, as the primary installer for compatible requirements.txt deployments. Its uv pip install interface works with standard pip workflows. Fallback strategy: Compatibility remains the priority. When uv cannot handle a deployment, the platform retries with pip, preserving the behavior customers already depend on. Cache behavior: Package caches remain local to the build container. When the same app is deployed again before the kudu (build) container is recycled, both pip and uv can reuse cached packages and avoid repeated downloads. Result: Package installation time dropped from 4.35 min → 1.50 min (3× faster). Reducing File Copy Overhead A file copy showed up in two places. First, before compression, the build process copied the entire build directory (application code plus Python packages) to a staging location. This existed historically as a safety measure; creating a clean snapshot before tar reads the file tree. But the cost was steep for the large number of files inherent in Python dependencies. The fix was straightforward: create the tar archive directly from the build directory, skipping the intermediate copy entirely. Second, for pre-built deployment scenarios, we replaced the legacy Kudu sync path with Linux-native rsync. That gave us a better optimized tool for large Linux file trees and reduced the overhead of moving files into the final deployment location. Because this path is used beyond Python, the improvement benefits pre-built apps across the broader App Service Linux ecosystem. Result: Eliminated the 0.98-minute staging copy (8% of build time), reduced temporary disk usage, and improved the remaining file sync path. Pre-Built Python Wheels Cache We added a complementary optimization: a read-only cache of pre-built wheels for commonly used Python packages, selected using platform telemetry. The cache is mounted into the Kudu build container at runtime for Python workloads, allowing the installer to use local wheel artifacts before downloading packages externally. When a matching wheel is available, the installer uses it directly, avoiding a network fetch for that package. Cache misses fall back to the upstream registry (e.g., PyPI) as usual. The cache is managed by the platform and kept up to date, so supported Python builds can use it without any app change. Combined Results Controlled Benchmark (PyTorch 7.5 GB, P1mv3 App Service Tier) The following benchmark was measured on the P1mv3 App Service tier. Values in the "After" column reflect the optimized pipeline with zstd compression, uv package installation, direct tar creation, and the pre-built wheels cache enabled together. Phase Before After Improvement Package installation 4.35 min 1.50 min ~3× faster File copy 0.98 min 0 min Eliminated Compression 7.53 min 1.18 min ~6× faster Total build time 12.86 min ~2.68 min ~79% reduction Production Fleet (All Python Linux Web Apps) Production telemetry across Python deployments shows the impact of these changes: deployment latency decreased by approximately 30% after the rollout. The controlled benchmark shows a larger improvement (~79%) because it exercises a dependency-heavy workload where package installation, file copy, and compression dominate total build time. Typical production apps are smaller and spend less time proportionally in those phases. Beyond Faster Builds: Reliability and Runtime Performance Faster builds only help when deployment requests reliably reach a worker that is ready to build. We updated the primary deployment clients Azure CLI, GitHub Actions, and Azure DevOps Pipelines to warm up Kudu before initiating deployments. Clients now issue a lightweight health-check request to the Kudu endpoint, helping ensure the deployment container is running and ready before the deployment begins. Clients also preserve affinity to the warmed-up worker using the ARR affinity cookie returned by the first request. This increases the chance that the deployment uses a worker with Kudu already running and local package caches already available from recent deployments. Together, these client-side changes reduced deployment failures from transient infrastructure issues and helped the pipeline optimizations reach the build phase reliably. Result: Deployment failures caused by cold-start errors (502, 503, 499) dropped by ~30%. We also improved the default runtime configuration for Python apps using the platform-provided Gunicorn startup path. Previously, the platform defaulted to a single worker, leaving most CPU cores idle. Now, it follows Gunicorn's recommended worker formula, fully utilizing available cores on multi-core SKUs and delivering higher request throughput out of the box. workers = (2 × NUM_CORES) + 1 Key Takeaways Measure before optimizing: Platform telemetry showed that remote builds and requirements.txt based installs were the dominant Python deployment paths, which helped us focus on changes that would benefit the most customers. Compression was the biggest bottleneck: In the dependency-heavy benchmark, archive compression took longer than package installation. Replacing gzip with zstd reduced both build time and cold-start extraction time. File count matters: Python virtual environments can contain tens of thousands of files, and AI workloads can contain many more. Reducing unnecessary file copies and using Linux-native file sync helped lower overhead. Compatibility needs a fallback path: Introducing uv improved the common path, while falling back to pip preserved compatibility for apps that depend on existing Python packaging behavior. Deployment reliability is part of performance: Faster builds only help if deployment requests consistently reach a ready worker. Warm-up and worker affinity made the optimized path more reliable for customers. Beyond deployment: Runtime defaults, such as Gunicorn worker configuration, also affect how production apps perform once deployment is complete. Together, these changes made Python deployments faster and more reliable while preserving compatibility through safe fallbacks. We will continue improving the platform to make Azure App Service faster, more reliable, and better suited for AI application development.85Views1like0CommentsGive Your AI Agent Eyes: Browser-Harness Meets Playwright Workspaces Remote Browsers
What happens when you hand a coding agent a real browser — not a mock, not an API wrapper, but a full Chromium instance running in the cloud? It fills form for you. It does research for you. It navigates JavaScript-heavy SPAs that would make any REST-based scraper weep. And it does it across 10+ parallel sessions without touching your local machine. This is the story of combining two tools that were built for different worlds — and discovering they're a perfect fit. The Problem Today's coding agents — Codex, Claude Code, Copilot — are extraordinary at reading and writing code. But ask one to product availability on an web site, and it hits a wall. Modern websites are JavaScript-rendered, authentication-gated, geolocation-aware, and hostile to simple HTTP requests. The agent needs a real browser. Not requests.get(). Not a headless puppeteer script you wrote last Tuesday. A browser that renders CSS, executes JavaScript, handles cookies, and lets the agent see what a human would see. Enter Browser-Harness Browser-harness is an open-source tool that gives AI agents direct control over a Chrome browser via the Chrome DevTools Protocol (CDP). It exposes a clean Python API: ● agent: wants to upload a file │ ● agent-workspace/agent_helpers.py → helper missing │ ● agent writes it agent_helpers.py │ + custom helper ✓ file uploaded One websocket to Chrome, nothing between. The agent writes what's missing during execution. The harness improves itself every run. But there's a catch. Where does this browser run? The Infrastructure Gap If the browser runs locally, you've got problems: Your machine is busy. Running Chrome while the agent works eats RAM and CPU. No parallelism. One browser per machine. Want to scrape 10 sites simultaneously? Buy 10 machines. No consistency. Different OS, different Chrome versions, different results. No isolation. Letting the agent run amock on autopilot with your local browser can be risky, it can reuse your creds, stored cookies and sessions No observability. The agent is clicking around in a browser you can't see. What you really want is a browser that runs somewhere else — managed, scalable, observable — and your agent just connects to it over a WebSocket. Enter Playwright Workspaces Playwright Workspaces provides exactly this: remote browser endpoints on Azure. You make an HTTP request, a Chromium instance spins up in the cloud, and you get back a WebSocket URL (wss://...) to connect via CDP. The key insight: browser-harness speaks CDP. Playwright Workspaces serves CDP. They snap together like LEGO. Your Agent → browser-harness → CDP WebSocket → Playwright Workspaces → Cloud Chromium No local Chrome needed. No browser installation. No display server. Just a WebSocket connection to a fully managed browser. The Two-Step Connection Flow Connecting them is surprisingly simple: Step 1: Provision a remote browser def get_connect_options(os_name="linux", run_id=str(uuid.uuid4())) -> tuple[str, dict[str, str]]: service_url = os.getenv("PLAYWRIGHT_SERVICE_URL") service_access_token = os.getenv("PLAYWRIGHT_SERVICE_ACCESS_TOKEN") headers = {"Authorization": f"Bearer {service_access_token}"} service_run_id = os.getenv("PLAYWRIGHT_SERVICE_RUN_ID") ws_endpoint = f"{service_url}?os={os_name}&runId={service_run_id}&api-version=2025-09-01" return ws_endpoint, headers Step 2: Point browser-harness at it export BU_CDP_WS="${session_url}" browser-harness -c "print(page_info())" # → {'url': 'about:blank', 'title': '', 'w': 780, 'h': 441} That's it. Your agent now controls a cloud browser. What This Unlocks: A Real-World Demo We gave a coding agent this prompt: "Go to Website1, search for gifts under ₹500 for 10-year-old kids. Must be useful, reusable (not single-use). Delivery in Bengaluru within 3 days. Must have 5 pieces available." Here's what the agent did — autonomously, with no human intervention: Provisioned a remote Chromium browser via Playwright Workspaces Connected browser-harness to the cloud browser over WebSocket Navigated to FirstCry.com Set delivery location to Bengaluru (pincode 560001) Searched for kids' gifts Applied filters — price ₹0–250 and ₹250–500 via JavaScript DOM interaction Browsed products, rejecting single-use items (greeting cards) in favor of reusable ones (stainless steel water bottles) Checked delivery dates — rejected items with 6-day delivery, found ones with Next Day Delivery Verified stock availability — confirmed ADD TO CART was active with no stock warnings Took screenshots at every step for audit and debugging Result: Found the Brand A 600 Stainless Steel Water Bottle at ₹444.69 with next-day delivery to Bengaluru. All criteria met. The entire workflow ran on a remote browser in Azure — the local machine never launched Chrome. The Power of Remote Endpoints Why does running browsers remotely change everything? 1. Massive Parallelism Spin up multiple remote browsers and work in parallel. Each gets its own isolated Chromium instance. No resource contention, no port conflicts. 2. Zero Local Dependencies No Chrome installation. No chromedriver version mismatches. No --no-sandbox hacks. The browser is a managed service — you just connect to it. 3. Geographic Flexibility Remote browsers run in Azure data centers. Need to see what a website looks like from East US? Or Southeast Asia? Pick your region. The browser's IP and geolocation are in the cloud, not on your laptop. 4. Ephemeral & Secure Each browser session is isolated and destroyed when the WebSocket closes. No leftover cookies, no persistent state leaking between runs. Every session starts clean. The Bigger Picture We're at an inflection point. AI agents are moving from code generation to code execution — and execution means interacting with the real world. Browsers are the universal interface to that world. The combination of browser-harness (agent-to-browser control) and Playwright Workspaces (managed remote browsers) creates a powerful primitive: give any AI agent a browser, anywhere, on demand. Get Started The full sample — including the playwright_service_client.py helper, setup prompts, and environment templates — is available here: 📦 playwright-workspaces/samples/browser-harness Resources: Playwright Workspaces Documentation Browser-Harness GitHub Create a Playwright Workspace182Views3likes0CommentsHow to Test AI Agents with LangSmith: A Complete Guide
Testing AI agents is crucial for ensuring reliability and accuracy in production. Evaluation is a technique to evaluate your agents. Different type of evaluation are # Evaluation Type 1 Task Success (Pass / Fail) 2 Instruction Adherence 3 Correctness / Accuracy 4 Relevance 5 Groundedness (Hallucination) 6 Coherence / Fluency 7 Tool‑Use Accuracy 8 Safety / Harmfulness LangSmith provides powerful tools for creating datasets, running evaluations, and using LLM-as-judge techniques. This guide walks through the complete workflow using a practical example. Prerequistes : 1) create your account under langsmith. 2) generate langsmith key and store in .env file and load whenever a reference made for datacreation or doing evaluation or from command prompt use set LANGCHAIN_API_KEY = <your_api_key_here> Part 1: Creating Your Test Dataset The foundation of any good evaluation is a quality dataset. LangSmith allows you to create datasets programmatically with input-output pairs that serve as ground truth. from langsmith import Client def create_evaluation_dataset(): client = Client() # Create a new dataset dataset = client.create_dataset( dataset_name="Sample dataset", description="A sample dataset in LangSmith." ) # Define your test examples examples = [ { "inputs": {"question": "Which country is Mount Kilimanjaro located in?"}, "outputs": {"answer": "Mount Kilimanjaro is located in Tanzania."}, }, { "inputs": {"question": "What is Earth's lowest point?"}, "outputs": {"answer": "Earth's lowest point is The Dead Sea."}, }, ] # Add examples to the dataset client.create_examples(dataset_id=dataset.id, examples=examples) print(f"Created dataset: {dataset.name}") return dataset Best Practices for Dataset Creation Diverse Examples: Include edge cases and various question types Clear Ground Truth: Ensure reference answers are accurate and complete Sufficient Volume: Create enough examples to get statistically meaningful results Consistent Format: Maintain consistent input/output structure Part 2: Setting Up LLM-as-Judge Evaluation LLM-as-judge is a powerful technique where you use a language model to evaluate the quality of another model's responses. This approach scales well and can assess subjective qualities like correctness and hallucinations. import os from dotenv import load_dotenv from langsmith import Client, wrappers from openai import AzureOpenAI from openevals.llm import create_llm_as_judge from openevals.prompts import CORRECTNESS_PROMPT load_dotenv() # Wrap your AI client for LangSmith tracing openai_client = wrappers.wrap_openai(AzureOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_key=os.environ["AZURE_OPENAI_API_KEY"], api_version="2025-04-01-preview", )) DEPLOYMENT_NAME = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-5-mini") Defining Your Target Function The target function represents the AI agent you want to test: def target(inputs: dict) -> dict: """The AI agent being evaluated""" response = openai_client.chat.completions.create( model=DEPLOYMENT_NAME, messages=[ {"role": "system", "content": "Answer the following question accurately"}, {"role": "user", "content": inputs["question"]}, ], ) return {"answer": response.choices[0].message.content.strip()} Creating Custom Evaluators 1. Correctness Evaluator def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict): """Evaluates how correct the answer is compared to the reference""" evaluator = create_llm_as_judge( prompt=CORRECTNESS_PROMPT, # Pre-built prompt for correctness model="azure_openai:" + DEPLOYMENT_NAME, feedback_key="correctness", ) return evaluator( inputs=inputs, outputs=outputs, reference_outputs=reference_outputs ) 2. Hallucination Evaluator def hallucination_evaluator(inputs: dict, outputs: dict, reference_outputs: dict): """Detects if the answer contains unsupported claims""" evaluator = create_llm_as_judge( prompt="""You are an expert judge evaluating AI responses for hallucinations. <question> {inputs} </question> <answer> {outputs} </answer> <reference_answer> {reference_outputs} </reference_answer> Does the answer contain any claims or information that are not supported by the question or reference answer? Respond with true if the answer is free of hallucinations, false if it contains hallucinated information. You must also provide a brief explanation of your reasoning.""", model="azure_openai:" + DEPLOYMENT_NAME, feedback_key="hallucination", ) return evaluator( inputs=inputs, outputs=outputs, reference_outputs=reference_outputs ) Part 3: Running the Evaluation Execute the Complete Evaluation Pipeline def run_evaluation(): client = Client() # Run the evaluation experiment_results = client.evaluate( target, # Function to test data="Sample dataset", # Dataset name evaluators=[ # List of evaluators correctness_evaluator, hallucination_evaluator, ], experiment_prefix="first-eval-in-langsmith", max_concurrency=2, # Control API rate limits ) print("Evaluation Results:") print(experiment_results) return experiment_results if __name__ == "__main__": run_evaluation() Understanding Your Results When the evaluation completes, you'll get detailed metrics including: Individual Scores: Per-example results for each evaluator Aggregate Metrics: Overall performance across the dataset Trace Links: Deep links to view exact model interactions Comparison Views: Side-by-side comparisons of outputs vs. references Key Benefits of This Approach Automated Testing: Run comprehensive evaluations without manual review Scalable Assessment: Evaluate subjective qualities at scale Continuous Monitoring: Track performance changes over time Rich Analytics: Get detailed insights into failure modesFrom Test Cases to Trusted Automation: Scaling Enterprise Quality with GitHub Copilot
Automation First, But Trust Is Earned Enterprise QA teams today are automation‑led by default. Regression suites run daily, API tests validate integrations, and UI automation protects critical workflows. Yet, many teams still struggle with: Automation suites that lag behind changing requirements Brittle regression tests producing false failures High effort spent on maintaining, refactoring, and rewriting tests Limited time for testers to think deeply about risk and coverage Automation creates speed—but trust is built only when automation stays relevant, maintainable, and aligned to business intent. That is where AI‑assisted workflows started to play a role—not to replace automation engineers, but to remove friction from automation execution and evolution. GitHub Copilot as an Automation Accelerator GitHub Copilot proved most effective when used as a support system for automation teams, not a replacement for expertise. Faster Automation Creation Without Losing Intent Automation engineers often spend significant time writing boilerplate code—test scaffolding, assertions, setup, and repetitive patterns. Copilot helped accelerate this phase by: Generating consistent test skeletons Assisting with repetitive automation logic Suggesting assertions aligned to test intent This allowed engineers to focus on what needed to be validated, not how fast they could type it. Improving Maintainability of Automation Suites At enterprise scale, the true cost of automation is maintenance. Copilot helped reduce this burden by: Accelerating refactoring of existing test code Making automation scripts more readable and standardized Supporting quicker updates when requirements changed As a result, regression suites stayed healthier and more reliable—directly improving release confidence. Strengthening Regression Confidence Automation is valuable only when it can be trusted during regression cycles. By reducing effort spent on maintaining and updating tests, Copilot indirectly strengthened regression stability, ensuring automation remained aligned with evolving functionality. Importantly, every AI suggestion was reviewed, validated, and owned by humans. Automation logic remained intentional, deterministic, and compliant with enterprise standards. Automation at Scale: Where Quality Is Really Won or Lost As automation grows across releases and teams, quality risks move upstream. The questions stop being: Do we have automation? And become: Can we trust what automation is telling us? This is where quality engineering truly matters. By using Copilot to lower the mechanical overhead of automation, QA engineers could invest more time in: Identifying risk‑based test coverage gaps Improving negative and edge‑case scenarios Ensuring UI, API, and integration automation complemented each other Designing automation that reflected real business flows Automation stopped being a maintenance burden and became a strategic quality asset. The Real Mindset Shift for QA Teams The biggest impact was not technical—it was cultural. Instead of spending the majority of time creating and fixing automation scripts, QA engineers could shift their focus toward: Test design strategy Regression optimization Failure analysis and pattern recognition Cross‑team conversations on quality risks AI didn’t reduce QA effort. It redirected effort to higher‑value quality ownership. This is what modern QA leadership looks like—not writing more tests, but ensuring the right tests exist, run reliably, and protect customer trust. Responsible AI Was Non‑Negotiable In an enterprise context, automation quality is inseparable from governance and responsibility. Clear guardrails were essential: No blind acceptance of AI‑generated automation Human review for every test case and assertion Awareness of security, data sensitivity, and compliance Using Copilot as an assistant—not an authority This ensured automation quality improved without compromising trust or control. Final Thoughts: Automation Builds Speed, Trust Builds Confidence Automation enables scale. Test design ensures coverage. Trust is built when both evolve together. GitHub Copilot did not replace automation skills on our enterprise project—it amplified them. By removing friction from test creation and maintenance, it allowed automation to scale responsibly and enabled QA teams to focus on what truly matters: confidence in every release. The future of quality engineering is not manual vs automation. It is automation‑led, AI‑assisted, and human‑governed quality. That is how trust is built at enterprise scale. Microsoft Learn – References on Automation & Quality Engineering The following Microsoft Learn resources provide authoritative guidance on automation‑led quality engineering, test strategy, and building trust at enterprise scale. Architecture strategies for testing - Microsoft Azure Well-Architected Framework | Microsoft Learn Architecture strategies for designing a reliability testing strategy - Microsoft Azure Well-Architected Framework | Microsoft Learn What is Azure Test Plans? Manual, exploratory, and automated test tools. - Azure Test Plans | Microsoft Learn Azure/AZVerify Your Azure diagram, your Bicep templates, and your live environment are three separate sources of truth. They can drift apart. AzVerify gives GitHub Copilot the skills to connect them.Building a Scalable Contract Data Extraction Pipeline with Microsoft Foundry and Python
Architecture Overview Alt text: Architecture diagram showing Blob Storage triggering Azure Function, calling Document Intelligence, transforming data, and storing in Cosmos DB Flow: Upload contract files (PDF or ZIP) to Azure Blob Storage Azure Function triggers automatically on file upload Azure AI Document Intelligence extracts layout and tables A transformation layer converts output into a canonical JSON format Data is stored in Azure Cosmos DB Step 1: Trigger Processing with Azure Functions An Azure Function with a Blob trigger enables automatic processing when a file is uploaded. import logging import azure.functions as func import zipfile import io def main(myblob: func.InputStream): logging.info(f"Processing blob: {myblob.name}") if myblob.name.endswith(".zip"): with zipfile.ZipFile(io.BytesIO(myblob.read())) as z: for file_name in z.namelist(): logging.info(f"Extracting {file_name}") file_data = z.read(file_name) # Pass file_data to extraction step Best Practices Keep functions stateless and idempotent Handle retries for transient failures Store configuration in environment variables Step 2: Extract Layout Using Document Intelligence The prebuilt layout model helps extract tables, text, and structure from documents. from azure.ai.documentintelligence import DocumentIntelligenceClient from azure.core.credentials import AzureKeyCredential client = DocumentIntelligenceClient( endpoint="<your-endpoint>", credential=AzureKeyCredential("<your-key>") ) poller = client.begin_analyze_document( "prebuilt-layout", document=file_data ) result = poller.result() Output Includes Structured tables Paragraphs and text blocks Bounding regions for layout context Step 3: Handle Multi-Page Table Continuity Contract documents often contain tables split across multiple pages. These need to be merged to preserve data integrity. def merge_tables(tables): merged = [] current = None for table in tables: headers = [cell.content for cell in table.cells if cell.row_index == 0] if current and headers == current["headers"]: current["rows"].extend(extract_rows(table)) else: if current: merged.append(current) current = { "headers": headers, "rows": extract_rows(table) } if current: merged.append(current) return merged Key Considerations Match headers to detect continuation Preserve row order Avoid duplicate headers Step 4: Transform to a Canonical JSON Schema A consistent schema ensures compatibility across downstream systems. { "id": "contract_123", "documentType": "contract", "vendorName": "ABC Corp", "invoiceDate": "2023-05-05", "tables": [ { "name": "Line Items", "headers": ["Item", "Qty", "Price"], "rows": [ ["Service A", "2", "100"] ] } ], "metadata": { "sourceFile": "contract.pdf", "processedAt": "2026-04-22T10:00:00Z" } } Design Tips Keep schema flexible and extensible Include metadata for traceability Avoid excessive nesting Step 5: Persist Data in Cosmos DB Store the transformed data in a scalable NoSQL database. from azure.cosmos import CosmosClient client = CosmosClient("<cosmos-uri>", "<key>") database = client.get_database_client("contracts-db") container = database.get_container_client("documents") container.upsert_item(canonical_json) Best Practices Choose an appropriate partition key (for example, documentType or vendorName) Optimize indexing policies Monitor request units (RU) usage Observability and Monitoring To ensure reliability: Enable logging with Application Insights Track processing time and failures Monitor document extraction accuracy Security Considerations Store secrets securely using Azure Key Vault Use Managed Identity for service authentication Apply role-based access control (RBAC) to storage resources Conclusion This approach provides a scalable and maintainable solution for contract data extraction: Event-driven processing with Azure Functions Accurate extraction using Document Intelligence Clean transformation into a reusable schema Efficient storage with Cosmos DB This foundation can be extended with validation layers, review workflows, or analytics dashboards depending on your business requirements. Resources Contract data extraction – Document Intelligence: Foundry Tools | Microsoft Learn microsoft/content-processing-solution-accelerator: Programmatically extract data and apply schemas to unstructured documents across text-based and multi-modal content using Azure AI Foundry, Azure OpenAI, Azure AI Content Understanding, and Cosmos DB.Fixing Broken Markdown in AI Translation: Hardening a Production Pipeline
By Minseok Song and Hiroshi Yoshioka (Microsoft MVPs) TL;DR Recent community feedback, especially from Japanese translations, revealed that many translation failures were not semantic, but structural. Through detailed issue reports and discussions, we identified recurring patterns such as broken links, malformed code fences, inconsistent list structures, and CJK-specific formatting issues. In response, Co-op Translator has undergone a series of structural improvements across multiple releases, culminating in v0.18.1 with enhancements such as parser-based code fence handling, list-aware chunking, language-specific Markdown templates, safer CJK emphasis normalization, more robust image migration, and improved internal anchor consistency. These changes were directly informed by real-world community feedback. We would like to especially thank Hiroshi Yoshioka (Microsoft MVP), whose many detailed reports not only uncovered several of these systemic issues but also made this community report possible. The result is not just improved Japanese translations, but a more reliable and resilient translation pipeline for any repository that depends on Markdown fidelity. Introduction Most translation bugs are not actually translation bugs. They are structural failures. They show up as broken links, missing bold markers, unclosed code fences, skipped content, or images that quietly point to the wrong place. To a learner reading translated technical documentation, those issues can make a page feel untrustworthy. To a maintainer localizing documentation at scale, they reveal something deeper: the translation pipeline is not preserving structure as carefully as it preserves meaning. That insight became much clearer over the past several months through community feedback on Co-op Translator. Co-op Translator helps maintain educational GitHub content across many languages while keeping Markdown, images, and notebooks synchronized as the source evolves. As Hiroshi Yoshioka reported a series of Japanese translation issues across real Microsoft learning repositories, each issue looked narrow on the surface: a broken link here, a skipped line there, bold markers not surviving around linked text, HTML image tags not being rewritten, or code fences breaking after chunking. Example of a real community-reported issue where a code block was broken during translation, causing structural corruption in the output. But taken together, those reports exposed a broader pattern: The hardest problem was not “translate this sentence.” The hardest problem was “translate this document without damaging its structure.” This post is a community report on the hardening work that followed, especially in the recent run-up to v0.18.1, and what we learned from those real-world cases. Why these reports mattered One of the most useful things about community feedback is that it reveals failure modes that synthetic tests often miss. These were not edge cases found in toy Markdown samples. The reports came from real translated content in active educational repositories. That meant we were dealing with the kinds of files maintainers actually have to ship: nested lists fenced code blocks inline HTML relative links translated headings migrated image assets CJK punctuation and emphasis edge cases In other words, we were seeing the kinds of Markdown that break when a translation system is only mostly correct. 1) We stopped treating code fences like a regex problem Code fences are not a regex problem—they are a structural one. Left: Regex-based handling breaks code fences and list structure across chunks. Right: Parser-based processing preserves code blocks and their surrounding context as atomic units. One of the earliest recurring themes was code fence integrity. A report on incorrectly handled triple backticks highlighted a classic failure mode: if fenced blocks are detected or split incorrectly, placeholders can fall out of sync, chunk boundaries can be corrupted, and the translated file can come back structurally damaged. A later report showed a closely related issue: list items and indented code placeholders could be split into separate chunks, which then caused broken fences downstream. The right fix was not another regex patch. Instead, Co-op Translator moved to a parser-based approach using markdown-it-py for fenced code block detection. This made code block handling spec-aware and more resilient to cases like unmatched fences, variable fence lengths, and info strings. More importantly, it ensured code sections were treated as atomic units during chunking and placeholder restoration. This same principle was extended to list-aware chunking. Rather than splitting Markdown line by line and hoping the model would preserve structure, the pipeline now groups list items together with their continuation lines and indented placeholders such as @@CODE_BLOCK_X@@. This prevents bullets and their associated code content from being separated into different translation chunks. This was not just a better heuristic. It changed the unit of chunking itself. In practice, this required modifying the chunking pipeline to detect and preserve list-item blocks before token-based splitting. Instead of treating each line independently, we introduced a grouping step that keeps the entire list context intact, including nested indentation and code placeholders. The change was implemented directly in the chunking logic: lines = _group_lines_preserving_list_items(part_text) This helper ensures that list items and their associated code blocks are processed as a single unit, preventing structural corruption during translation. Why this mattered Technical documentation frequently embeds code examples directly under list items or step-by-step instructions. When these relationships are broken during translation, the issue is not just cosmetic. It results in structurally invalid Markdown and misplaced code blocks that can confuse readers and make examples unusable. These were not edge cases. They appeared in real production documentation where: Fenced code blocks became malformed after chunking List items and their associated code placeholders were separated into different segments Placeholder ordering drifted, breaking reconstruction of the original structure In practice, this meant that even when the translated text was correct, the document itself could no longer be trusted as a working technical resource. What changed in practice Before: Code samples could leak out of their list context List items and code blocks were split across chunks Placeholder ordering could drift, breaking reconstruction After: Code blocks are preserved as atomic units during chunking List-bound code samples remain intact Placeholder ordering is stable across the pipeline 2) We restored internal link consistency across translation chunks Even when each chunk appears locally correct, internal links can break at the document level. Left: Anchor links drift out of sync because headings and links are translated independently across chunks. Right: After document-level normalization, links correctly resolve to their corresponding translated headings. Another cluster of issues surfaced when translating longer Markdown documents: internal links would silently break once the content was processed in chunks. Co-op Translator splits large documents into multiple chunks to fit within model constraints. While this works well for translation itself, it introduces a structural problem. Internal links such as [Go to section](#section-name) depend on heading-derived anchor slugs, and those slugs can change during translation. When each chunk is translated independently, links and headings can drift out of sync. In practice, this meant that even when translated headings and links looked correct locally within a chunk, they no longer matched at the document level. Tables of contents, section jump links, and cross-references inside the same file could silently break. The right fix was not to rely on chunk-level correctness. Instead, Co-op Translator introduced a document-level normalization step for internal anchor links. The pipeline now parses both the source and translated Markdown using markdown-it, extracts headings, generates GitHub-style slugs from the translated headings, and then realigns internal anchor links so they correctly point to their corresponding translated sections. Rather than trusting fragment identifiers produced during chunk-level translation, links are reconciled against the final translated document structure. This was not just a small post-processing tweak. It changed where consistency is enforced. In practice, this required introducing a normalization step that runs after all chunks are merged back into a single document. Instead of assuming each chunk is self-consistent, the system now treats the entire document as the source of truth and rebinds internal links accordingly. The change was implemented as a dedicated normalization pass: normalize_internal_anchor_links(source_markdown, translated_markdown) This function aligns fragment identifiers with translated heading slugs, ensuring that internal navigation remains valid even when content has been translated in multiple independent chunks. Why this mattered Technical documentation relies heavily on internal navigation such as tables of contents, section links, and cross-references within the same file. When anchor links drift out of sync with translated headings, the document becomes difficult to navigate even if the translation itself is accurate. Readers may click on links that lead to incorrect sections or nowhere at all, which significantly reduces trust in the content. These issues surfaced in real-world usage where: Internal links no longer matched translated heading slugs Tables of contents pointed to incorrect or missing sections Cross-references silently broke across chunk boundaries This highlighted that correctness at the chunk level was not enough. Consistency had to be enforced at the document level. What changed in practice Before: Internal links could drift out of sync with translated headings Tables of contents pointed to incorrect or missing sections Cross-references silently broke across chunk boundaries Long documents behaved like fragmented outputs rather than a single unit After: Internal links are realigned with translated heading slugs at the document level Tables of contents correctly resolve to translated sections Cross-references remain consistent across the entire document Long Markdown documents behave as a single coherent unit 3) We fixed CJK emphasis the safe way Bold and italic rendering around CJK text was a recurring and subtle failure point. Issues like “Markdown bold not handled correctly” may look minor, but they reveal a deeper compatibility problem: many Markdown renderers do not consistently apply emphasis when markers sit directly next to CJK characters. To address this, we introduced a dedicated normalization step for emphasis markers. Instead of relying on each renderer to interpret `*`, `**`, and `***` correctly in CJK-adjacent cases, Co-op Translator converts them into equivalent HTML tags such as `<em>` and `<strong>` when the target language is Japanese, Korean, or Chinese. This shifts emphasis rendering from renderer-dependent behavior to deterministic output. What mattered was not just fixing it, but fixing it safely. The normalization is strictly scoped to CJK languages and carefully designed to avoid overmatching. It does not mutate inline code spans or unrelated fragments. This is critical, because overly aggressive formatting fixes can easily break code, identifiers, or underscore-heavy technical text. Unlike whitespace-delimited languages, Japanese, Korean, and Chinese often place characters directly adjacent to emphasis markers without clear boundaries. For example, a phrase like: example is ... may be translated into Japanese as: 例は ... Here, the particle は is attached directly to the emphasized word. In some Markdown renderers, this breaks the expected boundary around ..., causing the emphasis to render incorrectly or not at all. This pattern is not limited to Japanese. Similar boundary issues can appear across CJK languages due to the absence of whitespace between words. Why this mattered Formatting bugs around emphasis may look cosmetic, but they affect readability, hierarchy, and trust especially in instructional documentation where emphasis often signals warnings, key concepts, or required steps. What changed in practice Before: Emphasis markers could render inconsistently when adjacent to CJK characters Bold and italic formatting could break depending on the Markdown renderer Fixes risked overmatching and corrupting code or inline technical content After: Emphasis rendering is deterministic across CJK languages using HTML tags Bold and italic formatting remains consistent regardless of renderer behavior Normalization is safely scoped, avoiding unintended mutations in code and inline content Next steps With the recent release, Co-op Translator now exposes a programmatic API that allows the translation pipeline to be executed directly from Python, not only through the CLI. This is an important step, but it is not the end state. The immediate focus is improving adoption. Documentation and usage patterns are being developed so that the API can be reliably integrated across different environments and workflows. More fundamentally, the direction is shifting. Co-op Translator is evolving from a repository-specific tool into a reusable translation engine that can operate as part of larger content pipelines. This enables broader use cases, including: Long-form content such as eBooks and technical blogs Developer documentation and static site projects (for example, Docusaurus or Astro) Continuous documentation pipelines that track and update translations as source content evolves Multilingual SDK, API documentation, and knowledge base systems The long-term goal is to treat translation as infrastructure rather than a one-time task. Instead of generating static outputs, the system is being designed to support continuous updates, structural guarantees, and seamless integration into real-world documentation workflows. Why community feedback mattered so much here One of the most encouraging parts of this work is that the most useful reports were not always long reports. Sometimes a single repository link, a screenshot, and one concrete example of broken output were enough to reveal a structural weakness in the translation engine. That feedback created a valuable loop between people reading translated docs and people maintaining the translation tooling. Hiroshi's reports did not just identify isolated defects. They helped surface recurring categories of failure: code fence integrity chunk boundary safety link preservation CJK emphasis compatibility image path migration anchor normalization Once those patterns became visible, the fixes could be implemented in the core and covered with tests so that the broader ecosystem not just one file or one repo would benefit. Why this matters for learners worldwide Co-op Translator is used in educational repositories where translated documentation can lower the barrier to learning for people around the world. That raises the quality bar. A learner should not have to wonder whether a missing bold marker changed the meaning of a sentence. A learner should not hit a broken anchor halfway through a tutorial. A learner should not lose trust in a translated page because a code block or image path was corrupted during processing. Improving those details is not cosmetic. It is part of making global technical education more reliable. Closing thoughts This community report comes down to a simple truth: Translation quality depends on structural quality. Community feedback helped Co-op Translator get better at preserving the things technical documents depend on most: code fences, lists, links, emphasis, images, and anchors. The result is a more dependable foundation for multilingual documentation not only for Japanese, but for any repository that needs translated content to behave like a maintained technical artifact rather than a plain text dump. To everyone who has opened an issue, shared a screenshot, submitted a PR, or stress-tested translated docs in the real world: thank you. That feedback is helping Co-op Translator become a stronger tool for maintainers and a more trustworthy experience for learners. If you are maintaining multilingual Markdown content, I hope these lessons are useful beyond this project too: use parsers where you can, make structure a first-class concern, and treat community bug reports as design input not just support tickets. If you are working on multilingual documentation, you can explore Co-op Translator here: https://github.com/Azure/co-op-translator Selected GitHub references Repository: https://github.com/Azure/co-op-translator Issue #221: https://github.com/Azure/co-op-translator/issues/221 PR #226: https://github.com/Azure/co-op-translator/pull/226 Issue #234: https://github.com/Azure/co-op-translator/issues/234 PR #237: https://github.com/Azure/co-op-translator/pull/237 Issue #235: https://github.com/Azure/co-op-translator/issues/235 Issue #239: https://github.com/Azure/co-op-translator/issues/239 Issue #357: https://github.com/Azure/co-op-translator/issues/357 Issue #362: https://github.com/Azure/co-op-translator/issues/362 Issue #363: https://github.com/Azure/co-op-translator/issues/363 PR #370: https://github.com/Azure/co-op-translator/pull/370 PR #372: https://github.com/Azure/co-op-translator/pull/372 PR #377: https://github.com/Azure/co-op-translator/pull/377 PR #378: https://github.com/Azure/co-op-translator/pull/378 PR #379: https://github.com/Azure/co-op-translator/pull/379 PR #364: https://github.com/Azure/co-op-translator/pull/364 About the authors Minseok Song (Microsoft MVP) is an OSS maintainer of Co-op Translator focusing on GitHub-native multilingual automation. Hiroshi Yoshioka (Microsoft MVP) is a community contributor who has played a key role in improving translation quality through detailed real-world feedback.Securing Your AI Agents Before They Ship: Red Teaming with Microsoft PyRIT
Securing Your AI Agents Before They Ship: Red Teaming with Microsoft PyRIT You wouldn't ship a web app without running OWASP ZAP or Snyk. So why are AI agents going to production without a single security scan? Prompt injection, data leakage, system prompt theft — the OWASP Top 10 for LLM Applications reads like a checklist of things most teams haven't tested for. PyRIT is Microsoft's open-source answer: an automation framework battle-tested on 100+ products including Copilot. But here's the catch — PyRIT is a research library. To make it work in a real engineering workflow, you need to wrap it. This post shows you how. In this post: Why AI red teaming is fundamentally different from traditional security testing What PyRIT gives you out of the box How to build a thin wrapper that turns PyRIT into a config-driven, pipeline-ready scanner When and how to plug it into your CI/CD workflow Customizing every step for your threat model 🛡️ Why AI Red Teaming Is Different If you're building agentic AI — systems that reason, call tools, and take actions — you already know that traditional security testing doesn't cut it. Microsoft's AI Red Team learned this the hard way after red-teaming 100+ generative AI products. Three things make AI red teaming unique: You're testing two risk surfaces at once — security vulnerabilities (prompt injection, data exfiltration) *and* responsible AI harms (bias, toxicity, manipulation). Traditional pen testers focus on one. Outputs are probabilistic — the same prompt can produce different responses across runs. You can't just assert on a fixed output. You need automated scoring at scale. Every architecture is different — standalone chatbots, RAG pipelines, multi-agent workflows, tool-calling agents. A single test harness has to flex across all of them. The OWASP LLM Top 10 (2025) gives us the taxonomy — prompt injection, sensitive information disclosure, excessive agency, system prompt leakage, data poisoning, supply chain risks, improper output handling, embedding weaknesses, misinformation, and unbounded consumption. Every AI agent you deploy is exposed to all ten. The question is whether *you* discover the gaps or your users do. 🔧 What PyRIT Gives You PyRIT (Python Risk Identification Tool) started as internal scripts at Microsoft in 2022. Today it's a 3,800-star, MIT-licensed framework with 129 contributors and a published paper. "We were able to pick a harm category, generate several thousand malicious prompts, and use PyRIT's scoring engine to evaluate the output from the Copilot system — all in the matter of hours instead of weeks." — Microsoft Security Blog The building blocks: 53+ datasets — AIRT, HarmBench, AdvBench, XSTest, and more. Curated adversarial prompts covering content harms, jailbreaks, data exfiltration, and social bias. 70+ prompt converters — Base64, ROT13, Leetspeak, Unicode confusables, LLM-powered rephrasing, translation, multimodal injection. They stack — a prompt can be translated, then Base64-encoded, then embedded in an image. 6 attack strategies — from simple `PromptSendingAttack` (single-turn) to `CrescendoAttack` (gradual escalation), `TreeOfAttacksWithPruning` (TAP), and multi-turn dialogue attacks. 20+ scorers — LLM-as-judge, Azure AI Content Safety, true/false classifiers, Likert scales. 10+ targets — OpenAI, Azure, HuggingFace, HTTP endpoints, Playwright, WebSockets. This is powerful — PyRIT gives you the components — datasets, converters, attack strategies, scorers — but not the glue. You still need something that loads a config, wires the right components together, runs attacks, scores the results, and tells your pipeline pass or fail. That's what a wrapper does. 🏗️ Building an Enterprise Wrapper The idea is simple: take PyRIT's primitives and compose them into an opinionated, config-driven pipeline that any developer can run with a single command. Below is given the idea on how we can create the wrapper around PyRIT to make it useful for agentic ai security testing, but this is not limited. The Flow Everything starts with a YAML config and ends with a pass/fail exit code: The key insight: every step in this pipeline is configurable through YAML, not code. Switching attack strategies, adding converters, adjusting thresholds — it's all one config edit away. Project Structure At POC level, the wrapper is lean — an orchestrator that stitches PyRIT's components together: The heart of it is `runner.py` — a single orchestrator that: Reads the YAML config Dynamically imports your agent and wraps it as a PyRIT target Loads datasets (built-in + your custom prompts) Runs attacks in escalating phases — plain → encoded → semantic → multi-turn Scores every response using LLM-as-judge Maps findings to OWASP LLM Top 10 categories Generates reports (JSON, Markdown) Evaluates the release gate — pass or fail That's it. One file doing the orchestration, a handful of helpers, and a YAML config. You can start here and grow the structure as your needs evolve. The Configuration A single YAML file drives everything: scan: name: "my-agent-security-scan" # Point to your agent — any Python callable target: module: "my_agent.main" function: "handle_prompt" # Attack strategies to run strategies: - name: "prompt_sending" # Single-turn baseline - name: "crescendo" # Gradual escalation max_turns: 5 # Prompt transformations — they stack automatically converters: - name: "Base64Converter" - name: "LeetspeakConverter" - name: "TranslationConverter" # Adversarial prompt sets datasets: - "airt_illegal" # Built-in: Microsoft AI Red Team - "harmbench" # Built-in: harm evaluation benchmark - path: "datasets/custom_prompts.yaml" # Release gate thresholds gate: fail_on: critical_findings: 0 # Zero tolerance for critical high_findings: 3 max_score: 0.3 # Defense rate must exceed 70% Point the target at your agent, pick your strategies and datasets, set your thresholds — run. Teams can start scanning their agents in an afternoon, not weeks. 🔄 Plugging Into Your Pipeline Since the wrapper is a pip-installable package(we can use setup tools or poetry to build and make it pip installable), integrating it into any CI/CD system is straightforward — `pip install`, then call the CLI. No custom actions or marketplace extensions needed. The key decision is when to run scans. Not every merge needs a full red team pass. Here's what works in practice: The idea is that developers can optionally run quick scans locally as a fast feedback loop, while full scans are manually triggered or approval-gated — the tech lead or architect decides when it's worth running a comprehensive assessment based on the nature of the changes. Since it's just a CLI, integration is the same everywhere — GitHub Actions, Azure DevOps, Jenkins, or a shell script. Install the package, call `pyrit-scan run`, check the exit code. ⚙️ Customization Without Forking The whole point of a wrapper is that teams customize behavior through configuration — not by modifying framework code. What to Customize How Example Which agent to test Point target.module + target.function in YAML to any Python callable Your chatbot, RAG pipeline, or multi-agent workflow Attack strategies Add/remove entries under strategies in YAML Start with prompt_sending , add crescendo when ready Prompt transformations List converters in YAML — they stack automatically Base64 → Leetspeak → Translation = multi-phase evasion Datasets Use built-in (53+) or add custom YAML prompt files HIPAA prompts, financial compliance scenarios Scoring thresholds Set per-OWASP-category thresholds in gate.fail_on Zero tolerance for data leakage (LLM02), relaxed for misinformation (LLM09) Report formats List formats in reporting.formats JSON for automation, PDF for compliance, JUnit for dashboards New attack classes Register via custom_attacks in YAML — module + class name No framework code change, no PR needed 🎯 Start Red Teaming Today AI red teaming isn't a nice-to-have anymore. If you're shipping agentic AI — systems that call tools, access data, and take actions on behalf of users — you need automated security testing in your pipeline. PyRIT gives you the primitives. A thin wrapper gives you the automation. Together, they turn AI security from a one-off exercise into a continuous, measurable practice. The pattern: YAML config → wrap your agent → run attacks → score → map to OWASP → gate the release. Build it once. Run it on every release. Sleep better. Resources PyRIT on GitHub — source code, docs, and community PyRIT Documentation — getting started guides and API reference OWASP LLM Top 10 (2025) — the industry standard risk taxonomy Microsoft AI Red Team Hub — threat models, bug bars, and best practices 3 Takeaways from Red Teaming 100 Products — lessons learned at scale PyRIT Launch Blog — origin story and key design decisions PyRIT Paper (arXiv) — the academic paper695Views0likes0CommentsMaking Sense of Azure AI Foundry IQ
As enterprise teams build AI agents, the hardest design decisions often have nothing to do with models. Instead, they revolve around a more fundamental question: How should an agent access organizational knowledge in a way that is accurate, secure, and sustainable over time? Azure AI Foundry IQ is designed to address a specific version of that problem. It is not a general‑purpose data access layer, and it is not a replacement for every retrieval pattern. Understanding where it fits and where it does not is key to using it effectively. This post explores those boundaries and grounds them in concrete, enterprise‑relevant scenarios, before showing how Foundry IQ can be implemented directly via Azure AI Search APIs and SDKs. What Azure AI Foundry IQ Is (and Is Not): Azure AI Foundry IQ is a managed knowledge layer built on Azure AI Search. It allows you to define a knowledge base that spans multiple content sources such as SharePoint, Azure Blob Storage, OneLake, existing Azure AI Search indexes, and selected external sources and expose them through a single, permission‑aware endpoint. When an agent queries a knowledge base, Foundry IQ: Plans how the query should be executed Selects relevant knowledge sources Runs retrieval (optionally in multiple steps) Enforces user permissions Returns grounded results with citations A single knowledge base can be reused across multiple agents or applications, avoiding duplicated indexing and inconsistent retrieval logic. What Foundry IQ is not: It does not execute SQL queries, perform aggregations, or provide real‑time numeric accuracy. Foundry IQ retrieves unstructured text, not transactional or analytical data. Where Foundry IQ Is a Good Fit 1. Multi‑Source, Distributed Knowledge Foundry IQ is most valuable when relevant knowledge is spread across multiple systems. It removes the need for each agent to manage source‑specific routing and retrieval logic. This benefit increases as the number of sources grows; with a single source, the overhead is rarely justified. 2. Complex or Multi‑Part Questions Foundry IQ’s agentic retrieval model is designed for questions that require: Decomposition into sub‑questions Retrieval from multiple documents Synthesis across sources Its multi‑step retrieval approach is especially effective when a single document cannot answer the question on its own. 3. Reduced Custom Retrieval Engineering Foundry IQ automates indexing, chunking, vectorization, and orchestration across sources. This makes it a strong choice for teams that want to focus on agent behavior rather than building and maintaining custom RAG pipelines. 4. Enterprise Security and Governance Foundry IQ integrates with Microsoft Entra ID and supports document‑level permissions and Purview sensitivity labels where the underlying source allows it. This makes it suitable for internal or regulated scenarios where permission trimming is a hard requirement. 5. Shared Knowledge Across Multiple Agents A single knowledge base can serve multiple agents or applications, reducing operational overhead and ensuring consistent retrieval behavior across experiences. 6. High Emphasis on Answer Quality and Trust For scenarios where correctness, grounding, and citations matter more than latency or cost, Foundry IQ’s multi‑step retrieval consistently outperforms basic RAG approaches. Example Scenarios Where Foundry IQ Works Well Scenario A: Internal Policy and Operations Assistant An enterprise builds an internal assistant for store managers. Relevant information lives in: • HR policies in SharePoint • Safety procedures in Blob Storage • Operations manuals in OneLake Questions often span multiple documents. A single Foundry IQ knowledge base unifies these sources and enforces permissions automatically. Scenario B: Compliance or Regulatory Knowledge Assistant A compliance team needs answers strictly grounded in approved documents, with citations and access control. Foundry IQ ensures only authorized content is retrieved, reducing the risk of accidental data exposure. Scenario C: Shared Knowledge Layer for Multiple Internal Agents Multiple internal agents like chat assistants, workflow helpers, embedded copilots rely on the same procedural content. A shared knowledge base avoids duplicate indexing and centralizes governance. Where Foundry IQ Is Not a Good Fit 1. Simple or Single‑Source Q&A For a single, well‑defined source, Foundry IQ’s orchestration adds complexity without proportional benefit. 2. Structured or Analytical Data Queries Foundry IQ does not execute live queries or calculations. It retrieves text, not metrics. 3. Ultra‑Low Latency or High‑Throughput Requirements Agentic retrieval introduces LLM‑in‑the‑loop latency and token costs. For sub‑second responses at scale, simpler retrieval pipelines are more appropriate. 4. Highly Customized Retrieval Logic Foundry IQ abstracts the retrieval pipeline. If you require fine‑grained control over scoring or transformations, a fully custom search pipeline may be preferable. Example Scenarios Where Foundry IQ Is the Wrong Tool Scenario D: Sales and Inventory Analytics Agent Questions like “What were Q4 sales by region?” require live data queries. Indexing reports leads to stale answers. A direct SQL or analytics tool is the correct solution. Scenario E: High‑Volume, Low‑Latency Assistant Voice‑based assistants requiring sub‑second responses cannot tolerate the latency of agentic retrieval. A Common Architecture Pattern Most successful implementations combine: Foundry IQ for unstructured documents and policies Structured data tools for analytics and live queries An application or agent layer that routes questions based on intent This avoids forcing a single tool to solve every problem. Querying Foundry IQ Knowledge Bases Directly via Azure AI Search SDK You can query Azure AI Foundry IQ knowledge bases directly using the azure-search-documents Python SDK without using Foundry Agent Service. Your App → Azure AI Search SDK → Foundry IQ Knowledge Base → Grounded Results Ideal when you want full orchestration control while still benefiting from managed, agentic retrieval. How this works Note:It is a reference implementation Install pip install --pre azure-search-documents azure-identity Setup (High Level) Provision Azure AI Search (Basic or higher) Enable Azure AD and API key authentication Enable a system‑assigned managed identity Ingest Content via Knowledge Sources Blob Storage, SharePoint, or OneLake Index, indexer, data source, and skillset are created automatically Knowledge sources and KBs are created via REST API (2025‑11‑01‑preview) Create a Knowledge Base minimal reasoning → semantic retrieval only (no LLM) low / medium reasoning → requires Azure OpenAI model Search service MI needs Cognitive Services User Querying the Knowledge Base (Python) Initialize the Client from azure.identity import DefaultAzureCredential from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient client = KnowledgeBaseRetrievalClient( endpoint="https://<search-service>.search.windows.net", knowledge_base_name="<kb-name>", credential=DefaultAzureCredential(), ) Minimal Reasoning (Fast, No LLM) from azure.search.documents.knowledgebases.models import ( KnowledgeBaseRetrievalRequest, KnowledgeRetrievalSemanticIntent, KnowledgeRetrievalMinimalReasoningEffort, KnowledgeRetrievalOutputMode, ) request = KnowledgeBaseRetrievalRequest( intents=[KnowledgeRetrievalSemanticIntent(search="your question here")], retrieval_reasoning_effort=KnowledgeRetrievalMinimalReasoningEffort(), output_mode=KnowledgeRetrievalOutputMode.EXTRACTIVE_DATA, ) response = client.retrieve(retrieval_request=request) Conversational Reasoning (LLM‑Backed) from azure.search.documents.knowledgebases.models import ( KnowledgeBaseRetrievalRequest, KnowledgeBaseMessage, KnowledgeBaseMessageTextContent, KnowledgeRetrievalLowReasoningEffort, KnowledgeRetrievalOutputMode, ) request = KnowledgeBaseRetrievalRequest( messages=[ KnowledgeBaseMessage( role="user", content=[KnowledgeBaseMessageTextContent(text="<first user question>")] ), KnowledgeBaseMessage( role="assistant", content=[KnowledgeBaseMessageTextContent(text="<assistant response>")] ), KnowledgeBaseMessage( role="user", content=[KnowledgeBaseMessageTextContent(text="<follow-up question>")] ), ], retrieval_reasoning_effort=KnowledgeRetrievalLowReasoningEffort(), output_mode=KnowledgeRetrievalOutputMode.EXTRACTIVE_DATA, ) response = client.retrieve(retrieval_request=request) Keep in mind: intents → minimal reasoning only messages → low / medium reasoning only They are not interchangeable. Processing the Response # Extracted content for msg in (response.response or []): for item in (msg.content or []): print(item.text) # Citations (handles blob, SharePoint, OneLake, and search index references) for ref in (response.references or []): ref_id = getattr(ref, "id", None) url = getattr(ref, "blob_url", None) or getattr(ref, "url", None) print(f"[{ref_id}] {url}") # Retrieval diagnostics for record in (response.activity or []): elapsed = getattr(record, "elapsed_ms", None) or "" print(f"{record.type}: {elapsed}ms") Output Modes Mode When to Use extractiveData Feed grounded chunks into your own LLM answerSynthesis Return a ready‑made answer with citations (LLM required) Security & Permissions RBAC: Search Index Data Reader with DefaultAzureCredential Permission trimming Must be enabled at ingestion (ingestionPermissionOptions) Enforced at query time by passing the user’s bearer token response = client.retrieve( retrieval_request=request, x_ms_query_source_authorization="Bearer <user-token>" ) Foundry IQ won't solve every retrieval problem. But when your agents need grounded, permission-aware answers from content scattered across SharePoint, Blob Storage, and OneLake, it handles the hard parts — so you can focus on what your agent actually does.The Future of Agentic AI: Inside Microsoft Agent Framework 1.0
Agentic AI is rapidly moving beyond demos and chatbots toward long‑running, autonomous systems that reason, call tools, collaborate with other agents, and operate reliably in production. On April 3, 2026, Microsoft marked a major milestone with the General Availability (GA) release of Microsoft Agent Framework 1.0, a production‑ready, open‑source framework for building agents and multi‑agent workflows in.NET and Python. [techcommun...rosoft.com] In this post, we’ll deep‑dive into: What Microsoft Agent Framework actually is Its core architecture and design principles What’s new in version 1.0 How it differs from other agent frameworks When and how to use it—with real code examples What Is Microsoft Agent Framework? According to the official announcement, Microsoft Agent Framework is an open‑source SDK and runtime for building AI agents and multi‑agent workflows with strong enterprise foundations. Agent Framework provides two primary capability categories: 1. Agents Agents are long‑lived runtime components that: Use LLMs to interpret inputs Call tools and MCP servers Maintain session state Generate responses They are not just prompt wrappers, but stateful execution units. 2. Workflows Workflows are graph‑based orchestration engines that: Connect agents and functions Enforce execution order Support checkpointing and human‑in‑the‑loop scenarios This leads to a clean separation of responsibilities: Concern Handled By Reasoning & interpretation Agent Execution policy & control flow Workflow This separation is a foundational design decision. High‑Level Architecture From the official overview, Agent Framework is composed of several core building blocks: Model clients (chat completions & responses) Agent sessions (state & conversation management) Context providers (memory and retrieval) Middleware pipeline (interception, filtering, telemetry) MCP clients (tool discovery and invocation) Workflow engine (graph‑based orchestration) Conceptual Flow 🌟 What’s New in Version 1.0 Version 1.0 marks the transition from "Release Candidate" to "General Availability" (GA). Production-Ready Stability: Unlike the earlier experimental packages, 1.0 offers stable APIs, versioned releases, and a commitment to long-term support (LTS). A2A Protocol (Agent-to-Agent): A new structured messaging protocol that allows agents to communicate across different runtimes. For example, an agent built in Python can seamlessly coordinate with an agent running in a .NET environment. MCP (Model Context Protocol) Support: Full integration with the Model Context Protocol, enabling agents to dynamically discover and invoke external tools and data sources without manual integration code. Multi-Agent Orchestration Patterns: Stable implementations of complex patterns, including: Sequential: Linear handoffs between specialized agents. Group Chat: Collaborative reasoning where agents discuss and solve problems. Magentic-One: A sophisticated pattern for task-oriented reasoning and planning. Middleware Pipeline: The new middleware architecture lets you inject logic into the agent's execution loop without modifying the core prompts. This is essential for Responsible AI (RAI), allowing you to add content safety filters, logging, and compliance checks globally. DevUI Debugger: A browser-based local debugger that provides a real-time visual representation of agent message flows, tool calls, and state changes. Code Examples Creating a Simple Agent (C#) From Microsoft Learn : using Azure.AI.Projects; using Azure.Identity; using Microsoft.Agents.AI; AIAgent agent = new AIProjectClient( new Uri("https://your-foundry-service.services.ai.azure.com/api/projects/your-project"), new AzureCliCredential()) .AsAIAgent( model: "gpt-5.4-mini", instructions: "You are a friendly assistant. Keep your answers brief."); Console.WriteLine(await agent.RunAsync("What is the largest city in France?")); This shows: Provider‑agnostic model access Session‑aware agent execution Minimal setup for production agents Creating a Simple Agent (Python) from agent_framework.foundry import FoundryChatClient from azure.identity import AzureCliCredential client = FoundryChatClient( project_endpoint="https://your-foundry-service.services.ai.azure.com/api/projects/your-project", model="gpt-5.4-mini", credential=AzureCliCredential(), ) agent = client.as_agent( name="HelloAgent", instructions="You are a friendly assistant. Keep your answers brief.", ) result = await agent.run("What is the largest city in France?") print(result) The same agent abstraction applies across languages. When to Use Agents vs Workflows Microsoft provides clear guidance: Use an Agent when… Use a Workflow when… Task is open‑ended Steps are well‑defined Autonomous tool use is needed Execution order matters Single decision point Multiple agents/functions collaborate Key principle: If you can solve the task with deterministic code, do that instead of using an AI agent. 🔄 How It Differs from Other Frameworks Microsoft Agent Framework 1.0 distinguishes itself by focusing on "Enterprise Readiness" and "Interoperability." Feature Microsoft Agent Framework 1.0 Semantic Kernel / AutoGen LangChain / CrewAI Philosophy Unified, production-ready SDK. Research-focused or tool-specific. High-level, developer-friendly abstractions. Integration Deeply integrated with Microsoft Foundry and Azure. Varied; often requires more glue code. Generally cloud-agnostic. Interoperability Native A2A and MCP for cross-framework tasks. Limited to internal ecosystem. Uses proprietary connectors. Runtime Identical API parity for .NET and Python. Primarily Python-first (SK has C#). Primarily Python. Control Graph-based deterministic workflows. More non-deterministic/experimental. Mixture of role-based and agentic. 🛠️ Key Technical Components Agent Harness: The execution layer that provides agents with controlled access to the shell, file system, and messaging loops. Agent Skills: A portable, file-based or code-defined format for packaging domain expertise. Implementation Tip: If you are coming from Semantic Kernel, Microsoft provides migration assistants that analyze your existing code and generate step-by-step plans to upgrade to the new Agent Framework 1.0 standards. Microsoft Agent Framework Version 1.0 | Microsoft Agent Framework Agent Framework documentation 🎯 Summary Microsoft Agent Framework 1.0 is the "grown-up" version of AI orchestration. By standardizing the way agents talk to each other (A2A), discover tools (MCP), and process information (Middleware), Microsoft has provided a clear path for taking AI experiments into production. For more detailed guides, check out the official Microsoft Agent Framework DocumentationMicrosoft Agent Framework - .NET AI Community StandupStop Experimenting, Start Building: AI Apps & Agents Dev Days Has You Covered
The AI landscape has shifted. The question is no longer “Can we build AI applications?” it’s “Can we build AI applications that actually work in production?” Demos are easy. Reliable, scalable, resilient AI systems that handle real-world complexity? That’s where most teams struggle. If you’re an AI developer, software engineer, or solution architect who’s ready to move beyond prototypes and into production-grade AI, there’s a series built specifically for you. What Is AI Apps & Agents Dev Days? AI Apps & Agents Dev Days is a monthly technical series from Microsoft Reactor, delivered in partnership with Microsoft and NVIDIA. You can explore the full series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ This isn’t a slide deck marathon. The series tagline says it best: “It’s not about slides, it’s about building.” Each session tackles real-world challenges, shares patterns that actually work, and digs into what’s next in AI-driven app and agent design. You bring your curiosity, your code, and your questions. You leave with something you can ship. The sessions are led by experienced engineers and advocates from both Microsoft and NVIDIA, people like Pamela Fox, Bruno Capuano, Anthony Shaw, Gwyneth Peña-Siguenza, and solutions architects from NVIDIA’s Cloud AI team. These aren’t theorists; they’re practitioners who build and ship the tools you use every day. What You’ll Learn The series covers the full spectrum of building AI applications and agent-based systems. Here are the key themes: Building AI Applications with Azure, GitHub, and Modern Tooling Sessions walk through how to wire up AI capabilities using Azure services, GitHub workflows, and the latest SDKs. The focus is always on code-first learning, you’ll see real implementations, not abstract architecture diagrams. Designing and Orchestrating AI Agents Agent development is one of the series’ strongest threads. Sessions cover how to build agents that orchestrate long-running workflows, persist state automatically, recover from failures, and pause for human-in-the-loop input, without losing progress. For example, the session “AI Agents That Don’t Break Under Pressure” demonstrates building durable, production-ready AI agents using the Microsoft Agent Framework, running on Azure Container Apps with NVIDIA serverless GPUs. Scaling LLM Inference and Deploying to Production Moving from a working prototype to a production deployment means grappling with inference performance, GPU infrastructure, and cost management. The series covers how to leverage NVIDIA GPU infrastructure alongside Azure services to scale inference effectively, including patterns for serverless GPU compute. Real-World Architecture Patterns Expect sessions on container-based deployments, distributed agent systems, and enterprise-grade architectures. You’ll learn how to use services like Azure Container Apps to host resilient AI workloads, how Foundry IQ fits into agent architectures as a trusted knowledge source, and how to make architectural decisions that balance performance, cost, and scalability. Why This Matters for Your Day Job There’s a critical gap between what most AI tutorials teach and what production systems actually require. This series bridges that gap: Production-ready patterns, not demos. Every session focuses on code and architecture you can take directly into your projects. You’ll learn patterns for state persistence, failure recovery, and durable execution — the things that break at 2 AM. Enterprise applicability. The scenarios covered — travel planning agents, multi-step workflows, GPU-accelerated inference — map directly to enterprise use cases. Whether you’re building internal tooling or customer-facing AI features, the patterns transfer. Honest trade-off discussions. The speakers don’t shy away from the hard questions: When do you need serverless GPUs versus dedicated compute? How do you handle agent failures gracefully? What does it actually cost to run these systems at scale? Watch On-Demand, Build at Your Own Pace Every session is available on-demand. You can watch, pause, and build along at your own pace, no need to rearrange your schedule. The full playlist is available at This is particularly valuable for technical content. Pause a session while you replicate the architecture in your own environment. Rewind when you need to catch a configuration detail. Build alongside the presenters rather than just watching passively. What You’ll Walk Away Wit After working through the series, you’ll have: Practical agent development skills — how to design, orchestrate, and deploy AI agents that handle real-world complexity, including state management, failure recovery, and human-in-the-loop patterns Production architecture patterns — battle-tested approaches for deploying AI workloads on Azure Container Apps, leveraging NVIDIA GPU infrastructure, and building resilient distributed systems Infrastructure decision-making confidence — a clearer understanding of when to use serverless GPUs, how to optimise inference costs, and how to choose the right compute strategy for your workload Working code and reference implementations — the sessions are built around live coding and sample applications (like the Travel Planner agent demo), giving you starting points you can adapt immediately A framework for continuous learning — with new sessions each month, you’ll stay current as the AI platform evolves and new capabilities emerge Start Building The AI applications that will matter most aren’t the ones with the flashiest demos — they’re the ones that work reliably, scale gracefully, and solve real problems. That’s exactly what this series helps you build. Whether you’re designing your first AI agent system or hardening an existing one for production, the AI Apps & Agents Dev Days sessions give you the patterns, tools, and practical knowledge to move forward with confidence. Explore the series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ and start watching the on-demand sessions at the link above. The best time to level up your AI engineering skills was yesterday. The second-best time is right now and these sessions make it easy to start.