best practices
246 TopicsPlatform Improvements for Python AI Apps on Azure App Service
Overview Azure App Service (Linux) is a fully managed PaaS offering that supports a broad range of languages, including Python, Node.js, .NET, PHP, and Java. Developers can push source code or deploy a pre-built artifact; the platform handles the rest, including dependency installation, application containerization, and running the application at cloud scale. More customers are building intelligent applications using Azure AI Foundry and other AI services, and Python has become a language of choice for these workloads. The performance and reliability of the Python deployment pipeline directly shape the developer's experience on the platform, so we looked across the deployment path for opportunities to reduce latency and improve reliability. The first set of changes has reduced Python deployment latency on Azure App Service Linux by approximately 30%. This is the first step in a broader effort to make the platform better suited for AI application development, but the gains resulting from this effort will benefit all apps on the platform. Let's look at the details. Where Deployment Time Was Going Python web application deployments on Azure App Service Linux rely on Oryx, the platform's open-source build system, to produce runnable artifacts during remote builds. Platform telemetry showed that around 70% of Python app deployments use remote builds, and the majority of those resolve dependencies via requirements.txt using pip install. To understand where time was going, we profiled a stress workload: a 7.5 GB PyTorch application. Most production builds are smaller, but stress-testing a dependency-heavy application made the pipeline bottlenecks clear. When a Python app is deployed via remote build, the build container in Kudu (the App Service deployment service) runs Oryx to: Extract the uploaded source code. Create a Python virtual environment. Install dependencies via pip install; 4.35 min (~34% of build time). Copy files to a staging directory; 0.98 min (~8%). Compress via tar + gzip into an archive; 7.53 min (~58%). Write the archive to /home (Azure Storage SMB mount). The app container then extracts this archive to the local disk on every cold start. Why the Archive-Based Approach? The /home directory is backed by an Azure Storage SMB mount, where small-file I/O is comparatively expensive. Python dependencies are file-heavy: virtual environments commonly contain tens of thousands of files, and dependency-heavy ML applications can exceed 200,000 files. Writing those files individually over SMB would be prohibitively slow. Instead, the pipeline builds on the container's local filesystem, writes a single compressed archive over SMB, and the app container extracts it locally on startup for efficient module loading. Key insight: Compression was the single largest phase at 58% of build time, longer than installing the packages themselves. What We Changed Zstandard Compression (Replacing gzip) Standard gzip compression is single-threaded. In our benchmark, compression accounted for 58% of total build time, making it the dominant bottleneck. Because the archive is also decompressed during container startup, decompression time affects runtime startup latency as well. We evaluated three compression algorithms: gzip, LZ4, and Zstandard (zstd). The following results are averaged across multiple deployments of a 7.5 GB Python application with PyTorch and additional ML packages: Metric gzip LZ4 zstd Compression time 7.53 min 1.20 min 1.18 min Decompression time 2.80 min 1.18 min 1.07 min Archive size 4.0 GB 5.0 GB 4.8 GB Both zstd and LZ4 were more than 6× faster than gzip for compression and more than 2× faster for decompression. We selected zstd for the following reasons: Comparable speed to LZ4, with smaller archive sizes (4.8 GB vs. 5.0 GB). Mature ecosystem: zstd is based on RFC 8878 published in 2021 and ships with many common Linux distributions. Native tar support: tar –I zstd works out of the box; no extra packages required. Result: Compression time dropped from 7.53 min → 1.18 min (6.4× faster). Decompression improved from 2.80 min → 1.07 min (2.6× faster), directly reducing cold-start latency. Faster Package Installation with uv pip is implemented in Python and has historically optimized compatibility over maximum parallelism. In dependency-heavy workloads, package download, resolution, and installation can become a major part of deployment time. In our 7.5 GB PyTorch benchmark, package installation accounted for ~34% of total build time (4.35 min out of 12.86 min). We introduced uv, a Python package manager written in Rust, as the primary installer for compatible requirements.txt deployments. Its uv pip install interface works with standard pip workflows. Fallback strategy: Compatibility remains the priority. When uv cannot handle a deployment, the platform retries with pip, preserving the behavior customers already depend on. Cache behavior: Package caches remain local to the build container. When the same app is deployed again before the kudu (build) container is recycled, both pip and uv can reuse cached packages and avoid repeated downloads. Result: Package installation time dropped from 4.35 min → 1.50 min (3× faster). Reducing File Copy Overhead A file copy showed up in two places. First, before compression, the build process copied the entire build directory (application code plus Python packages) to a staging location. This existed historically as a safety measure; creating a clean snapshot before tar reads the file tree. But the cost was steep for the large number of files inherent in Python dependencies. The fix was straightforward: create the tar archive directly from the build directory, skipping the intermediate copy entirely. Second, for pre-built deployment scenarios, we replaced the legacy Kudu sync path with Linux-native rsync. That gave us a better optimized tool for large Linux file trees and reduced the overhead of moving files into the final deployment location. Because this path is used beyond Python, the improvement benefits pre-built apps across the broader App Service Linux ecosystem. Result: Eliminated the 0.98-minute staging copy (8% of build time), reduced temporary disk usage, and improved the remaining file sync path. Pre-Built Python Wheels Cache We added a complementary optimization: a read-only cache of pre-built wheels for commonly used Python packages, selected using platform telemetry. The cache is mounted into the Kudu build container at runtime for Python workloads, allowing the installer to use local wheel artifacts before downloading packages externally. When a matching wheel is available, the installer uses it directly, avoiding a network fetch for that package. Cache misses fall back to the upstream registry (e.g., PyPI) as usual. The cache is managed by the platform and kept up to date, so supported Python builds can use it without any app change. Combined Results Controlled Benchmark (PyTorch 7.5 GB, P1mv3 App Service Tier) The following benchmark was measured on the P1mv3 App Service tier. Values in the "After" column reflect the optimized pipeline with zstd compression, uv package installation, direct tar creation, and the pre-built wheels cache enabled together. Phase Before After Improvement Package installation 4.35 min 1.50 min ~3× faster File copy 0.98 min 0 min Eliminated Compression 7.53 min 1.18 min ~6× faster Total build time 12.86 min ~2.68 min ~79% reduction Production Fleet (All Python Linux Web Apps) Production telemetry across Python deployments shows the impact of these changes: deployment latency decreased by approximately 30% after the rollout. The controlled benchmark shows a larger improvement (~79%) because it exercises a dependency-heavy workload where package installation, file copy, and compression dominate total build time. Typical production apps are smaller and spend less time proportionally in those phases. Beyond Faster Builds: Reliability and Runtime Performance Faster builds only help when deployment requests reliably reach a worker that is ready to build. We updated the primary deployment clients Azure CLI, GitHub Actions, and Azure DevOps Pipelines to warm up Kudu before initiating deployments. Clients now issue a lightweight health-check request to the Kudu endpoint, helping ensure the deployment container is running and ready before the deployment begins. Clients also preserve affinity to the warmed-up worker using the ARR affinity cookie returned by the first request. This increases the chance that the deployment uses a worker with Kudu already running and local package caches already available from recent deployments. Together, these client-side changes reduced deployment failures from transient infrastructure issues and helped the pipeline optimizations reach the build phase reliably. Result: Deployment failures caused by cold-start errors (502, 503, 499) dropped by ~30%. We also improved the default runtime configuration for Python apps using the platform-provided Gunicorn startup path. Previously, the platform defaulted to a single worker, leaving most CPU cores idle. Now, it follows Gunicorn's recommended worker formula, fully utilizing available cores on multi-core SKUs and delivering higher request throughput out of the box. workers = (2 × NUM_CORES) + 1 Key Takeaways Measure before optimizing: Platform telemetry showed that remote builds and requirements.txt based installs were the dominant Python deployment paths, which helped us focus on changes that would benefit the most customers. Compression was the biggest bottleneck: In the dependency-heavy benchmark, archive compression took longer than package installation. Replacing gzip with zstd reduced both build time and cold-start extraction time. File count matters: Python virtual environments can contain tens of thousands of files, and AI workloads can contain many more. Reducing unnecessary file copies and using Linux-native file sync helped lower overhead. Compatibility needs a fallback path: Introducing uv improved the common path, while falling back to pip preserved compatibility for apps that depend on existing Python packaging behavior. Deployment reliability is part of performance: Faster builds only help if deployment requests consistently reach a ready worker. Warm-up and worker affinity made the optimized path more reliable for customers. Beyond deployment: Runtime defaults, such as Gunicorn worker configuration, also affect how production apps perform once deployment is complete. Together, these changes made Python deployments faster and more reliable while preserving compatibility through safe fallbacks. We will continue improving the platform to make Azure App Service faster, more reliable, and better suited for AI application development.142Views1like0CommentsFrom Prompt to Production: Open in VS Code for Terraform in Azure Copilot
We’re excited to introduce a new step in the Terraform on Azure experience: Open in VS Code, now available directly from Azure Copilot in the Azure Portal. This capability helps you move seamlessly from AI‑generated Terraform code to real Azure deployments - within a connected, guided workflow designed for enterprise scenarios. Why This Matters Infrastructure as Code with Terraform is powerful, but moving from generated configuration to a deployed environment typically involves multiple tools and handoffs. Teams need to understand Terraform state, work with remote backends, and integrate their code into version‑controlled CI/CD pipelines - often backed by Terraform Cloud or Azure‑native backends in enterprise environments. Open in VS Code brings these steps together. It bridges the gap between AI‑assisted authoring in the Azure Portal and the real‑world workflows required to validate, manage state, and deploy infrastructure with confidence. Continue Your Workflow in VS Code With Azure Copilot, you can describe your infrastructure in natural language and generate Terraform configurations in seconds. For example: “Create an Azure Container App using Terraform with a managed environment, Log Analytics enabled, and a system‑assigned managed identity to securely pull images from Azure Container Registry.” Copilot generates the Terraform configuration for you. From there, you can select Open full view to enter a full‑screen Terraform editor, and then choose Open in VS Code to launch the configuration in an Azure‑hosted VS Code environment. There’s no need to download files or set up a local development environment. VS Code for the web opens with Azure authentication already configured, along with commonly used extensions, so you can immediately focus on refining, validating, and preparing your infrastructure for deployment. Built‑in Guidance for Real Deployments Beyond editing, the VS Code experience includes built‑in, step‑by‑step guidance to help you deploy your Terraform configuration into your own Azure environment - whether you’re experimenting or preparing for production. Because Terraform relies on state management, the workflow starts by helping you choose and configure a backend. Backend Options Option 1: Azure Storage Account as a remote backend A natural fit for Azure‑native and enterprise environments. The experience guides you through creating or selecting a storage account and configuring Terraform to store state securely in Azure. Option 2: HCP Terraform (Terraform Cloud) as a remote backend Ideal for teams already using Terraform Cloud. The guided flow helps you authenticate, connect to an existing organization and workspace, and generate the required backend configuration directly into your Terraform files. Option 3: Temporary workspace for quick validation Designed for learning and experimentation. You can run terraform plan and terraform apply directly in the Azure workspace with temporary state, without committing to a long‑term backend - ideal for quick validation, but not intended for production use. Each option includes an end‑to‑end walkthrough, so you can complete backend setup and run Terraform commands without leaving the VS Code environment or searching through external documentation. Connecting Code, State, and Deployment This experience connects three essential parts of the Terraform workflow: AI‑assisted code generation in Azure Portal Copilot Interactive editing and guided execution in VS Code for the web Flexible backend options for managing Terraform state Together, these pieces make it easier to move from idea to infrastructure in a structured, supported way—whether you’re new to Terraform or managing production workloads with established CI/CD pipelines. Available Now - and What’s Next The Open in VS Code experience for Terraform is now public preview in Azure Portal Copilot. We’re continuing to invest in this workflow, including clearer deployment guidance, future integration with GitHub Actions and other CI/CD pipelines, and deeper enhancements to the full‑screen Terraform editor experience. If you haven’t tried it yet, generate a Terraform configuration with Azure Copilot and open it in VS Code to go from prompt to production end to end in one connected workflow.410Views0likes0CommentsRunning Foundry Agent Service on Azure Container Apps
Microsoft’s Customer Zero blog series gives an insider view of how Microsoft builds and operates Microsoft using our trusted, enterprise-grade agentic platform. Learn best practices from our engineering teams with real-world lessons, architectural patterns, and operational strategies for pressure-tested solutions in building, operating, and scaling AI apps and agent fleets across the organization. Challenge: Scaling agents to production changes the requirements As teams move from experimenting with AI agents to running them in production, the questions they ask begin to change. Early prototypes often focus on whether an agent can reason to generate useful output. But once agents are placed into real systems where they continuously need to serve users and respond to events, new concerns quickly take center stage: reliability, scale, observability, security, and long‑running operations. A common misconception at this stage is to think of an agent as a simple chatbot wrapped around an API. In practice, an AI agent is something very different. It is a service that listens, thinks, and acts, ingesting unstructured inputs, reasoning over context, and producing outputs that may span multiple phases. Treating agents as services means teams often need more than they initially expect: dependable compute, strong security, and real-time visibility to run agents safely and effectively at scale. When we kick off an agent loop, we provide input that informs the context it recalls for the task, the data it connects to, the tools it calls, and the reasoning steps it outlines for itself to generate an output. Agent needs are different from traditional services in hosting, scaling, identity, security, and observability; it’s a product with a probabilistic nature that requires secure, auditable access to many resources at the same lightspeed performance that users expect from any software. This isn’t the first time that the software industry needed to evolve its thinking around infrastructure. When modern application architectures began shifting from monolithic apps toward microservices, existing infrastructure wasn’t built with that model in mind. As systems were reconstructed into independent services, teams quickly discovered they needed new runtime architecture that properly accommodated microservice needs. The modern app era brought new levels of performance, reliability, and scalability of apps, but it also warranted that we rebuild app infrastructure with container orchestration and new operational patterns in mind. AI agents represent a similar inflection. Infrastructure designed for request‑response applications or stateless workloads wasn’t built with long‑running, tool‑calling, AI‑driven workflows in mind. As the builders of Foundry Agent Service, we were very aware that traditional architectures wouldn’t hold up to the bursty agentic workflows that needed to aggregate data across sources, connect to several simultaneous tools, and reason through execution plans for the output that we needed. Rather than building new infrastructure from scratch, the choice for building on Azure Container Apps was clear. With over a million Apps hosted on Azure Container Apps, it was the tried-and-true solution we needed to keep our team focused on building agent intelligence and behavior instead of the plumbing underneath. Solution: Building Foundry Agent Service on a resilient agent runtime foundation Foundry Agent Service is Microsoft’s fully managed platform for building, deploying, and scaling AI agents as production services. Builders start by choosing their preferred framework or immediately building an agent inside Foundry, while Foundry Agent Service handles the operational complexity required to run agents at scale. Let’s use the example of a sales agent in Foundry Agent Service. You might have a salesperson who prompts a sales agent with “Help me prepare for my upcoming meeting with customer Contoso.” The agent is going to kick off several processes across data and tools to generate the best answer: Work IQ to understand Teams conversations with Contoso, Fabric IQ for current product usage and forecast trends, Foundry IQ to do an AI search over internal sales materials, and even GitHub Copilot SDK to generate and execute code that can draft PowerPoint and Word artifacts for the meeting. And this is just one agent; more than 20,000 customers rely on Foundry Agent Service. At the core of Foundry Agent Service is a dedicated agent runtime through Azure Container Apps that explicitly meets our demands for production agents. Agent runtime through flexible cloud infrastructure allows builders to focus on making powerful agent experiences without worrying about under-the-hood compute and configurations. This runtime is built around five foundational pillars: Fast startup and resume. Agents are event‑driven and often bursty. Responsiveness depends on the ability to start or resume execution quickly when events arrive. Built‑in agent tool execution. Agents must securely execute tool calls like APIs, workflows, and services as part of their reasoning process, without fragile glue code or ad‑hoc orchestration. State persistence and restore. Many agent workflows are long‑running and multi‑phase. The runtime must allow agents to reason, pause, and resume with safely preserved state. Strong isolation per agent task. As agents execute code and tools dynamically, isolation is critical to prevent data leakage and contain blast radius. Secure by default. Identity, access, and execution controls are enforced at the runtime layer rather than bolted on after the fact. Together, these pillars define what it means to run AI agents as first‑class production services. Impact: How Azure Container Apps powers agent runtime Building and operating agent infrastructure from scratch introduces unnecessary complexity and risk. Azure Container Apps has been pressure‑tested at Microsoft scale, proving to be a powerful, serverless foundation for running AI workloads and aligns naturally with the needs of agent runtime. It provides serverless, event‑driven scaling with fast startup and scale‑to‑zero, which is critical for agents with unpredictable execution patterns. Execution is secure by default, with built‑in identity, isolation, and security boundaries enforced at the platform layer. Azure Container Apps natively supports running MCP servers and executing full agent workflows, while Container Apps jobs enable on‑demand tool execution for discrete units of work without custom orchestration. For scenarios involving AI‑generated or untrusted code, dynamic sessions allow execution in isolated sandboxes, keeping blast radius contained. Azure Container Apps also supports running model inference directly within the container boundary, helping preserve data residency and reduce unnecessary data movement. Learnings for your agent runtime foundation Make infrastructure flexible with serverless architecture. AI systems move too fast to create infrastructure from scratch. With bursty, unpredictable agent workloads, sub‑second startup times and serverless scaling are critical. Simplify heavy lifting. Developers should focus on agent behavior, tool invocation, and workflow design instead of infrastructure plumbing. Using trusted cloud infrastructure, pain points like making sure agents run in isolated sandboxes, properly applying security policy to agent IDs, and ensuring secure connections to virtual networks are already solved. When you simplify the operational overhead, you make it easier for developers to focus on meaningful innovation. Invest in visibility and monitoring. Strong observability enables faster iteration, safer evolution, and continuous self‑correction for both humans and agents as systems adapt over time. Want to learn more? Learn about building and hosting agents with Foundry Agent Service Discover agent runtime through Azure Container Apps Read about best practices for managing agents168Views0likes0CommentsMoving Beyond Prompts: A Practical Introduction to Spec-Driven Development
In the last year, many of us have started writing code differently. We describe what we want, let AI generate an answer, review it, tweak the prompt, and try again. This loop—prompt, retry, adjust—has quietly become part of our daily workflow. At first, it feels incredibly productive. But as the complexity of the task increases, something changes. The iteration cycle becomes longer, outputs become inconsistent, and the effort shifts from solving the problem to refining the prompt. This is where a subtle but important shift in approach can help: moving from prompt-driven development to spec-driven development. The Problem: Prompt → Retry → Guess Most AI-assisted workflows today look something like this: Write a prompt describing the task Review the generated output Adjust the prompt Repeat until it looks acceptable In practice, this often simplifies to: Prompt → Retry → Guess Figure: Prompt-driven vs spec-driven workflow comparison For simple tasks, this works well. But for anything involving multiple inputs, constraints, or edge cases, the process can become unpredictable. In my experience, the challenge is not the model—it is the lack of structure in how we describe the problem. A Shift in Thinking: From Prompts to Specifications Instead of asking AI to “figure it out,” spec-driven development introduces a simple idea: Define the problem clearly before asking for a solution. A specification (spec) is not a long document—it is a structured way of describing: Inputs Outputs Constraints Edge cases When this structure is provided upfront, the interaction changes significantly. Rather than iterating on vague prompts, you are guiding the system with a clear contract. What This Looks Like in Practice Let’s take a simple example: an order summary API (for example, a backend service hosted on Azure App Service). Without a Spec (Typical Prompt) “Write an API that returns order details for a user.” A model can generate something reasonable, but in practice, the responses often vary: Field names may be inconsistent Pagination may be missing Edge cases (no orders, large datasets) may not be handled Structure may change across iterations Example response (typical output): { "userId": 123, "orders": [ { "id": 1, "amount": 250 } ] } With a Spec (Structured Input) Now consider providing a simple specification: Specification: Input: userId page pageSize Output: userId orders[] orderId totalAmount orderDate pagination page pageSize totalRecords Constraints: Default pageSize = 10 Return empty list if no orders Handle large datasets efficiently Example response (based on the spec): { "userId": 123, "orders": [ { "orderId": 1, "totalAmount": 250, "orderDate": "2024-01-10" } ], "pagination": { "page": 1, "pageSize": 10, "totalRecords": 50 } } Why This Tends to Work The difference here is not just stylistic—it is structural. An unstructured prompt leaves room for interpretation. A spec reduces ambiguity by defining expectations explicitly. In practice, I have observed that providing structured inputs like this often leads to the following: More consistent field naming Better handling of edge cases Reduced need for repeated prompt refinement Rather than relying on trial-and-error, the interaction becomes more predictable and aligned with expectations. Applying This to Existing Code (Refactor Scenario) This approach becomes even more useful when applied to existing code. Instead of asking: “Fix the bug in the Auth controller” You can define expected behavior: Input validation rules Response formats Error handling Authorization behavior The task then becomes aligning the implementation with the defined spec. This shifts the interaction from guesswork to validation—comparing current behavior with intended behavior. Example Comparison (Auth Scenario) Without Spec (Typical Prompt) “Fix the login issue in Auth controller” Possible outcomes include: Partial validation added Inconsistent error responses No clear handling of repeated failed attempts With Spec (Defined Behavior) Spec defines: Validate username and password Return consistent error responses Lock account after 5 failed attempts Do not expose internal errors Resulting behavior: Input validation is consistently applied Error responses follow a defined structure Edge cases like account lockout are handled explicitly This mirrors the same pattern seen in the API example—moving from ambiguity to clearly defined behavior. A Practical Way to Start You do not need new tools or frameworks to try this. A simple workflow that has worked well in practice: Ask – Describe the problem (prompt, discussion, or notes) Write a spec – Define inputs, outputs, constraints Refine – Remove ambiguity Generate – Use the spec as input Validate – Compare output with the spec This adds a small upfront step, but it often reduces back-and-forth iterations later. The Practical Challenge One important point to note: Writing a good spec requires understanding the problem. Spec-driven development does not eliminate complexity—it surfaces it earlier. In many cases, the hardest part is not writing code, but clearly defining: What the system should do What it should not do How it should behave under edge conditions This is also why specs evolve over time. They do not need to be perfect upfront. They improve as your understanding improves. Where This Approach Helps From what I have seen, this approach is most useful in scenarios where the problem involves multiple inputs, defined contracts, or structured outputs such as APIs, schema-driven systems, or refactoring existing code where consistency matters. Where It May Not Be Necessary For simpler tasks such as small scripts, minor UI changes, or quick experiments, a detailed specification may not add much value. In those cases, a straightforward prompt is often sufficient. A Note on Tools Tools like GitHub Copilot, Azure AI Studio, and AI-assisted workflows in Visual Studio Code tend to be more effective when given clear, structured inputs. Spec-driven development is not tied to any specific tool. It is a way of thinking about how we interact with these systems more effectively. References https://github.com/features/copilot https://platform.openai.com/docs/guides/prompt-engineering https://github.com/github/spec-kit Amplifier - Modular AI Agent Framework - Amplifier Final Thoughts Many discussions around AI-assisted development focus on what tools can do. This approach focuses on something slightly different: How developers can structure problems more effectively before implementation. In my experience, moving from prompts to specs does not eliminate iteration, but it makes that iteration more predictable and purposeful.High Availability Testing for Azure Kubernetes Service in a Single Region with Availability Zones
Perform repeatable high availability testing for Azure Kubernetes Service workloads deployed in a single Azure region with Availability Zones. Validate pod recovery, node disruptions, and workload resiliency during infrastructure failures.Azure SRE Agent for Azure Monitor Alerts: Reduce Alert Fatigue, Investigate What Matters
The Alert Problem Organizations running Azure Monitor tend to land in one of two situations: Alert fatigue has set in. Alert rules tend to grow over time — a CPU threshold from two years ago, a health probe check from a migration, a disk alert from an outage that never got cleaned up. These rules fire regularly, most auto-resolve, and nobody investigates them. But buried in that noise are real incidents that go unnoticed until they escalate. Teams respond, but the effort is repetitive. Engineers triage the same alerts repeatedly — running the same diagnostic queries, confirming the same "transient spike, no action needed" conclusion. They know the rule is noisy, but fixing it in Azure Monitor requires data they don't have readily available: What should the threshold be? What's the auto-resolution rate? Is it safe to change? So the noisy rule stays, and the manual toil continues. Both situations share the same gap: there's no intelligent layer between Azure Monitor and the team. Azure SRE Agent fills that gap — it receives alert fires in real time, investigates them automatically, consolidates noisy ones, and surfaces the data your team needs to improve the rules at the source. Here's how to set it up. 1. Intelligent Alert Handling: Cooldown and Response Plan Configuration 1.a. Alert Reinvestigation Cooldown The most impactful configuration for Azure Monitor alerts is the new reinvestigation cooldown. This is a per-response-plan setting that controls how the agent handles repeated fires of the same alert rule. When an alert rule fires and the agent already has an active thread for that rule, it merges the new fire into the existing thread — no new investigation, no duplicate work. What makes this especially useful: if the previous thread was resolved or closed within the cooldown window, the agent reopens it and appends the new fire rather than starting a fresh investigation. This catches the common "it fired, we resolved it, it fired again 30 minutes later" pattern that generates the most duplicate effort. To configure it: Navigate to your AzMonitor response plan and look for the "Alert reinvestigation cooldown" section in the Save step. It's enabled by default with a 3-hour window — a default chosen because most noisy alert rules re-fire within a 1–3 hour cycle, making this window broad enough to catch recurring patterns while short enough that a genuinely new issue several hours later still gets a fresh investigation. To disable the cooldown entirely — for critical alerts where every fire demands a fresh investigation — uncheck the merge toggle: You can adjust the window between 1 and 24 hours depending on the alert pattern: Alert Pattern Recommended Window Frequent polling-based alerts (health probes, heartbeats) 1–2 hours Recurring issues tied to daily batch jobs or deploy cycles 6–12 hours Intermittent failures with unpredictable recurrence 12–24 hours Critical alerts where every fire demands a fresh look Disable the cooldown entirely 1.b. Segmenting Alerts with Response Plans The cooldown works best when paired with tiered response plans that route alerts by severity and title keyword. Rather than one catch-all plan for all alert types, create separate plans that match the right investigation depth to the right alerts. Critical alerts (Sev0–1, titles containing "failover", "security", "data loss") — disable cooldown. Every fire gets a fresh investigation because a repeat fire here likely means the first remediation didn't hold. Operational alerts (Sev2, titles containing "high CPU", "memory pressure", "latency") — set a 6-hour cooldown. These are real issues, but recurring fires within a few hours are almost always the same root cause. The agent consolidates them into one thread while still giving a genuinely new occurrence later in the day a fresh look. Low-priority alerts (Sev3–4, titles containing "health probe", "availability test") — set a short 1-hour cooldown. These rarely require deep investigation. The agent captures context without spending effort on redundant analysis. Informational alerts — don't create a response plan at all. These are telemetry, not incidents. This tiering works regardless of which agent mode (Autonomous or Review) your team uses. The value comes from the cooldown and severity segmentation — agent mode is a separate decision based on your team's comfort level with autonomous remediation. To see the difference this makes in practice: we deployed a web app with Azure Monitor alert rules and induced real failures. Azure Monitor fired 9 alerts across three rule types over a few hours. The agent consolidated them based on each response plan's cooldown: Alert Rule Response Plan Merge Setting AzMon Fires Agent Threads Total Alerts (in thread) What Happened High Response Time (Sev3) low-priority-alerts Merge ON, 4h cooldown 3 1 4 All 4 fires merged into a single thread — the agent investigated once and appended recurring fires HTTP 5xx Errors (Sev2) critical-alerts-no-merge Merge OFF 3 3 1 each Each fire created its own investigation — appropriate for critical alerts where every occurrence matters High CPU (Sev2) operational-alerts Merge ON, 1h cooldown 2 2 1 each Fires were >1 hour apart (resolved at 12:05, re-fired at 3:37) — outside the cooldown window, so the agent correctly treated them as separate incidents The key insight: the same 9 Azure Monitor alerts produced different agent behavior depending on the response plan configuration. The High Response Time rule demonstrates the merge path saving 3 redundant investigations. The HTTP 5xx rule shows merge disabled for critical alerts. And the High CPU rule shows what happens when the cooldown window is too short for the alert's recurrence pattern — a signal to increase the window. 2. Proactive Noise Monitoring: Let the Agent Analyze Its Own Patterns Handling alerts intelligently is the first step. The next is having the agent proactively surface insights about your alert landscape so your team can improve the rules at the source — which is the data that Category 2 teams in our intro are missing. 2.a. Weekly Alert Hygiene Report Create a weekly scheduled task with instructions like: Analyze all Azure Monitor alert threads from the past 7 days. For each alert rule that fired more than 3 times, produce a ranked report covering: High Auto-Resolution Rules: Rules with high auto-resolution rates. Recommend threshold changes or suppression windows. Rules with Recurring Root Causes: Rules where the same root cause recurs. Recommend permanent remediation actions. Miscategorized Severity: Rules where investigation concludes low impact but the alert is Sev1/Sev2. Recommend severity adjustment. Cost Summary: Estimated effort consumed per alert rule this week. This creates a compounding feedback loop. Week over week, your team has a concrete, data-backed list of which alert rules to adjust in Azure Monitor — complete with specific recommendations. The data that was too time-consuming to gather manually is now generated automatically. 2.b. Monthly Threshold Audit For a deeper analysis, schedule a monthly task: Audit Azure Monitor alert rules for this agent's subscriptions. For each rule: Query the rule's metric history over 30 days Compare current threshold vs. actual P50, P90, and P99 values Flag rules with threshold below P50 (always firing) or above P99 (never firing) For high-frequency rules with high auto-resolution, recommend a threshold at P95 to reduce fires while still catching genuine anomalies Produce: a threshold optimization table, dormant rules (no fires in 30+ days), and specific Azure CLI commands to update each rule. This is the highest-leverage outcome because it fixes noise at the source. A single threshold adjustment on one noisy rule can eliminate hundreds of alert fires per month — permanently. And the agent provides the data and specific commands to make it happen. What This Means for Agent Costs Each alert investigation consumes LLM tokens — for reasoning, querying, and building analysis. Without thoughtful configuration, a high-volume alert pipeline can lead to higher agent costs than expected. The setup described in this post naturally keeps token usage in check: the cooldown prevents redundant investigations, tiered response plans match effort to alert importance, and low-priority alerts get minimal attention. For additional control, you can optionally add a PostToolUse hook that nudges the agent to include time-range filters in Log Analytics queries — preventing large, unbounded result sets from inflating the conversation context. Since this hook uses a simple regex check on the query text rather than an LLM call, it adds zero token cost of its own. Getting Started Connect Azure Monitor as an incident source in your SRE Agent Enable the reinvestigation cooldown on your response plans (the 3-hour default is a sensible starting point) Create tiered response plans — at minimum, separate critical alerts (cooldown disabled) from operational alerts (cooldown 6h) and low-priority alerts (cooldown 1h) Set up a weekly alert hygiene report as a scheduled task to start building visibility into your alert patterns Add the monthly threshold audit once your weekly reports have a few weeks of data Start with the first three — they take a few minutes each and begin working immediately. Learn More Incident Response Overview — How SRE Agent handles incidents across platforms including Azure Monitor Incident Response Plans — Configuring response plans, filters, severity routing, and cooldown settings Setting Up a Response Plan — Step-by-step tutorial for creating your first response plan Scheduled Tasks — Creating weekly and monthly automated reports Agent Hooks — PostToolUse hooks, command hooks, and governance controls Monitor Agent Usage — Tracking token usage and agent activity Getting Started with Incident Response — Connecting Azure Monitor and configuring your first alert pipeline497Views0likes0CommentsSecuring Your AI Agents Before They Ship: Red Teaming with Microsoft PyRIT
Securing Your AI Agents Before They Ship: Red Teaming with Microsoft PyRIT You wouldn't ship a web app without running OWASP ZAP or Snyk. So why are AI agents going to production without a single security scan? Prompt injection, data leakage, system prompt theft — the OWASP Top 10 for LLM Applications reads like a checklist of things most teams haven't tested for. PyRIT is Microsoft's open-source answer: an automation framework battle-tested on 100+ products including Copilot. But here's the catch — PyRIT is a research library. To make it work in a real engineering workflow, you need to wrap it. This post shows you how. In this post: Why AI red teaming is fundamentally different from traditional security testing What PyRIT gives you out of the box How to build a thin wrapper that turns PyRIT into a config-driven, pipeline-ready scanner When and how to plug it into your CI/CD workflow Customizing every step for your threat model 🛡️ Why AI Red Teaming Is Different If you're building agentic AI — systems that reason, call tools, and take actions — you already know that traditional security testing doesn't cut it. Microsoft's AI Red Team learned this the hard way after red-teaming 100+ generative AI products. Three things make AI red teaming unique: You're testing two risk surfaces at once — security vulnerabilities (prompt injection, data exfiltration) *and* responsible AI harms (bias, toxicity, manipulation). Traditional pen testers focus on one. Outputs are probabilistic — the same prompt can produce different responses across runs. You can't just assert on a fixed output. You need automated scoring at scale. Every architecture is different — standalone chatbots, RAG pipelines, multi-agent workflows, tool-calling agents. A single test harness has to flex across all of them. The OWASP LLM Top 10 (2025) gives us the taxonomy — prompt injection, sensitive information disclosure, excessive agency, system prompt leakage, data poisoning, supply chain risks, improper output handling, embedding weaknesses, misinformation, and unbounded consumption. Every AI agent you deploy is exposed to all ten. The question is whether *you* discover the gaps or your users do. 🔧 What PyRIT Gives You PyRIT (Python Risk Identification Tool) started as internal scripts at Microsoft in 2022. Today it's a 3,800-star, MIT-licensed framework with 129 contributors and a published paper. "We were able to pick a harm category, generate several thousand malicious prompts, and use PyRIT's scoring engine to evaluate the output from the Copilot system — all in the matter of hours instead of weeks." — Microsoft Security Blog The building blocks: 53+ datasets — AIRT, HarmBench, AdvBench, XSTest, and more. Curated adversarial prompts covering content harms, jailbreaks, data exfiltration, and social bias. 70+ prompt converters — Base64, ROT13, Leetspeak, Unicode confusables, LLM-powered rephrasing, translation, multimodal injection. They stack — a prompt can be translated, then Base64-encoded, then embedded in an image. 6 attack strategies — from simple `PromptSendingAttack` (single-turn) to `CrescendoAttack` (gradual escalation), `TreeOfAttacksWithPruning` (TAP), and multi-turn dialogue attacks. 20+ scorers — LLM-as-judge, Azure AI Content Safety, true/false classifiers, Likert scales. 10+ targets — OpenAI, Azure, HuggingFace, HTTP endpoints, Playwright, WebSockets. This is powerful — PyRIT gives you the components — datasets, converters, attack strategies, scorers — but not the glue. You still need something that loads a config, wires the right components together, runs attacks, scores the results, and tells your pipeline pass or fail. That's what a wrapper does. 🏗️ Building an Enterprise Wrapper The idea is simple: take PyRIT's primitives and compose them into an opinionated, config-driven pipeline that any developer can run with a single command. Below is given the idea on how we can create the wrapper around PyRIT to make it useful for agentic ai security testing, but this is not limited. The Flow Everything starts with a YAML config and ends with a pass/fail exit code: The key insight: every step in this pipeline is configurable through YAML, not code. Switching attack strategies, adding converters, adjusting thresholds — it's all one config edit away. Project Structure At POC level, the wrapper is lean — an orchestrator that stitches PyRIT's components together: The heart of it is `runner.py` — a single orchestrator that: Reads the YAML config Dynamically imports your agent and wraps it as a PyRIT target Loads datasets (built-in + your custom prompts) Runs attacks in escalating phases — plain → encoded → semantic → multi-turn Scores every response using LLM-as-judge Maps findings to OWASP LLM Top 10 categories Generates reports (JSON, Markdown) Evaluates the release gate — pass or fail That's it. One file doing the orchestration, a handful of helpers, and a YAML config. You can start here and grow the structure as your needs evolve. The Configuration A single YAML file drives everything: scan: name: "my-agent-security-scan" # Point to your agent — any Python callable target: module: "my_agent.main" function: "handle_prompt" # Attack strategies to run strategies: - name: "prompt_sending" # Single-turn baseline - name: "crescendo" # Gradual escalation max_turns: 5 # Prompt transformations — they stack automatically converters: - name: "Base64Converter" - name: "LeetspeakConverter" - name: "TranslationConverter" # Adversarial prompt sets datasets: - "airt_illegal" # Built-in: Microsoft AI Red Team - "harmbench" # Built-in: harm evaluation benchmark - path: "datasets/custom_prompts.yaml" # Release gate thresholds gate: fail_on: critical_findings: 0 # Zero tolerance for critical high_findings: 3 max_score: 0.3 # Defense rate must exceed 70% Point the target at your agent, pick your strategies and datasets, set your thresholds — run. Teams can start scanning their agents in an afternoon, not weeks. 🔄 Plugging Into Your Pipeline Since the wrapper is a pip-installable package(we can use setup tools or poetry to build and make it pip installable), integrating it into any CI/CD system is straightforward — `pip install`, then call the CLI. No custom actions or marketplace extensions needed. The key decision is when to run scans. Not every merge needs a full red team pass. Here's what works in practice: The idea is that developers can optionally run quick scans locally as a fast feedback loop, while full scans are manually triggered or approval-gated — the tech lead or architect decides when it's worth running a comprehensive assessment based on the nature of the changes. Since it's just a CLI, integration is the same everywhere — GitHub Actions, Azure DevOps, Jenkins, or a shell script. Install the package, call `pyrit-scan run`, check the exit code. ⚙️ Customization Without Forking The whole point of a wrapper is that teams customize behavior through configuration — not by modifying framework code. What to Customize How Example Which agent to test Point target.module + target.function in YAML to any Python callable Your chatbot, RAG pipeline, or multi-agent workflow Attack strategies Add/remove entries under strategies in YAML Start with prompt_sending , add crescendo when ready Prompt transformations List converters in YAML — they stack automatically Base64 → Leetspeak → Translation = multi-phase evasion Datasets Use built-in (53+) or add custom YAML prompt files HIPAA prompts, financial compliance scenarios Scoring thresholds Set per-OWASP-category thresholds in gate.fail_on Zero tolerance for data leakage (LLM02), relaxed for misinformation (LLM09) Report formats List formats in reporting.formats JSON for automation, PDF for compliance, JUnit for dashboards New attack classes Register via custom_attacks in YAML — module + class name No framework code change, no PR needed 🎯 Start Red Teaming Today AI red teaming isn't a nice-to-have anymore. If you're shipping agentic AI — systems that call tools, access data, and take actions on behalf of users — you need automated security testing in your pipeline. PyRIT gives you the primitives. A thin wrapper gives you the automation. Together, they turn AI security from a one-off exercise into a continuous, measurable practice. The pattern: YAML config → wrap your agent → run attacks → score → map to OWASP → gate the release. Build it once. Run it on every release. Sleep better. Resources PyRIT on GitHub — source code, docs, and community PyRIT Documentation — getting started guides and API reference OWASP LLM Top 10 (2025) — the industry standard risk taxonomy Microsoft AI Red Team Hub — threat models, bug bars, and best practices 3 Takeaways from Red Teaming 100 Products — lessons learned at scale PyRIT Launch Blog — origin story and key design decisions PyRIT Paper (arXiv) — the academic paper722Views0likes0CommentsMicrosoft’s New In‑House AI Models (MAI‑Transcribe, MAI‑Voice, MAI‑Image)
What Are the New MAI Models? MAI‑Transcribe‑1 (Speech‑to‑Text) MAI‑Transcribe‑1 is Microsoft’s first‑generation in‑house speech recognition model. It supports 25 languages and is optimized for real‑world, noisy enterprise audio, such as meetings and call centers. Key highlights Enterprise‑grade transcription accuracy Designed for multilingual and accented speech Lower GPU cost compared to prior Azure speech offerings MAI‑Voice‑1 (Text‑to‑Speech) MAI‑Voice‑1 is a high‑fidelity voice generation model capable of producing natural, expressive speech while preserving speaker identity over long‑form audio. Key highlights Generates up to 60 seconds of audio in ~1 second Supports custom voice creation Optimized for voice agents and conversational systems MAI‑Image‑2 (Text‑to‑Image) MAI‑Image‑2 is Microsoft’s highest‑capability text‑to‑image model, already ranking among top image models used in production Copilot experiences. Key highlights High‑quality photorealistic image generation Accurate in‑image text rendering Production‑ready latency and cost profile Why This Matters for Azure Developers For Azure developers, this launch changes three things fundamentally: First‑party AI stack Developers can now build speech, voice, and image workloads without relying on external AI providers. Enterprise‑ready by default These models inherit Azure RBAC, Managed Identity, compliance, and governance through Microsoft Foundry. Agent‑first design MAI models are designed to be embedded inside AI agents, not just called as single APIs Below is a common enterprise architecture using MAI models. Sample Code Calling MAI‑Transcribe‑1: What Changed with MAI Models: Before vs After (Developer Perspective) Microsoft’s MAI models are not just new endpoints — they represent a fundamental shift in how Azure developers build multimodal and agent‑based AI solutions. High‑Level Comparison Aspect Before MAI (Azure & External Models) After MAI (MAI‑Transcribe, Voice, Image) Model Ownership Heavy dependency on third‑party models (OpenAI, external TTS/STT providers) First‑party Microsoft‑built models, operated and optimized by Microsoft Enterprise Integration AI models integrated into Azure AI models native to Microsoft Foundry Governance & Compliance Mixed controls depending on model provider Unified Azure RBAC, Entra ID, Purview, Managed Identity Agent Readiness Primarily single‑request / single‑response APIs Designed for agent‑oriented, long‑running workflows Cost Predictability Token‑based or mixed pricing models Enterprise‑optimized price‑to‑performance models Operational Consistency Different SDKs, APIs, quotas Single Foundry tooling and SDK surface927Views0likes0CommentsStop Experimenting, Start Building: AI Apps & Agents Dev Days Has You Covered
The AI landscape has shifted. The question is no longer “Can we build AI applications?” it’s “Can we build AI applications that actually work in production?” Demos are easy. Reliable, scalable, resilient AI systems that handle real-world complexity? That’s where most teams struggle. If you’re an AI developer, software engineer, or solution architect who’s ready to move beyond prototypes and into production-grade AI, there’s a series built specifically for you. What Is AI Apps & Agents Dev Days? AI Apps & Agents Dev Days is a monthly technical series from Microsoft Reactor, delivered in partnership with Microsoft and NVIDIA. You can explore the full series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ This isn’t a slide deck marathon. The series tagline says it best: “It’s not about slides, it’s about building.” Each session tackles real-world challenges, shares patterns that actually work, and digs into what’s next in AI-driven app and agent design. You bring your curiosity, your code, and your questions. You leave with something you can ship. The sessions are led by experienced engineers and advocates from both Microsoft and NVIDIA, people like Pamela Fox, Bruno Capuano, Anthony Shaw, Gwyneth Peña-Siguenza, and solutions architects from NVIDIA’s Cloud AI team. These aren’t theorists; they’re practitioners who build and ship the tools you use every day. What You’ll Learn The series covers the full spectrum of building AI applications and agent-based systems. Here are the key themes: Building AI Applications with Azure, GitHub, and Modern Tooling Sessions walk through how to wire up AI capabilities using Azure services, GitHub workflows, and the latest SDKs. The focus is always on code-first learning, you’ll see real implementations, not abstract architecture diagrams. Designing and Orchestrating AI Agents Agent development is one of the series’ strongest threads. Sessions cover how to build agents that orchestrate long-running workflows, persist state automatically, recover from failures, and pause for human-in-the-loop input, without losing progress. For example, the session “AI Agents That Don’t Break Under Pressure” demonstrates building durable, production-ready AI agents using the Microsoft Agent Framework, running on Azure Container Apps with NVIDIA serverless GPUs. Scaling LLM Inference and Deploying to Production Moving from a working prototype to a production deployment means grappling with inference performance, GPU infrastructure, and cost management. The series covers how to leverage NVIDIA GPU infrastructure alongside Azure services to scale inference effectively, including patterns for serverless GPU compute. Real-World Architecture Patterns Expect sessions on container-based deployments, distributed agent systems, and enterprise-grade architectures. You’ll learn how to use services like Azure Container Apps to host resilient AI workloads, how Foundry IQ fits into agent architectures as a trusted knowledge source, and how to make architectural decisions that balance performance, cost, and scalability. Why This Matters for Your Day Job There’s a critical gap between what most AI tutorials teach and what production systems actually require. This series bridges that gap: Production-ready patterns, not demos. Every session focuses on code and architecture you can take directly into your projects. You’ll learn patterns for state persistence, failure recovery, and durable execution — the things that break at 2 AM. Enterprise applicability. The scenarios covered — travel planning agents, multi-step workflows, GPU-accelerated inference — map directly to enterprise use cases. Whether you’re building internal tooling or customer-facing AI features, the patterns transfer. Honest trade-off discussions. The speakers don’t shy away from the hard questions: When do you need serverless GPUs versus dedicated compute? How do you handle agent failures gracefully? What does it actually cost to run these systems at scale? Watch On-Demand, Build at Your Own Pace Every session is available on-demand. You can watch, pause, and build along at your own pace, no need to rearrange your schedule. The full playlist is available at This is particularly valuable for technical content. Pause a session while you replicate the architecture in your own environment. Rewind when you need to catch a configuration detail. Build alongside the presenters rather than just watching passively. What You’ll Walk Away Wit After working through the series, you’ll have: Practical agent development skills — how to design, orchestrate, and deploy AI agents that handle real-world complexity, including state management, failure recovery, and human-in-the-loop patterns Production architecture patterns — battle-tested approaches for deploying AI workloads on Azure Container Apps, leveraging NVIDIA GPU infrastructure, and building resilient distributed systems Infrastructure decision-making confidence — a clearer understanding of when to use serverless GPUs, how to optimise inference costs, and how to choose the right compute strategy for your workload Working code and reference implementations — the sessions are built around live coding and sample applications (like the Travel Planner agent demo), giving you starting points you can adapt immediately A framework for continuous learning — with new sessions each month, you’ll stay current as the AI platform evolves and new capabilities emerge Start Building The AI applications that will matter most aren’t the ones with the flashiest demos — they’re the ones that work reliably, scale gracefully, and solve real problems. That’s exactly what this series helps you build. Whether you’re designing your first AI agent system or hardening an existing one for production, the AI Apps & Agents Dev Days sessions give you the patterns, tools, and practical knowledge to move forward with confidence. Explore the series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ and start watching the on-demand sessions at the link above. The best time to level up your AI engineering skills was yesterday. The second-best time is right now and these sessions make it easy to start.