azure app service
520 TopicsPlatform Improvements for Python AI Apps on Azure App Service
Overview Azure App Service (Linux) is a fully managed PaaS offering that supports a broad range of languages, including Python, Node.js, .NET, PHP, and Java. Developers can push source code or deploy a pre-built artifact; the platform handles the rest, including dependency installation, application containerization, and running the application at cloud scale. More customers are building intelligent applications using Azure AI Foundry and other AI services, and Python has become a language of choice for these workloads. The performance and reliability of the Python deployment pipeline directly shape the developer's experience on the platform, so we looked across the deployment path for opportunities to reduce latency and improve reliability. The first set of changes has reduced Python deployment latency on Azure App Service Linux by approximately 30%. This is the first step in a broader effort to make the platform better suited for AI application development, but the gains resulting from this effort will benefit all apps on the platform. Let's look at the details. Where Deployment Time Was Going Python web application deployments on Azure App Service Linux rely on Oryx, the platform's open-source build system, to produce runnable artifacts during remote builds. Platform telemetry showed that around 70% of Python app deployments use remote builds, and the majority of those resolve dependencies via requirements.txt using pip install. To understand where time was going, we profiled a stress workload: a 7.5 GB PyTorch application. Most production builds are smaller, but stress-testing a dependency-heavy application made the pipeline bottlenecks clear. When a Python app is deployed via remote build, the build container in Kudu (the App Service deployment service) runs Oryx to: Extract the uploaded source code. Create a Python virtual environment. Install dependencies via pip install; 4.35 min (~34% of build time). Copy files to a staging directory; 0.98 min (~8%). Compress via tar + gzip into an archive; 7.53 min (~58%). Write the archive to /home (Azure Storage SMB mount). The app container then extracts this archive to the local disk on every cold start. Why the Archive-Based Approach? The /home directory is backed by an Azure Storage SMB mount, where small-file I/O is comparatively expensive. Python dependencies are file-heavy: virtual environments commonly contain tens of thousands of files, and dependency-heavy ML applications can exceed 200,000 files. Writing those files individually over SMB would be prohibitively slow. Instead, the pipeline builds on the container's local filesystem, writes a single compressed archive over SMB, and the app container extracts it locally on startup for efficient module loading. Key insight: Compression was the single largest phase at 58% of build time, longer than installing the packages themselves. What We Changed Zstandard Compression (Replacing gzip) Standard gzip compression is single-threaded. In our benchmark, compression accounted for 58% of total build time, making it the dominant bottleneck. Because the archive is also decompressed during container startup, decompression time affects runtime startup latency as well. We evaluated three compression algorithms: gzip, LZ4, and Zstandard (zstd). The following results are averaged across multiple deployments of a 7.5 GB Python application with PyTorch and additional ML packages: Metric gzip LZ4 zstd Compression time 7.53 min 1.20 min 1.18 min Decompression time 2.80 min 1.18 min 1.07 min Archive size 4.0 GB 5.0 GB 4.8 GB Both zstd and LZ4 were more than 6× faster than gzip for compression and more than 2× faster for decompression. We selected zstd for the following reasons: Comparable speed to LZ4, with smaller archive sizes (4.8 GB vs. 5.0 GB). Mature ecosystem: zstd is based on RFC 8878 published in 2021 and ships with many common Linux distributions. Native tar support: tar –I zstd works out of the box; no extra packages required. Result: Compression time dropped from 7.53 min → 1.18 min (6.4× faster). Decompression improved from 2.80 min → 1.07 min (2.6× faster), directly reducing cold-start latency. Faster Package Installation with uv pip is implemented in Python and has historically optimized compatibility over maximum parallelism. In dependency-heavy workloads, package download, resolution, and installation can become a major part of deployment time. In our 7.5 GB PyTorch benchmark, package installation accounted for ~34% of total build time (4.35 min out of 12.86 min). We introduced uv, a Python package manager written in Rust, as the primary installer for compatible requirements.txt deployments. Its uv pip install interface works with standard pip workflows. Fallback strategy: Compatibility remains the priority. When uv cannot handle a deployment, the platform retries with pip, preserving the behavior customers already depend on. Cache behavior: Package caches remain local to the build container. When the same app is deployed again before the kudu (build) container is recycled, both pip and uv can reuse cached packages and avoid repeated downloads. Result: Package installation time dropped from 4.35 min → 1.50 min (3× faster). Reducing File Copy Overhead A file copy showed up in two places. First, before compression, the build process copied the entire build directory (application code plus Python packages) to a staging location. This existed historically as a safety measure; creating a clean snapshot before tar reads the file tree. But the cost was steep for the large number of files inherent in Python dependencies. The fix was straightforward: create the tar archive directly from the build directory, skipping the intermediate copy entirely. Second, for pre-built deployment scenarios, we replaced the legacy Kudu sync path with Linux-native rsync. That gave us a better optimized tool for large Linux file trees and reduced the overhead of moving files into the final deployment location. Because this path is used beyond Python, the improvement benefits pre-built apps across the broader App Service Linux ecosystem. Result: Eliminated the 0.98-minute staging copy (8% of build time), reduced temporary disk usage, and improved the remaining file sync path. Pre-Built Python Wheels Cache We added a complementary optimization: a read-only cache of pre-built wheels for commonly used Python packages, selected using platform telemetry. The cache is mounted into the Kudu build container at runtime for Python workloads, allowing the installer to use local wheel artifacts before downloading packages externally. When a matching wheel is available, the installer uses it directly, avoiding a network fetch for that package. Cache misses fall back to the upstream registry (e.g., PyPI) as usual. The cache is managed by the platform and kept up to date, so supported Python builds can use it without any app change. Combined Results Controlled Benchmark (PyTorch 7.5 GB, P1mv3 App Service Tier) The following benchmark was measured on the P1mv3 App Service tier. Values in the "After" column reflect the optimized pipeline with zstd compression, uv package installation, direct tar creation, and the pre-built wheels cache enabled together. Phase Before After Improvement Package installation 4.35 min 1.50 min ~3× faster File copy 0.98 min 0 min Eliminated Compression 7.53 min 1.18 min ~6× faster Total build time 12.86 min ~2.68 min ~79% reduction Production Fleet (All Python Linux Web Apps) Production telemetry across Python deployments shows the impact of these changes: deployment latency decreased by approximately 30% after the rollout. The controlled benchmark shows a larger improvement (~79%) because it exercises a dependency-heavy workload where package installation, file copy, and compression dominate total build time. Typical production apps are smaller and spend less time proportionally in those phases. Beyond Faster Builds: Reliability and Runtime Performance Faster builds only help when deployment requests reliably reach a worker that is ready to build. We updated the primary deployment clients Azure CLI, GitHub Actions, and Azure DevOps Pipelines to warm up Kudu before initiating deployments. Clients now issue a lightweight health-check request to the Kudu endpoint, helping ensure the deployment container is running and ready before the deployment begins. Clients also preserve affinity to the warmed-up worker using the ARR affinity cookie returned by the first request. This increases the chance that the deployment uses a worker with Kudu already running and local package caches already available from recent deployments. Together, these client-side changes reduced deployment failures from transient infrastructure issues and helped the pipeline optimizations reach the build phase reliably. Result: Deployment failures caused by cold-start errors (502, 503, 499) dropped by ~30%. We also improved the default runtime configuration for Python apps using the platform-provided Gunicorn startup path. Previously, the platform defaulted to a single worker, leaving most CPU cores idle. Now, it follows Gunicorn's recommended worker formula, fully utilizing available cores on multi-core SKUs and delivering higher request throughput out of the box. workers = (2 × NUM_CORES) + 1 Key Takeaways Measure before optimizing: Platform telemetry showed that remote builds and requirements.txt based installs were the dominant Python deployment paths, which helped us focus on changes that would benefit the most customers. Compression was the biggest bottleneck: In the dependency-heavy benchmark, archive compression took longer than package installation. Replacing gzip with zstd reduced both build time and cold-start extraction time. File count matters: Python virtual environments can contain tens of thousands of files, and AI workloads can contain many more. Reducing unnecessary file copies and using Linux-native file sync helped lower overhead. Compatibility needs a fallback path: Introducing uv improved the common path, while falling back to pip preserved compatibility for apps that depend on existing Python packaging behavior. Deployment reliability is part of performance: Faster builds only help if deployment requests consistently reach a ready worker. Warm-up and worker affinity made the optimized path more reliable for customers. Beyond deployment: Runtime defaults, such as Gunicorn worker configuration, also affect how production apps perform once deployment is complete. Together, these changes made Python deployments faster and more reliable while preserving compatibility through safe fallbacks. We will continue improving the platform to make Azure App Service faster, more reliable, and better suited for AI application development.35Views1like0CommentsControl runtime patch updates with Platform Release Channel on Azure App Service for Linux
Azure App Service for Linux is introducing Platform Release Channel, a new setting that gives you more control over when runtime patch updates are applied to your app. With this feature, you can choose how quickly your app moves to newly rolled-out runtime patches. This helps teams balance two common needs: staying current with the latest security and platform updates, while also having enough time to validate changes before adopting them. Why this matters Runtime patch updates are important because they include fixes, security updates, and platform improvements. However, some production applications need time to validate these updates before moving to the newest available patch. Platform Release Channel gives you that flexibility. You can choose to stay close to the latest patch updates, use the default balanced option, or stay on an extended channel that gives you more time before adopting newer patches. How it works You can configure the Platform release setting from the Stack settings section in the Azure portal. The setting supports three values: Channel Behavior Recommended for Latest Updates are delivered as soon as they are available Not intended for production workloads Standard Default setting. Recieves updates at our standard release cadence Recommended for most production apps Extended Typically stays one release behind standard Apps that need extra time before adopting newer patches By default, apps are set to Standard. This gives you additional time to test the latest patch before your app moves to it. Choose Latest when security and immediate access to the newest runtime patches are your priority. Choose Extended when your application needs more validation time before adopting newer patch versions. How channels move forward When a new runtime patch is available, App Service first rolls it out through the Latest channel using a faster release cadence. The same patch then continues through the normal rollout process and becomes available in the Standard channel after it has progressed further through validation and rollout. Extended remains further behind Standard to provide additional validation time for apps that need it. For example, with the current .NET 10 rollout, the channels look like this: Stack Runtime version Latest Standard Extended DOTNETCORE 10 10.0.7 10.0.4 10.0.2 As the .NET 10 rollout progresses, the 10.0.7 patch will first be available through the Latest channel, then move to Standard through the normal rollout cadence. For some stacks, Standard and Extended may currently show the same patch version. This is expected while the release channels are still moving through their rollout cadence. As additional rollout waves progress, the channel versions will separate and reflect the intended behavior for each channel. Configure Platform Release Channel You can configure it on the Azure portal You can also configure the release channel using Azure CLI. To move to the latest available patch channel: az webapp update \ --resource-group <resource-group> \ --name <site-name> \ --platform-release-channel Latest To use the default channel: az webapp update \ --resource-group <resource-group> \ --name <site-name> \ --platform-release-channel Standard To give your app more time before adopting newer patches: az webapp update \ --resource-group <resource-group> \ --name <site-name> \ --platform-release-channel Extended Summary Platform Release Channel gives you a simple way to control the pace at which runtime patch updates are applied to your Linux apps on Azure App Service. Use Latest when you want the newest available patches as soon as they are rolled out. Use Standard for the default balance between currency and stability. Use Extended when your app needs more validation time before moving to newer runtime patches.46Views0likes0CommentsCalling APIs using private Certificate Authorities from Logic Apps
A colleague reached out to me to help with a customer's Logic App issue. They were trying to make what looked like a simple HTTP request to a Jira API but were getting back an SSL error. The root of the error was that the customer was securing the API endpoint with a certificate generated from a private Certificate Authority in their environment. Because this certificate was not signed by VeriSign, GoDaddy or other public certificate authority, and the Logic App HTTP Connector was failing as it did not trust the issuer. The fix for this is simple as it turns out but required a bit of reading the manual. First, go to Certificates and load the root CA and any intermediate CAs you might be using to the store. Then, to get the Logic App to load these certs, go to Environment Variables and add a new setting WEBSITE_LOAD_ROOT_CERTIFICATES and place the thumbprints of the added root and intermediate CAs as a comma delimited string (hat tip to Glen from our consulting team for figuring that one out). When you save the change, the service will restart and will now trust these private certificates. It should be noted that Logic Apps Standard is a flavor of Azure App Service, so this fix would also work for a regular App Service or a Function App as well. I hope this helps you out!Agentic IIS Migration to Managed Instance on Azure App Service
Introduction Enterprises running ASP.NET Framework workloads on Windows Server with IIS face a familiar dilemma: modernize or stay put. The applications work, the infrastructure is stable, and nobody wants to be the person who breaks production during a cloud migration. But the cost of maintaining aging on-premises servers, patching Windows, and managing IIS keeps climbing. Azure App Service has long been the lift-and-shift destination for these workloads. But what about applications that depend on Windows registry keys, COM components, SMTP relay, MSMQ queues, local file system access, or custom fonts? These OS-level dependencies have historically been migration blockers — forcing teams into expensive re-architecture or keeping them anchored to VMs. Managed Instance on Azure App Service changes this equation entirely. And the IIS Migration MCP Server makes migration guided, intelligent, and safe — with AI agents that know what to ask, what to check, and what to generate at every step. What Is Managed Instance on Azure App Service? Managed Instance on App Service is Azure's answer to applications that need OS-level customization beyond what standard App Service provides. It runs on the PremiumV4 (PV4) SKU with IsCustomMode=true, giving your app access to: Capability What It Enables Registry Adapters Redirect Windows Registry reads to Azure Key Vault secrets — no code changes Storage Adapters Mount Azure Files, local SSD, or private VNET storage as drive letters (e.g., D:\, E:\) install.ps1 Startup Script Run PowerShell at instance startup to install Windows features (SMTP, MSMQ), register COM components, install MSI packages, deploy custom fonts Custom Mode Full access to the Windows instance for configuration beyond standard PaaS guardrails The key constraint: Managed Instance on App Service requires PV4 SKU with IsCustomMode=true. No other SKU combination supports it. Why Managed Instance Matters for Legacy Apps Consider a classic enterprise ASP.NET application that: Reads license keys from HKLM\SOFTWARE\MyApp in the Windows Registry Uses a COM component for PDF generation registered via regsvr32 Sends email through a local SMTP relay Writes reports to D:\Reports\ on a local drive Uses a custom corporate font for PDF rendering With standard App Service, you'd need to rewrite every one of these dependencies. With Managed Instance on App Service, you can: Map registry reads to Key Vault secrets via Registry Adapters Mount Azure Files as D:\ via Storage Adapters Enable SMTP Server via install.ps1 Register the COM DLL via install.ps1 (regsvr32) Install the custom font via install.ps1 Please note that when you are migrating your web applications to Managed Instance on Azure App Service in majority of the use cases "Zero application code changes may be required " but depending on your specific web app some code changes may be necessary. Microsoft Learn Resources Managed Instance on App Service Overview Azure App Service Documentation App Service Migration Assistant Tool Migrate to Azure App Service Azure App Service Plans Overview PremiumV4 Pricing Tier Azure Key Vault Azure Files AppCat (.NET) — Azure Migrate Application and Code Assessment Why Agentic Migration? The Case for AI-Guided IIS Migration The Problem with Traditional Migration Microsoft provides excellent PowerShell scripts for IIS migration — Get-SiteReadiness.ps1, Get-SitePackage.ps1, Generate-MigrationSettings.ps1, and Invoke-SiteMigration.ps1. They're free, well-tested, and reliable. So why wrap them in an AI-powered system? Because the scripts are powerful but not intelligent. They execute what you tell them to. They don't tell you what to do. Here's what a traditional migration looks like: Run readiness checks — get a wall of JSON with cryptic check IDs like ContentSizeCheck, ConfigErrorCheck, GACCheck Manually interpret 15+ readiness checks per site across dozens of sites Decide whether each site needs Managed Instance or standard App Service (how?) Figure out which dependencies need registry adapters vs. storage adapters vs. install.ps1 (the "Managed Instance provisioning split") Write the install.ps1 script by hand for each combination of OS features Author ARM templates for adapter configurations (Key Vault references, storage mount specs, RBAC assignments) Wire together PackageResults.json → MigrationSettings.json with correct Managed Instance fields (Tier=PremiumV4, IsCustomMode=true) Hope you didn't misconfigure anything before deploying to Azure Even experienced Azure engineers find this time-consuming, error-prone, and tedious — especially across a fleet of 20, 50, or 100+ IIS sites. What Agentic Migration Changes The IIS Migration MCP Server introduces an AI orchestration layer that transforms this manual grind into a guided conversation: Traditional Approach Agentic Approach Read raw JSON output from scripts AI summarizes readiness as tables with plain-English descriptions Memorize 15 check types and their severity AI enriches each check with title, description, recommendation, and documentation links Manually decide Managed Instance vs App Service recommend_target analyzes all signals and recommends with confidence + reasoning Write install.ps1 from scratch generate_install_script builds it from detected features Author ARM templates manually generate_adapter_arm_template generates full templates with RBAC guidance Wire JSON artifacts between phases by hand Agents pass readiness_results_path → package_results_path → migration_settings_path automatically Pray you set PV4 + IsCustomMode correctly Enforced automatically — every tool validates Managed Instance constraints Deploy and find out what broke confirm_migration presents a full cost/resource summary before touching Azure The core value proposition: the AI knows the Managed Instance provisioning split. It knows that registry access needs an ARM template with Key Vault-backed adapters, while SMTP needs an install.ps1 section enabling the Windows SMTP Server feature. You don't need to know this. The system detects it from your IIS configuration and AppCat analysis, then generates exactly the right artifacts. Human-in-the-Loop Safety Agentic doesn't mean autonomous. The system has explicit gates: Phase 1 → Phase 2: "Do you want to assess these sites, or skip to packaging?" Phase 3: "Here's my recommendation — Managed Instance for Site A (COM + Registry), standard for Site B. Agree?" Phase 4: "Review MigrationSettings.json before proceeding" Phase 5: "This will create billable Azure resources. Type 'yes' to confirm" The AI accelerates the workflow; the human retains control over every decision. Quick Start Clone and set up the MCP server git clone https://github.com/gsethdev/agenticmigration.git cd iis-migration-mcp python -m venv .venv .venv\Scripts\activate pip install -r requirements.txt # Download Microsoft's migration scripts (NOT included in this repo) # From: https://appmigration.microsoft.com/api/download/psscripts/AppServiceMigrationScripts.zip # Unzip to C:\MigrationScripts (or your preferred path) # Start using in VS Code with Copilot # 1. Copy .vscode/mcp.json.example → .vscode/mcp.json # 2. Open folder in VS Code # 3. In Copilot Chat: "Configure scripts path to C:\MigrationScripts" # 4. Then: @iis-migrate "Discover my IIS sites" The server also works with any MCP-compatible client — Claude Desktop, Cursor, Copilot CLI, or custom integrations — via stdio transport. Architecture: How the MCP Server Works The system is built on the Model Context Protocol (MCP), an open protocol that lets AI assistants like GitHub Copilot, Claude, or Cursor call external tools through a standardized interface. ┌──────────────────────────────────────────────────────────────────┐ │ VS Code + Copilot Chat │ │ @iis-migrate orchestrator agent │ │ ├── iis-discover (Phase 1) │ │ ├── iis-assess (Phase 2) │ │ ├── iis-recommend (Phase 3) │ │ ├── iis-deploy-plan (Phase 4) │ │ └── iis-execute (Phase 5) │ └─────────────┬────────────────────────────────────────────────────┘ │ stdio JSON-RPC (MCP Transport) ▼ ┌──────────────────────────────────────────────────────────────────┐ │ FastMCP Server (server.py) │ │ 13 Python Tool Modules (tools/*.py) │ │ └── ps_runner.py (Python → PowerShell bridge) │ │ └── Downloaded PowerShell Scripts (user-configured) │ │ ├── Local IIS (discovery, packaging) │ │ └── Azure ARM API (deployment) │ └──────────────────────────────────────────────────────────────────┘ The server exposes 13 MCP tools organized across 5 phases, orchestrated by 6 Copilot agents (1 orchestrator + 5 specialist subagents). Important: The PowerShell migration scripts are not included in this repository. Users must download them from GitHub and configure the path using the configure_scripts_path tool. This ensures you always use the latest version of Microsoft's scripts, avoiding version mismatch issues. The 13 MCP Tools: Complete Reference Phase 0 — Setup configure_scripts_path Purpose: Point the server to Microsoft's downloaded migration PowerShell scripts. Before any migration work, you need to download the scripts from GitHub, unzip them, and tell the server where they are. "Configure scripts path to C:\MigrationScripts" Phase 1 — Discovery 1. discover_iis_sites Purpose: Scan the local IIS server and run readiness checks on every web site. This is the entry point for every migration. It calls Get-SiteReadiness.ps1 under the hood, which: Enumerates all IIS web sites, application pools, bindings, and virtual directories Runs 15 readiness checks per site (config errors, HTTPS bindings, non-HTTP protocols, TCP ports, location tags, app pool settings, app pool identity, virtual directories, content size, global modules, ISAPI filters, authentication, framework version, connection strings, and more) Detects source code artifacts (.sln, .csproj, .cs, .vb) near site physical paths Output: ReadinessResults.json with per-site status: Status Meaning READY No issues detected — clear for migration READY_WITH_WARNINGS Minor issues that won't block migration READY_WITH_ISSUES Non-fatal issues that need attention BLOCKED Fatal issues (e.g., content > 2GB) — cannot migrate as-is Requires: Administrator privileges, IIS installed. 2. choose_assessment_mode Purpose: Route each discovered site into the appropriate next step. After discovery, you decide the path for each site: assess_all: Run detailed assessment on all non-blocked sites package_and_migrate: Skip assessment, proceed directly to packaging (for sites you already know well) The tool classifies each site into one of five actions: assess_config_only — IIS/web.config analysis assess_config_and_source — Config + AppCat source code analysis (when source is detected) package — Skip to packaging blocked — Fatal errors, cannot proceed skip — User chose to exclude Phase 2 — Assessment 3. assess_site_readiness Purpose: Get a detailed, human-readable readiness assessment for a specific site. Takes the raw readiness data from Phase 1 and enriches each check with: Title: Plain-English name (e.g., "Global Assembly Cache (GAC) Dependencies") Description: What the check found and why it matters Recommendation: Specific guidance on how to resolve the issue Category: Grouping (Configuration, Security, Compatibility) Documentation Link: Microsoft Learn URL for further reading This enrichment comes from WebAppCheckResources.resx, an XML resource file that maps check IDs to detailed metadata. Without this tool, you'd see GACCheck: FAIL — with it, you see the full context. Output: Overall status, enriched failed/warning checks, framework version, pipeline mode, binding details. 4. assess_source_code Purpose: Analyze an Azure Migrate application and code assessment for .NET JSON report to identify Managed Instance-relevant source code dependencies. If your application has source code and you've run the assessment tool against it, this tool parses the results and maps findings to migration actions: Dependency Detected Migration Action Windows Registry access Registry Adapter (ARM template) Local file system I/O / hardcoded paths Storage Adapter (ARM template) SMTP usage install.ps1 (SMTP Server feature) COM Interop install.ps1 (regsvr32/RegAsm) Global Assembly Cache (GAC) install.ps1 (GAC install) Message Queuing (MSMQ) install.ps1 (MSMQ feature) Certificate access Key Vault integration The tool matches rules from the assessment output against known Managed Instance-relevant patterns. For a complete list of rules and categories, see Interpret the analysis results. Output: Issues categorized as mandatory/optional/potential, plus install_script_features and adapter_features lists that feed directly into Phase 3 tools. Phase 3 — Recommendation & Provisioning 5. suggest_migration_approach Purpose: Recommend the right migration tool/approach for the scenario. This is a routing tool that considers: Source code available? → Recommend the App Modernization MCP server for code-level changes No source code? → Recommend this IIS Migration MCP (lift-and-shift) OS customization needed? → Highlight Managed Instance on App Service as the target 6. recommend_target Purpose: Recommend the Azure deployment target for each site based on all assessment data. This is the intelligence center of the system. It analyzes config assessments and source code findings to recommend: Target When Recommended SKU MI_AppService Registry, COM, MSMQ, SMTP, local file I/O, GAC, or Windows Service dependencies detected PremiumV4 (PV4) AppService Standard web app, no OS-level dependencies PremiumV2 (PV2) ContainerApps Microservices architecture or container-first preference N/A Each recommendation comes with: Confidence: high or medium Reasoning: Full explanation of why this target was chosen Managed Instance reasons: Specific dependencies that require Managed Instance Blockers: Issues that prevent migration entirely install_script_features: What the install.ps1 needs to enable adapter_features: What the ARM template needs to configure Provisioning guidance: Step-by-step instructions for what to do next 7. generate_install_script Purpose: Generate an install.ps1 PowerShell script for OS-level feature enablement on Managed Instance. This handles the OS-level side of the Managed Instance provisioning split. It generates a startup script that includes sections for: Feature What the Script Does SMTP Install-WindowsFeature SMTP-Server, configure smart host relay MSMQ Install MSMQ, create application queues COM/MSI Run msiexec for MSI installers, regsvr32/RegAsm for COM registration Crystal Reports Install SAP Crystal Reports runtime MSI Custom Fonts Copy .ttf/.otf to C:\Windows\Fonts, register in registry The script can auto-detect needed features from config and source assessments, or you can specify them manually. 8. generate_adapter_arm_template Purpose: Generate an ARM template for Managed Instance registry and storage adapters. This handles the platform-level side of the Managed Instance provisioning split. It generates a deployable ARM template that configures: Registry Adapters (Key Vault-backed): Map Windows Registry paths (e.g., HKLM\SOFTWARE\MyApp\LicenseKey) to Key Vault secrets Your application reads the registry as before; Managed Instance redirects the read to Key Vault transparently Storage Adapters (three types): Type Description Credentials AzureFiles Mount Azure Files SMB share as a drive letter Storage account key in Key Vault Custom Mount storage over private endpoint via VNET Requires VNET integration LocalStorage Allocate local SSD on the Managed Instance as a drive letter None needed The template also includes: Managed Identity configuration RBAC role assignments guidance (Key Vault Secrets User, Storage File Data SMB Share Contributor, etc.) Deployment CLI commands ready to copy-paste Phase 4 — Deployment Planning & Packaging 9. plan_deployment Purpose: Plan the Azure App Service deployment — plans, SKUs, site assignments. Collects your Azure details (subscription, resource group, region) and creates a validated deployment plan: Assigns sites to App Service Plans Enforces PV4 + IsCustomMode=true for Managed Instance — won't let you accidentally use the wrong SKU Supports single_plan (all sites on one plan) or multi_plan (separate plans) Optionally queries Azure for existing Managed Instance plans you can reuse 10. package_site Purpose: Package IIS site content into ZIP files for deployment. Calls Get-SitePackage.ps1 to: Compress site binaries + web.config into deployment-ready ZIPs Optionally inject install.ps1 into the package (so it deploys alongside the app) Handle sites with non-fatal issues (configurable) Size limit: 2 GB per site (enforced by System.IO.Compression). 11. generate_migration_settings Purpose: Create the MigrationSettings.json deployment configuration. This is the final configuration artifact. It calls Generate-MigrationSettings.ps1 and then post-processes the output to inject Managed Instance-specific fields: Important: The Managed Instance on App Service Plan is not automatically created by the migration tools. You must pre-create the Managed Instance on App Service Plan (PV4 SKU with IsCustomMode=true) in the Azure portal or via CLI before generating migration settings. When running generate_migration_settings, provide the name of your existing Managed Instance plan so the settings file references it correctly. { "AppServicePlan": "mi-plan-eastus", "Tier": "PremiumV4", "IsCustomMode": true, "InstallScriptPath": "install.ps1", "Region": "eastus", "Sites": [ { "IISSiteName": "MyLegacyApp", "AzureSiteName": "mylegacyapp-azure", "SitePackagePath": "packagedsites/MyLegacyApp_Content.zip" } ] } Phase 5 — Execution 12. confirm_migration Purpose: Present a full migration summary and require explicit human confirmation. Before touching Azure, this tool displays: Total plans and sites to be created SKU and pricing tier per plan Whether Managed Instance is configured Cost warning for PV4 pricing Resource group, region, and subscription details Nothing proceeds until the user explicitly confirms. 13. migrate_sites Purpose: Deploy everything to Azure App Service. This creates billable resources. Calls Invoke-SiteMigration.ps1, which: Sets Azure subscription context Creates/validates resource groups Creates App Service Plans (PV4 with IsCustomMode for Managed Instance) Creates Web Apps Configures .NET version, 32-bit mode, pipeline mode from the original IIS settings Sets up virtual directories and applications Disables basic authentication (FTP + SCM) for security Deploys ZIP packages via Azure REST API Output: MigrationResults.json with per-site Azure URLs, Resource IDs, and deployment status. The 6 Copilot Agents The MCP tools are orchestrated by a team of specialized Copilot agents — each responsible for a specific phase of the migration lifecycle. @iis-migrate — The Orchestrator The root agent that guides the entire migration. It: Tracks progress across all 5 phases using a todo list Delegates work to specialist subagents Gates between phases — asks before transitioning Enforces the Managed Instance constraint (PV4 + IsCustomMode) at every decision point Never skips the Phase 5 confirmation gate Usage: Open Copilot Chat and type @iis-migrate I want to migrate my IIS applications to Azure iis-discover — Discovery Specialist Handles Phase 1. Runs discover_iis_sites, presents a summary table of all sites with their readiness status, and asks whether to assess or skip to packaging. Returns readiness_results_path and per-site routing plans. iis-assess — Assessment Specialist Handles Phase 2. Runs assess_site_readiness for every site, and assess_source_code when AppCat results are available. Merges findings, highlights Managed Instance-relevant issues, and produces the adapter/install features lists that drive Phase 3. iis-recommend — Recommendation Specialist Handles Phase 3. Runs recommend_target for each site, then conditionally generates install.ps1 and ARM adapter templates. Presents all recommendations with confidence levels and reasoning, and allows you to edit generated artifacts. iis-deploy-plan — Deployment Planning Specialist Handles Phase 4. Collects Azure details, runs plan_deployment, package_site, and generate_migration_settings. Validates Managed Instance configuration, allows review and editing of MigrationSettings.json. Does not execute migration. iis-execute — Execution Specialist Handles Phase 5 only. Runs confirm_migration to present the final summary, then only proceeds with migrate_sites after receiving explicit "yes" confirmation. Reports results with Azure URLs and deployment status. The Managed Instance Provisioning Split: A Critical Concept One of the most important ideas Managed Instance introduces is the provisioning split — the division of OS dependencies into two categories that are configured through different mechanisms: ┌──────────────────────────────────────────────────────────────┐ │ MANAGED INSTANCE PROVISIONING SPLIT │ ├─────────────────────────────┬────────────────────────────────┤ │ ARM Template │ install.ps1 │ │ (Platform-Level) │ (OS-Level) │ ├─────────────────────────────┼────────────────────────────────┤ │ Registry Adapters │ COM/MSI Registration │ │ → Key Vault secrets │ → regsvr32, RegAsm, msiexec │ │ │ │ │ Storage Mounts │ SMTP Server Feature │ │ → Azure Files │ → Install-WindowsFeature │ │ → Local SSD │ │ │ → VNET private storage │ MSMQ │ │ │ → Message queue setup │ │ │ │ │ │ Crystal Reports Runtime │ │ │ → SAP MSI installer │ │ │ │ │ │ Custom Fonts │ │ │ → Copy to C:\Windows\Fonts │ └─────────────────────────────┴────────────────────────────────┘ The MCP server handles this split automatically: assess_source_code detects which dependencies fall into which category recommend_target reports both adapter_features and install_script_features generate_adapter_arm_template builds the ARM template for platform features generate_install_script builds the PowerShell startup script for OS features You don't need to remember which goes where — the system decides and generates the right artifacts. End-to-End Walkthrough: From Discovery to Running on Managed Instance Here's what a complete migration conversation looks like: You: "@iis-migrate I want to migrate my IIS applications to Azure" Phase 1 — Discovery: Agent runs discover_iis_sites, presents a table: Site Status Framework Source Code? HRPortal READY_WITH_ISSUES v4.8 Yes (.sln found) PayrollAPI READY v4.8 No IntranetCMS BLOCKED (>2GB) v4.7.2 No Phase 2 — Assessment: Agent runs assess_site_readiness for HRPortal — finds GACCheck and RegistryCheck failures. Runs assess_source_code using AppCat report — confirms COM interop, registry access, and SMTP usage. Phase 3 — Recommendation: Agent runs recommend_target: HRPortal → MI_AppService (high confidence) — COM, registry, SMTP dependencies PayrollAPI → AppService (high confidence) — no OS dependencies Generates install.ps1 for HRPortal (SMTP + COM sections). Generates ARM template with registry adapter (Key Vault-backed) for HRPortal. Phase 4 — Deployment Planning: Agent collects subscription/RG/region, validates PV4 availability. Packages both sites. Generates MigrationSettings.json with two plans: mi-plan-hrportal (PremiumV4, IsCustomMode=true) — HRPortal std-plan-payrollapi (PremiumV2) — PayrollAPI Phase 5 — Execution: Agent shows full summary with cost projection. You type "yes". Sites deploy. You get Azure URLs within minutes. Prerequisites & Setup Requirement Purpose Windows Server with IIS Source server for discovery and packaging PowerShell 5.1 Runs migration scripts (ships with Windows) Python 3.10+ MCP server runtime Administrator privileges Required for IIS discovery, packaging, and migration Azure subscription Target for deployment (execution phase only) Azure PowerShell (Az module) Deploy to Azure (execution phase only) Migration Scripts ZIP Microsoft's PowerShell migration scripts AppCat CLI Source code analysis (optional) FastMCP (mcp[cli]>=1.0.0) MCP server framework Data Flow & Artifacts Every phase produces JSON artifacts that chain into the next phase: Phase 1: discover_iis_sites ──→ ReadinessResults.json │ Phase 2: assess_site_readiness ◄──────┘ assess_source_code ───→ Assessment JSONs │ Phase 3: recommend_target ◄───────────┘ generate_install_script ──→ install.ps1 generate_adapter_arm ─────→ mi-adapters-template.json │ Phase 4: package_site ────────────→ PackageResults.json + site ZIPs generate_migration_settings → MigrationSettings.json │ Phase 5: confirm_migration ◄──────────┘ migrate_sites ───────────→ MigrationResults.json │ ▼ Apps live on Azure *.azurewebsites.net Each artifact is inspectable, editable, and auditable — providing a complete record of what was assessed, recommended, and deployed. Error Handling The MCP server classifies errors into actionable categories: Error Cause Resolution ELEVATION_REQUIRED Not running as Administrator Restart VS Code / terminal as Admin IIS_NOT_FOUND IIS or WebAdministration module missing Install IIS role + WebAdministration AZURE_NOT_AUTHENTICATED Not logged into Azure PowerShell Run Connect-AzAccount SCRIPT_NOT_FOUND Migration scripts path not configured Run configure_scripts_path SCRIPT_TIMEOUT PowerShell script exceeded time limit Check IIS server responsiveness OUTPUT_NOT_FOUND Expected JSON output wasn't created Verify script execution succeeded Conclusion The IIS Migration MCP Server turns what used to be a multi-week, expert-driven project into a guided conversation. It combines Microsoft's battle-tested migration PowerShell scripts with AI orchestration that understands the nuances of Managed Instance on App Service — the provisioning split, the PV4 constraint, the adapter configurations, and the OS-level customizations. Whether you're migrating 1 site or 10, agentic migration reduces risk, eliminates guesswork, and produces auditable artifacts at every step. The human stays in control; the AI handles the complexity. Get started: Download the migration scripts, set up the MCP server, and ask @iis-migrate to discover your IIS sites. The agents will take it from there. This project is compatible with any MCP-enabled client: VS Code GitHub Copilot, Claude Desktop, Cursor, and more. The intelligence travels with the server, not the client.451Views0likes0CommentsWhy Does Azure App Service Return HTTP 404?
When an application deployed to Azure App Service suddenly starts returning HTTP 404 – Not Found, it can be confusing —especially when: The deployment completed successfully The App Service shows as Running No obvious errors appear in the portal This behaviour is more common than it appears and is often linked to routing, configuration, or platform : In this article, I’ll walk through real-world reasons why Azure App Service can return HTTP 404 errors, based on issues . The goal is to help you systematically isolate the root cause—whether it’s application-level, configuration-related, or platform-specific. What Does HTTP 404 Mean in Azure App Service? An HTTP 404 response from Azure App Service means: The incoming request successfully reached Azure App Service, but neither the platform nor the application could locate the requested resource. This distinction is important. Unlike connectivity or DNS issues, a 404 confirms that: DNS resolution worked The request hit the App Service front end The failure happened after request routing Incorrect Application URL or Route This is the most common cause of 404 errors. Typical scenarios Accessing the root URL (https://<app>.azurewebsites.net) for a Web API that exposes only API routes Missing route prefixes such as /api , /v1controller/action name segments Case sensitivity mismatches on Linux App Service Example https://myapp.azurewebsites.net Returns 404, but: https://myapp.azurewebsites.net/weatherforecast Works as expected. ✅ Tip: Always validate your routing locally and confirm the exact same path is being accessed in Azure. Application Appears Running, but Startup Failed Partially It is possible for an App Service to show Running even when the application failed to initialize fully. Common causes Missing or incorrect environment variables Invalid connection strings Exceptions thrown during Program.cs / Startup.cs Dependency initialization failures at startup In such scenarios, the app may start the host process but fail to register routes—resulting in 404 responses instead of 500 errors. ✅ Where to check Application logs Deployment logs Kudu → LogFiles Static Files Not Found or Not Being Served For applications hosting static content (HTML, JavaScript, images, JSON files), a 404 can occur even when files exist. Common reasons Files not deployed to the expected directory (wwor root, /home/site/wwwroot) Missing or unsupported MIME type configuration (commonly seen with .json) Static file middleware not enabled in ASP.NET Core applications ✅ Quick validation: Deploy a simple test.html to wwwroot and try accessing it directly. Windows vs Linux App Service Differences Behaviour can differ significantly between Windows App Service and Linux App Service. Common pitfalls on Linux Case-sensitive file paths (Index.html ≠ index.html) Missing or incorrect startup command Differences in request routing handled by Nginx ✅ Tip: If the app works on Windows App Service but fails on Linux, always recheck file casing and startup configuration first. Custom Domain and Networking Configuration Issues In some cases, requests reach the App Service but fail due to domain or network constraints. Possible causes Incorrect custom domain binding ✅ Isolation step: Always test using the default *.azurewebsites.net specific issues the issue is domain-specific. 6. Health Checks or Monitoring Probes Targeting Invalid Paths Seeing periodic 404 entries in logs—every few minutes—is often a sign of misconfigured probes. Typical scenarios App Service Health Check configured with a non-existent endpoint External monitoring tools probing /health or paths that do no exist ✅ Fix: Ensure the health check path maps to a valid endpoint implemented by the application. 7.Missing or Corrupted Deployment Artifacts Even when deployments report success, application files may not be where the runtime expects them. Commonly observed with Zip deployments WEBSITE_RUN_FROM_PACKAGE misconfigurations Partial or interrupted deployments ✅ Verify using Kudu: Browse /home/site/wwwroot and check files are present. Quick Troubleshooting Checklist If your Azure App Service is returning HTTP 404: Verify the exact URL and route Test hostingstart.html or a static file (for example, /hostingstart.html) Review startup and application logs Inspect deployed artifacts via Kudu Validate Windows vs Linux behaviour differences Review networking, authentication, and health check settings 8. Application Gateway infront of App Service If you have Application gateway infront of app service , please check the re-write rules so that the request is being sent to correct path. Final Thoughts HTTP 404 errors on Azure App Service are rarely random. In most cases, they point to: Routing mismatches Startup or configuration failures Platform-specific behavior differences By breaking the investigation into platform → configuration → application, you can systematically narrow down the root cause and resolve the issue. Happy debugging 🚀311Views1like0CommentsApp Service Easy MCP: Add AI Agent Capabilities to Your Existing Apps with Zero Code Changes
The age of AI agents is here. Tools like GitHub Copilot, Claude, and other AI assistants are no longer just answering questions—they're taking actions, calling APIs, and automating complex workflows. But how do you make your existing applications and APIs accessible to these intelligent agents? At Microsoft Ignite, I teamed up to present session BRK116: Apps, agents, and MCP is the AI innovation recipe, where I demonstrated how you can add agentic capabilities to your existing applications with little to no code changes. Today, I'm excited to share a concrete example of that vision: Easy MCP—a way to expose any REST API to AI agents with absolutely zero code changes to your existing apps. The Challenge: Bridging REST APIs and AI Agents Most organizations have invested years building REST APIs that power their applications. These APIs represent critical business logic, data access patterns, and integrations. But AI agents speak a different language—they use protocols like Model Context Protocol (MCP) to discover and invoke tools. The traditional approach would require you to: Learn the MCP SDK Write new MCP server code Manually map each API endpoint to an MCP tool Deploy and maintain additional infrastructure What if you could skip all of that? Introducing Easy MCP (a proof of concept not associated with the App Service platform) Easy MCP is an OpenAPI-to-MCP translation layer that automatically generates MCP tools from your existing REST APIs. If your API has an OpenAPI (Swagger) specification—which most modern APIs do—you can make it accessible to AI agents in minutes. This means that if you have existing apps with OpenAPI specifications already running on App Service, or really any hosting platform, this tool makes enabling MCP seamless. How It Works Point the gateway at your API's base URL Detect your OpenAPI specification automatically Connect and the gateway generates MCP tools for every endpoint Use the MCP endpoint URL with any MCP-compatible AI client That's it. No code changes. No SDK integration. No manual tool definitions. See It in Action Let's say you have a Todo API running on Azure App Service at `https://my-todo-app.azurewebsites.net`. In just a few clicks: Open the Easy MCP web UI Enter your API URL Click "Detect" to find your OpenAPI spec Click "Connect" Now configure your AI client (like VS Code with GitHub Copilot) to use the gateway's MCP endpoint: { "servers": { "my-api": { "type": "http", "url": "https://my-gateway.azurewebsites.net/mcp" } } } Instantly, your AI assistant can: "What's on my todo list?" "Add 'Review PR #123' to my todos with high priority" "Mark all tasks as complete" All powered by your existing REST API, with zero modifications. The Bigger Picture: Modernization Without Rewrites This approach aligns perfectly with a broader modernization strategy we're enabling on Azure App Service. App Service Managed Instance: Move and Modernize Legacy Apps For organizations with legacy applications—whether they're running on older Windows frameworks, custom configurations, or traditional hosting environments—Azure App Service Managed Instance provides a path to the cloud with minimal friction. You can migrate these applications to a fully managed platform without rewriting code. Easy MCP: Add AI Capabilities Post-Migration Once your legacy applications are running on App Service, Easy MCP becomes the next step in your modernization journey. That 10-year-old internal API? It can now be accessed by AI agents. That legacy inventory system? AI assistants can query and update it. No code changes needed. The modernization path: Migrate legacy apps to App Service with Managed Instance (no code changes) Expose APIs to AI agents with Easy MCP Gateway (no code changes) Empower your organization with AI-assisted workflows Deploy It Yourself Easy MCP is open source and ready to deploy. If you already have an existing API to use with this tool, go for it. If you need an app to test with, check out this sample. Make sure you complete the "Add OpenAPI functionality to your web app" step. You don't need to go beyond that. GitHub Repository: seligj95/app-service-easy-mcp Deploy to Azure in minutes with Azure Developer CLI: azd auth login azd init azd up Or run it locally for testing: npm install npm run dev # Open http://localhost:3000 What's Next: Native App Service Integration Here's where it gets really exciting. We're exploring ways to build this capability directly into the Azure App Service platform so you won't have to deploy a second app or additional resources to get this capability. Azure API Management recently released a feature with functionality to expose a REST API, including an API on App Service, as an MCP server, which I highly recommend that you check out if you're familiar with Azure API Management. But in this case, imagine a future where adding AI agent capabilities to your App Service apps is as simple as flipping a switch in the Azure Portal—no gateway or API Management deployment required, no additional infrastructure or services to manage, and built-in security, monitoring, scaling, etc.—all of the features you're already using and are familiar with on App Service. Stay tuned for updates as we continue to make Azure App Service the best platform for AI-powered applications. And please share your feedback on Easy MCP—we want to hear how you're using it and what features you'd like to see next as we consider this feature for native integration.1.1KViews1like1CommentAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.14KViews1like2CommentsNFS Permission Denied in Azure App Service on Linux: What It Means and What to Do
If your Azure App Service on Linux uses an Azure Files NFS share, you may sometimes see errors like Permission denied or Errno 13 when your app tries to write to the mounted path. Azure Files supports NFS for Linux and Unix workloads, and NFS uses Unix-style numeric ownership and permissions (UID/GID), which can behave differently from SMB-based file sharing. Overview This post is for customers using Azure App Service on Linux together with an Azure Files NFS share for persistent storage. Azure Files NFS is designed for Linux and Unix-style workloads, supports POSIX-style permissions, and does not support Windows clients or NFS ACLs. In this setup, a write failure does not always mean the file is corrupted. Sometimes it means the file ownership seen by the running app no longer matches the identity context currently used to access the NFS share. In containerized Linux environments, user IDs inside a container can be mapped differently outside the container, and Docker documents that this can affect access to host-mounted resources. Common signs You may notice: Permission denied Errno 13 your app can read files but cannot update or overwrite them file ownership looks different than expected when you inspect the mounted path These symptoms are consistent with how NFS handles Unix-style ownership and permissions. Azure documents that NFS permissions are enforced through the operating system and NFS model rather than SMB-style user authentication. Why this can happen At a high level, NFS uses numeric ownership such as UID and GID. In container-based Linux environments, the identity that appears inside the container is not always the same as the identity seen outside the container. Docker’s user namespace documentation explains that a container user such as root can be mapped to a less-privileged user on the host, and that mounted-resource access can become more complex because of that mapping. That means a file created earlier under one effective identity context may later be accessed under a different one. When that happens, the app may no longer be able to write to the file even though the file itself is still present and intact. What to check first Start by checking the mounted share from the app’s runtime context. ls -l /mount/path/file ls -ln /mount/path/file id -u id -g The ls -ln output is especially useful because it shows the numeric UID and GID directly. If you need shell access for investigation, App Service supports SSH into Linux containers, and Microsoft notes that Linux custom containers may need extra SSH configuration. You should also review the NFS share’s squash setting. Azure Files NFS supports No Root Squash, Root Squash, and All Squash. Microsoft documents these options in the root squash guidance. A practical mitigation If the main issue is inconsistent ownership behavior, a practical mitigation is often to use All Squash on the NFS share. Azure documents All Squash as a supported NFS setting, and squash settings are specifically intended to control how client identities are handled when they access the share. One important note: changing the squash setting does not automatically rewrite old files. If existing data was created under a different ownership context, you may still need to migrate that data to a new share configured the way you want. Recommended approach A simple and cautious approach is: Create a new Azure Files NFS share. Configure it with All Squash if that matches your workload needs. Mount both the old share and the new share on a Linux environment. Copy the data from old to new. Validate that the app can read and write correctly. Repoint production to the validated share. Azure Files supports NFS shares and squash configuration, and Azure also documents how to mount NFS shares on Linux if you need a separate environment for validation or migration. Final takeaway If your App Service on Linux starts hitting NFS permission denied errors, focus first on ownership, UID/GID behavior, and squash settings before assuming the files are damaged. For many users, the most effective path is to validate the current ownership model, review the NFS squash setting, and, if needed, migrate data to a share configured with All Squash. References NFS file shares in Azure Files | Microsoft Learn Configure Root Squash Settings for NFS Azure File Shares | Microsoft Learn SSH Access for Linux and Windows Containers - Azure App Service | Microsoft Learn Isolate containers with a user namespace | Docker Docs137Views0likes0CommentsAnnouncing the Public Preview of the New Hybrid Connection Manager (HCM)
Update May 28, 2025: The new Hybrid Connection Manager is now Generally Available. The download links shared in this post will give you the latest Generally Available version. Learn more Key Features and Improvements The new version of HCM introduces several enhancements aimed at improving usability, performance, and security: Cross-Platform Compatibility: The new HCM is now supported on both Windows and Linux clients, allowing for seamless management of hybrid connections across different platforms, providing users with greater flexibility and control. Enhanced User Interface: We have redesigned the GUI to offer a more intuitive and efficient user experience. In addition to a new and more accessible GUI, we have also introduced a CLI that includes all the functionality needed to manage connections, especially for our Linux customers who may solely use a CLI to manage their workloads. Improved Visibility: The new version offers enhanced logging and connection testing, which provides greater insight into connections and simplifies debugging. Getting Started To get started with the new Hybrid Connection Manager, follow these steps: Requirements: Windows clients must have ports 4999-5001 available Linux clients must have port 5001 available Download and Install: The new HCM can be downloaded from the following links. Ensure you download the version that corresponds to your client. If you are new to the HCM, check out the existing documentation to learn more about the product and how to get started. If you are an existing Windows user, installing the new Windows version will automatically upgrade your existing version to the new version, and all your existing connections will be automatically ported over. There is no automated migration path from the Windows to the Linux version at this time. Windows download Download the MSI package and follow the installation instructions Linux download From your terminal running as administrator, follow these steps: sudo apt update sudo apt install tar gzip build-essential sudo wget "https://download.microsoft.com/download/HybridConnectionManager-Linux.tar.gz" sudo tar -xf HybridConnectionManager-Linux.tar.gz cd HybridConnectionManager/ sudo chmod 755 setup.sh sudo ./setup.sh Once that is finished, your HCM is ready to be used Run `hcm help` to see the available commands For interactive mode, you will need to install and login to the Azure CLI. Authentication from the HCM to Azure is done using this credential. Install the Azure CLI with: `install azure cli: curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash` Run `az login` and follow the prompts Add your first connection by running `hcm add` Configure Your Connections: Use the GUI or the CLI to add hybrid connections to your local machine. Manage Your Connections: Use the GUI or the CLI with the `hcm list` and `hcm remove` commands to manage your hybrid connections efficiently. Detailed help texts are available for each command to assist you. Join the Preview We invite you to join the public preview and provide your valuable feedback. Your insights will help us refine and improve the Hybrid Connections Manager to better meet your needs. Feedback and Support If you encounter any issues or have suggestions, please reach out to hcmsupport@service.microsoft.com or leave a comment on this post. We are committed to ensuring a smooth and productive experience with the new HCM. Detailed documentation and guidance will be available in the coming weeks as we get closer to General Availability (GA). Thank you for your continued support and collaboration. We look forward to hearing your thoughts and feedback on this exciting new release.3.1KViews2likes19CommentsEvent-Driven IaC Operations with Azure SRE Agent: Terraform Drift Detection via HTTP Triggers
What Happens After terraform plan Finds Drift? If your team is like most, the answer looks something like this: A nightly terraform plan runs and finds 3 drifted resources A notification lands in Slack or Teams Someone files a ticket During the next sprint, an engineer opens 4 browser tabs — Terraform state, Azure Portal, Activity Log, Application Insights — and spends 30 minutes piecing together what happened They discover the drift was caused by an on-call engineer who scaled up the App Service during a latency incident at 2 AM They revert the drift with terraform apply The app goes down because they just scaled it back down while the bug that caused the incident is still deployed Step 7 is the one nobody talks about. Drift detection tooling has gotten remarkably good — scheduled plans, speculative runs, drift alerts — but the output is always the same: a list of differences. What changed. Not why. Not whether it's safe to fix. The gap isn't detection. It's everything that happens after detection. HTTP Triggers in Azure SRE Agent close that gap. They turn the structured output that drift detection already produces — webhook payloads, plan summaries, run notifications — into the starting point of an autonomous investigation. Detection feeds the agent. The agent does the rest: correlates with incidents, reads source code, classifies severity, recommends context-aware remediation, notifies the team, and even ships a fix. Here's what that looks like end to end. What you'll see in this blog: An agent that classifies drift as Benign, Risky, or Critical — not just "changed" Incident correlation that links a SKU change to a latency spike in Application Insights A remediation recommendation that says "Do NOT revert" — and why reverting would cause an outage A Teams notification with the full investigation summary An agent that reviews its own performance, finds gaps, and improves its own skill file A pull request the agent created on its own to fix the root cause The Pipeline: Detection to Resolution in One Webhook The architecture is straightforward. Terraform Cloud (or any drift detection tool) sends a webhook when it finds drift. An Azure Logic App adds authentication. The SRE Agent's HTTP Trigger receives it and starts an autonomous investigation. The end-to-end pipeline: Terraform Cloud detects drift and sends a webhook. The Logic App adds Azure AD authentication via Managed Identity. The SRE Agent's HTTP Trigger fires and the agent autonomously investigates across 7 dimensions. Setting Up the Pipeline Step 1: Deploy the Infrastructure with Terraform We start with a simple Azure App Service running a Node.js application, deployed via Terraform. The Terraform configuration defines the desired state: App Service Plan: B1 (Basic) — single vCPU, ~$13/mo App Service: Node 20-lts with TLS 1.2 Tags: environment: demo, managed_by: terraform, project: sre-agent-iac-blog resource "azurerm_service_plan" "demo" { name = "iacdemo-plan" resource_group_name = azurerm_resource_group.demo.name location = azurerm_resource_group.demo.location os_type = "Linux" sku_name = "B1" } A Logic App is also deployed to act as the authentication bridge between Terraform Cloud webhooks and the SRE Agent's HTTP Trigger endpoint, using Managed Identity to acquire Azure AD tokens. Learn more about HTTP Triggers here. Step 2: Create the Drift Analysis Skill Skills are domain knowledge files that teach the agent how to approach a problem. We create a terraform-drift-analysis skill with an 8-step workflow: Identify Scope — Which resource group and resources to check Detect Drift — Compare Terraform config against Azure reality Correlate with Incidents — Check Activity Log and App Insights Classify Severity — Benign, Risky, or Critical Investigate Root Cause — Read source code from the connected repository Generate Drift Report — Structured summary with severity-coded table Recommend Smart Remediation — Context-aware: don't blindly revert Notify Team — Post findings to Microsoft Teams The key insight in the skill: "NEVER revert critical drift that is actively mitigating an incident." This teaches the agent to think like an experienced SRE, not just a diff tool. Step 3: Create the HTTP Trigger In the SRE Agent UI, we create an HTTP Trigger named tfc-drift-handler with a 7-step agent prompt: A Terraform Cloud run has completed and detected infrastructure drift. Workspace: {payload.workspace_name} Organization: {payload.organization_name} Run ID: {payload.run_id} Run Message: {payload.run_message} STEP 1 — DETECT DRIFT: Compare Terraform configuration against actual Azure state... STEP 2 — CORRELATE WITH INCIDENTS: Check Azure Activity Log and App Insights... STEP 3 — CLASSIFY SEVERITY: Rate each drift item as Benign, Risky, or Critical... STEP 4 — INVESTIGATE ROOT CAUSE: Read the application source code... STEP 5 — GENERATE DRIFT REPORT: Produce a structured summary... STEP 6 — RECOMMEND SMART REMEDIATION: Context-aware recommendations... STEP 7 — NOTIFY TEAM: Post a summary to Microsoft Teams... Step 4: Connect GitHub and Teams We connect two integrations in the SRE Agent Connectors settings: Code Repository: GitHub — so the agent can read application source code during investigations Notification: Microsoft Teams — so the agent can post drift reports to the team channel The Incident Story Act 1: The Latency Bug Our demo app has a subtle but devastating bug. The /api/data endpoint calls processLargeDatasetSync() — a function that sorts an array on every iteration, creating an O(n² log n) blocking operation. On a B1 App Service Plan (single vCPU), this blocks the Node.js event loop entirely. Under load, response times spike from milliseconds to 25-58 seconds, with 502 Bad Gateway errors from the Azure load balancer. Act 2: The On-Call Response An on-call engineer sees the latency alerts and responds — not through Terraform, but directly through the Azure Portal and CLI. They: Add diagnostic tags — manual_update=True, changed_by=portal_user (benign) Downgrade TLS from 1.2 to 1.0 while troubleshooting (risky — security regression) Scale the App Service Plan from B1 to S1 to throw more compute at the problem (critical — cost increase from ~$13/mo to ~$73/mo) The incident is partially mitigated — S1 has more compute, so latency drops from catastrophic to merely bad. Everyone goes back to sleep. Nobody updates Terraform. Act 3: The Drift Check Fires The next morning, a nightly speculative Terraform plan runs and detects 3 drifted attributes. The notification webhook fires, flowing through the Logic App auth bridge to the SRE Agent HTTP Trigger. The agent wakes up and begins its investigation. What the Agent Found Layer 1: Drift Detection The agent compares Terraform configuration against Azure reality and produces a severity-classified drift report: Three drift items detected: Critical: App Service Plan SKU changed from B1 (~$13/mo) to S1 (~$73/mo) — a +462% cost increase Risky: Minimum TLS version downgraded from 1.2 to 1.0 — a security regression vulnerable to BEAST and POODLE attacks Benign: Additional tags (changed_by: portal_user, manual_update: True) — cosmetic, no functional impact Layer 2: Incident Correlation Here's where the agent goes beyond simple drift detection. It queries Application Insights and discovers a performance incident correlated with the SKU change: Key findings from the incident correlation: 97.6% of requests (40 of 41) were impacted by high latency The /api/data endpoint does not exist in the repository source code — the deployed application has diverged from the codebase The endpoint likely contains a blocking synchronous pattern — Node.js runs on a single event loop, and any synchronous blocking call would explain 26-58s response times The SKU scale-up from B1→S1 was an attempt to mitigate latency by adding more compute, but scaling cannot fix application-level blocking code on a single-threaded Node.js server Layer 3: Smart Remediation This is the insight that separates an autonomous agent from a reporting tool. Instead of blindly recommending "revert all drift," the agent produces context-aware remediation recommendations: The agent's remediation logic: Tags (Benign) → Safe to revert anytime via terraform apply -target TLS 1.0 (Risky) → Revert immediately — the TLS downgrade is a security risk unrelated to the incident SKU S1 (Critical) → DO NOT revert until the /api/data performance root cause is fixed This is the logic an experienced SRE would apply. Blindly running terraform apply to revert all drift would scale the app back down to B1 while the blocking code is still deployed — turning a mitigated incident into an active outage. Layer 4: Investigation Summary The agent produces a complete summary tying everything together: Key findings in the summary: Actor: surivineela@microsoft.com made all changes via Azure Portal at ~23:19 UTC Performance incident: /api/data averaging 25-57s latency, affecting 97.6% of requests Code-infrastructure mismatch: /api/data exists in production but not in the repository source code Root cause: SKU scale-up was emergency incident response, not unauthorized drift Layer 5: Teams Notification The agent posts a structured drift report to the team's Microsoft Teams channel: The on-call engineer opens Teams in the morning and sees everything they need: what drifted, why it drifted, and exactly what to do about it — without logging into any dashboard. The Payoff: A Self-Improving Agent Here's where the demo surprised us. After completing the investigation, the agent did two things we didn't explicitly ask for. The Agent Improved Its Own Skill The agent performed an Execution Review — analyzing what worked and what didn't during its investigation — and found 5 gaps in its own terraform-drift-analysis.md skill file: What worked well: Drift detection via az CLI comparison against Terraform HCL was straightforward Activity Log correlation identified the actor and timing Application Insights telemetry revealed the performance incident driving the SKU change Gaps it found and fixed: No incident correlation guidance — the skill didn't instruct checking App Insights No code-infrastructure mismatch detection — no guidance to verify deployed code matches the repository No smart remediation logic — didn't warn against reverting critical drift during active incidents Report template missing incident correlation column No Activity Log integration guidance — didn't instruct checking who made changes and when The agent then edited its own skill file to incorporate these learnings. Next time it runs a drift analysis, it will include incident correlation, code-infra mismatch checks, and smart remediation logic by default. This is a learning loop — every investigation makes the agent better at future investigations. The Agent Created a PR Without being asked, the agent identified the root cause code issue and proactively created a pull request to fix it: The PR includes: App safety fixes: Adding MAX_DELAY_MS and SERVER_TIMEOUT_MS constants to prevent unbounded latency Skill improvements: Incorporating incident correlation, code-infra mismatch detection, and smart remediation logic From a single webhook: drift detected → incident correlated → root cause found → team notified → skill improved → fix shipped. Key Takeaways Drift detection is not enough. Knowing that B1 changed to S1 is table stakes. Knowing it changed because of a latency incident, and that reverting it would cause an outage — that's the insight that matters. Context-aware remediation prevents outages. Blindly running terraform apply after drift would have scaled the app back to B1 while blocking code was still deployed. The agent's "DO NOT revert SKU" recommendation is the difference between fixing drift and causing a P1. Skills create a learning loop. The agent's self-review and skill improvement means every investigation makes the next one better — without human intervention. HTTP Triggers connect any platform. The auth bridge pattern (Logic App + Managed Identity) works for Terraform Cloud, but the same architecture applies to any webhook source: GitHub Actions, Jenkins, Datadog, PagerDuty, custom internal tools. The agent acts, not just reports. From a single webhook: drift detected, incident correlated, root cause identified, team notified via Teams, skill improved, and PR created. End-to-end in one autonomous session. Getting Started HTTP Triggers are available now in Azure SRE Agent: Create a Skill — Teach the agent your operational runbook (in this case, drift analysis with severity classification and smart remediation) Create an HTTP Trigger — Define your agent prompt with {payload.X} placeholders and connect it to a skill Set Up an Auth Bridge — Deploy a Logic App with Managed Identity to handle Azure AD token acquisition Connect Your Source — Point Terraform Cloud (or any webhook-capable platform) at the Logic App URL Connect GitHub + Teams — Give the agent access to source code and team notifications Within minutes, you'll have an autonomous pipeline that turns infrastructure drift events into fully contextualized investigations — with incident correlation, root cause analysis, and smart remediation recommendations. The full implementation guide, Terraform files, skill definitions, and demo scripts are available in this repository.741Views0likes0Comments