rss.livelink.threads-in-node

Private subnets by default in Azure Virtual Networks: What changed and how to use NAT Gateway

aimeelittleton — Wed, 22 Apr 2026 19:09:37 GMT

Azure is evolving to better support secure‑by‑default cloud architectures.

Starting with API version 2025‑07‑01 (released after March 31, 2026), newly created virtual networks now default to using private subnets. This change removes the long‑standing platform behavior of automatically enabling outbound internet access through implicit public IPs, also known as default outbound access (DOA).

As a result: newly deployed virtual machines will not have public outbound connectivity unless explicitly configured.

What changed?

Previously, Azure automatically assigned a hidden Microsoft‑owned public IP to virtual machines deployed without an explicit outbound method (such as NAT Gateway, Load Balancer outbound rules, or instance‑level public IPs). This allowed public outbound connectivity without requiring customer configuration.

While convenient, this model introduced challenges:

Security – Implicit internet access conflicts with Zero Trust principles.

Reliability – Platform‑managed outbound IPs can change unexpectedly.

Operational consistency – VMSS instances or multi‑NIC VMs may egress using different default outbound IPs.

With API version 2025‑07‑01 and later:

Subnets in newly created VNets are private by default.

The subnet property `defaultOutboundAccess` is set to false.

Azure no longer assigns implicit outbound public IPs.

This applies across deployment methods including Portal, ARM/Bicep, CLI, and PowerShell. Portal has started using the new model as of April 1, 2026.

Note: This change has not yet applied to Terraform.

Am I impacted by this change?

Deployment scenario	Behavior
Existing VNets or VMs using DOA	✅ Unchanged
New VMs in existing VNets	✅ Unchanged
Subnets already using explicit outbound	✅ Continue using configured outbound method
New VMs in new VNets (with subnets created using API 07-01-2025 or later)	🔒 Subnets private by default
New VMs in private subnets without explicit outbound configured	❌ No public outbound connectivity

Existing workloads are not impacted.

If required, you can still create new subnets without the private setting by choosing the appropriate configuration option during creation. See the FAQ section of this blog for more information. However, we strongly recommend transitioning to an explicit outbound method so that:

Your workloads won’t be affected by public IP address changes.

You have greater control over how your VMs connect to public endpoints.

Your VMs use traceable IP resources that you own.

When is outbound connectivity required?

If your virtual network contains virtual machines, you must configure explicit outbound connectivity. Here are common scenarios that require it:

Virtual machine operating system activation and updates, such as Windows or Linux.

Pulling container images from public registries (Docker Hub or Microsoft Container Registry).

Accessing 3rd party SaaS or public APIs

Virtual machine scale sets using flexible orchestration mode are always secure by default and therefore require an explicit outbound method.

Private subnets don’t apply to delegated or managed subnets that host PaaS services. In these cases, the service handles outbound connectivity—see the service-specific documentation for details.

Recommended outbound connectivity method: StandardV2 NAT Gateway

Azure now recommends using an explicit outbound connectivity method such as:

NAT Gateway

Load Balancer outbound rules

Public IP assigned to the VM

Network Virtual Appliance (NVA) / Firewall

Among these, Azure StandardV2 NAT Gateway is the recommended method for outbound connectivity for scalable and resilient outbound connectivity.

StandardV2 NAT Gateway:

Provides zone‑redundancy by default in supported regions

Supports up to 100 Gbps throughput

Provides dual-stack support with IPv4 and IPv6 public IPs

Uses customer‑owned static public IPs

Enables outbound connectivity without allowing inbound internet access

Requires no route table configuration when associated to a subnet

When configured, NAT Gateway automatically becomes the subnet’s default outbound path and takes precedence over:

Load Balancer outbound rules
VM instance‑level public IPs

Note: UDRs for 0.0.0.0/0 traffic directed to virtual appliances/Firewall takes precedence over NAT gateway.

Flow chart showing priority order for different outbound methods

Migrate from Default Outbound Access to NAT Gateway

To transition from DOA to Azure’s recommended method of outbound, StandardV2 NAT Gateway:

Go to your virtual network in the portal, and select the subnet you want to modify.
In the Edit subnet menu, select the ‘Enable private subnet’ checkbox under the Private subnet section

Enabling private subnet can also be done through other supported clients, below is an example for CLI, in which the default-outbound parameter is set to false:

az network vnet subnet update \ --resource-group rgname \ --name subnetname \ --vnet-name vnetname \ --default-outbound false

3. Deploy a StandardV2 NAT gateway resource.

4. Associate one or more StandardV2 public IP addresses or prefixes.

5. Attach the NAT gateway to the target subnet.

Once associated:

All new outbound traffic from that subnet uses NAT Gateway automatically

VM‑level public IPs are no longer required

Existing outbound connections are not interrupted

Note: Enabling private subnet on an existing subnet will not affect any VMs already using default outbound IPs. Private subnet ensures that only new VMs don’t receive a default outbound public IP.

For step-by-step guidance, see migrate default outbound access to NAT Gateway.

FAQ

1. Will my existing workloads lose outbound connectivity?

No. Workloads currently using default outbound IPs are not impacted by this change. The private subnet by default update only affects:

Newly created VNets

New subnets created using the updated API, 2025-07-01

New virtual machines deployed into those subnets using the updated API

VMs and subnets using an explicit outbound connectivity method like a NAT gateway, NVA / Firewall, a VM instance level public IP or Load balancer outbound rules is not impacted by this change.

2. Why can’t my new VM reach the internet or other public endpoints within Microsoft (e.g. VM activation, updates)?

New subnets are private by default. If your deployment does not include an explicit outbound method — such as a NAT Gateway, Public IP, Load Balancer outbound rule, or NVA/Firewall— outbound connectivity is not automatically enabled.

3. My workload has a dependency on default outbound IPs and isn’t ready to move to private subnets, what should I do?

You can opt-out of the default private subnet setting by disabling the private subnet feature. You can do this in the portal by unselecting the private subnet checkbox:

Disabling private subnet can also be done through other supported clients, below is an example for CLI, in which the default-outbound parameter is set to true:

az network vnet subnet update \ --resource-group rgname \ --name subnetname \ --vnet-name vnetname \ --default-outbound true

4. Why do I see an alert showing that I have a default outbound IP on my VM?

There's a NIC-level parameter `defaultOutboundConnectivityEnabled` that tracks whether a default outbound IP is allocated to a VM/Virtual Machine Scale Set instance. If detected, the Azure portal displays a notification banner and will generate Azure Advisor recommendations about disabling default outbound connectivity for your VMs / VMSS.

5. How do I clear this alert?

To remove the default outbound IP and clear the alert:

Configure a StandardV2 NAT gateway (or other explicit outbound method).
Set your subnet to be private or by setting the subnet property defaultOutboundAccess = false using one of the supported clients.
Stop and deallocate any applicable virtual machines (this will remove the default outbound IP currently associated with the VM).

6. I have a NAT gateway (or UDR pointing to an NVA) configured for my private subnet, why do I still see this alert?

In some cases, a default outbound IP is still assigned to virtual machines in a non-private subnet, even when an explicit outbound method—such as a NAT gateway or a UDR directing traffic to an NVA/firewall—is configured.

This does not mean that the default outbound IP is used for egress traffic.

To fully remove the assignment (and clear the alert):

Set the subnet to private

Stop and deallocate the affected virtual machines

Summary

The move to private subnets by default improves the security posture of Azure networking deployments by removing implicit outbound internet access.

Customers deploying new workloads must now explicitly configure outbound connectivity.
StandardV2 NAT Gateway provides a scalable, resilient method for enabling outbound internet access without exposing workloads to inbound connections or relying on platform‑managed IPs.

Learn more

Default Outbound Access

StandardV2 NAT Gateway

Migrate Default Outbound Access to StandardV2 NAT Gateway

AI-Powered Downtime Investigation for Azure VMs: Automating Root Cause Analysis

Jon_Andoni_Baranda — Wed, 22 Apr 2026 18:34:50 GMT

Co-authors: Jie Su, Abhinav Dua, Mukthar Ahmed, Dhruv Joshi

In a previous post, we shared how Azure Automated VM Recovery works to minimize virtual machine downtime through a three-stage approach: Detection, Diagnosis, and Mitigation. This post goes one layer deeper into how our team is using AI to transform incident investigation, one of the most time-consuming parts of that process.

When an alert fires for a recovery event taking longer than expected, a DRI is notified and a ticket is opened. From there, the DRI must manually dig through logs across multiple sources, build Kusto queries from scratch, and correlate timestamps across systems to identify where time was lost. This has historically taken a long time. On top of that, an engineering manager or TPM had to review the incident, understand the failure, and route it to the right engineer, often resulting in multiple handoffs before the right owner was found. Across a platform the size of Microsoft Azure, that time adds up. That is the problem we set out to solve.

How do we use AI for long duration downtime investigation?

Model Context Protocol (MCP) is a standardized protocol that connects AI models to external tools; in our case, Kusto databases, log analyzers, and incident metadata extractors. Rather than generating text about what might be wrong, the AI actually runs real queries against live telemetry. Critically, this is not a chatbot. There is no interface for a DRI to interact with. When an incident fires, the system triggers automatically, runs the full investigation pipeline, and attaches a structured analysis report directly to the ticket. By the time a DRI opens the alert, the work is already done.

The real intelligence in this system goes beyond incident analysis. It comes from encoded domain knowledge about what "normal" looks like: expected recovery timelines for different error categories, log patterns that indicate specific failure modes, and the precise meaning of each phase in the healing workflow. The system knows, for example, how to distinguish a diagnostics bottleneck from a node isolation bottleneck, and what it signals when a particular isolation step runs longer than expected. This is knowledge that took our team years to accumulate, now automatically applied to every incident. Ultimately, the goal is not to replace the DRI but to eliminate the manual investigation work so they can focus on what matters most: making the right call. The system surfaces the analysis; a human always makes the final decision.

How the System Works

The investigation pipeline follows a six-step reasoning chain that mirrors how our best engineers approach manual triage.

Step 1 (Parse and Identify): The system extracts the key metadata from the ticket incident: the affected node identifier, container identifier, the timestamp when the VM went down, and the total duration of the outage. These parameters become the inputs for everything that follows.

Step 2 (Query VM Health Events): Using the extracted metadata, the AI invokes the AI assisted triage against VM availability tables, retrieving the sequence of state transitions the virtual machine experienced during the incident window.

Step 3 (Check Host Health): The AI then queries host-level health event tables, examining node state changes to understand what the underlying host was doing during the same period. This establishes whether the issue originated at the VM level or at the node level.

Step 4 (Correlate Repair Service Logs): With both the VM and host picture in hand, the AI cross-references repair service logs to trace when our repair orchestration service was triggered, what actions it took, and how long each step took.

Step 5 (Build the Timeline): The AI assembles all of the retrieved data into a chronological, end-to-end timeline of the recovery event. This timeline maps directly to the three phases we track: Time to Detect (TTD), Time to Diagnose (TTDiag), and Time to Mitigate (TTM), as well as Time to Isolate (TTI) when service healing is involved.

Step 6 (Root Cause and Report): Finally, the AI analyzes the timeline, identifies which phase contained the largest gap, determines what operation caused the bottleneck, and generates a structured investigation report that is automatically attached to the ticket incident.

Results and conclusion

The results are measurable across three dimensions. On speed, the investigation pipeline now completes in under 5 minutes, down from 30 to 60 minutes manually, a roughly 90% reduction that shaves 50% off total triage time. On consistency, 100% of qualifying incidents receive the same thorough analysis regardless of who is on call, with the full phase breakdown (TTD, TTDiag, TTMitigate, and TTIsolate) applied every time. On ownership, the generated report gives managers and TPMs immediate context to assign the incident to the right engineer from the start, eliminating the back-and-forth handoffs that previously delayed remediation. This solution has saved Engineering Manager and TPM 10-20 minutes of manual work per incident.

By encoding our team's best practices into an automated pipeline, we turned a slow, inconsistent manual process into something fast, thorough, and always available. MCP offers a practical path for any engineering team to make the knowledge of their most experienced engineers universally accessible, not as documentation, but as an automated system that applies it to every incident, every time. We will continue to share updates as this evolves and would love to hear from teams working on similar problems.

Azure VNet Data Gateway for Secure Power BI & Power Platform Access in Enterprises

kirankumar_manchiwar04 — Wed, 22 Apr 2026 16:30:19 GMT

What Is a VNet data gateway?

The VNet data gateway is a Microsoft‑managed gateway service that runs inside a delegated subnet of an Azure Virtual Network. It allows supported Microsoft cloud services—such as Power BI, Power Platform dataflows, and Microsoft Fabric workloads—to securely connect to data sources that are protected using private networking.

Key characteristics:

No customer‑managed VM or container

No OS, patching, or gateway software upgrades

Gateway lifecycle fully managed by Microsoft

Traffic stays on the Azure backbone network

Works seamlessly with Private Endpoints

This makes it ideal for enterprise and regulated environments where security and operational efficiency are equally important.

Why Enterprises need VNet data gateway

Eliminates gateway infrastructure management

Traditional gateways require:

Virtual machines

High availability setup

OS patching and scaling

Monitoring and troubleshooting

With the VNet data gateway:

Microsoft manages compute lifecycle

No VM or gateway software to maintain

No HA or load balancer design needed

✅ Result: Significant reduction in operational and maintenance overhead for platform and infrastructure teams.

Secure access to private Azure resources

Most enterprise Azure environments use:

Private Endpoints

NSGs and route tables

Firewalls blocking public access

The VNet data gateway:

Is injected into a delegated subnet in your VNet

Uses private IP addressing

Enforces NSG and UDR rules

Communicates with Microsoft services over a Microsoft‑managed internal tunnel

✅ Result: Data sources remain fully private—no public endpoints or inbound ports required.

Designed for Power Platform & Power BI at Scale

The gateway supports secure access for:

Power BI semantic models

Power BI paginated reports

Microsoft Fabric Dataflow Gen2

Fabric pipelines and copy jobs

Because it’s cloud‑native and centrally managed, the VNet data gateway scales well in large enterprises standardizing on Power Platform and Fabric.

High‑level architecture overview

At runtime, the VNet data gateway works as follows:

A query is initiated from Power BI / Power Platform
Query details and credentials are sent to the Microsoft Power Platform VNet service
A containerized gateway instance is injected into the delegated subnet
The gateway connects to the private data source using private networking
Results are sent back to Power BI or Power Platform via a Microsoft‑managed internal tunnel

Key security highlights:

No inbound connectivity

No public IP exposure

Traffic remains on Azure backbone

Full enforcement of NSGs and routing rules

Key Enterprise benefits

Least management overhead – no gateway servers

Zero Trust aligned – private-only connectivity

Fully managed by Microsoft

Enterprise-grade security & governance

Works with Azure Private Endpoint architectures

When to Use VNet Data Gateway

Scenario	Recommendation
Azure private PaaS services	✅ VNet data gateway
Private Endpoint–only access	✅ VNet data gateway
Zero Trust network model	✅ VNet data gateway
Minimal ops & maintenance	✅ VNet data gateway
On‑prem only, no Azure	❌ Traditional gateway

Step‑by‑step configuration: VNet data gateway (Enterprise setup)

High‑level flow (What you will configure)

Register required Azure resource provider
Prepare Azure Virtual Network and subnet
Configure private connectivity to data source
Create the VNet data gateway
Create and bind data source connections
Validate with Power BI / Power Platform workloads

Step 1: Register Microsoft.PowerPlatform resource provider

Why this step is required

The VNet data gateway is a Microsoft‑managed service that is injected into your Azure VNet. Azure must explicitly allow Power Platform to deploy managed infrastructure into your subscription.

Configuration steps

Sign in to Azure portal
Navigate to Subscriptions
Select the subscription that hosts the target VNet
Go to Resource providers
Search for Microsoft.PowerPlatform
Click Register

✅ Status must show Registered

This step enables subnet delegation to Power Platform services.

Step 2: Prepare the Azure Virtual Network

Why this step is required

The gateway runs inside your VNet. It must be placed in a dedicated, delegated subnet to maintain isolation and security boundaries.

Requirements

VNet can be in any Azure region

Subnet must be exclusive to VNet data gateway

Subnet must have outbound connectivity to the data source

Configuration steps

Go to Azure portal → virtual networks
Select your existing VNet (or create one)
Navigate to Subnets → + Subnet
Configure:
- Subnet name: snet-vnet-datagateway
- Address range: /27 or larger (recommended)

- Subnet delegation:
  Microsoft.PowerPlatform/vnetaccesslinks
- Save the subnet

⚠️ Do not place any VMs, app gateway, or other workloads in this subnet.

Step 3: Configure private connectivity to the data source

Why this step is required

Enterprises typically block public access to PaaS services. The VNet data gateway is designed to work natively with private endpoints.

Example: Azure SQL / SQL Managed Instance

Create Private Endpoint for the data service
Attach it to the same VNet (can be different subnet)
Create or link a Private DNS Zone, for example:
privatelink.database.windows.net
Link the Private DNS Zone to the VNet
Ensure DNS resolution from the delegated subnet resolves to private IP

✅ This ensures all traffic remains private and internal.

Step 4: Create the VNet data gateway

Why this step is required

This is where the actual Microsoft‑managed gateway is logically created and associated with your VNet.

Configuration steps

You can do this from either Power BI Service or Power Platform Admin Center.

Using Power Platform Admin Center

Go to https://admin.powerplatform.microsoft.com

Select Data → Gateways

Click + New → Virtual network data gateway

Provide:

- Gateway name

- Azure subscription

- Resource group

- Virtual network

- Delegated subnet

1. Click Create

📌 Notes:

Gateway metadata is stored in Power BI tenant home region

Gateway runtime executes in the VNet region

No VM or scale settings are required

Step 5: Create and configure data source connections

Why this step is required

The gateway exists, but Power BI / Power Platform must know which data sources can be accessed via it.

Configuration steps (Power BI example)

Go to Power BI Service

- Navigate to Settings → Manage connections and gateways

- Select the newly created VNet data gateway

- Click + New connection

- Provide:

- - Data source type (Azure SQL, Storage, Databricks, etc.)

- - Server / endpoint name (private DNS name)

- - Authentication (SQL / Entra ID)
Save the connection
Assign users or security groups

✅ This step enables governance and access control.

Step 6: Use the gateway in Power BI / Power Platform

Power BI

Open dataset or semantic model settings
Under Gateway connection, select:
Use a data gateway

Choose the VNet data gateway
Apply changes
Refresh or run queries

Power Platform / Fabric

Select the same connection when configuring:

Dataflows Gen2

Fabric pipelines

Copy jobs

Step 7: Validate and test

Validation Checklist

✅ DNS resolves to private IP
✅ No public endpoint access enabled
✅ NSGs allow outbound traffic to data source
✅ Dataset refresh succeeds
✅ No gateway VM exists in subscription

Optional:

Enable logging and auditing from Power BI / Fabric

Monitor gateway health in Admin Center

Key Enterprise design guidance (Best practices)

Use one gateway per environment tier (Prod / Non‑Prod)

Use dedicated VNets for data access where possible

Use Private Endpoint only (avoid service endpoints)

Control access via AAD groups, not individuals

Avoid mixing gateway subnet with other workloads

Conclusion: For enterprises looking to consume Power Platform, Power BI, and Microsoft Fabric securely while keeping operational overhead close to zero, the VNet data gateway is the recommended approach.

It removes gateway infrastructure complexity, strengthens security posture, and aligns perfectly with modern Azure landing zone and Zero Trust architectures.

NFS Permission Denied in Azure App Service on Linux: What It Means and What to Do

michelleyau — Wed, 22 Apr 2026 07:31:52 GMT

If your Azure App Service on Linux uses an Azure Files NFS share, you may sometimes see errors like Permission denied or Errno 13 when your app tries to write to the mounted path. Azure Files supports NFS for Linux and Unix workloads, and NFS uses Unix-style numeric ownership and permissions (UID/GID), which can behave differently from SMB-based file sharing.

Overview

This post is for customers using Azure App Service on Linux together with an Azure Files NFS share for persistent storage. Azure Files NFS is designed for Linux and Unix-style workloads, supports POSIX-style permissions, and does not support Windows clients or NFS ACLs.

In this setup, a write failure does not always mean the file is corrupted. Sometimes it means the file ownership seen by the running app no longer matches the identity context currently used to access the NFS share. In containerized Linux environments, user IDs inside a container can be mapped differently outside the container, and Docker documents that this can affect access to host-mounted resources.

Common signs

You may notice:

Permission denied
Errno 13
your app can read files but cannot update or overwrite them
file ownership looks different than expected when you inspect the mounted path

These symptoms are consistent with how NFS handles Unix-style ownership and permissions. Azure documents that NFS permissions are enforced through the operating system and NFS model rather than SMB-style user authentication.

Why this can happen

At a high level, NFS uses numeric ownership such as UID and GID. In container-based Linux environments, the identity that appears inside the container is not always the same as the identity seen outside the container. Docker’s user namespace documentation explains that a container user such as root can be mapped to a less-privileged user on the host, and that mounted-resource access can become more complex because of that mapping.

That means a file created earlier under one effective identity context may later be accessed under a different one. When that happens, the app may no longer be able to write to the file even though the file itself is still present and intact.

What to check first

Start by checking the mounted share from the app’s runtime context.

ls -l /mount/path/file
ls -ln /mount/path/file
id -u
id -g

The ls -ln output is especially useful because it shows the numeric UID and GID directly. If you need shell access for investigation, App Service supports SSH into Linux containers, and Microsoft notes that Linux custom containers may need extra SSH configuration.

You should also review the NFS share’s squash setting. Azure Files NFS supports No Root Squash, Root Squash, and All Squash. Microsoft documents these options in the root squash guidance.

A practical mitigation

If the main issue is inconsistent ownership behavior, a practical mitigation is often to use All Squash on the NFS share. Azure documents All Squash as a supported NFS setting, and squash settings are specifically intended to control how client identities are handled when they access the share.

One important note: changing the squash setting does not automatically rewrite old files. If existing data was created under a different ownership context, you may still need to migrate that data to a new share configured the way you want.

Recommended approach

A simple and cautious approach is:

Create a new Azure Files NFS share.
Configure it with All Squash if that matches your workload needs.
Mount both the old share and the new share on a Linux environment.
Copy the data from old to new.
Validate that the app can read and write correctly.
Repoint production to the validated share.

Azure Files supports NFS shares and squash configuration, and Azure also documents how to mount NFS shares on Linux if you need a separate environment for validation or migration.

Final takeaway

If your App Service on Linux starts hitting NFS permission denied errors, focus first on ownership, UID/GID behavior, and squash settings before assuming the files are damaged. For many users, the most effective path is to validate the current ownership model, review the NFS squash setting, and, if needed, migrate data to a share configured with All Squash.

References

If You're Building AI on Azure, ECS 2026 is Where You Need to Be

Lee_Stott — Wed, 22 Apr 2026 09:17:46 GMT

Let me be direct: there's a lot of noise in the conference calendar. Generic cloud events. Vendor showcases dressed up as technical content. Sessions that look great on paper but leave you with nothing you can actually ship on Monday.

ECS 2026 isn't that.

As someone who will be on stage at Cologne this May, I can tell you the European Collaboration Summit combined with the European AI & Cloud Summit and European Biz Apps Summit is one of the few events I've seen where engineers leave with real, production-applicable knowledge.

Three days. Three summits. 3,000+ attendees. One of the largest Microsoft-focused events in Europe, and it keeps getting better.

If you're building AI systems on Azure, designing cloud-native architectures, or trying to figure out how to take your AI experiments to production — this is where the conversation is happening.

What ECS 2026 Actually Is

ECS 2026 runs May 5–7 at Confex in Cologne, Germany. It brings together three co-located summits under one roof:

European Collaboration Summit — Microsoft 365, Teams, Copilot, and governance
European AI & Cloud Summit — Azure architecture, AI agents, cloud security, responsible AI
European BizApps Summit — Power Platform, Microsoft Fabric, Dynamics

For Azure engineers and AI developers, the European AI & Cloud Summit is your primary destination. But don't ignore the overlap, some of the most interesting AI conversations happen at the intersection of collaboration tooling and cloud infrastructure.

The scale matters here: 3,000+ attendees, 100+ sessions, multiple deep-dive tracks, and a speaker lineup that includes Microsoft executives, Regional Directors, and MVPs who have built, broken, and rebuilt production systems.

The Azure + AI Track - What's Actually On the Agenda

The AI & Cloud Summit agenda is built around real technical depth. Not "intro to AI" content, actual architecture decisions, patterns that work, and lessons from things that didn't.

Here's what you can expect:

AI Agents and Agentic Systems
This is where the energy is right now, and ECS is leaning in. Expect sessions covering how to design agent workflows, chain reasoning steps, handle memory and state, and integrate with Azure AI services. Marco Casalaina, VP of Products for Azure AI at Microsoft, is speaking if you want to understand the direction of the Azure AI platform from the people building it, this is a direct line.

Azure Architecture at Scale
Cloud-native patterns, microservices, containers, and the architectural decisions that determine whether your system holds up under real load. These sessions go beyond theory you'll hear from engineers who've shipped these designs at enterprise scale.

Observability, DevOps, and Production AI
Getting AI to production is harder than the demos suggest. Sessions here cover monitoring AI systems, integrating LLMs into CI/CD pipelines, and building the operational practices that keep AI in production reliable and governable.

Cloud Security and Compliance
Security isn't optional when you're putting AI in front of users or connecting it to enterprise data. Tracks cover identity, access patterns, responsible AI governance, and how to design systems that satisfy compliance requirements without becoming unmaintainable.

Pre-Conference Deep Dives

One underrated part of ECS: the pre-conference workshops. These are extended, hands-on sessions typically 3–6 hours that let you go deep on a single topic with an expert. Think of them as intensive short courses where you can actually work through the material, not just watch slides.

If you're newer to a particular area of Azure AI, or you want to build fluency in a specific pattern before the main conference sessions, these are worth the early travel.

The Speaker Quality Is Different Here

The ECS speaker roster includes Microsoft executives, Microsoft MVPs, and Regional Directors, people who have real accountability for the products and patterns they're presenting. You'll hear from over 20 Microsoft speakers:

Marco Casalaina — VP of Products, Azure AI at Microsoft
Adam Harmetz — VP of Product at Microsoft, Enterprise Agent

And dozens of MVPs and Regional Directors who are in the field every day, solving the same problems you are. These aren't keynote-only speakers — they're in the session rooms, at the hallway track, available for real conversations.

The Hallway Track Is Not a Cliché

I know "networking" sounds like a corporate afterthought. At ECS it genuinely isn't.

When you put 3,000 practitioners, engineers, architects, DevOps leads, security specialists in one venue for three days, the conversations between sessions are often more valuable than the sessions themselves. You get candid answers to "how are you actually handling X in production?" that you won't find in documentation.

The European Microsoft community is tight-knit and collaborative. ECS is where that community concentrates.

Why This Matters Right Now

We're in a period where AI development is moving fast but the engineering discipline around it is still maturing. Most teams are figuring out:

How to move from AI prototype to production system
How to instrument and observe AI behaviour reliably
How to design agent systems that don't become unmaintainable
How to satisfy security and compliance requirements in AI-integrated architectures

ECS 2026 is one of the few places where you can get direct answers to these questions from people who've solved them — not theoretically, but in production, on Azure, in the last 12 months.

If you go, you'll come back with practical patterns you can apply immediately. That's the bar I hold events to. ECS consistently clears it.

Register and Explore the Agenda

Register for ECS 2026: ecs.events
Explore the AI & Cloud Summit agenda: cloudsummit.eu/en/agenda
Dates: May 5–7, 2026 | Location: Confex, Cologne, Germany

Early registration is worth it the pre-conference workshops fill up.

And if you're coming, find me, I'll be the one talking too much about AI agents and Azure deployments.

See you in Cologne.

Getting Started with the SUSE Multi-Linux Manager MCP Server and GitHub Copilot

abbottkarl — Wed, 22 Apr 2026 07:00:00 GMT

Enterprise Linux environments are heterogeneous. That's not a problem statement - it's just the truth. SUSE, Ubuntu, RHEL, and their downstream variants coexist in every data center I've seen, and increasingly across Azure subscriptions too. AI assistants like GitHub Copilot can already connect to these machines, run commands, troubleshoot issues, apply patches one box at a time. But if you're managing a fleet of hundreds or thousands of systems across distributions, the gap isn't whether AI can touch your infrastructure. It's whether it can work through the centralized management tooling where your inventory, patch orchestration, RBAC, and audit trails actually live.

SUSE just took a meaningful step to close that gap. Their Multi-Linux Manager MCP Server, built on the open source Uyuni project gives AI agents like GitHub Copilot a structured, authenticated interface to your existing management platform. Not the individual boxes. The management plane where your centralized inventory, CVE auditing, cross-distribution patch scheduling, and RBAC already live. Not a rip-and-replace. Not a new console to learn. A way to talk to the infrastructure management you've already built.

This post walks through what the MCP server does, why it matters in an Azure context, and how to get it wired up with GitHub Copilot so you can start working with it today.

The Model Context Protocol (MCP) is an open standard that defines how AI models connect to external tools and data sources. Think of it as the USB-C of AI integrations - a common interface so that different clients (GitHub Copilot, Claude Desktop, Gemini CLI) can talk to different servers (Azure, SUSE, databases, APIs) without bespoke glue code for every combination.

Why This Matters for Azure Customers

If you are running Linux workloads on Azure - whether for SAP, HPC, or traditional enterprise applications - the Multi-Linux Manager MCP server provides a conversational interface for your infrastructure without requiring you to change tools.

Management-plane depth, not just infrastructure inventory. Azure and Copilot already give you fleet-wide visibility into your VMs. The SUSE MCP server adds the layer underneath: patch scheduling state, erratum tracking, cross-distribution CVE audits, and system group management that lives in your Multi-Linux Manager instance.
A single pane of glass. Pair this with the Azure MCP Server and your AI assistant can move between Azure resource operations and OS-level fleet management in one conversation, across the distributions Multi-Linux Manager supports, without switching tools or contexts.

What You Can Actually Do With It

The MCP server exposes over 20 practical tools for day-to-day infrastructure operations. Instead of relying on a generic knowledge base, Copilot queries your actual infrastructure.

Inventory and Inspection: You can list active systems across your fleet or pull detailed event histories for specific machines.
Patch Management and CVE Response: Copilot can rapidly audit all systems for pending updates or identify specific machines vulnerable to a new CVE.
Operational Actions: You can list system groups, register new systems, or schedule server reboots.

The Security Model: Human-in-the-Loop

Letting an AI agent touch production infrastructure raises the obvious question: what keeps it from doing something destructive? SUSE has been deliberate about this by designing the MCP server with a default "human-in-the-loop" security model.

Read-Only by Default: The server ships with all write actions disabled (UYUNI_MCP_WRITE_TOOLS_ENABLED=false).
Explicit Confirmation: If you enable write tools, Copilot is required to ask for your explicit confirmation before executing state-changing actions like applying patches or scheduling reboots.
Enterprise Authentication: The server supports OAuth 2.0, ensuring the AI agent authenticates through your identity provider.
Layered Governance: Combined with Multi-Linux Manager’s role-based access control (RBAC) and the principle of least privilege for the service account, you get layered governance without bolting on a separate approval system.

AI-assisted operations that bypass human judgment won't get adopted in enterprises. AI-assisted operations that make the human faster while keeping them in control, that's the model that actually ships.

Architecture on Azure

Here's the topology we're working with:

SUSE Multi-Linux Manager - Running on an Azure VM, managing your Linux fleet across distributions. This is the control plane for your systems - inventory, patching, configuration. Available on Azure Marketplace.
MCP Server - Runs as a container (Docker/Podman), either locally alongside your dev environment or as a standalone HTTP service. The MCP Server container is available in SUSE Registry and is backed by a secure, trusted software supply chain.
GitHub Copilot - In VS Code or the CLI. Configured to use the MCP server as a tool source. Sends natural language requests, receives structured responses from your infrastructure.
Your Linux fleet on Azure - Whatever Multi-Linux Manager manages for you. The MCP server doesn't care about the distribution mix; that's the whole point of Multi-Linux Manager.

Getting Started: Step by Step

Prerequisites

A running SUSE Multi-Linux Manager instance managing your Linux estate
Docker or Podman installed on your workstation (for local deployment) or network access to a remote MCP server instance
GitHub Copilot with agent mode enabled (VS Code or CLI)

Step 1: Stand up the MCP Server

For local deployment, pull the container and point it at your Multi-Linux Manager instance following the project documentation. For remote/team deployments, your administrator can run the server as a standalone HTTP service with OAuth 2.0.

Step 2: Configure GitHub Copilot

In VS Code, open the Command Palette and type GitHub Copilot: Configure MCP Servers. Add your server to the config:

{
"mcpServers": {
"suse-multi-linux-manager": {
"type": "http",
"url": "https://your-mcp-server.example.com/mcp"
}
}
}

Step 3: Verify the Connection

Open GitHub Copilot and try a read-only query:

"List all active systems managed by my SUSE Multi-Linux Manager."

If your fleet inventory appears, you're connected.

Step 4: Start Operating

"Are any of my systems affected by CVE-2026-XXXX?"

"Show me all systems that have pending but unscheduled security patches."

"Which systems need a reboot?"

Getting Involved

The SUSE Multi-Linux Manager MCP server is open source under the Apache 2.0 license, built on the Uyuni project. The current v0.5 is a tech preview. Feedback goes to uyuni-project/uyuni#10562, bugs to GitHub Issues.

The gap in AI-assisted Linux operations was never whether AI could reach your infrastructure. It was whether it could work through the management tooling where your fleet-scale decisions actually get made. SUSE built the bridge to that layer. GitHub Copilot is the conversational interface. Your fleet is already there. Go connect them.

Dynamic hostpool sessions not updating

OnzenHans — Wed, 22 Apr 2026 07:28:04 GMT

We have created a dynamic host pool in a test environment. We see that new hosts are being created based on the scaling plan.

However, these are no longer being deleted. When we look at the status, we see that there are no active sessions, but when we zoom in on the session hosts, it shows that there is a session on two of the three hosts. The latter is incorrect, but it is likely the reason why scaling down is not taking place.

Does anyone recognize this? Is there possibly a solution for this?

Small addition: If I log in with a user and then log out properly, the current sessions in the host pool overview are updated quickly. However, if I then go to Manage, Session Hosts, the total sessions on that host remain at 1.

When I now put the host in drinamode, only then are the actual sessions updated.

Ingest at Scale, Securely — Azure Monitor pipeline Is Now Generally Available

XemaPathak — Tue, 21 Apr 2026 20:51:13 GMT

Today, we're thrilled to announce the general availability of Azure Monitor pipeline — a telemetry pipeline built for secure, high-scale ingestion across any environment. But the best way to understand what makes it powerful isn't to start with features. It's to start with the problems that kept showing up, over and over, in our conversations with customers. So, let's dig in...

Chances are, this sounds a lot like your environment

Imagine a large enterprise rolling out Microsoft Sentinel as their SIEM.

They have sites across regions, a mix of on‑premises and cloud environments, and security telemetry streaming in from firewalls, network devices, and Linux servers—100,000 to 1 million events per second in some locations. Traditional forwarders buckle under the load, drop events during network blips, and ship everything – signal and noise – straight into Sentinel. The result: skyrocketing ingestion costs, degraded detections, and a brittle forwarding infrastructure that demands constant babysitting.

If you're managing environments like these, these questions are probably top of mind:

How do I securely ingest telemetry—without opening hundreds of risky endpoints?
How do I reduce ingestion costs when telemetry spikes across thousands of sources simultaneously?
How do I centrally standardize logs across sites and device types before they ever reach Azure?
What happens to telemetry from an entire location when connectivity drops?
And how do I do all of this consistently, at massive scale, and centrally across environments instead of configuring each host individually?

These aren't edge cases. For many teams, getting data into the system itself is the hardest part of observability —and by the time telemetry reaches Azure Monitor or Sentinel, it's already too late to fix these problems.

Customers need control before the data hits the cloud.

What is Azure Monitor pipeline (and why it’s different)?

Azure Monitor pipeline provides a centralized control point for telemetry ingestion and transformation, designed specifically for secure, high‑throughput, enterprise‑scale scenarios. It's built on open-source technologies from the OpenTelemetry ecosystem and includes the components needed to receive telemetry from local clients, process that telemetry, and forward it to Azure Monitor.

It’s not another agent. And NO, you do not need to install it on all the resources…

Agents such as Azure Monitor agent are great for collecting telemetry from individual machines and services. Azure Monitor pipeline solves a different problem:

“How do I ingest telemetry from across my environment through a centralized pipeline – instead of configuring each host – while maintaining control over reliability, security, and ingestion cost?”

With Azure Monitor pipeline control, you can:

Ensure logs land directly in Azure‑native schemas – automatic schematization into tables such as Syslog and CommonSecurityLog
Prevent data loss during intermittent connectivity across sites – local buffering in persistent storage with automated backfill
Reduce ingestion costs before data reaches the cloud – centralized filtering, aggregation, and transformation
Ingest telemetry at sustained high volumes in the range of hundreds and thousands of events per second – horizontally scalable pipeline architecture
Secure telemetry ingestion without managing certificates on each host individually – centralized TLS/mTLS with automated certificate provisioning and zero‑downtime rotation
Maintain visibility into ingestion infrastructure health – pipeline performance and health monitoring
Plan deployments confidently at large scale – infrastructure sizing guidance for expected telemetry volume

And all of this is fully supported and production‑ready in GA. Learn more.

So, let's talk a little bit about these in detail!

Tired of broken detections because logs don't match your table schema? - Automatic schematization (a customer favorite!)

A consistent theme from preview customers was how painful it is to deal with log formats.

Azure Monitor pipeline is the only solution that automatically shapes and schematizes data, so it lands directly in standard Azure tables such as Syslog and CommonSecurityLog. Learn more.

That means:

No custom parsing pipelines downstream
No broken detections due to schema drift
Faster time to value for security teams

This happens before data reaches the cloud – right where it matters most.

What happens to my telemetry when the network goes down? - Local buffering in persistent storage and automated backfill

Networks fail. Maintenance happens. Sites go offline.

Azure Monitor pipeline is built for this reality. It buffers telemetry locally in your configured persistent storage during network interruptions and automatically backfills data when connectivity is restored. Learn more.

The result:

No gaps in security visibility
No manual replays
Confidence that critical telemetry isn’t lost

How do I reduce ingestion costs without sacrificing signal quality? - Filter and aggregate at the edge

Nobody likes to pay for the data that they do not need...

With Azure Monitor pipeline, customers can filter, aggregate, and shape the telemetry at the edge, sending only high‑value data to Azure. Learn more.

This helps teams:

Reduce ingestion costs
Improve detection quality
Keep cloud analytics focused on signal, not volume

Cost optimization and signal quality are no longer trade‑offs – you get both.

How do I keep up when telemetry volumes spike to hundreds of thousands of events per second? - Scaling

One of the biggest pain points we hear is scale.

Azure Monitor pipeline is designed for sustained high throughput ingestion, scaling horizontally and vertically to handle hundreds of thousands to millions of events per second. Learn more.

This isn’t about theoretical limits; it’s about handling the real-world extremes that break traditional forwarders.

How do I send telemetry in a secure manner? - Secure ingestion with TLS and mTLS

Security teams consistently tell us that plain TCP ingestion just isn’t acceptable – especially in regulated environments.

Azure Monitor pipeline addresses this head‑on by providing TLS‑secured ingestion endpoints with mutual authentication, ensuring telemetry is encrypted in transit and accepted only from trusted sources. Learn more.

The result:

Secure ingestion at the boundary by encrypting data in transit using TLS with automated certificate provisioning and zero downtime rotation.
Clients and Azure Monitor pipeline endpoints both validate each other before ingestion by enabling mutual authentication with mTLS, and it’s easy to set it up with our default experience.
Do you have your own PKI and certificate management systems? - Feel free to bring your own certificates to enable secure ingestion.

If the pipeline is this critical — how do I know it's healthy?

One thing we heard loud and clear during preview:

“If this pipeline is critical, I need to see how it’s doing.”

Azure Monitor pipeline now exposes health and performance signals, so it’s no longer a black box. Learn more.

Customers can answer questions like:

Is my pipeline receiving, processing, and sending telemetry?
What’s the CPU and memory usage of each pipeline instance?
Why is a pipeline unhealthy—or down?

Observability for observability felt like the right bar to meet.

How do I plan infrastructure without over- or under-provisioning?

Planning pipeline infrastructure shouldn't be a guessing game – and we heard this loud and clear during preview.

GA includes clear sizing guidance to help you plan the right infrastructure based on your expected telemetry volume and workload characteristics. Not rigid formulas, but practical starting points that give you a confident baseline so you can design intentionally, deploy faster, and avoid costly over- or under-provisioning. Learn more.

Alright, these are a bunch of exciting features. How much do I need to pay for them?

Azure Monitor pipeline is included at no additional cost for ingesting telemetry into Azure Monitor and Microsoft Sentinel.

With general availability, Azure Monitor pipeline is production-ready so you can run the most demanding ingestion scenarios with confidence. If you’re already using it in preview, welcome to GA. If you’re just getting started, there’s never been a better time to dive in.

As always, your feedback is what drives this forward. Drop a comment below, reach out directly, or share what you're building. We'd love to hear from you.

Announcing public preview of redundant TCP support for RDP Multipath for Azure Virtual Desktop

Rinku_Dalwani — Tue, 21 Apr 2026 16:00:28 GMT

Reliable connectivity is essential for ensuring consistent productivity in Azure Virtual Desktop (AVD) environments. Network variability—whether due to packet loss, NAT misconfiguration, UDP‑restricted networks, or restrictive enterprise network policies—continues to be one of the most common causes of session interruptions across enterprise virtual desktop deployments.

To improve connection resiliency across a broader set of network conditions, we’re excited to introduce redundant TCP transport paths for RDP Multipath, now available in public preview for Azure Virtual Desktop.

This builds upon the existing RDP Multipath capability that continuously evaluates multiple network paths and dynamically switches to the most reliable path at runtime—without requiring changes from IT administrators or users.

How does this feature work

RDP Multipath establishes multiple network paths between the client and session host based on available network routes and real-time network conditions. This allows Azure Virtual Desktop to continuously evaluate path health and dynamically select the most reliable transport during a session.

In its initial phase, RDP Multipath focused on UDP-based RDP Shortpath connections using STUN (Simple Traversal Underneath NAT) and TURN (Traversal Using Relays around NAT). This enabled sessions to transition between redundant UDP paths if degradation or failure was detected, improving reliability and performance.

When UDP-based connectivity is available, it remains the preferred transport due to its performance and reliability advantages. Multipath continues to maintain multiple UDP paths as the primary active transport, enabling efficient failover across available routes.

Enhanced resiliency with redundant UDP and TCP paths

With this update, Azure Virtual Desktop expands Multipath capabilities by introducing support for redundant TCP standby transport paths alongside existing UDP paths.

For customers already using Multipath over UDP, this adds an additional layer of resiliency through alternate TCP paths. For environments that previously relied on a single TCP connection, this update enables multiple TCP paths—helping reduce the impact of transient network issues and path instability.

If the active transport path becomes unavailable or degraded, Azure Virtual Desktop automatically switches to the next best available UDP or TCP path. This helps maintain session continuity without requiring user reconnection.

If all transport paths are temporarily disrupted — such as during a local network failure or ISP outage—the session automatically reconnects once connectivity is restored.

Improved reliability for TCP-only (UDP-restricted) environments

In environments where UDP connectivity is unavailable or restricted, sessions rely entirely on TCP-based Reverse Connect transport. Previously, these environments typically operated with a single active TCP connection between the client and session host, making them more susceptible to transient network degradation.

With this update, Azure Virtual Desktop can now establish multiple standby TCP transport paths—even in TCP-only scenarios. This allows sessions to dynamically transition between available TCP routes if the active path becomes degraded or fails.

As a result, customers benefit from improved session continuity and more consistent connectivity, even in environments with restrictive network configurations or where UDP traffic is blocked.

How to enable this feature

For public preview, you can test the feature by tagging your host pool to the validation ring. By default, this feature is enabled for everyone in the validation pool, providing seamless integration and enhanced connectivity without requiring any changes from IT departments or end users. Redundant TCP transport paths are currently supported only on Windows devices using Windows App on Windows client, version 2.0.1069.0 or later.

How to opt-out from this feature

If you wish to disable the feature, you can opt out the host pool from the validation ring. This self-help option allows you to revert to the previous configuration if necessary.

Learn more

To learn more about the feature please check here 

Stay up to date! Bookmark the Azure Virtual Desktop Tech Community.

Claim your IQ Series: Foundry IQ badge

aycabas — Tue, 21 Apr 2026 12:51:52 GMT

The IQ Series kicked off with three Foundry IQ episodes, each paired with a hands-on cookbook. If you've worked through all three or you're planning to, there's now a digital badge waiting for you to claim!

What the badge represents

The IQ Series: Foundry IQ badge recognizes developers who've completed the full Foundry IQ curriculum end-to-end: not just watched the episodes, but deployed the Azure resources, run every notebook, and built working knowledge bases against live data.

Earners have:

Deployed AI Search, Azure OpenAI, a Foundry project, and Azure Blob Storage with seeded sample data
Connected structured and unstructured sources into Foundry IQ
Built and queried multi-source AI knowledge bases
Grounded agent responses in permission-aware enterprise knowledge

Badges are issued by the Global AI Community, so you'll want an account there before you submit.

What the three episodes cover

Episode 1 — Unlocking Knowledge for Your Agents. Introduces Foundry IQ and the core ideas behind it. The episode explains how AI agents work with knowledge and walks through the main components of Foundry IQ that support knowledge-driven applications.

Episode 2 — Building the Data Pipeline with Knowledge Sources. Focuses on Knowledge Sources and how different types of content flow into Foundry IQ across SharePoint, Fabric, OneLake, Azure Blob Storage, Azure AI Search, and the web.

Episode 3 — Querying the Multi-Source AI Knowledge Bases. Dives into Knowledge Bases and how multiple knowledge sources can be organized behind a single endpoint. The episode demonstrates how AI systems query across these sources and synthesize information to answer complex questions.

Each episode is paired with a cookbook for you to learn hands-on and each of them reuses the same Azure deployment, so you set up once and build across all three.

How to claim the badge

Four steps, in order:

Fork the IQ Series repo and work through all three episode cookbooks in your fork. Commit your notebooks with cell outputs saved! That's the proof of completion.
Capture a final output screenshot for each episode. Your GitHub username or Azure resource name needs to be visible in the screenshot.
Submit a badge request issue. The template walks you through fork URLs, screenshots, and one brief technical takeaway per episode.
Complete the badge form. This step is required. Without the form, we can't issue the badge.

Why this badge is worth your time

The IQ Series recognizes your hands-on learning with real infrastructure, real indexed data, real agents and queries. If you're working on enterprise AI (grounding, retrieval, knowledge-aware agents), this is a concrete artifact that says: I've built this, end to end, on the actual platform.

Work IQ and Fabric IQ are coming next, and each phase will have its own badge. Foundry IQ is your head start on the full IQ Series.

👉 Start with Episodes or jump straight to the cookbooks if you prefer to learn by doing.

Questions along the way? Create and issue in the repo or drop into our Discord. The Foundry IQ team and community are there to help.

Leveraging Azure Resource Graph Queries for Azure Redis Configuration

Soma_Sekhara_Raju — Tue, 21 Apr 2026 07:57:08 GMT

Scenario

Many times, we receive requests for a quick and reliable way to review Azure Redis configurations such as SKU tiers, Redis versions, TLS settings, Microsoft Entra authentication status, and public network exposure. Traditionally, these checks are performed using PowerShell, Azure CLI, or REST APIs. While effective, these methods can be time-consuming due to script development and module installation. Azure Resource Graph Explorer offers a faster and more scalable alternative by enabling customers to query Redis configurations directly using Kusto Query Language (KQL). This approach eliminates the need to create and maintain scripts while providing centralized visibility across multiple subscriptions.

Azure Resource Graph Explorer

Azure Resource Graph Explorer allows you to run KQL queries directly from the Azure portal to inspect Redis configurations across subscriptions at scale. All queries in this document use the Resources table, filter on Redis resource types, and retrieve configuration properties from the Redis resource schema.

The queries target the following resource types:

microsoft.cache/redis

microsoft.cache/redisenterprise

How to Open Azure Resource Graph Explorer (Quick Steps)

Sign in to the Azure Portal
In the global search bar, search for Resource Graph Explorer
Open Resource Graph Explorer
Paste the KQL query into the query window
Click Run query to view results

Following queries can be used to quickly analyse and validate Azure Cache configurations across subscriptions:

1. Redis SKU Information

Find all Redis instances and identify their SKU tier.

Resources

| where type in~ ("microsoft.cache/redis", "microsoft.cache/redisenterprise")

| extend SKU = coalesce(tostring(sku.name), tostring(properties.sku.name))

| project name, resourceGroup, location, SKU

Explanation

This query retrieves all Azure Cache for Redis instances and identifies their SKU tier (Basic, Standard, Premium, Enterprise and AMR).
The SKU information helps understand performance capabilities, high availability features, and scaling options configured for each Redis instance.

2. Redis Version Information (OSS Cache Only)

Identify Redis version being used for Azure Cache for Redis (Basic, Standard, Premium).

Resources

| where type =~ "microsoft.cache/redis"

| project name, resourceGroup, location, SKU=sku.name, RedisVersion=properties.redisVersion

Explanation:
This query lists Redis instances along with their deployed Redis version.
Identifying older Redis versions helps prioritize upgrades, maintain supportability, and ensure compatibility with newer features and security enhancements.

Note: This query applies only to OSS Azure Cache for Redis (Basic, Standard, and Premium tiers). Azure Managed Redis (AMR) is not included because these properties are not exposed in Azure Resource Graph for AMR.

3. Minimum TLS Version for Redis

List Redis instances and configured minimum TLS version.

Resources
| where type in~ ("microsoft.cache/redis", "microsoft.cache/redisenterprise")
| project name, resourceGroup, location, MinimumTLS = properties.minimumTlsVersion

Explanation:
This query identifies the minimum TLS version configured for Redis cache.
Using TLS 1.2 or higher is recommended to meet modern security compliance and encryption standards.

4. Redis Instances with Public Network Access Enabled

Identify Redis instances that allow public network access.

Resources
| where type in~ ("microsoft.cache/redis", "microsoft.cache/redisenterprise")
| project name, resourceGroup, location, PublicNetworkAccess = properties.publicNetworkAccess

Explanation:
This query checks whether Redis instances are accessible over public internet.

Possible values include:

Enabled — Redis accessible via public endpoint
Disabled — Redis accessible only via private endpoint / virtual network

5. Microsoft Entra Authentication Enabled (OSS Cache Only)

Check Microsoft Entra ID authentication and key-based authentication for Azure Cache for Redis (Basic, Standard, Premium).

Resources

| where type =~ "microsoft.cache/redis"

| extend EntraAuthEnabled = tostring(properties.redisConfiguration["aad-enabled"])

| extend KeyBasedAuthDisabled = tostring(properties.disableAccessKeyAuthentication)

| project name, resourceGroup, location, EntraAuthEnabled, KeyBasedAuthDisabled

Explanation:
This query reviews authentication and access security settings for Azure Cache for Redis (OSS tiers).

Microsoft Entra Authentication – Shows whether Microsoft Entra ID authentication is enabled
- true — Enabled
- false — Disabled
Key-Based Authentication – Shows whether access keys are disabled
- true — Access keys disabled (Recommended)
- false — Access keys enabled

Reference

Kindly note this blog is focused on Azure Cache configurations, the same approach can be leveraged for other Azure resource types in a similar way by querying their respective resource schemas using Azure Resource Graph.

Hope this helps!

Leveraging Azure Resource Graph Queries for Azure Storage Configuration

jainsourabh — Tue, 21 Apr 2026 07:46:00 GMT

Scenario

Many times, we receive requests for a quick and reliable way to check which Azure Storage features are enabled across subscriptions—such as SFTP, Hierarchical Namespace (HNS), or default access tiers. For such scenarios, customers can use PowerShell, Azure CLI, or REST APIs; however, these approaches can be time‑consuming due to module setup, frequent updates, and script maintenance. Azure Resource Graph Explorer provides a faster and simpler alternative by allowing customers to directly query storage account configurations at scale using Kusto Query Language (KQL), without the need to write or maintain scripts.

Azure Resource Graph Explorer

Azure Resource Graph Explorer enables you to run KQL queries directly from the Azure Portal to inspect resource configurations across subscriptions at scale. All queries in this blog use the Resources table, filter on the resource type
microsoft.storage/storageaccounts, and retrieve specific configuration properties defined in the Microsoft.Storage/storageAccounts resource schema.

How to Open Azure Resource Graph Explorer (Quick Steps)

Sign in to the Azure Portal
In the global search bar, search for Resource Graph Explorer
Open Resource Graph Explorer
Paste the KQL query and click Run query

Following queries can be used to quickly analyse and validate Azure Storage account configurations across subscriptions:

1. Storage Accounts with SFTP Enabled

Find all storage accounts that have Secure File Transfer Protocol (SFTP) turned on

Resources

| where type =~ "microsoft.storage/storageaccounts"

| where properties.isSftpEnabled == true

| project name, resourceGroup, location

Find all storage accounts that have Secure File Transfer Protocol (SFTP) turned on in a specific subscription

Resources

| where type =~ "microsoft.storage/storageaccounts" and subscriptionId =~ "XXXXXXXXXXXXXXXXXXXX"

| where properties.isSftpEnabled == true

| project name, resourceGroup, location

Explanation: The isSftpEnabled property is a boolean under properties that, when set to true, enables Secure File Transfer Protocol on the storage account. This query filters for accounts where SFTP is active and returns the account name, resource group, and location.

2. Minimum TLS Version per Storage Account

List each storage account alongside its configured minimum TLS version.

Resources

| where type =~ "microsoft.storage/storageaccounts"

| project StorageAccount = name, resourceGroup, location,

MinimumTLS = properties.minimumTlsVersion

Explanation: Every storage account exposes a minimumTlsVersion string property that specifies the minimal TLS protocol version permitted for incoming requests.

3. Storage Accounts with Hierarchical Namespace (HNS) Enabled

Find all storage accounts that have Hierarchical Namespace enabled (Azure Data Lake Storage Gen2).

Resources

| where type =~ "microsoft.storage/storageaccounts"

| where properties.isHnsEnabled == true

| project name, resourceGroup, location

Explanation: The isHnsEnabled boolean indicates whether the account has the Hierarchical Namespace feature turned on.

4. Storage Accounts That Do NOT Allow Public Blob Access

Identify storage accounts where anonymous public read access to blobs is disallowed.

Resources

| where type =~ "microsoft.storage/storageaccounts"

| where properties.allowBlobPublicAccess == false

| project name, resourceGroup, location

Explanation: The allowBlobPublicAccess boolean controls whether anonymous public read access to blob data is permitted at the account level.

5. Storage Accounts with NFS 3.0 Support Enabled

Find all storage accounts that have NFS 3.0 protocol support turned on.

Resources

| where type =~ "microsoft.storage/storageaccounts"

| where properties.isNfsV3Enabled == true

| project name, resourceGroup, location

Explanation: The isNfsV3Enabled property is a boolean described in the resource schema as: "NFS 3.0 protocol support enabled if set to true". NFS 3.0 support allows Linux clients to mount Azure Blob Storage using the NFS protocol, which is useful for high-performance computing and large-scale analytics workloads.

6. Storage Accounts with Default Access Tier

Find all storage accounts and check their default access tier (Hot / Cool).

Resources

| where type =~ "microsoft.storage/storageaccounts"

| extend defaultAccessTier = tostring(properties.accessTier)

| project name, resourceGroup, location, kind, sku.name, defaultAccessTier

Explanation:
The properties.accessTier property indicates the default access tier configured for the storage account (for supported account kinds).

7. Storage Accounts Open to All Network Traffic (No Firewall Restrictions)

Find storage accounts that are accessible from any network without virtual network or IP-based firewall rules.

Resources

| where type =~ "microsoft.storage/storageaccounts"

| where (properties.publicNetworkAccess == "Enabled"

or isnull(properties.publicNetworkAccess))

and properties.networkAcls.defaultAction == "Allow"

| project name, resourceGroup, location

Explanation:
This query helps identify storage accounts that are fully open to public network access, with no firewall or network restrictions in place, which may pose security risks during audits or compliance reviews.

Reference

Kindly note this blog is focused on Azure Storage, the same approach can be leveraged for other Azure resource types in a similar way by querying their respective resource schemas using Azure Resource Graph.

Hope this helps!

How to Troubleshoot Azure Functions Host Startup Issue

vikasgupta5 — Tue, 21 Apr 2026 07:04:42 GMT

Overview

Azure Functions is a powerful serverless compute service that enables you to run event-driven code without managing infrastructure. When you deploy a Function App, the Azure Functions host is the runtime process responsible for discovering your functions, loading extensions and bindings, connecting to storage, and starting trigger listeners.

A host startup issue occurs when the Functions runtime fails to initialize and cannot reach a healthy Running state. When this happens, you may see one or more of these symptoms:

"Function host is not running" error in the Azure Portal
Functions are not visible in the Functions blade
Triggers stop firing — HTTP functions return 503, timer/queue functions are silent
The portal shows Error state or no response on the host status endpoint
Application Insights logs show repeated startup exceptions followed by restarts
Log Stream shows a restart loop or no output at all

This issue can be frustrating, especially when a deployment appeared to succeed and your code works correctly on your local machine. In this blog, we will explore how the host starts up, what can go wrong, and — most importantly — how to systematically diagnose and resolve startup failures.

Understanding How the Host Starts Up

Before diving into troubleshooting, it is important to understand the startup sequence. The Functions host executes the following steps each time the runtime initializes:

Host Startup Sequence

ASP.NET Core Startup → Register WebHost services (DI, secrets, diagnostics, middleware) → WebJobsScriptHostService.StartAsync() → Check file system (run-from-package validation) → Build inner ScriptHost → ScriptHost.InitializeAsync() → PreInitialize (validate settings, file system) → Load function metadata (function.json / decorators) → Load extensions and bindings (extension bundles / NuGet) → Create function descriptors and register triggers → Start trigger listeners → State = Running ✓

Complete Source Code: Azure/azure-functions-host

If any step in this sequence fails, the host enters an Error state and attempts to restart with exponential backoff (starting at 1 second, up to 2 minutes between attempts). After repeated failures, the platform may report an application-level failure.

Host States

The Functions host can be in any of the following states:

State	Meaning
Default	Host has not yet been created
Starting	Host is in the process of starting
Initialized	Functions indexed, listeners not yet running
Running	Fully running — triggers active, functions discoverable
Error	Host encountered an error — will attempt restart
Stopping	Host is shutting down
Stopped	Host is stopped
Offline	Host is offline (app_offline.htm is present)

Only when the host reaches the Running state are functions visible in the portal and triggers active. The Error state triggers an automatic restart loop.

Key Settings That Affect Startup

Setting	Purpose	Impact If Wrong
FUNCTIONS_EXTENSION_VERSION	Specifies the runtime version (e.g., ~4)	Host throws startup error if missing or invalid
FUNCTIONS_WORKER_RUNTIME	Specifies the language runtime (e.g., dotnet-isolated, node, python)	Host cannot load the correct worker process
AzureWebJobsStorage	Connection string for the required storage account	Host cannot store keys, coordinate triggers, or maintain state
WEBSITE_RUN_FROM_PACKAGE	Controls how deployment packages are loaded	Host shuts down if package is inaccessible or corrupted
WEBSITE_CONTENTAZUREFILECONNECTIONSTRING	Storage connection for content share (Consumption/Premium)	Host cannot access function code
WEBSITE_CONTENTSHARE	File share name for function content	Host cannot locate function files

Startup Failure Categories

Category	Examples	Typical Symptom
Configuration	Missing/invalid app settings, bad host.json	Host enters Error state immediately
Storage	AzureWebJobsStorage unreachable, expired SAS token, firewall	Host fails repeatedly, storage-related exceptions
Extensions/Bindings	Missing extension bundle, version mismatch, load failure	Host errors during extension loading phase
Deployment/Packaging	Corrupted zip, wrong package structure, missing files	Host starts but finds no functions, or fails to load assemblies
Code/Startup	DI exception, external startup error, assembly conflict	Host errors during initialization with code-specific exception
Runtime/Worker	Wrong worker runtime, language mismatch, gRPC failure	Host cannot establish worker channel
Networking	VNet blocks outbound, DNS failure, private endpoint misconfigured	Host cannot reach storage/dependencies at startup
Platform	Resource exhaustion, app_offline.htm, platform issue	Host enters Offline state or is killed before startup completes

Common Causes and Solutions

1. Missing or Invalid FUNCTIONS_EXTENSION_VERSION

Symptoms:

Host immediately fails to start
Error message: "Invalid site extension configuration. Please update the App Setting 'FUNCTIONS_EXTENSION_VERSION' to a valid value (e.g. ~4)."
Repeated restart loops in Application Insights

Why This Happens:

The FUNCTIONS_EXTENSION_VERSION app setting tells the platform which version of the Functions runtime to load. When your app runs as a hosted site extension (the normal case in Azure), this setting is validated as one of the first steps in ScriptHost.PreInitialize(). If it is missing, empty, or set to an unrecognized value, the host throws a HostInitializationException and cannot proceed.

How to Verify:

Navigate to your Function App in the Azure Portal
Go to Settings → Configuration → Application settings
Look for FUNCTIONS_EXTENSION_VERSION
Confirm it is set to a valid value: ~4 (recommended), ~3 (legacy), or a specific version

Solution:

Set FUNCTIONS_EXTENSION_VERSION to ~4 (or the appropriate version for your app)
If the setting was recently changed or removed, restore it
Save and restart the Function App

Ref: FUNCTIONS_EXTENSION_VERSION

2. Missing or Mismatched FUNCTIONS_WORKER_RUNTIME

Symptoms:

Error: "The 'FUNCTIONS_WORKER_RUNTIME' setting is required..." (diagnostic code AZFD0011)
Error: "The 'FUNCTIONS_WORKER_RUNTIME' is set to 'X', which does not match the worker runtime metadata..." (diagnostic code AZFD0013)
Host enters Error state after loading function metadata

Why This Happens:

The FUNCTIONS_WORKER_RUNTIME setting controls which language worker process the host launches (e.g., dotnet-isolated, node, python, java, powershell). During initialization, the host validates that this setting matches the actual function metadata discovered in your deployment. A mismatch — for example, deploying a Python app but having FUNCTIONS_WORKER_RUNTIME=node — results in a HostInitializationException.

How to Verify:

Check the app setting value in Portal → Configuration
Compare against your actual project type:
- C# in-process: dotnet
- C# isolated: dotnet-isolated
- Node.js: node
- Python: python
- Java: java
- PowerShell: powershell

Solution:

Set FUNCTIONS_WORKER_RUNTIME to the correct value matching your function code
If you recently migrated language models (e.g., in-process to isolated), update the setting accordingly
Save and restart

Ref: FUNCTIONS_WORKER_RUNTIME

3. Storage Account Connectivity Issues (AzureWebJobsStorage)

Symptoms:

Host fails to start and cannot recover
Errors related to Blob storage connectivity
"Unable to get function keys" or secret management errors
Health check returns Unhealthy

Why This Happens:

The Functions host requires a valid and reachable storage account for:

Storing function keys and secrets
Coordinating distributed triggers (e.g., timer triggers, queue listeners)
Maintaining internal state and lock management
Hosting the content share for Consumption and Premium plans

The host runs a background health check (WebJobsStorageHealthCheck) every 30 seconds that verifies Blob storage connectivity. If the storage account is unreachable — due to a wrong connection string, rotated keys, firewall restrictions, deleted account, or expired SAS token — the host will fail to initialize properly.

How to Verify:

Check your Application Settings for these storage-related values:

Setting	Required For
AzureWebJobsStorage	All plans — primary storage connection
WEBSITE_CONTENTAZUREFILECONNECTIONSTRING	Consumption and Premium plans — content share
WEBSITE_CONTENTSHARE	Consumption and Premium plans — file share name

You can also verify storage connectivity using the host status endpoint.

Solution:

Verify the storage account exists — check the Azure Portal to confirm it has not been deleted or disabled
Check for rotated keys — if storage keys were recently regenerated, update the connection string:
- Get the new connection string from the Storage Account → Access keys blade
- Update AzureWebJobsStorage in your Function App settings
Check storage firewall rules:
- Go to Storage Account → Networking
- Ensure the Function App has access (public endpoint, service endpoint, or private endpoint depending on your architecture)
For SAS-token-based connections — verify the token has not expired (diagnostic code AZFD0006)
For VNet-integrated apps:
- Ensure service endpoints or private endpoints are configured for the storage account
- Verify DNS resolution works for *.blob.core.windows.net, *.queue.core.windows.net, *.table.core.windows.net, and *.file.core.windows.net

For detailed guidance, see Storage considerations for Azure Functions.

4. Invalid host.json Configuration

Symptoms:

Error: "The host.json file is missing the required 'version' property." (diagnostic code AZFD0009)
Error: "'X' is an invalid value for host.json 'version' property."
JSON deserialization failures in logs
Host enters a special HandlingConfigurationParsingError mode

Why This Happens:

The host.json file is parsed early in the startup sequence. If it is missing the required "version": "2.0" property, contains invalid JSON syntax, or has unrecognized configuration values, the host throws a HostConfigurationException. The host then restarts in a degraded mode that skips host.json parsing — the admin APIs remain functional for diagnostics, but functions will not load.

How to Verify:

Check your host.json in the deployment:

Windows plans: Use Kudu → Debug Console → Navigate to site/wwwroot/host.json
Linux/Flex Consumption: Use SSH or Azure CLI

Validate that the file:

Is valid JSON (use a JSON validator)
Contains the required "version": "2.0" property
Does not have unrecognized or misspelled configuration keys

Minimal valid host.json:

{ "version": "2.0" }

Typical host.json with extension bundle:

{ "version": "2.0", "extensionBundle": { "id": "Microsoft.Azure.Functions.ExtensionBundle", "version": "[4.*, 5.0.0)" }, "logging": { "applicationInsights": { "samplingSettings": { "isEnabled": true, "excludedTypes": "Request" } } } }

Solution:

Fix any JSON syntax errors
Ensure "version": "2.0" is present
Remove or correct any unrecognized configuration keys
Redeploy or edit the file directly via Kudu (Windows plans)

Ref: host.json

5. Extension Bundle or Binding Load Failures

Symptoms:

Host fails to start with extension-related errors in logs
Error: "Referenced bundle X of version Y does not meet the required minimum version..."
Error: "One or more loaded extensions do not meet the minimum requirements..."
Errors referencing ScriptStartUpErrorLoadingExtensionBundle or ScriptStartUpUnableToLoadExtension
Works locally but fails in Azure

Why This Happens:

Azure Functions uses extension bundles to provide trigger and binding implementations (Service Bus, Event Hubs, Cosmos DB, etc.). During startup, the ScriptStartupTypeLocator loads extension assemblies from either the bundle path or the bin folder. If the bundle is missing, the version is incompatible, an assembly fails to load, or the type does not implement the expected interfaces, the host throws a HostInitializationException.

How to Verify:

Check host.json for the extensionBundle configuration
Verify the version range is compatible with your runtime version
For compiled C# apps that don't use bundles, verify all required NuGet packages are present and compatible

Solution:

Ensure extensionBundle is configured in host.json:

{ "version": "2.0", "extensionBundle": { "id": "Microsoft.Azure.Functions.ExtensionBundle", "version": "[4.*, 5.0.0)" } }

Use the correct version range for your runtime:
- Functions v4: [4.*, 5.0.0)
For compiled .NET apps using explicit extensions:
- Verify all extension NuGet packages are up to date
- Ensure extensions.json is present in the bin folder after build
Check for assembly version conflicts in the build output

6. Deployment Package Issues (WEBSITE_RUN_FROM_PACKAGE)

Symptoms:

Host shuts down immediately after startup
Error: "Shutting down host due to presence of FAILED TO INITIALIZE RUN FROM PACKAGE.txt"
Functions were visible before but disappeared after deployment
"No functions found" in the portal
Read-only file system errors in logs

Why This Happens:

When WEBSITE_RUN_FROM_PACKAGE is configured, the Functions host runs directly from a deployment package (ZIP file). During startup, the host checks the file system for failure markers. If the file FAILED TO INITIALIZE RUN FROM PACKAGE.txt is found, the host immediately shuts down the application — this is a fatal, non-recoverable error that requires redeployment.

Other common package issues include an inaccessible URL, an expired SAS token, files nested in a subfolder instead of the ZIP root, or a corrupted package.

WEBSITE_RUN_FROM_PACKAGE Values:

Value	Behavior
1	Runs from a local package in d:\home\data\SitePackages (Windows) or /home/data/SitePackages (Linux)
<URL>	Runs from a remote package at the specified URL (required for Linux Consumption)
Not set	Traditional deployment — files extracted to wwwroot

How to Verify:

Check WEBSITE_RUN_FROM_PACKAGE in Application Settings
If value is 1:
- Go to Kudu → Debug Console
- Navigate to d:\home\data\SitePackages
- Verify a .zip file exists and packagename.txt points to it
If value is a URL:
- Try accessing the URL directly — it should download the ZIP
- Check for expired SAS tokens (403 response) or missing blobs (404 response)
Verify package contents:
- Download and extract the ZIP
- Confirm host.json and function files are at the root level, not in a nested subfolder

Common Issues:

Problem	Symptom	Fix
Expired SAS token	Package URL returns 403	Generate new SAS with longer expiry
Package URL not accessible	Package URL returns 404	Verify blob exists and URL is correct
Wrong package structure	Files in subfolder	Ensure files are at ZIP root
Corrupted package	Host startup errors	Redeploy with a fresh package
Storage firewall blocking	Timeout errors	Allow Function App access to storage

Solution:

Redeploy your Function App using your preferred deployment method
If using URL-based packages, regenerate the SAS token or use managed identity-based access
If the failure marker file exists, redeployment will overwrite it
Restart the Function App after fixing:

Ref: WEBSITE_RUN_FROM_PACKAGE

7. Code-Level Startup Exceptions (DI and External Startup)

Symptoms:

Host Error state with application-specific exception in logs
Error: "Error configuring services in an external startup class" (diagnostic code AZFD0005)
Dependency injection failures (InvalidOperationException, TypeLoadException)
Errors in Program.cs or Startup.cs of your application
Assembly binding or version conflict exceptions

Why This Happens:

For isolated worker (.NET) apps, your Program.cs runs custom startup code before the worker connects to the host. For in-process (.NET) apps, custom IWebJobsStartup implementations run during host initialization. If this code throws — for example, a missing dependency, a failed external service connection, or a type load error — the host catches the exception and enters an Error state with a HostInitializationException.

How to Verify:

Check Application Insights Exceptions table for the specific exception type and stack trace
Look for errors containing AZFD0005 (external startup error)
Review your Program.cs / Startup.cs for:
- Service registrations that depend on external resources (databases, APIs, Key Vault)
- Missing NuGet packages or assembly version mismatches
- Configuration values that may differ between local and Azure environments

Solution:

Fix the exception identified in logs — the stack trace usually points directly to the failing code
Ensure all required environment variables and connection strings are set in Application Settings
For assembly conflicts, check that all NuGet package versions are compatible and aligned
Consider making external-service connections resilient by deferring initialization or adding retry logic
Test startup locally with the same environment variables as Azure

8. Language Worker Channel Failure

Symptoms:

Error: "Failed to start Language Worker Channel for language: {runtime}"
Error: "Failed to start Rpc Server. Check if your app is hitting connection limits."
Host starts but cannot communicate with the language worker process
Timeout errors during worker initialization

Why This Happens:

For out-of-process languages (Node.js, Python, Java, PowerShell, .NET Isolated), the Functions host communicates with a separate worker process over gRPC. If the host cannot start the gRPC server, or the worker process fails to launch or connect, the host throws a HostInitializationException.

Common causes include:

Port conflicts
Missing language runtime or incorrect version
Worker process crashes on startup
Resource exhaustion (memory, file handles)

How to Verify:

Check Application Insights for gRPC or worker-related errors
Verify the correct language runtime version is installed:
- For Node.js: Check WEBSITE_NODE_DEFAULT_VERSION
- For Python: Check the Python version in Configuration → General settings
- For Java: Check FUNCTIONS_WORKER_JAVA_LOAD_APP_LIBS and Java version
- For .NET Isolated: Check target framework in the deployed assemblies
Check if the Function App is hitting plan resource limits

Solution:

Ensure the correct language runtime version is configured
For Linux Consumption, verify the correct runtime stack is selected in Configuration → General settings
If resource limits are suspected, consider scaling up to a higher plan tier
Restart the Function App to clear temporary port or resource issues

9. Networking Blocking Required Dependencies

Symptoms:

Host fails to start in VNet-integrated apps
Timeout errors connecting to storage or other Azure services
Works without VNet integration, fails with it enabled
DNS resolution failures in logs
NSG or firewall-related errors

Why This Happens:

During startup, the Functions host must reach several external endpoints:

Azure Storage (Blob, Queue, Table, File) — for keys, triggers, and state
Extension bundle CDN — to download extension bundles (first run or cold start)
Azure Key Vault — if Key Vault references are used in app settings
Application Insights — for telemetry (non-blocking, but can delay if timing out)

If VNet integration, NSG rules, forced tunneling, or a firewall blocks these outbound connections, the host cannot complete startup.

How to Verify:

Check if the Function App has VNet integration enabled (Networking blade)
Review NSG rules on the integrated subnet — ensure outbound to Azure services is allowed
For apps with forced tunneling, verify the firewall/NVA allows required endpoints
Check DNS resolution for storage endpoints from within the VNet context

Solution:

Add NSG rules or firewall rules to allow outbound traffic to the required endpoints
Configure service endpoints or private endpoints for storage on the integrated subnet
Ensure DNS resolution works for all required endpoints
For private DNS zones, ensure proper zone links and records exist for storage

See Azure Functions networking options for detailed configuration guidance.

10. app_offline.htm Causing Offline State

Symptoms:

Host status shows Offline
All requests return an offline page
Portal shows the app is running but functions return errors

Why This Happens:

If a file named app_offline.htm exists in the function app's script root directory, the host detects it during startup and enters the Offline state. Some deployment tools create this file during deployment to gracefully take the app offline, and it should be removed automatically when deployment completes. If it is left behind — for example, due to a failed deployment — the host remains offline.

How to Verify:

Windows plans: Go to Kudu → Debug Console → Navigate to site/wwwroot and look for app_offline.htm
Linux: Use SSH or Azure CLI to check for the file

Solution:

Delete app_offline.htm from the app's root directory
The host will automatically detect the deletion and restart into a normal state
If the file reappears after deletion, investigate your deployment pipeline — it may be creating the file but failing to remove it

Using Diagnose and Solve Problems

The Azure Portal provides built-in diagnostics specifically designed for Functions host startup issues.

How to Access:

Navigate to your Function App in the Azure Portal
Select Diagnose and solve problems from the left menu
Search for relevant detectors:

Detector	What It Checks
Function App Down or Reporting Errors	Overall app health, host status, crash history
Function App Startup Issue	Specific startup failure analysis, configuration validation
Functions Configurations Check	host.json and app settings validation
Functions Deployment	Recent deployment status and potential issues
Network Troubleshooter	VNet, private endpoint, and access restriction diagnostics

These detectors run automated checks against your Function App and provide targeted recommendations.

The detectors often identify the root cause faster than manual investigation.

Verifying Host Status via REST API

You can check the host status programmatically to determine the current state and any reported errors.

Get Host Status:

curl "https://<app>.azurewebsites.net/admin/host/status?code=<master-key>"</master-key></app>

See Admin API for details.

The state field is the single most important indicator:

State	Action
Running	Host is healthy — investigate function-level issues
Error	Host startup failed — check the errors array for root cause
Offline	app_offline.htm present — check deployment state
No response / timeout	Host cannot serve requests — check platform health and networking

List Functions (verify discovery):

curl "https://<app>.azurewebsites.net/admin/functions?code=<master-key>"</master-key></app>

Quick Troubleshooting Checklist

Use this checklist to systematically diagnose host startup issues:

[ ] Host status: Check /admin/host/status — is the state Running, Error, or Offline?
[ ] First error: Check Application Insights Exceptions or Log Stream — what is the first exception after the latest restart?
[ ] FUNCTIONS_EXTENSION_VERSION: Is it set to a valid value (e.g., ~4)?
[ ] FUNCTIONS_WORKER_RUNTIME: Is it set correctly and does it match the deployed code?
[ ] AzureWebJobsStorage: Is the connection string valid? Is the storage account reachable from the app's network context?
[ ] host.json: Does it exist, contain valid JSON, and include "version": "2.0"?
[ ] Extension bundle: Is extensionBundle configured with a compatible version range?
[ ] Package deployment: If using WEBSITE_RUN_FROM_PACKAGE, is the package accessible and correctly structured?
[ ] Startup code: For .NET apps, does Program.cs / startup code throw during DI registration?
[ ] Networking: If VNet-integrated, can the app reach storage, Key Vault, and extension CDN endpoints?
[ ] Offline file: Is app_offline.htm present in the root directory?
[ ] Diagnose and Solve: Have you run the Function App Startup Issue detector in the Azure Portal?

Diagnostic Event Codes Reference

When reviewing logs, look for these Azure Functions diagnostic codes that are related to startup failures:

Code	Name	Meaning
AZFD0005	External Startup Error	Error in a custom IWebJobsStartup class
AZFD0006	SAS Token Expiring	AzureWebJobsStorage SAS token is expiring or expired
AZFD0009	Unable to Parse host.json	host.json file is missing or has invalid content
AZFD0011	Missing FUNCTIONS_WORKER_RUNTIME	The required worker runtime setting is not configured
AZFD0013	Worker Runtime Mismatch	FUNCTIONS_WORKER_RUNTIME does not match deployed function metadata

These codes appear in Application Insights traces and diagnostic event logs.

Diagnostic Events

Conclusion

Azure Functions host startup failures can be caused by a wide range of issues — from a simple missing app setting to complex networking misconfigurations. The key to efficient troubleshooting is a systematic approach:

Key Takeaways:

Always check host status first — the /admin/host/status endpoint tells you the current state and any errors
Find the first error, not the cascade — look for the initial exception after the most recent restart
Validate configuration — FUNCTIONS_EXTENSION_VERSION, FUNCTIONS_WORKER_RUNTIME, and AzureWebJobsStorage are the three settings that cause the most startup failures
Check host.json — a missing version property or invalid JSON is a common and easily fixable cause
Verify deployment artifacts — ensure your package is complete, correctly structured, and accessible
Use built-in diagnostics — the Diagnose and Solve Problems detectors are purpose-built for these issues
Apply one fix at a time — change one setting, restart, and recheck. Avoid multiple simultaneous changes that obscure which fix resolved the issue

If you continue to experience startup issues after following these steps, consider opening a support ticket with Microsoft Azure Support, providing:

Function App name and resource group
Timestamp of when the issue started
Host status endpoint response (copy the full JSON)
The first exception from Application Insights or Log Stream
Recent deployment or configuration changes
Networking configuration details (if VNet-integrated)

References

Have questions or feedback? Leave a comment below.

From Playwright Automation to Agent Driven Testing (GHCP in Action)

syedarshad — Tue, 21 Apr 2026 05:00:00 GMT

What Is Agent-Driven Testing?

Agent-driven testing represents a revolutionary shift from traditional, hardcoded test automation to intelligent, adaptive testing powered by AI agents. Unlike conventional Playwright tests that rely on static selectors and predefined workflows, agent-driven testing leverages GitHub Copilot (GHCP) agents with Model Context Protocol (MCP) to dynamically analyze web pages, discover elements intelligently, and create self-healing tests that adapt to UI changes.

Traditional vs Agent-Driven Approach Comparison

Traditional Playwright	Agent-Driven (MCP-Enhanced)
Hardcoded selectors	AI-discovered elements
Static test scripts	Dynamic, adaptive tests
Breaks with UI changes	Self-healing automation
Manual element analysis	Intelligent page exploration
Rule-based logic	Context-aware decisions
Limited fallback options	Intelligent cascading strategies

❌ Traditional Approach - Brittle and static

const searchInput = page.locator('input[name="q"]');

✅ Agent-Driven Approach - Intelligent and adaptive

Uses AI discovery

const searchResult = await this.mcpClient.callTool({

name: 'playwright_find_element',

arguments: {

element_type: 'search_input',

page_url: await this.page.url(),

confidence_threshold: 0.8,

generate_multiple_selectors: true

The agent doesn't just execute tests—it thinks about them, analyzing page structure, scoring element reliability, and making intelligent decisions about the best interaction strategies.

How Does Agent-Driven Testing Work with MCP?

Agent-driven testing operates through a sophisticated Model Context Protocol (MCP) workflow that mimics human intelligence. Here's how the MCP server analyzes pages and makes intelligent decisions:

🔬 1. Intelligent Page Analysis

The agent first explores the target website like a human tester would:

MCP-Enhanced exploration from your implementation

🧠 2. Dynamic Element Discovery with Confidence Scoring

Your implementation shows how MCP uses confidence scoring to intelligently identify elements:

Intelligent scoring from your sample workflow.page.ts

🎯 3. MCP Server Page Analysis

The MCP server analyzes page content and provides intelligent insights:

MCP snapshot and analysis from your implementation

⚡ 4. Adaptive Fallback Strategies

When primary strategies fail, the agent intelligently cascades through alternatives:

Your implementation's intelligent fallback system

async getSearchInput() {

console.log('🔎 MCP: Using intelligently discovered search input...');

Try MCP-discovered element first (highest reliability)

Real-time dynamic discovery (adaptive)

Traditional selectors (fallback safety)

How to Implement Agent-Driven Testing: Step-by-Step Guide

Step 1: Create GitHub Agent Configuration

Create the agent configuration that enables MCP capabilities:

1) Under your project directory mkdir -p .github/agents

Create .github/agents/playwright-agent.md with your exact configuration:

Step 2: Select Agent in GitHub Copilot Chat

Open GitHub Copilot Chat in VS Code
Click the agent selector at the top of the chat
Choose Playwright Tester Mode from the dropdown
The agent will now use MCP-enhanced capabilities

Step 3: Create Test Using Natural Language Prompts

Now you can create tests using natural language prompts to the agent:

Prompt to GHCP Agent:

> Create a test case that navigates to www.google.com, searches for 'playwright tutorial', and navigates to the Playwright homepage. Use MCP analysis to discover elements intelligently."

The agent will generate a test like your implementation:

Step 4: Execute and Monitor Agent Intelligence

Run your MCP-enhanced tests:

npm install @playwright/test

npx playwright install

npx playwright test --headed

Watch the intelligent decision-making in action:

🚀 Starting MCP-Enhanced Test Journey...

🔬 Using MCP to explore and navigate to Google...

🧠 MCP: Analyzing target URL: https://www.google.com

📸 MCP: Taking page snapshot for element analysis...

🔍 MCP: Analyzing page elements dynamically...

🎯 MCP: Discovered search input: input[name="q"]

🔎 MCP: Using intelligently discovered search input...

✅ MCP: Using discovered selector: input[name="q"]

✅ MCP: Search executed using discovered elements

Sample Test Case Results - Google to Playwright Navigation

Based on your actual implementation, here's what the agent accomplishes:

Test Execution Flow:

🔬 MCP Page Analysis

MCP: Analyzing target URL: https://www.google.com

MCP: Taking page snapshot for element analysis

MCP: Analyzing page elements dynamically

🎯 Intelligent Element Discovery

MCP: Discovered search input: input[name="q"]

MCP: Discovered search button: input[value="Google Search"]

🔍 Confidence-Based Search Execution

MCP: Using intelligently discovered search input

MCP: Search executed using discovered elements

🧠 Adaptive Link Detection

// From your discoverPlaywrightLinks implementation

if (href.includes('playwright.dev')) confidence += 50;

if (fullText.includes('playwright')) confidence += 20;

Outcome Details

🎯 Performance Results

Based on your test execution summary:

Metric	Traditional Approach	MCP-Enhanced Approach
Element Discovery	Static, breaks easily	95% success with confidence scoring
Maintenance Effort	High (manual updates)	90% reduction** (self-healing)
Bot Detection Handling	Basic fallback	Intelligent adaptive strategies
Test Reliability	60-70% (UI changes)	85-90%** (AI adaptation)
Debugging Time	2-4 hours per failure	20-30 minutes** (intelligent insights)

🚀 Key Benefits Achieved

Self-Healing Tests

- Tests adapt to UI changes automatically

- Confidence scoring prevents false positives

- Intelligent fallback strategies improve reliability

Intelligent Element Discovery

No more hardcoded selectors that break

Instead: AI-powered discovery with scoring:

if (name === 'q') score += 10;

if (role === 'combobox') score += 7;

if (placeholder?.includes('search')) score += 5;

Enhanced Debugging & Insights

✅ MCP: Using discovered selector: input[name="q"]

🧠 MCP: Found 18 potential Playwright links

Natural Language Test Creation

- Write tests using prompts instead of code

- Agent generates optimized, intelligent automation

-Built-in best practices and error handling

🔮 The Future of Testing is Intelligent

Agent-driven testing with GitHub Copilot and MCP represents the evolution from brittle, maintenance-heavy automation to intelligent, self-healing test suites. Your implementation demonstrates how AI can:

- Think about element discovery instead of hardcoding selectors

- Adapt to UI changes through confidence scoring

- Learn from page analysis to improve over time

- Heal automatically when traditional approaches fail

The result? Tests that improve themselves, dramatically reducing maintenance overhead while increasing reliability and providing intelligent insights into application behavior.

Start your journey from traditional Playwright automation to intelligent agent-driven testing today—your future self (and your QA team) will thank you! 🚀

Implementation Checklist

✅ Quick Start Checklist

- Create github/agents/playwright-agent.md configuration file

- Select "Playwright Tester Mode" agent in GitHub Copilot Chat

- Install Playwright: `npm install @playwright/test`

- Create MCP-enhanced Page Object Model with confidence scoring

- Configure `playwright.config.ts` with proper reporting

- Write tests using natural language prompts to the agent

- Run tests and observe intelligent decision-making: `npx playwright test --headed`

- Review MCP insights in console output and test reports

🎯 Success Metrics

You'll know agent-driven testing is working when you see:

- Console logs showing MCP analysis: "MCP: Analyzing page elements dynamically..."

- Confidence scoring in action: "MCP: Found 18 potential Playwright links"

- Adaptive behavior: "MCP: Using discovered selector: input[name='q']"

- Self-healing: Tests passing even when UI changes occur

- Reduced maintenance: 90% fewer test fix cycles

Service Bus SBMP Retirement: What BizTalk Server 2020 Customers Need to Know

hcamposu — Tue, 21 Apr 2026 01:02:21 GMT

On September 30, 2026, the Azure Service Bus team will retire support for the Service Bus Messaging Protocol (SBMP). This is important BizTalk Server 2020 customers who use the BizTalk Service Bus (SB-Messaging) adapter, as SBMP is the protocol that adapter relies on today.

To help customers maintain continuity while planning their transition to Azure Logic Apps, we’ve released a BizTalk Server 2020 hotfix that adds support for Advanced Message Queuing Protocol (AMQP) in the adapter.

What’s changing

SBMP support retires on September 30, 2026 in Azure Service Bus.
A hotfix enables AMQP for the BizTalk Service Bus (SB-Messaging) adapter (request KB5091375 opening a support case).
AMQP becomes the default transport with the hotfix installed, while SBMP remains available as an opt-in fallback for backward compatibility.
The hotfix will be available for BizTalk Server 2020 CU6 and CU7.
The current hotfix is based on the current Service Bus SDK (scheduled for deprecation in September 2026), and we expect an updated version in June based on the new Service Bus SDK.

What you need to do

If you plan to continue using the BizTalk Server 2020 Service Bus adapter, you should:

Migrate your adapter configuration to AMQP.
Install the hotfix well before September 2026, and run validation in a non-production environment.
Validate your scenarios, including large message/file patterns and any operational fallback strategies you depend on.
Decide whether to test now or wait for the June update: use the current hotfix to validate large file scenarios and fallback approaches, or wait for the June SDK-based refresh if you don’t need to install immediately.

How to obtain the hotfix

You can obtain the hotfix by opening a support case (request KB5091375) or by contacting your Microsoft account team. The hotfix enables AMQP support for the BizTalk Service Bus (SB-Messaging) adapter. A new KB article will be issued for the June update.

Support and lifecycle context

Microsoft remains committed to supporting BizTalk Server 2020 and its features in accordance with the official product lifecycle. Extended paid support will be available after April 2028.

Closing thoughts

If you’re using the SB-Messaging adapter today, now is the right time to plan your move to AMQP and schedule validation in a non-production environment. This keeps you ahead of the September 2026 retirement date and helps ensure a smooth path as you modernize toward Azure Logic Apps.

Azure Incident Retrospective — Please register for one of the 2 sessions below!

SaiVai — Tue, 21 Apr 2026 20:33:41 GMT

Join our upcoming live webcast for a transparent discussion about this recent Azure service incident — led by our engineering teams.

Network degradation within East US AZ-02

Tracking ID: DG_Z-S08 | Impacted: 20 March 2026

What to expect

📚 Understand

What happened, how we responded, and what we learned

💬 Ask

Live Q&A with our engineering experts throughout the session

🛠 Learn

The fixes we've put in place and guidance for workload resiliency

Choose your session

Same content presented at both times — pick the one that works best for your timezone:

Session 1

17:30 UTC

Thursday, 23 April 2026

Register now →

Session 2

05:30 UTC

Friday, 24 April 2026

Register now →

9:30 AM US Pacific (PDT)

12:30 PM US Eastern (EDT)

5:30 PM London (BST)

1:30 AM +1 Beijing (CST)

4:30 AM +1 Sydney (AEDT)

6:30 AM +1 Auckland (NZDT)

9:30 PM -1 US Pacific (PDT)

12:30 AM US Eastern (EDT)

5:30 AM London (BST)

1:30 PM Beijing (CST)

4:30 PM Sydney (AEDT)

6:30 PM Auckland (NZDT)

Our engineering leaders

Newton Sanches

Partner, Engineering Manager

Azure Networking

Cloud+AI Engineering

LinkedIn ↗

Frank Rey

Partner, General Manager

Azure Networking

Cloud+AI Engineering

LinkedIn ↗

⚠️ Prepare before the livestream

Read the Post Incident Review (PIR) ahead of time so you can ask any follow up questions during the live Q&A

Helpful resources

🔔 Azure Service Health Alerts

Get alerts for relevant incidents by setting up notifications via email, SMS, or webhook

🎥 Past Retrospective Recordings

Watch recordings of previous retrospective livestreams

📄 Azure Post Incident Reviews

Azure RBAC Custom Role Best Practices or Common Build Patterns

nicksal — Mon, 20 Apr 2026 18:40:54 GMT

As a platform admin, I want to grant application admins Contributor access while removing their ability to write or delete most Microsoft.Network resource types, with a few exceptions such as Private Endpoints, Network Interfaces, and Application Gateways.

Based on the effective control plane permissions logic, we designed two custom roles. The first role is a duplicate of the Contributor role, but with Microsoft.Network//Write and Microsoft.Network//Delete added to notActions. The second role adds back specific Microsoft.Network operations using wildcarded resource types, such as Microsoft.Network/networkInterfaces/*.

Application Admin Effective Permissions = Role 1 (Contributor - Microsoft.Network) + Role 2 (for example, Microsoft.Network/networkInterfaces/, Microsoft.Network/networkSecurityGroups/, Microsoft.Network/applicationGateways/write, etc.)

I understand that Microsoft RBAC best practices recommend avoiding wildcard (*) operations. However, my team has found that building roles with individual operations is extremely tedious and time-consuming, especially when trying to understand the impact of each operation.

Does anyone have suggestions for a simpler or more maintainable pattern for implementing this type of custom RBAC design?

Troubleshoot with OpenTelemetry in Azure Monitor - Public Preview

KayodePrince — Mon, 20 Apr 2026 18:14:20 GMT

OpenTelemetry is fast becoming the industry standard for modern telemetry collection and ingestion pipelines. With Azure Monitor’s new OpenTelemetry Protocol (OTLP) support, you can ship logs, metrics, and traces from wherever you run workloads to analyze and act on your observability data in one place.

What’s in the preview

Direct OTLP ingestion into Azure Monitor for logs, metrics, and traces.
Automated onboarding for AKS workloads.
Application Insights on OTLP for distributed tracing, performance and troubleshooting experiences.
Pre-built Grafana dashboards to visualize signals quickly.
Prometheus for metric storage and query.
OpenTelemetry semantic conventions for logs and traces, so your data lands in a familiar standard-based schema.

How to send OTLP to Azure Monitor: pick your path

AKS: Auto-instrument Java and Node.js workloads using the Azure Monitor OpenTelemetry distro, or auto-configure any OpenTelemetry SDK-instrumented workload to export OTLP to Azure Monitor. Get started
- Limited preview: Auto-instrumentation for .NET and Python is also available. Get started
VMs/VM Scale Sets (and Azure Arc-enabled compute): Use the Azure Monitor Agent (AMA) to receive OTLP from your apps and export it to Azure Monitor. Get started
Any environment: Use the OpenTelemetry Collector to receive OTLP signals and export directly to Azure Monitor cloud ingestion endpoints. Get started

Diagram: Choose your ingestion path

Under the hood: where your telemetry lands

Metrics: Stored in an Azure Monitor Workspace, a Prometheus metrics store.
Logs + traces: Stored in a Log Analytics workspace using an OpenTelemetry semantic conventions–based schema.
Troubleshooting: Application Insights lights up distributed tracing and end-to-end performance investigations, backed by Azure Monitor.

Application Map on OpenTelemetry signals

Why it matters

Standardize once: Instrument with OpenTelemetry and keep your telemetry portable.
Reduce overhead: Fewer bespoke exporters and pipelines to maintain.
Debug faster: Correlate metrics, logs, and traces to get from alert to root cause with less guesswork.
Observe with confidence: Use dashboards and tracing views that are ready on day one.

Next step: Try the OTLP preview in your environment, then validate end-to-end signal flow with Application Insights and Grafana dashboards. Learn More

Secure, Keyless Application Access with Managed Identities - Now GA in Azure Files SMB

Priyanka-Gangal — Mon, 20 Apr 2026 17:57:56 GMT

As enterprises modernize applications and strengthen their security posture, identity is central to how applications access shared storage. Traditional identity models relying on account keys, stored credentials, or domain‑joined infrastructure add operational overhead and introduce security risks such as credential leakage, lack of identity attribution, and excessive privilege if shared keys are compromised. Today, we are excited to announce the General Availability (GA) of Managed Identity support for Azure Files over SMB, enabling applications and virtual machines to securely access Azure Files without secrets, passwords, or key distribution.

Managed Identity support enables customers to meet modern enterprise security standards without reliance on storage account keys, streamlining how organizations securely enable file‑based application access and reducing the operational overhead of filing internal exceptions. New storage accounts can support secure, identity‑based SMB access out of the box, while existing deployments can get secure by enabling Managed Identity authentication.

From web application workloads such as WordPress, to databases on Azure Kubernetes Service (AKS), to CI/CD pipelines, applications require secure access. In a world where security is foundational, continued reliance on key-based access conflicts with Zero Trust principles and least privilege access.

What’s New In GA

AKS Workload Identity Support

AKS Workload Identity (preview) extends the traditional managed identity model for Kubernetes by shifting the identity from the node to pods. Instead of inheriting the identity of the underlying cluster, each Kubernetes pod can use its own federated identity, mapped directly to a Microsoft Entra ID principal.

This feature enables:

Pod-level identity isolation, rather than cluster-level
Least-privilege access with secure RBAC
Seamless scaling and redeployment, without identity reconfiguration
No secrets, no key rotation, no credential injection

When combined with Azure Files over SMB, Workload Identity allows AKS workloads to access shared file storage securely and natively per pod, using the same identity-driven model as cluster level managed identities. Now available with AKS 1.35, for customers specifically in the financial services industries, AKS Workload Identity enables per‑application, least‑privilege access to Azure Files without credentials, improving isolation and auditability. This allows regulated, stateful workloads to run securely on AKS while meeting strict compliance and regulatory requirements.

Co-existence of Application Identities and end-user identity access

Azure Files now enables both Managed Identity and end‑user access on the same storage account, with users and applications independently authenticated via Entra ID and authorized through a shared permissions model.
This unified access model eliminates the need for duplicate storage or credentials, enabling secure collaboration, troubleshooting, and automation on shared data without compromising governance or compliance.

This supports scenarios such as:

Developers accessing the same file share as an application for debugging
Admins managing content used by automated workflows
Hybrid environments with user-driven and app-driven access

Simplified Storage Account enablement via the Azure portal

We have now added a dedicated Managed Identity property that makes enabling identity‑based SMB access simple and transparent via the Azure portal for new as well as existing storage accounts. With a single configuration at the storage account level, customers can allow applications to authenticate to Azure Files using Managed Identities. This portal experience supports incremental adoption, making it easy to modernize authentication while maintaining compatibility with existing user access and governance models.

Get Started with Managed Identities with SMB Azure Files

Start using Managed Identities with Azure Files today at no additional cost. This feature is supported on HDD and SSD SMB shares across all billing models. Refer to our documentation for complete set-up guidance.

Whether provisioning new storage or enhancing existing deployments, this capability provides secure, enterprise‑grade access with a streamlined configuration experience.

For any questions, reach out to the team at azurefiles@microsoft.com.

AKS on AzureLocal: KMSv1 -> KMSv2

the-capricorn — Mon, 20 Apr 2026 10:05:49 GMT

Hey, quick question on AKS Arc — we're running moc-kms-plugin:0.2.172-official on an Arc-enabled AKS cluster on Azure Local and currently have KMSv1=true as a feature gate to keep encryption at rest working.

KMSv1 is deprecated in 1.28+ and we want to migrate to KMSv2 before it gets removed. Since moc-kms-plugin is a Microsoft-managed component we can't just swap it out ourselves.

A few questions:

Does version 0.2.172 already support the KMSv2 gRPC API, or is that coming in a later release?
Is there a supported migration path for AKS Arc specifically, or does this come automatically through a platform update?
Any docs or internal guidance you can point us to?

Thanks!