azure functions
380 TopicsFrom Coding Agents to Cloud Automation: AI-Assisted Customer Related Incidents in Azure Functions
On the Azure Functions team, we have been exploring how AI can help with investigating customer-reported incidents, root-cause analysis, and incident mitigation. This post shares our journey from early RCA agents to coding-agent-assisted investigations and cloud-hosted automation, and the lessons we learned along the way. Microsoft Engineering teams like Azure Functions work on production live site issues alongside customer reported issues, and these are one of the most important and rewarding parts of the job. On the Azure Functions team, complex customer incidents often require deep investigation. Engineers review Azure Data Explorer (Kusto) query results, source code, GitHub issues, previous incidents, public documentation, internal troubleshooting guides, and service-specific operational knowledge. The goal is always to mitigate customer impact quickly, identify the root cause, and feed what we learn back into the platform, and log improvement work items where needed. This work is valuable, but it is also time-consuming. As AI capabilities improved, we started asking a practical question: could AI help us reduce the operational burden of incident investigation while preserving the learning and engineering judgment that make those investigations useful? This is the story of how our approach evolved—from early RCA agents, to coding-agent-based workflows, and finally to cloud-hosted automation. Starting with AI-Assisted RCA Around May 2024, we began experimenting with an internal RCA agent together with a colleague from Microsoft Research. The first version was an informal approach towards the development of a formal service. It was a personal tool to help with our own investigations. The early experiments were very useful. We could give the agent an incident, let it run for several minutes, and then review the analysis. It did not always produce a perfect root cause, but it could run multiple queries, explore different hypotheses, and narrow the solution space enough to save time. Later, Azure SRE Agent emerged as a formal internal service. We contributed to it based on what We had learned from our earlier experiments. At that point, using AI to help resolve customer-reported incidents became a major focus for our team. What We Learned from Agentic Workflows The first generation of AI-assisted incident workflows were highly structured. The early experiments with the available models required careful design—especially for generating complex Kusto—we often needed to expose fixed Kusto queries as tools and let the model call them through well-defined parameters. Fig. 1 kusto query tool This made execution more predictable and reproducible, but it also revealed limitations. Detailed agentic workflows could work well when the incident matched the predefined path. Outside those paths, they were less flexible. Engineers also found it expensive to define and maintain those workflows, especially when the output felt only modestly better than a dashboard. Fig2. Agentic Workflow That experience taught us an important lesson: for complex operational investigations, flexibility matters as much as structure. The Shift to Coding Agents Near the end of 2025, we started using an internal tool using GitHub Copilot and skills, which made it possible to define and share VS Code workspaces. A workspace could include agent definitions, instructions, prompts, skills, MCP configuration, and repositories. Fig 3. GitHub Copilot internal tool The quality difference was significant. Combined with newer models, coding agents could investigate incidents much more flexibly. They could run Kusto queries, inspect code, use CLI and MCP tools, and iterate quickly by trying different paths. The team quickly adopted this model. With earlier workflow-based systems, engineers were reluctant to onboard because defining detailed workflows took effort, and the payoff was limited. With this internal tool, engineers started contributing agent definitions and skills because the system was easy to extend. Over time, the Azure Functions team accumulated a growing set of AI-ready materials which consisted of agent definitions, skills, MCP tools, instructions and repositories A number of lessons stood out. Lessons from Building AI-Ready Materials Prefer guidance over over-specification Modern coding agents are capable enough that they do not need every step spelled out. In fact, too many instructions can make the system brittle or stale. We found it better to provide concise guidance and point the agent to maintained sources of truth rather than embedding large amounts of detail directly into prompts. Manage context deliberately Instructions, tool definitions, conversation history, and tool outputs all compete for model context. Irrelevant or contradictory information can reduce quality. Tool design matters too: if a tool returns a large payload directly to the model, it can consume many tokens and confuse the agent. For large outputs, writing results to files and returning concise pointers often works better. Use files as durable memory Long-running investigations benefit from a simple pattern: create a plan and checklist file, update it as work progresses, and let the agent re-read it when needed. This helps the agent recover from context compaction and gives the investigation a durable state inside the workspace. Prefer references over inline knowledge Agent Definition and skills includes domain knowledge. Internal troubleshooting guides, product behavior, operational history, and expert judgment. Instead of placing all that information directly in prompts, we found it more effective to provide references to where the knowledge lives and guidance on when to use it. Make the right repositories visible Coding agents are strong at reading code. For our scenarios, multi-repository workspaces were especially powerful. When the agent could see related repositories together, it could trace behavior across components, understand dependencies, and produce better analysis. Domain knowledge matters most The best agent assets were often created by engineers with deep product and operational experience, not necessarily by AI specialists. The key skill was turning expert knowledge into instructions, references, and repository layouts that an agent could use. Facilitate and streamline domain knowledge updates Every incident not fully handled by an agent is a learning opportunity. Feed the context engineering flywheel: investigate, find gaps, update agent guidance, then re-test. It's important to keep this cycle quick and easy. Why We Moved Toward Cloud Automation Coding agents were extremely helpful, but they were still interactive tools. An engineer had to start the investigation and often guide it. For incident response, we wanted to go further. If an incident entered a specific feature area, the system should be able to start the investigation automatically, run the relevant analysis, and post useful results back to the incident. Even if the analysis was not perfect, narrowing the problem space early could reduce mitigation time. Some scenarios could eventually support automatic mitigation or automatic transfer to the right team. A local coding-agent workflow has advantages, especially because it can authenticate as the user. But as a foundation for reliable automation, it also had important limitations. First, it still depended on human involvement. AI dramatically improved individual productivity, but in incident response the bottleneck is often human attention and time. Even when starting an agent is simple, requiring an engineer to initiate the run introduces a context switch and consumes a scarce resource. Second, it depended on user credentials. Coding agents run with the user’s permissions, which can be overly broad for automation, and they inherit human-oriented flows such as browser-based reauthentication. For durable automation, we wanted an identity model better suited to unattended execution, such as managed identity. Third, there were execution-environment and security concerns. A local environment is powerful, but it does not naturally provide the sandboxing we wanted for safe automation. Because it runs with user access, it may also reach a much wider set of files and resources than is desirable for an automated incident workflow. Local and dev-box environments also have operational drawbacks. They can require restarts, contend with other workloads, and are not ideal for durable execution, failure recovery, or failover. For automation, we wanted a dedicated execution environment rather than something tied to an engineer’s machine. Finally, token management became an operational concern. User-linked token consumption can create instability when limits are reached, and automation can skew usage patterns so that one user appears to consume a disproportionate share of AI capacity. That adds noise to operational analysis and makes governance harder. For all of these reasons, cloud execution looked like the right direction. We wanted managed identity, a secure sandbox, durable execution, and a system that would not depend on someone’s local machine. Requirements for Cloud Automation Many of us had become strong supporters of coding agents and wanted to keep using them. Just as importantly, we had already accumulated assets that had been proven to work well: agent definitions, instructions, prompts, skills, MCP configuration, and repository layouts that the team had gradually built up and refined. That meant our move toward cloud automation was not about replacing coding agents with something entirely different. We wanted to preserve and reuse the assets that had made coding agents successful, while moving to an execution model that was better suited to automation. At the same time, coding agents had set a high quality bar. Because they worked so well in practice, we were not willing to assume that a cloud service would automatically deliver the same level of quality. So we defined two concrete goals for the cloud path. Achieve the same level of quality we were seeing from our existing coding-agent workflows when run in a focused, one-shot investigation. Ensure the assets we had already built could continue to be used and improved. In other words, we were not looking for just another cloud AI system. We were looking for a cloud automation path that could inherit the strengths of coding agents while providing the operational properties automation required. Comparing Headless coding-agent execution service and Azure SRE Agent To evaluate which approach could meet those requirements, we ran a side-by-side comparison. One path was a prototype headless coding agent execution service. It reused the same the internal tool’s workspace definitions that engineers used locally, but ran them without a human in the loop. When an incident entered a target loop, the system created an agent workspace, prepared repositories, started GitHub Copilot CLI with an initial prompt, collected the analysis, and posted the result back to the incident. It also preserved session artifacts so that an engineer could later review or resume the investigation. Fig 4. Agent Helped Trend – It shows people use Coding Agent, the introduction of headless coding agent execution service and SRE Agent increases the percentage of usefulness. The other path used Azure SRE Agent, which had improved with preview customer feedback and was nearing general availability, since our earlier experiments. It now supported newer models, stronger custom-agent behavior, MCP and built-in tools, repository access, and incident-triggered execution. We’ve performed a one-time migration from the Coding Agent asset to the Azure SRE Agent asset. This was achieved in one day using GitHub Copilot CLI and our existing coding agents. The comparison was deliberately practical. We already knew that our internal coding-agent environment produced results engineers trusted and liked. That became our quality bar. If Azure SRE Agent could meet or exceed that bar while also satisfying the operational requirements of cloud automation, it would be the stronger long-term path. Results and Feedback Loop The first Headless coding-agent execution service results were very encouraging. In its first set of incidents, the RCA matched the SME conclusion in cases where the agent could safely process the incident. That showed that the assets we had built for local coding agents could transfer effectively into a headless scenario. Azure SRE Agent also performed strongly from the beginning. Headless coding-agent execution service initially had slightly better analysis in some areas, but Azure SRE Agent was already good enough to be operationally useful. We then built an evaluation framework that compared: Each agent’s RCA, confidence score, and mitigation steps The RCA and mitigation reason later provided by a human Auto-mitigation recommendations Path to auto-mitigation Auto-transfer recommendations Session-level execution issues This evaluation became a feedback loop. Engineers reviewed interesting incidents, identified weaknesses, improved agent definitions and skills, and submitted pull requests. We also used agent assistance to generate improvement PRs from comparison reports. Fig 5. LLM as Judge side-by-side eval for headless coding-agent execution service (blue) vs Azure SRE Agent (green) Within a few weeks, Azure SRE Agent’s quality consistently exceeded the headless coding-agent execution service baseline. At that point, we stopped posting headless results back to incidents and focused on improving the Azure SRE Agent path instead. We also automated synchronization from the internal coding-agent assets so improvements could continue to flow through pull requests. That shift was important. It meant Azure SRE Agent was no longer just an interesting alternative—it had become the cloud path that could inherit what worked in coding agents while providing a better foundation for automation. Why Cloud AI Started to Work Better A common reaction to coding agents is that they feel much improved than the previous cloud AI experiences. Our experience suggests two main reasons: stronger models and improved access to the right context. A coding agent sees a workspace. It can use instructions, skills, tools, repositories, and files. Traditional cloud AI systems often did not have access to the same set of resources. Once Azure SRE Agent could see similar assets - the right repositories, the right tools, and the right domain-specific knowledge - it could reach comparable or better quality. The details of context compaction, tool execution, and orchestration matter. But the core principle is simpler: the agent needs to reach the right knowledge at the right time without carrying unnecessary context all the time. That means the most important work is not only choosing a model or building a tool. It creates high-quality AI-ready assets: concise instructions, useful skills, accurate references, well-structured repository access, and domain knowledge that was previously locked in people’s heads. The cloud hosted automation path instantly provided an exciting benefit, which is that the issue analysis is stored in the cloud and not only on the developer's machine. This means that the conclusions and investigations are stored for perusal and human interaction is possible via the chat interface. Fig 6. An Example of the Chat Interface for Azure SRE Agent Conclusion Our journey started with a personal RCA assistant, moved through structured agentic workflows, accelerated with coding agents, and eventually led us back to a cloud-hosted automation path. The lesson is not that coding agents or cloud agents are universally better. The lesson is that agent quality depends heavily on what the agent can see, how much irrelevant context it avoids, and whether domain experts have translated their knowledge into usable assets. For us, the key was not abandoning coding agents. It was carrying their strengths forward into Azure SRE Agent and a cloud execution model that was better suited to automation. Modern agents are now capable enough to make that work worthwhile. For incident response, that opens the door to faster investigation, safer automation, and ultimately lower incident mitigation time for customers. The Azure Functions team hope this experience is useful to other teams exploring how to apply AI to complex engineering operations. In the next post, we plan to go deeper into the evaluation framework and how we automated the feedback loop behind these improvements.810Views2likes1CommentEasy Auth Configuration for Logic App Standard through CI/CD
Problem Statement When Easy Auth (Azure App Service’s built-in authentication and authorization) is enabled on a Logic App Standard, users frequently report that they cannot open the run history. Specifically, the inputs and outputs of the trigger and actions fail to load on the run details page, even though the workflow itself runs and the user has access to the resource. Background — How Easy Auth Interacts with Logic Apps Easy Auth is a feature of Azure App Service. Every request that reaches a Logic App Standard is first routed through the App Service layer, and only then handed off to the Logic App runtime for further processing. When Easy Auth is enabled, App Service authenticates each incoming request and decides whether it should be allowed or blocked — before the Logic App runtime ever sees it. This dual-layer model is what causes the run-history symptom: The Logic App runtime authenticates run-history requests using a SAS token specific to that run, generated from the Logic App access keys. The portal calls that load the inputs and outputs of historical runs do not carry a bearer token — they carry the SAS. Because App Service only knows how to validate Easy Auth tokens (not SAS), it blocks these requests whenever unauthenticatedClientAction is set to disallow unauthenticated traffic. The request never reaches the runtime, so the runtime cannot apply its SAS validation, and the inputs/outputs panel stays empty. Solution There are two ways to fix this, depending on what your security policy allows. Option 1 — Allow unauthenticated requests The simplest fix is to configure Easy Auth to allow unauthenticated requests. This does not mean anyone can invoke the workflow. Instead, all calls (failed and successful) are routed through to the Logic App runtime, and the runtime decides how to handle them: A workflow trigger call with no token → the runtime applies its own auth (SAS, AAD, etc.) and rejects unauthorized invocations. A run-history call carrying a valid SAS → App Service marks it as “failed Easy Auth” but still forwards it; the runtime sees the valid SAS and returns the data. The underlying App Service platform has no knowledge of SAS or any other Logic-App-specific auth scheme, so letting the runtime arbitrate is what makes the run-history experience work. Option 2 — Keep Easy Auth strict, but exclude the runtime paths In many enterprises the security team will not permit “Allow unauthenticated requests.” For those cases, you can leave authentication required but add the runtime endpoints to the excludedPaths list, so App Service skips Easy Auth specifically for those calls. The Logic App runtime continues to authenticate them via SAS. Important: The Azure portal lets you toggle Easy Auth, but it does not expose the excludedPaths setting. You must configure it through ARM, Bicep, the REST API, or CLI — which is exactly why this needs to live in your CI/CD pipeline. There are two ways to apply this through CI/CD. Approach 1 — ARM Template ( Microsoft.Web/sites/config ) Add a Microsoft.Web/sites/config resource of type authsettingsV2 to the same ARM template that deploys the Logic App. Below is the sample template: { "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { "logicAppName": { "type": "string" }, "location": { "type": "string", "defaultValue": "[resourceGroup().location]" }, "tenantID": { "type": "string" }, "ClientID": { "type": "string" } }, "variables": {}, "resources": [ { "type": "Microsoft.Web/sites", "apiVersion": "2022-03-01", "name": "[parameters('logicAppName')]", "location": "[parameters('location')]", "kind": "functionapp,workflowapp", "identity": { "type": "SystemAssigned" }, "properties": { "serverFarmId": "<App Service Plan ID>", "siteConfig": { "appSettings": [ { "name": "FUNCTIONS_EXTENSION_VERSION", "value": "~4" }, { "name": "FUNCTIONS_WORKER_RUNTIME", "value": "dotnet" }, { "name": "AzureWebJobsStorage", "value": "<Storage Account Connection String>" }, { "name": "APP_KIND", "value": "workflowApp" } ] }, "httpsOnly": true } }, { "type": "Microsoft.Web/sites/config", "apiVersion": "2021-02-01", "name": "[concat(parameters('logicAppName'), '/authsettingsV2')]", "location": "[parameters('location')]", "properties": { "platform": { "enabled": true, "runtimeVersion": "~1" }, "globalValidation": { "requireAuthentication": true, "unauthenticatedClientAction": "Return401", "excludedPaths": ["/runtime/*"] }, "identityProviders": { "azureActiveDirectory": { "enabled": true, "registration": { "openIdIssuer": "[concat('https://sts.windows.net/', parameters('tenantID'), '/v2.0')]", "clientId": "parameters('ClientID')", "clientSecretSettingName": "OVERRIDE_USE_MI_FIC_ASSERTION_CLIENTID" }, "login": { "disableWWWAuthenticate": false }, "validation": { "jwtClaimChecks": {}, "allowedAudiences": [], "defaultAuthorizationPolicy": { "allowedPrincipals": {}, "allowedApplications": ["<LIST OF ALLOWED APPLICATIONS ID>"] } } } } }, "dependsOn": [ "[resourceId('Microsoft.Web/sites', parameters('logicAppName'))]" ] } ], "outputs": {} } Key things to notice in the template: requireAuthentication: true and unauthenticatedClientAction: Return401 keep Easy Auth strict for the public surface. excludedPaths: ["/runtime/*"] carves out the runtime endpoints so the SAS-authenticated run-history calls aren’t blocked. allowedApplications lets you whitelist specific AAD app IDs that are allowed to call the workflow. Reference: Microsoft.Web/sites/config — authsettingsV2 (ARM template) · Bicep variant This is the easiest way to add or update Easy Auth on a new or existing Logic App. Approach 2 — REST API call as a post-deployment pipeline step If you’d rather keep your infra template lean (or you’re updating Easy Auth on a Logic App that already exists), add a step to your CI/CD pipeline that calls the App Service authsettingsV2 REST API after the Logic App infra deployment completes. The payload mirrors the properties block from the ARM example above — including excludedPaths: ["/runtime/*"] . This approach is useful when: The Logic App is provisioned by a different pipeline or team than the one owning auth configuration. You need to update Easy Auth settings without redeploying the site. You want to apply environment-specific values (tenant ID, client ID, allowed application list) at release time rather than template-compile time. Reference: Web Apps - Update Auth Settings V2 - REST API (Azure App Service) | Microsoft Learn · GlobalValidation Summary The “inputs/outputs don’t load on run history” symptom after enabling Easy Auth is caused by App Service blocking SAS-authenticated runtime calls before the Logic App runtime can see them. Either allow unauthenticated requests (and let the runtime do all the auth), or keep Easy Auth strict and exclude /runtime/* . Because the portal doesn’t expose excludedPaths , the production-grade fix is to deploy it through CI/CD — either by adding an authsettingsV2 config resource to your ARM template or by calling the App Service auth REST API as a pipeline step after deployment.130Views0likes0CommentsPerformance Tuning and Scaling Optimization for Large-Scale Azure Workloads
Summary As cloud-native systems scale, performance challenges rarely stem from a single bottleneck. Instead, they emerge from the interaction between compute, orchestration, and data layers under load. This article captures a practical optimization journey of a high-volume Azure-based workload and highlights how controlled scaling, improved orchestration design, and proactive database maintenance can significantly outperform brute-force scaling. Introduction Distributed systems are often designed with the assumption that scaling out will solve performance issues. However, for orchestration-heavy and database-intensive workloads, this approach can introduce more problems than it solves. In this scenario, the system processed millions of transactional records through Azure Functions, Durable Functions, messaging pipelines, APIs, and SQL databases. As the workload grew, the platform began experiencing: CPU and memory spikes Slower SQL queries Service Bus throttling Increased retries and execution delays What stood out was that these issues were not due to insufficient resources, but due to inefficient execution patterns at scale. The optimization effort therefore focused on controlling how the system scaled and executed, rather than simply increasing capacity. Understanding Workload Behavior A critical early step was identifying the nature of the workload—specifically, whether it was CPU-heavy or data-heavy. Rethinking Scaling: More Is Not Always Better One of the most important lessons was that scaling out aggressively can degrade performance. As more function instances processed messages in parallel: Database calls increased sharply API traffic surged Lock contention intensified Retry rates increased This created a cascading effect where retries amplified load, further slowing down the system. To address this, scaling was intentionally controlled using: Concurrency limits on function execution Batch-based processing instead of full parallel fan-out Small delays to smooth traffic spikes Chunking of large datasets into manageable units This shift from maximum parallelism to controlled throughput significantly improved system stability. Compute Optimization: CPU and Memory After stabilizing scaling behavior, the next step was optimizing compute usage. CPU Optimization CPU spikes were largely caused by excessive parallel execution and orchestration overhead. Improvements included: Breaking large workloads into smaller units Reducing unnecessary fan-outs of processes Limiting concurrent executions This resulted in more predictable CPU usage and improved execution consistency. Memory Optimization Memory pressure was primarily driven by large payloads and batch processing. Optimizations focused on: Processing data in smaller chunks Avoiding large in-memory payloads and memory leaks Reducing orchestration state size These changes improved system reliability and reduced execution failures under load. Scaling Approaches: Practical Trade-Offs Both vertical and horizontal scaling were used, but with careful consideration. Scale Up (Vertical Scaling) Quick to implement No architectural changes required Useful for immediate stabilization However, it had cost and scalability limits. Scale Out (Horizontal Scaling) Better suited for long-term scalability Enables workload distribution But without control, it can: Increase database contention Amplify retries Introduce instability Key Insight The most effective approach was not choosing one over the other but combining both with strict control over concurrency and execution patterns. Durable Functions: Orchestration Optimization Durable Functions were central to the system, making orchestration design a key factor in performance. Challenges Observed The initial design relied heavily on nested sub-orchestrators, which introduced: High orchestration overhead Increased replay and persistence operations Slower execution at scale Key Improvements Refactoring unnecessary sub-orchestrators into Activity Functions simplified execution and improved throughput. The benefits included: Reduced orchestration latency Faster execution cycles Lower infrastructure cost Note: However, sub-orchestrators remain the right choice when the design requires composing multiple dependent steps, managing scoped retry/error logic, or isolating orchestration history. The decision should be driven by the complexity and reuse requirements of each workflow segment and not applied as a blanket rule. Improved Retry Strategy Retry behavior was also optimized by redefining execution boundaries. Previously: One activity processed multiple records A single failure triggered a retry of the entire batch After optimization: One activity handled one logical unit of work This enabled: Granular retries Better failure isolation Reduced duplicate processing Database Hygiene: A Critical Foundation The database emerged as a major bottleneck due to fragmentation and stale statistics caused by continuous high-volume operations. Issues Identified Fragmented indexes Inefficient query plans Increased query execution time Optimization Approach A proactive maintenance strategy was implemented using scheduled jobs to: Update statistics regularly Rebuild indexes Maintain query performance consistency Controlled Database Load For heavy long-running workloads in multi-tenant architecture, execution of DB intensive process was intentionally run in singleton fashion at a tenant level to reduce contention. This approach: Prevented concurrent heavy operations Improved overall system stability Delivered more predictable throughput Observability: Finding the Real Problem A major challenge during optimization was distinguishing between symptoms and root causes. For example: Slow APIs were often caused by database contention High retries were triggered by upstream throttling Orchestration delays originated from downstream dependencies To address this, end-to-end observability was established using: Application-level tracing Load testing correlations Cross-service telemetry analysis This enabled accurate root cause identification and prevented misdirected optimization efforts. Key Takeaways Some key principles emerged from this optimization journey: Scaling more does not always mean performing better Controlled parallelism is more effective than unrestricted concurrency Orchestration design directly impacts system performance Database maintenance must be proactive Retry strategies should align with logical units of work Observability is essential for correct diagnosis Conclusion Performance tuning in distributed systems is less about adding resources and more about using them efficiently. By focusing on controlled scaling, simplifying orchestration, maintaining database health, and improving observability, the system achieved higher throughput, lower cost, and significantly improved stability. These lessons are broadly applicable to any Azure-based system handling large-scale, orchestration-heavy workloads and can help teams design more predictable and resilient architectures.638Views5likes0CommentsWhy Does Azure App Service Return HTTP 404?
When an application deployed to Azure App Service suddenly starts returning HTTP 404 – Not Found, it can be confusing —especially when: The deployment completed successfully The App Service shows as Running No obvious errors appear in the portal This behaviour is more common than it appears and is often linked to routing, configuration, or platform : In this article, I’ll walk through real-world reasons why Azure App Service can return HTTP 404 errors, based on issues . The goal is to help you systematically isolate the root cause—whether it’s application-level, configuration-related, or platform-specific. What Does HTTP 404 Mean in Azure App Service? An HTTP 404 response from Azure App Service means: The incoming request successfully reached Azure App Service, but neither the platform nor the application could locate the requested resource. This distinction is important. Unlike connectivity or DNS issues, a 404 confirms that: DNS resolution worked The request hit the App Service front end The failure happened after request routing Incorrect Application URL or Route This is the most common cause of 404 errors. Typical scenarios Accessing the root URL (https://<app>.azurewebsites.net) for a Web API that exposes only API routes Missing route prefixes such as /api , /v1controller/action name segments Case sensitivity mismatches on Linux App Service Example https://myapp.azurewebsites.net Returns 404, but: https://myapp.azurewebsites.net/weatherforecast Works as expected. ✅ Tip: Always validate your routing locally and confirm the exact same path is being accessed in Azure. Application Appears Running, but Startup Failed Partially It is possible for an App Service to show Running even when the application failed to initialize fully. Common causes Missing or incorrect environment variables Invalid connection strings Exceptions thrown during Program.cs / Startup.cs Dependency initialization failures at startup In such scenarios, the app may start the host process but fail to register routes—resulting in 404 responses instead of 500 errors. ✅ Where to check Application logs Deployment logs Kudu → LogFiles Static Files Not Found or Not Being Served For applications hosting static content (HTML, JavaScript, images, JSON files), a 404 can occur even when files exist. Common reasons Files not deployed to the expected directory (wwor root, /home/site/wwwroot) Missing or unsupported MIME type configuration (commonly seen with .json) Static file middleware not enabled in ASP.NET Core applications ✅ Quick validation: Deploy a simple test.html to wwwroot and try accessing it directly. Windows vs Linux App Service Differences Behaviour can differ significantly between Windows App Service and Linux App Service. Common pitfalls on Linux Case-sensitive file paths (Index.html ≠ index.html) Missing or incorrect startup command Differences in request routing handled by Nginx ✅ Tip: If the app works on Windows App Service but fails on Linux, always recheck file casing and startup configuration first. Custom Domain and Networking Configuration Issues In some cases, requests reach the App Service but fail due to domain or network constraints. Possible causes Incorrect custom domain binding ✅ Isolation step: Always test using the default *.azurewebsites.net specific issues the issue is domain-specific. 6. Health Checks or Monitoring Probes Targeting Invalid Paths Seeing periodic 404 entries in logs—every few minutes—is often a sign of misconfigured probes. Typical scenarios App Service Health Check configured with a non-existent endpoint External monitoring tools probing /health or paths that do no exist ✅ Fix: Ensure the health check path maps to a valid endpoint implemented by the application. 7.Missing or Corrupted Deployment Artifacts Even when deployments report success, application files may not be where the runtime expects them. Commonly observed with Zip deployments WEBSITE_RUN_FROM_PACKAGE misconfigurations Partial or interrupted deployments ✅ Verify using Kudu: Browse /home/site/wwwroot and check files are present. Quick Troubleshooting Checklist If your Azure App Service is returning HTTP 404: Verify the exact URL and route Test hostingstart.html or a static file (for example, /hostingstart.html) Review startup and application logs Inspect deployed artifacts via Kudu Validate Windows vs Linux behaviour differences Review networking, authentication, and health check settings 8. Application Gateway infront of App Service If you have Application gateway infront of app service , please check the re-write rules so that the request is being sent to correct path. Final Thoughts HTTP 404 errors on Azure App Service are rarely random. In most cases, they point to: Routing mismatches Startup or configuration failures Platform-specific behavior differences By breaking the investigation into platform → configuration → application, you can systematically narrow down the root cause and resolve the issue. Happy debugging 🚀371Views1like0CommentsHow to Troubleshoot Azure Functions Service Bus Trigger Issues
Overview Azure Functions integrates with Azure Service Bus via triggers and bindings, allowing you to build event-driven applications that react to queue and topic messages. The Service Bus trigger uses PeekLock mode to receive messages, automatically manages message locks, and completes or abandons messages based on function execution results. When this integration encounters problems, you may see one or more of these symptoms: Messages accumulate in the queue or topic subscription and are not processed Functions execute but messages end up in the dead-letter queue (DLQ) MessageLockLostException or ServiceBusException errors in Application Insights Messages are processed multiple times (duplicate processing) The function app shows connection failures or AMQP errors in logs Trigger scaling does not work as expected — too few or too many instances Session-enabled queues stop processing after a period of time This blog walks you through how the Service Bus trigger works internally, what can go wrong, and — most importantly — how to systematically diagnose and resolve these failures. Understanding How the Service Bus Trigger Works Before diving into troubleshooting, it is important to understand how the Service Bus trigger processes messages. Message Processing Flow Service Bus Namespace (Queue or Topic/Subscription) → Functions runtime discovers serviceBusTrigger binding → ServiceBusProcessor created (PeekLock mode) → Message received → Lock acquired → Function invoked with message payload → Function succeeds → Message Completed ✓ → Function fails → Message Abandoned → Redelivered → Max delivery count reached → Dead-Letter Queue The Functions runtime uses the Azure.Messaging.ServiceBus SDK under the hood. It creates a ServiceBusProcessor (or ServiceBusSessionProcessor for session-enabled entities) that manages the message receive loop, lock renewal, and concurrency. Key Concepts Concept Description PeekLock The default receive mode. The message is locked for a duration (default 30 seconds at the entity level) and must be completed or abandoned. Auto-Complete By default (autoCompleteMessages: true), the runtime calls Complete on success and Abandon on failure. You can disable this to handle settlement in your own code. Lock Renewal If function execution takes longer than the lock duration, the runtime automatically renews the lock up to maxAutoLockRenewalDuration (default 5 minutes). Concurrency maxConcurrentCalls (default 16) controls how many messages are processed in parallel per instance. On multi-core plans, this is multiplied by the core count. Prefetch prefetchCount (default 0) controls how many messages are pre-fetched from the broker to improve throughput. Dead-Letter Queue Messages that exceed the maximum delivery count (set on the Service Bus entity, default 10) are moved to the DLQ instead of being redelivered. host.json Configuration Reference All Service Bus trigger settings are configured under the extensions.serviceBus section of host.json: { "version": "2.0", "extensions": { "serviceBus": { "clientRetryOptions":{ "mode": "exponential", "tryTimeout": "00:01:00", "delay": "00:00:00.80", "maxDelay": "00:01:00", "maxRetries": 3 }, "prefetchCount": 0, "transportType": "amqpWebSockets", "webProxy": "https://proxyserver:8080", "autoCompleteMessages": true, "maxAutoLockRenewalDuration": "00:05:00", "maxConcurrentCalls": 16, "maxConcurrentSessions": 8, "maxMessageBatchSize": 1000, "minMessageBatchSize": 1, "maxBatchWaitTime": "00:00:30", "sessionIdleTimeout": "00:01:00", "enableCrossEntityTransactions": false } } } Note: The clientRetryOptions settings apply only to interactions with the Service Bus service. They do not affect retries of function executions. For function-level retries, see Azure Functions error handling and retries. Issue Categories Category Typical Symptoms Root Cause Area Connection AMQP errors, timeout, function not triggering Connection string, network, firewall Authentication 401/403 errors, unauthorized access Managed identity, RBAC, SAS policy Message Lock MessageLockLostException, duplicate processing Long-running functions, lock duration mismatch Dead-Letter Messages going to DLQ unexpectedly Function exceptions, max delivery count Scaling Messages accumulating, underscaling Target-based scaling, host settings Configuration Trigger not firing, entity not found host.json, app settings, binding attributes Session Session processing stops Session lock, idle timeout, concurrency Networking Timeout in VNet-integrated apps NSG, private endpoints, DNS Common Causes and Solutions 1. Connection String or Configuration Errors Symptoms: Function does not trigger at all Error: "MessagingEntityNotFoundException" — queue or topic not found Error: "No connection string configured for the Service Bus trigger" Error referencing an invalid or missing app setting Why This Happens: The Service Bus trigger requires a valid connection to your Service Bus namespace. By default, it looks for an app setting named AzureWebJobsServiceBus. If you specify a custom Connection property on the trigger attribute, the runtime looks for that named setting instead. If the connection string is missing, invalid, or points to the wrong namespace, the trigger cannot create a ServiceBusProcessor and messages will not be processed. How to Verify: Check your trigger attribute for the Connection property value:[ServiceBusTrigger("myqueue", Connection = "ServiceBusConnection")] Navigate to your Function App → Settings → Configuration → Application settings Verify the connection setting exists and is correctly named For connection string–based connections, confirm the value contains a valid endpoint, SharedAccessKeyName, and SharedAccessKey For managed identity connections, confirm <CONNECTION_NAME>__fullyQualifiedNamespace is set to <your-namespace>.servicebus.windows.net Solution: Set the correct connection string or managed identity configuration in Application Settings Verify the queue or topic name in the trigger attribute matches the actual entity name in your Service Bus namespace (names are case-sensitive) If using managed identity, ensure the __fullyQualifiedNamespace suffix is used (with double underscores): { "ServiceBusConnection__fullyQualifiedNamespace": "myservicebus.servicebus.windows.net" } Ref: Service Bus trigger — Connections 2. Authentication and Authorization Failures (RBAC / SAS) Symptoms: Error: "Unauthorized access. 'Listen' claim(s) are required to perform this operation." Error: "AuthorizationFailedException" or "UnauthorizedException" Error: "Attempted to perform an unauthorized operation." 401 or 403 errors in Application Insights Why This Happens: The Service Bus trigger requires Listen permission on the queue or subscription. If you are using a Shared Access Signature (SAS) policy that does not include the Listen claim, or a managed identity without the correct RBAC role, the runtime cannot receive messages. For managed identity connections, the identity must be assigned the Azure Service Bus Data Receiver role (or Azure Service Bus Data Owner) at the appropriate scope. For topic subscriptions, the role assignment must have effective scope over the subscription resource, not just the topic. How to Verify: For SAS-based connections: Go to your Service Bus namespace → Shared access policies Confirm the policy used in your connection string has the Listen claim If your function also sends messages (output binding), the policy needs Send as well For managed identity: Go to your Service Bus namespace → Access control (IAM) → Role assignments Verify your Function App's managed identity has Azure Service Bus Data Receiver For topic triggers, verify the role is assigned at the subscription level (not just the topic) Solution: For SAS: Use a policy that has the required claims, or create a new policy with Listen (and Send if needed) For managed identity: Assign the correct role. Use the Azure CLI if the portal does not expose the subscription resource as a scope: Ref: Grant permission to the identity 3. Message Lock Lost Exceptions Symptoms: Error: "MessageLockLostException: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue." Messages are processed but then redelivered (duplicate processing) Messages eventually end up in the dead-letter queue after repeated failures Why This Happens: When the Service Bus trigger receives a message in PeekLock mode, it acquires a lock for a duration configured on the Service Bus entity (default 30 seconds). The Functions runtime automatically renews this lock while your function is executing, up to the maxAutoLockRenewalDuration (default 5 minutes). A MessageLockLostException occurs when: Function execution exceeds maxAutoLockRenewalDuration — If your function takes longer than 5 minutes (the default), the lock renewal stops and the lock expires. The message becomes available for redelivery. Lock renewal fails due to a transient error — A network blip or Service Bus throttling can prevent a renewal request from succeeding. The entity's lock duration is very short — If the lock duration on the queue or subscription is set lower than the time between renewal attempts, the lock may expire between renewals. Batch processing with long execution times — For batch-triggered functions, maxAutoLockRenewalDuration applies to the entire batch, not individual messages. Note: automatic lock renewal is not supported for batch functions — the lock duration is determined by the entity-level setting. How to Verify: Check Application Insights for MessageLockLostException entries and note the function execution duration Compare the execution duration against your maxAutoLockRenewalDuration setting Check the lock duration on your Service Bus entity: Go to Service Bus namespace → Queue or Topic/Subscription → Properties Note the Lock duration value Solution: Increase maxAutoLockRenewalDuration in host.json to exceed your longest expected function execution time: { "version": "2.0", "extensions": { "serviceBus": { "maxAutoLockRenewalDuration": "00:10:00" } } } Increase the entity's lock duration on the Service Bus queue or subscription (maximum 5 minutes) to provide a larger window between renewal attempts Optimize function execution time — If your function is doing heavy processing, consider: Offloading work to a Durable Functions orchestration Using a queue-based load leveling pattern Breaking long operations into smaller units For batch functions — Reduce maxMessageBatchSize so that each batch completes within the entity's lock duration, since automatic lock renewal does not apply to batches Important: maxAutoLockRenewalDuration only applies to single-message functions. For batch functions, the message lock is governed by the entity-level lock duration setting. 4. Messages Going to the Dead-Letter Queue Symptoms: Messages appear in the dead-letter queue (DLQ) instead of being processed The DeadLetterReason on the dead-lettered message shows MaxDeliveryCountExceeded Function logs show repeated exceptions for the same message Some messages process successfully while others consistently fail Why This Happens: When a function throws an unhandled exception, the runtime calls Abandon on the message (when autoCompleteMessages is true). The message is returned to the queue and its DeliveryCount is incremented. Once the delivery count reaches the entity's Max delivery count (default 10), the message is automatically moved to the DLQ by Service Bus. Common reasons messages repeatedly fail: Poison messages with malformed or unexpected content Transient dependency failures (database, external API) that affect all retries Deserialization errors when the message body does not match the expected type Application bugs triggered by specific message content How to Verify: Check the dead-letter queue using Service Bus Explorer (Azure Portal → Service Bus namespace → Queue → Service Bus Explorer → Dead-letter tab) Inspect the DeadLetterReason and DeadLetterErrorDescription properties on the dead-lettered messages Check Application Insights for exceptions correlated with the message IDs Review the DeliveryCount on the messages — if it equals the max delivery count, the message was redelivered until it was DLQ'd Solution: Fix the root cause — Examine the dead-lettered messages and the corresponding exceptions to identify why processing fails Add error handling — Implement try-catch logic and decide whether to complete, dead-letter, or abandon the message explicitly: [Function(nameof(ProcessMessage))] public async Task ProcessMessage( [ServiceBusTrigger("myqueue", Connection = "ServiceBusConnection", AutoCompleteMessages = false)] ServiceBusReceivedMessage message, ServiceBusMessageActions messageActions) { try { // Process the message await ProcessAsync(message); await messageActions.CompleteMessageAsync(message); } catch (InvalidDataException) { // Poison message — send to DLQ with a reason await messageActions.DeadLetterMessageAsync(message, "InvalidData", "Message body could not be deserialized."); } catch (Exception ex) { // Transient failure — abandon for retry _logger.LogError(ex, "Processing failed, abandoning message {MessageId}", message.MessageId); await messageActions.AbandonMessageAsync(message); } } Increase max delivery count on the Service Bus entity if you need more retry attempts before dead-lettering Process the DLQ — Set up a separate function or process to monitor and handle dead-lettered messages Tip: Use ServiceBusMessageActions with AutoCompleteMessages = false. This prevents the runtime from attempting to complete messages after a successful function invocation. 5. Duplicate Message Processing Symptoms: Business logic executes more than once for the same message Database records or downstream operations are duplicated Logs show the same MessageId processed by multiple instances or multiple times on the same instance Why This Happens: Duplicate processing can occur in several scenarios: Message lock lost — If the lock expires (see Issue 3), the message becomes available and is picked up again — either by the same or a different instance Function timeout — If the function exceeds the functionTimeout in host.json (default 5 minutes for Consumption, 30 minutes for Premium/Dedicated), the runtime cancels the invocation but the message may have already been partially processed Instance restarts — If the Function App instance restarts or is scaled down during processing, in-flight messages are abandoned and redelivered At-least-once delivery — Service Bus guarantees at-least-once delivery. In rare cases, a message may be delivered more than once even without lock expiration How to Verify: Search Application Insights for the same MessageId appearing in multiple invocations: traces | where message has "MessageId" | summarize count() by tostring(customDimensions["MessageId"]) | where count_ > 1 Check if MessageLockLostException precedes the duplicate invocation Review functionTimeout settings in host.json Solution: Make your function idempotent — Design processing logic so that executing it multiple times with the same message produces the same result. Common patterns: Use the MessageId as a deduplication key Use upserts instead of inserts in your database Check for an existing record before processing Enable duplicate detection on the Service Bus entity: Set requiresDuplicateDetection: true when creating the queue or topic Configure duplicateDetectionHistoryTimeWindow (default 10 minutes) Address lock expiration — Follow the guidance in Issue 3 to prevent lock-related redelivery Use sessions for ordered, exactly-once-per-session processing when your business logic requires it 6. Scaling Issues — Messages Accumulating in the Queue Symptoms: Message count in the queue or subscription grows steadily Only one or a few instances are running despite a large backlog Target-based scaling does not appear to be working Messages are processed very slowly Why This Happens: Azure Functions uses target-based scaling for Service Bus triggers on Consumption, Elastic Premium and Flex Consumption plan. The scale controller monitors the entity's message count and active message count to decide how many instances to allocate. Scaling issues can arise from: maxConcurrentCalls is too low — Each instance processes at most maxConcurrentCalls messages concurrently. If this is set to 1 and messages take 1 second each, a single instance can only process ~60 messages/minute. functionTimeout or long processing — If each message takes a long time, fewer messages are processed per instance and scale-out is needed. Consumption plan cold start — New instances take time to spin up and establish connections. Premium plan with VNET_ROUTE_ALL — VNet integration can slow cold starts due to DNS resolution and private endpoint setup. Batch size misconfigured — For batch-triggered functions, a very large maxMessageBatchSize with long processing per message can bottleneck throughput. How to Verify: Check the active message count on your Service Bus entity over time Review the instance count in Metrics → Function App → Instance Count Check Application Insights for function invocation durations Verify maxConcurrentCalls and other settings in host.json Solution: Increase maxConcurrentCalls if your function can safely handle more parallelism: { "version": "2.0", "extensions": { "serviceBus": { "maxConcurrentCalls": 32 } } } Use prefetchCount to reduce latency by pre-fetching messages from the broker: { "version": "2.0", "extensions": { "serviceBus": { "prefetchCount": 32 } } } Use batched functions for high-throughput scenarios — process multiple messages per invocation: [Function(nameof(ProcessBatch))] public void ProcessBatch( [ServiceBusTrigger("myqueue", Connection = "ServiceBusConnection", IsBatched = true)] ServiceBusReceivedMessage[] messages) { foreach (var message in messages) { // Process each message } } Optimize function execution time — Reduce the per-message processing duration to allow higher throughput per instance For Premium plans, consider setting FUNCTIONS_WORKER_PROCESS_COUNT to use multiple language worker processes per instance for out-of-process language workers 7. Session-Enabled Queue or Subscription Issues Symptoms: Session processing stops after some time Error: "SessionLockLostException" Only some sessions are being processed while others are idle Sessions appear "stuck" and messages accumulate Why This Happens: When IsSessionsEnabled = true on the trigger, the runtime creates a ServiceBusSessionProcessor. This processor acquires a session lock, processes messages for that session, and then moves to the next session. Issues can arise from: maxConcurrentSessions is too low — The default is 8. If you have many active sessions, some will wait for a processor to become available. sessionIdleTimeout is too short — When no messages arrive for a session within this timeout, the session is released. If messages arrive slightly after the timeout, a new session lock must be acquired, adding latency. Long-running session processing — If processing a message within a session takes longer than the session lock duration, a SessionLockLostException occurs. Single-threaded per session — Within a session, messages are processed sequentially (FIFO). If one message in a session takes very long, it blocks subsequent messages in that session. How to Verify: Check Application Insights for SessionLockLostException Review the maxConcurrentSessions and sessionIdleTimeout settings in host.json Monitor the number of active sessions on your Service Bus entity Solution: Increase maxConcurrentSessions to process more sessions in parallel: { "version": "2.0", "extensions": { "serviceBus": { "maxConcurrentSessions": 32, "sessionIdleTimeout": "00:02:00" } } } Increase maxAutoLockRenewalDuration to prevent session lock expiration during long-running processing Optimize per-message processing time within sessions Review your session design — If you have a very large number of sessions with low message volume per session, consider whether sessions are the right pattern for your use case 8. AMQP Connection and Network Errors Symptoms: Error: "An AMQP error occurred (condition: 'amqp:link:detach-forced')." Error: "ServiceBusCommunicationException" or "SocketException" Error: "The link 'xxx' is force detached... due to broker shutting down" Intermittent connection drops and slow reconnects Trigger stops firing after a period of working correctly Why This Happens: The Service Bus trigger communicates with the Service Bus namespace over AMQP (TCP port 5671/5672). Connection issues can occur when: Network firewall blocks AMQP ports — Corporate firewalls or NSGs may block the required ports VNet integration without proper routing — Missing service endpoints, private endpoints, or DNS configuration Service Bus namespace throttling — Exceeding the messaging units for your tier causes throttling responses Idle connection timeout — Long-idle connections may be terminated by intermediate network devices Service Bus service maintenance — Broker restarts or failovers can force-detach links How to Verify: Check Application Insights for ServiceBusCommunicationException or AMQP-related errors Test connectivity from your Function App's network context: For VNet-integrated apps: use Diagnose and solve problems → Network Troubleshooter Test DNS resolution for <namespace>.servicebus.windows.net Test TCP connectivity on port 5671 Check Service Bus namespace metrics for throttling (ThrottledRequests metric) Review NSG rules on the Function App's subnet Solution: Allow AMQP traffic — Ensure ports 5671 and 5672 are open outbound in your NSG/firewall rules. Alternatively, switch to WebSockets: { "version": "2.0", "extensions": { "serviceBus": { "transportType": "amqpWebSockets" } } } Using amqpWebSockets routes traffic over port 443, which is more likely to be allowed by corporate firewalls. Configure private endpoints for VNet-integrated apps: Create a private endpoint for your Service Bus namespace Configure private DNS zone privatelink.servicebus.windows.net Ensure DNS zone is linked to your VNet Scale up the Service Bus tier if throttling is the issue — check the namespace's messaging units and consider upgrading from Basic to Standard or Premium Configure retry options in host.json for transient failures: { "version": "2.0", "extensions": { "serviceBus": { "clientRetryOptions": { "mode": "exponential", "maxRetries": 5, "delay": "00:00:01", "maxDelay": "00:01:00", "tryTimeout": "00:02:00" } } } } 9. Extension Bundle or NuGet Package Version Mismatch Symptoms: Error: "The 'serviceBusTrigger' binding type is not registered" Error: "Microsoft.Azure.WebJobs.Host: Error indexing method..." Function works locally but fails in Azure Missing features (e.g., ServiceBusMessageActions, IsBatched) that should be available Why This Happens: The Service Bus trigger implementation lives in the extension package. For non-compiled languages (Node.js, Python, PowerShell, Java) it is delivered via extension bundles. For compiled .NET apps, it comes from NuGet packages. If the version is outdated or mismatched, trigger types may not be registered or newer features may be unavailable. App Type Package Source .NET Isolated Microsoft.Azure.Functions.Worker.Extensions.ServiceBus (NuGet) .NET In-Process Microsoft.Azure.WebJobs.Extensions.ServiceBus (NuGet) Node.js, Python, Java, PowerShell Extension bundle in host.json How to Verify: For .NET apps: Check the version of the Service Bus extension NuGet package in your .csproj file For non-.NET apps: Check the extensionBundle version range in host.json Compare against the latest available versions on NuGet Solution: For .NET Isolated apps, update to the latest extension: <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.22.0" /> For non-.NET apps, ensure your extension bundle is current: { "version": "2.0", "extensionBundle": { "id": "Microsoft.Azure.Functions.ExtensionBundle", "version": "[4.*, 5.0.0)" } } For features like ServiceBusMessageActions and IsBatched, ensure you are on extension version 5.14.1 or later 10. Function Timeout Causing Message Redelivery Symptoms: Function execution is cancelled mid-processing CancellationToken is triggered before function completes Messages are redelivered and may eventually end up in the DLQ Application Insights shows FunctionTimeoutException Why This Happens: Azure Functions enforces a maximum execution timeout per invocation. The default depends on your hosting plan: Ref: Function app timeout duration Plan Default Timeout Maximum Timeout Consumption 5 minutes 10 minutes Flex Consumption 30 minutes Unlimited Premium 30 minutes Unlimited Dedicated (App Service) 30 minutes Unlimited If your Service Bus-triggered function exceeds this timeout, the runtime cancels the invocation. The message is abandoned and redelivered by Service Bus. How to Verify: Check Application Insights for FunctionTimeoutException Review function execution durations in Application Insights: requests | where name == "ProcessMessage" | summarize avg(duration), max(duration), percentile(duration, 95) by bin(timestamp, 1h) Check the functionTimeout setting in host.json Solution: Increase functionTimeout in host.json (within plan limits): { "version": "2.0", "functionTimeout": "00:10:00" } Upgrade your plan if you need longer execution times — Premium and Dedicated plans support unlimited timeout Optimize processing — Offload long-running work to Durable Functions, or use the claim-check pattern to move heavy payloads out of the message Use the CancellationToken to gracefully handle timeout and avoid partial processing: [Function(nameof(ProcessMessage))] public async Task ProcessMessage( [ServiceBusTrigger("myqueue", Connection = "ServiceBusConnection")] ServiceBusReceivedMessage message, CancellationToken cancellationToken) { await DoWorkAsync(message, cancellationToken); } Using Diagnose and Solve Problems The Azure Portal provides built-in diagnostics for Service Bus integration issues. How to Access: Navigate to your Function App in the Azure Portal Select Diagnose and solve problems from the left menu Search for relevant detectors: Detector What It Checks Function App Down or Reporting Errors Overall app health, host status, crash history Functions Configurations Check host.json and app settings validation Messaging Function Trigger Failure Helps troubleshoot messaging function trigger failures Network Troubleshooter VNet, private endpoint, and access restriction diagnostics These detectors run automated checks and provide targeted recommendations. Quick Troubleshooting Checklist Use this checklist to systematically diagnose Service Bus trigger issues: [ ] Connection: Is the Service Bus connection string or managed identity configuration set correctly in Application Settings? [ ] Entity name: Does the queue/topic/subscription name in the trigger attribute match the actual Service Bus entity? [ ] RBAC: For managed identity, does the Function App have Azure Service Bus Data Receiver role? [ ] Extension version: Is the Service Bus extension (NuGet or extension bundle) up to date? [ ] host.json: Is the serviceBus section configured correctly under extensions? [ ] Message locks: Is maxAutoLockRenewalDuration sufficient for your function's execution time? [ ] Dead-letter queue: Are messages accumulating in the DLQ? Check DeadLetterReason. [ ] Function timeout: Is your function completing within the plan's timeout limit? [ ] Network: For VNet-integrated apps, can the app reach the Service Bus namespace on the required ports? [ ] Scaling: Are enough instances allocated? Check instance count vs. message backlog. [ ] Exceptions: Check Application Insights for the first and most frequent exceptions. [ ] Diagnose and Solve: Have you run the built-in detectors in the Azure Portal? Conclusion Azure Functions Service Bus trigger issues span a wide range — from simple connection misconfigurations to complex message lock timing problems. The key to efficient troubleshooting is a systematic approach: Key Takeaways: Start with the basics — Verify connection settings, entity names, and permissions first. Most issues are configuration-related. Understand the lock lifecycle — maxAutoLockRenewalDuration, entity lock duration, and function execution time must be tuned in concert to prevent MessageLockLostException and duplicate processing. Design for at-least-once delivery — Make your functions idempotent. Service Bus guarantees at-least-once, not exactly-once. Use ServiceBusMessageActions for control — Disable autoCompleteMessages and settle messages explicitly for production-grade error handling. Monitor the dead-letter queue — DLQ messages are a direct signal that something is failing. Inspect them regularly. Tune concurrency for throughput — maxConcurrentCalls, prefetchCount, and batching settings significantly impact throughput. Apply one fix at a time — Change one setting, restart, and recheck. Avoid multiple simultaneous changes that obscure which fix resolved the issue. If you continue to experience issues after following these steps, consider opening a support ticket with Microsoft Azure Support, providing: Function App name and resource group Timestamp of when the issue started Application Insights exceptions and traces around the failure time Service Bus entity configuration (lock duration, max delivery count, sessions) host.json serviceBus configuration Recent deployment or configuration changes Networking configuration details (if VNet-integrated) References Azure Service Bus trigger for Azure Functions Azure Service Bus bindings — host.json settings Azure Functions error handling and retries Target-based scaling for Service Bus Service Bus PeekLock behavior Azure Functions networking options Azure Functions diagnostics Troubleshoot Azure Functions Service Bus dead-letter queues Azure Service Bus RBAC roles Have questions or feedback? Leave a comment below.516Views0likes0CommentsHow to Troubleshoot Azure Functions Host Startup Issue
Overview Azure Functions is a powerful serverless compute service that enables you to run event-driven code without managing infrastructure. When you deploy a Function App, the Azure Functions host is the runtime process responsible for discovering your functions, loading extensions and bindings, connecting to storage, and starting trigger listeners. A host startup issue occurs when the Functions runtime fails to initialize and cannot reach a healthy Running state. When this happens, you may see one or more of these symptoms: "Function host is not running" error in the Azure Portal Functions are not visible in the Functions blade Triggers stop firing — HTTP functions return 503, timer/queue functions are silent The portal shows Error state or no response on the host status endpoint Application Insights logs show repeated startup exceptions followed by restarts Log Stream shows a restart loop or no output at all This issue can be frustrating, especially when a deployment appeared to succeed and your code works correctly on your local machine. In this blog, we will explore how the host starts up, what can go wrong, and — most importantly — how to systematically diagnose and resolve startup failures. Understanding How the Host Starts Up Before diving into troubleshooting, it is important to understand the startup sequence. The Functions host executes the following steps each time the runtime initializes: Host Startup Sequence ASP.NET Core Startup → Register WebHost services (DI, secrets, diagnostics, middleware) → WebJobsScriptHostService.StartAsync() → Check file system (run-from-package validation) → Build inner ScriptHost → ScriptHost.InitializeAsync() → PreInitialize (validate settings, file system) → Load function metadata (function.json / decorators) → Load extensions and bindings (extension bundles / NuGet) → Create function descriptors and register triggers → Start trigger listeners → State = Running ✓ Complete Source Code: Azure/azure-functions-host If any step in this sequence fails, the host enters an Error state and attempts to restart with exponential backoff (starting at 1 second, up to 2 minutes between attempts). After repeated failures, the platform may report an application-level failure. Host States The Functions host can be in any of the following states: State Meaning Default Host has not yet been created Starting Host is in the process of starting Initialized Functions indexed, listeners not yet running Running Fully running — triggers active, functions discoverable Error Host encountered an error — will attempt restart Stopping Host is shutting down Stopped Host is stopped Offline Host is offline (app_offline.htm is present) Only when the host reaches the Running state are functions visible in the portal and triggers active. The Error state triggers an automatic restart loop. Key Settings That Affect Startup Setting Purpose Impact If Wrong FUNCTIONS_EXTENSION_VERSION Specifies the runtime version (e.g., ~4) Host throws startup error if missing or invalid FUNCTIONS_WORKER_RUNTIME Specifies the language runtime (e.g., dotnet-isolated, node, python) Host cannot load the correct worker process AzureWebJobsStorage Connection string for the required storage account Host cannot store keys, coordinate triggers, or maintain state WEBSITE_RUN_FROM_PACKAGE Controls how deployment packages are loaded Host shuts down if package is inaccessible or corrupted WEBSITE_CONTENTAZUREFILECONNECTIONSTRING Storage connection for content share (Consumption/Premium) Host cannot access function code WEBSITE_CONTENTSHARE File share name for function content Host cannot locate function files Startup Failure Categories Category Examples Typical Symptom Configuration Missing/invalid app settings, bad host.json Host enters Error state immediately Storage AzureWebJobsStorage unreachable, expired SAS token, firewall Host fails repeatedly, storage-related exceptions Extensions/Bindings Missing extension bundle, version mismatch, load failure Host errors during extension loading phase Deployment/Packaging Corrupted zip, wrong package structure, missing files Host starts but finds no functions, or fails to load assemblies Code/Startup DI exception, external startup error, assembly conflict Host errors during initialization with code-specific exception Runtime/Worker Wrong worker runtime, language mismatch, gRPC failure Host cannot establish worker channel Networking VNet blocks outbound, DNS failure, private endpoint misconfigured Host cannot reach storage/dependencies at startup Platform Resource exhaustion, app_offline.htm, platform issue Host enters Offline state or is killed before startup completes Common Causes and Solutions 1. Missing or Invalid FUNCTIONS_EXTENSION_VERSION Symptoms: Host immediately fails to start Error message: "Invalid site extension configuration. Please update the App Setting 'FUNCTIONS_EXTENSION_VERSION' to a valid value (e.g. ~4)." Repeated restart loops in Application Insights Why This Happens: The FUNCTIONS_EXTENSION_VERSION app setting tells the platform which version of the Functions runtime to load. When your app runs as a hosted site extension (the normal case in Azure), this setting is validated as one of the first steps in ScriptHost.PreInitialize(). If it is missing, empty, or set to an unrecognized value, the host throws a HostInitializationException and cannot proceed. How to Verify: Navigate to your Function App in the Azure Portal Go to Settings → Configuration → Application settings Look for FUNCTIONS_EXTENSION_VERSION Confirm it is set to a valid value: ~4 (recommended), ~3 (legacy), or a specific version Solution: Set FUNCTIONS_EXTENSION_VERSION to ~4 (or the appropriate version for your app) If the setting was recently changed or removed, restore it Save and restart the Function App Ref: FUNCTIONS_EXTENSION_VERSION 2. Missing or Mismatched FUNCTIONS_WORKER_RUNTIME Symptoms: Error: "The 'FUNCTIONS_WORKER_RUNTIME' setting is required..." (diagnostic code AZFD0011) Error: "The 'FUNCTIONS_WORKER_RUNTIME' is set to 'X', which does not match the worker runtime metadata..." (diagnostic code AZFD0013) Host enters Error state after loading function metadata Why This Happens: The FUNCTIONS_WORKER_RUNTIME setting controls which language worker process the host launches (e.g., dotnet-isolated, node, python, java, powershell). During initialization, the host validates that this setting matches the actual function metadata discovered in your deployment. A mismatch — for example, deploying a Python app but having FUNCTIONS_WORKER_RUNTIME=node — results in a HostInitializationException. How to Verify: Check the app setting value in Portal → Configuration Compare against your actual project type: C# in-process: dotnet C# isolated: dotnet-isolated Node.js: node Python: python Java: java PowerShell: powershell Solution: Set FUNCTIONS_WORKER_RUNTIME to the correct value matching your function code If you recently migrated language models (e.g., in-process to isolated), update the setting accordingly Save and restart Ref: FUNCTIONS_WORKER_RUNTIME 3. Storage Account Connectivity Issues (AzureWebJobsStorage) Symptoms: Host fails to start and cannot recover Errors related to Blob storage connectivity "Unable to get function keys" or secret management errors Health check returns Unhealthy Why This Happens: The Functions host requires a valid and reachable storage account for: Storing function keys and secrets Coordinating distributed triggers (e.g., timer triggers, queue listeners) Maintaining internal state and lock management Hosting the content share for Consumption and Premium plans The host runs a background health check (WebJobsStorageHealthCheck) every 30 seconds that verifies Blob storage connectivity. If the storage account is unreachable — due to a wrong connection string, rotated keys, firewall restrictions, deleted account, or expired SAS token — the host will fail to initialize properly. How to Verify: Check your Application Settings for these storage-related values: Setting Required For AzureWebJobsStorage All plans — primary storage connection WEBSITE_CONTENTAZUREFILECONNECTIONSTRING Consumption and Premium plans — content share WEBSITE_CONTENTSHARE Consumption and Premium plans — file share name You can also verify storage connectivity using the host status endpoint. Solution: Verify the storage account exists — check the Azure Portal to confirm it has not been deleted or disabled Check for rotated keys — if storage keys were recently regenerated, update the connection string: Get the new connection string from the Storage Account → Access keys blade Update AzureWebJobsStorage in your Function App settings Check storage firewall rules: Go to Storage Account → Networking Ensure the Function App has access (public endpoint, service endpoint, or private endpoint depending on your architecture) For SAS-token-based connections — verify the token has not expired (diagnostic code AZFD0006) For VNet-integrated apps: Ensure service endpoints or private endpoints are configured for the storage account Verify DNS resolution works for *.blob.core.windows.net, *.queue.core.windows.net, *.table.core.windows.net, and *.file.core.windows.net For detailed guidance, see Storage considerations for Azure Functions. 4. Invalid host.json Configuration Symptoms: Error: "The host.json file is missing the required 'version' property." (diagnostic code AZFD0009) Error: "'X' is an invalid value for host.json 'version' property." JSON deserialization failures in logs Host enters a special HandlingConfigurationParsingError mode Why This Happens: The host.json file is parsed early in the startup sequence. If it is missing the required "version": "2.0" property, contains invalid JSON syntax, or has unrecognized configuration values, the host throws a HostConfigurationException. The host then restarts in a degraded mode that skips host.json parsing — the admin APIs remain functional for diagnostics, but functions will not load. How to Verify: Check your host.json in the deployment: Windows plans: Use Kudu → Debug Console → Navigate to site/wwwroot/host.json Linux/Flex Consumption: Use SSH or Azure CLI Validate that the file: Is valid JSON (use a JSON validator) Contains the required "version": "2.0" property Does not have unrecognized or misspelled configuration keys Minimal valid host.json: { "version": "2.0" } Typical host.json with extension bundle: { "version": "2.0", "extensionBundle": { "id": "Microsoft.Azure.Functions.ExtensionBundle", "version": "[4.*, 5.0.0)" }, "logging": { "applicationInsights": { "samplingSettings": { "isEnabled": true, "excludedTypes": "Request" } } } } Solution: Fix any JSON syntax errors Ensure "version": "2.0" is present Remove or correct any unrecognized configuration keys Redeploy or edit the file directly via Kudu (Windows plans) Ref: host.json 5. Extension Bundle or Binding Load Failures Symptoms: Host fails to start with extension-related errors in logs Error: "Referenced bundle X of version Y does not meet the required minimum version..." Error: "One or more loaded extensions do not meet the minimum requirements..." Errors referencing ScriptStartUpErrorLoadingExtensionBundle or ScriptStartUpUnableToLoadExtension Works locally but fails in Azure Why This Happens: Azure Functions uses extension bundles to provide trigger and binding implementations (Service Bus, Event Hubs, Cosmos DB, etc.). During startup, the ScriptStartupTypeLocator loads extension assemblies from either the bundle path or the bin folder. If the bundle is missing, the version is incompatible, an assembly fails to load, or the type does not implement the expected interfaces, the host throws a HostInitializationException. How to Verify: Check host.json for the extensionBundle configuration Verify the version range is compatible with your runtime version For compiled C# apps that don't use bundles, verify all required NuGet packages are present and compatible Solution: Ensure extensionBundle is configured in host.json: { "version": "2.0", "extensionBundle": { "id": "Microsoft.Azure.Functions.ExtensionBundle", "version": "[4.*, 5.0.0)" } } Use the correct version range for your runtime: Functions v4: [4.*, 5.0.0) For compiled .NET apps using explicit extensions: Verify all extension NuGet packages are up to date Ensure extensions.json is present in the bin folder after build Check for assembly version conflicts in the build output 6. Deployment Package Issues (WEBSITE_RUN_FROM_PACKAGE) Symptoms: Host shuts down immediately after startup Error: "Shutting down host due to presence of FAILED TO INITIALIZE RUN FROM PACKAGE.txt" Functions were visible before but disappeared after deployment "No functions found" in the portal Read-only file system errors in logs Why This Happens: When WEBSITE_RUN_FROM_PACKAGE is configured, the Functions host runs directly from a deployment package (ZIP file). During startup, the host checks the file system for failure markers. If the file FAILED TO INITIALIZE RUN FROM PACKAGE.txt is found, the host immediately shuts down the application — this is a fatal, non-recoverable error that requires redeployment. Other common package issues include an inaccessible URL, an expired SAS token, files nested in a subfolder instead of the ZIP root, or a corrupted package. WEBSITE_RUN_FROM_PACKAGE Values: Value Behavior 1 Runs from a local package in d:\home\data\SitePackages (Windows) or /home/data/SitePackages (Linux) <URL> Runs from a remote package at the specified URL (required for Linux Consumption) Not set Traditional deployment — files extracted to wwwroot How to Verify: Check WEBSITE_RUN_FROM_PACKAGE in Application Settings If value is 1: Go to Kudu → Debug Console Navigate to d:\home\data\SitePackages Verify a .zip file exists and packagename.txt points to it If value is a URL: Try accessing the URL directly — it should download the ZIP Check for expired SAS tokens (403 response) or missing blobs (404 response) Verify package contents: Download and extract the ZIP Confirm host.json and function files are at the root level, not in a nested subfolder Common Issues: Problem Symptom Fix Expired SAS token Package URL returns 403 Generate new SAS with longer expiry Package URL not accessible Package URL returns 404 Verify blob exists and URL is correct Wrong package structure Files in subfolder Ensure files are at ZIP root Corrupted package Host startup errors Redeploy with a fresh package Storage firewall blocking Timeout errors Allow Function App access to storage Solution: Redeploy your Function App using your preferred deployment method If using URL-based packages, regenerate the SAS token or use managed identity-based access If the failure marker file exists, redeployment will overwrite it Restart the Function App after fixing: Ref: WEBSITE_RUN_FROM_PACKAGE 7. Code-Level Startup Exceptions (DI and External Startup) Symptoms: Host Error state with application-specific exception in logs Error: "Error configuring services in an external startup class" (diagnostic code AZFD0005) Dependency injection failures (InvalidOperationException, TypeLoadException) Errors in Program.cs or Startup.cs of your application Assembly binding or version conflict exceptions Why This Happens: For isolated worker (.NET) apps, your Program.cs runs custom startup code before the worker connects to the host. For in-process (.NET) apps, custom IWebJobsStartup implementations run during host initialization. If this code throws — for example, a missing dependency, a failed external service connection, or a type load error — the host catches the exception and enters an Error state with a HostInitializationException. How to Verify: Check Application Insights Exceptions table for the specific exception type and stack trace Look for errors containing AZFD0005 (external startup error) Review your Program.cs / Startup.cs for: Service registrations that depend on external resources (databases, APIs, Key Vault) Missing NuGet packages or assembly version mismatches Configuration values that may differ between local and Azure environments Solution: Fix the exception identified in logs — the stack trace usually points directly to the failing code Ensure all required environment variables and connection strings are set in Application Settings For assembly conflicts, check that all NuGet package versions are compatible and aligned Consider making external-service connections resilient by deferring initialization or adding retry logic Test startup locally with the same environment variables as Azure 8. Language Worker Channel Failure Symptoms: Error: "Failed to start Language Worker Channel for language: {runtime}" Error: "Failed to start Rpc Server. Check if your app is hitting connection limits." Host starts but cannot communicate with the language worker process Timeout errors during worker initialization Why This Happens: For out-of-process languages (Node.js, Python, Java, PowerShell, .NET Isolated), the Functions host communicates with a separate worker process over gRPC. If the host cannot start the gRPC server, or the worker process fails to launch or connect, the host throws a HostInitializationException. Common causes include: Port conflicts Missing language runtime or incorrect version Worker process crashes on startup Resource exhaustion (memory, file handles) How to Verify: Check Application Insights for gRPC or worker-related errors Verify the correct language runtime version is installed: For Node.js: Check WEBSITE_NODE_DEFAULT_VERSION For Python: Check the Python version in Configuration → General settings For Java: Check FUNCTIONS_WORKER_JAVA_LOAD_APP_LIBS and Java version For .NET Isolated: Check target framework in the deployed assemblies Check if the Function App is hitting plan resource limits Solution: Ensure the correct language runtime version is configured For Linux Consumption, verify the correct runtime stack is selected in Configuration → General settings If resource limits are suspected, consider scaling up to a higher plan tier Restart the Function App to clear temporary port or resource issues 9. Networking Blocking Required Dependencies Symptoms: Host fails to start in VNet-integrated apps Timeout errors connecting to storage or other Azure services Works without VNet integration, fails with it enabled DNS resolution failures in logs NSG or firewall-related errors Why This Happens: During startup, the Functions host must reach several external endpoints: Azure Storage (Blob, Queue, Table, File) — for keys, triggers, and state Extension bundle CDN — to download extension bundles (first run or cold start) Azure Key Vault — if Key Vault references are used in app settings Application Insights — for telemetry (non-blocking, but can delay if timing out) If VNet integration, NSG rules, forced tunneling, or a firewall blocks these outbound connections, the host cannot complete startup. How to Verify: Check if the Function App has VNet integration enabled (Networking blade) Review NSG rules on the integrated subnet — ensure outbound to Azure services is allowed For apps with forced tunneling, verify the firewall/NVA allows required endpoints Check DNS resolution for storage endpoints from within the VNet context Solution: Add NSG rules or firewall rules to allow outbound traffic to the required endpoints Configure service endpoints or private endpoints for storage on the integrated subnet Ensure DNS resolution works for all required endpoints For private DNS zones, ensure proper zone links and records exist for storage See Azure Functions networking options for detailed configuration guidance. 10. app_offline.htm Causing Offline State Symptoms: Host status shows Offline All requests return an offline page Portal shows the app is running but functions return errors Why This Happens: If a file named app_offline.htm exists in the function app's script root directory, the host detects it during startup and enters the Offline state. Some deployment tools create this file during deployment to gracefully take the app offline, and it should be removed automatically when deployment completes. If it is left behind — for example, due to a failed deployment — the host remains offline. How to Verify: Windows plans: Go to Kudu → Debug Console → Navigate to site/wwwroot and look for app_offline.htm Linux: Use SSH or Azure CLI to check for the file Solution: Delete app_offline.htm from the app's root directory The host will automatically detect the deletion and restart into a normal state If the file reappears after deletion, investigate your deployment pipeline — it may be creating the file but failing to remove it Using Diagnose and Solve Problems The Azure Portal provides built-in diagnostics specifically designed for Functions host startup issues. How to Access: Navigate to your Function App in the Azure Portal Select Diagnose and solve problems from the left menu Search for relevant detectors: Detector What It Checks Function App Down or Reporting Errors Overall app health, host status, crash history Function App Startup Issue Specific startup failure analysis, configuration validation Functions Configurations Check host.json and app settings validation Functions Deployment Recent deployment status and potential issues Network Troubleshooter VNet, private endpoint, and access restriction diagnostics These detectors run automated checks against your Function App and provide targeted recommendations. The detectors often identify the root cause faster than manual investigation. Verifying Host Status via REST API You can check the host status programmatically to determine the current state and any reported errors. Get Host Status: curl "https://<app>.azurewebsites.net/admin/host/status?code=<master-key>"</master-key></app> See Admin API for details. The state field is the single most important indicator: State Action Running Host is healthy — investigate function-level issues Error Host startup failed — check the errors array for root cause Offline app_offline.htm present — check deployment state No response / timeout Host cannot serve requests — check platform health and networking List Functions (verify discovery): curl "https://<app>.azurewebsites.net/admin/functions?code=<master-key>"</master-key></app> Quick Troubleshooting Checklist Use this checklist to systematically diagnose host startup issues: [ ] Host status: Check /admin/host/status — is the state Running, Error, or Offline? [ ] First error: Check Application Insights Exceptions or Log Stream — what is the first exception after the latest restart? [ ] FUNCTIONS_EXTENSION_VERSION: Is it set to a valid value (e.g., ~4)? [ ] FUNCTIONS_WORKER_RUNTIME: Is it set correctly and does it match the deployed code? [ ] AzureWebJobsStorage: Is the connection string valid? Is the storage account reachable from the app's network context? [ ] host.json: Does it exist, contain valid JSON, and include "version": "2.0"? [ ] Extension bundle: Is extensionBundle configured with a compatible version range? [ ] Package deployment: If using WEBSITE_RUN_FROM_PACKAGE, is the package accessible and correctly structured? [ ] Startup code: For .NET apps, does Program.cs / startup code throw during DI registration? [ ] Networking: If VNet-integrated, can the app reach storage, Key Vault, and extension CDN endpoints? [ ] Offline file: Is app_offline.htm present in the root directory? [ ] Diagnose and Solve: Have you run the Function App Startup Issue detector in the Azure Portal? Diagnostic Event Codes Reference When reviewing logs, look for these Azure Functions diagnostic codes that are related to startup failures: Code Name Meaning AZFD0005 External Startup Error Error in a custom IWebJobsStartup class AZFD0006 SAS Token Expiring AzureWebJobsStorage SAS token is expiring or expired AZFD0009 Unable to Parse host.json host.json file is missing or has invalid content AZFD0011 Missing FUNCTIONS_WORKER_RUNTIME The required worker runtime setting is not configured AZFD0013 Worker Runtime Mismatch FUNCTIONS_WORKER_RUNTIME does not match deployed function metadata These codes appear in Application Insights traces and diagnostic event logs. Diagnostic Events Conclusion Azure Functions host startup failures can be caused by a wide range of issues — from a simple missing app setting to complex networking misconfigurations. The key to efficient troubleshooting is a systematic approach: Key Takeaways: Always check host status first — the /admin/host/status endpoint tells you the current state and any errors Find the first error, not the cascade — look for the initial exception after the most recent restart Validate configuration — FUNCTIONS_EXTENSION_VERSION, FUNCTIONS_WORKER_RUNTIME, and AzureWebJobsStorage are the three settings that cause the most startup failures Check host.json — a missing version property or invalid JSON is a common and easily fixable cause Verify deployment artifacts — ensure your package is complete, correctly structured, and accessible Use built-in diagnostics — the Diagnose and Solve Problems detectors are purpose-built for these issues Apply one fix at a time — change one setting, restart, and recheck. Avoid multiple simultaneous changes that obscure which fix resolved the issue If you continue to experience startup issues after following these steps, consider opening a support ticket with Microsoft Azure Support, providing: Function App name and resource group Timestamp of when the issue started Host status endpoint response (copy the full JSON) The first exception from Application Insights or Log Stream Recent deployment or configuration changes Networking configuration details (if VNet-integrated) References Azure Functions host.json reference Azure Functions app settings reference Azure Functions deployment technologies Storage considerations for Azure Functions Azure Functions networking options Azure Functions diagnostics Azure Functions Admin API (host status) Run your functions from a package file Troubleshoot Azure Functions Have questions or feedback? Leave a comment below.771Views1like0CommentsNetwork Connectivity Check APIs for Logic App Standard
Introduction When your Logic App Standard is integrated with a Virtual Network (VNET), you can use these APIs to troubleshoot connectivity issues to downstream resources like SQL databases, Storage Accounts, Service Bus, Key Vault, and more. The checks run directly from the worker hosting your Logic App, so the results reflect the actual network path your workflows use. API Overview API HTTP Method Route Suffix Purpose ConnectivityCheck POST /connectivityCheck Validates end-to-end connectivity to an Azure resource (SQL, Key Vault, Storage, Service Bus, etc.) DnsCheck POST /dnsCheck Performs DNS resolution for a hostname TcpPingCheck POST /tcpPingCheck Performs a TCP ping to a host and port How to Call Using Azure API Playground Sign in with your Azure account. https://portal.azure.com/#view/Microsoft_Azure_Resources/ArmPlayground.ReactView Use POST method with the URLs below. Instead of API playground you can also use PowerShell or Az Rest URL Pattern Production slot: POST https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Web/sites/{logicAppName}/connectivityCheck?api-version=2026-03-01-preview Deployment slot: POST https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Web/sites/{logicAppName}/slots/{slotName}/connectivityCheck?api-version=2026-03-01-preview Replace connectivityCheck with dnsCheck or tcpPingCheck as needed. all the requests should be Json 1. ConnectivityCheck Tests end-to-end connectivity from your Logic App to an Azure resource. This validates DNS, TCP, and authentication in a single call. Supported Provider Types ProviderType Use For KeyVault Azure Key Vault SQL Azure SQL Database / SQL Server ServiceBus Azure Service Bus EventHubs Azure Event Hubs BlobStorage Azure Blob Storage FileShare Azure File Share (see Port 445 limitation) only tese 443 QueueStorage Azure Queue Storage TableStorage Azure Table Storage Web Any HTTP/HTTPS endpoint Credential Types CredentialType When to Use ConnectionString You have a connection string to provide directly Authentication You have an endpoint URL with username and password CredentialReference You want to reference an existing connection string or app setting by name AppSetting You want to reference an app setting configured on the Logic App ManagedIdentity Your Logic App uses Managed Identity to authenticate Sample Request — Connection String (SQL Database) POST https://management.azure.com/subscriptions/{subId}/resourceGroups/{rg}/providers/Microsoft.Web/sites/{logicAppName}/connectivityCheck?api-version=2026-03-01-preview Content-Type: application/json { "properties": { "providerType": "SQL", "credentials": { "credentialType": "ConnectionString", "connectionString": "Server=tcp:myserver.database.windows.net,1433;Database=mydb;User ID=myuser;Password=mypassword;Encrypt=True;TrustServerCertificate=False;" }, "resourceMetadata": { "entityName": "" } } } Sample Request — App Setting Reference (Service Bus) Use this when your connection string is stored in an app setting on the Logic App (e.g., ServiceBusConnection). { "properties": { "providerType": "ServiceBus", "credentials": { "credentialType": "AppSetting", "appSetting": "ServiceBusConnection" }, "resourceMetadata": { "entityName": "myqueue" } } } Sample Request — Managed Identity (Blob Storage) Use this when your Logic App authenticates using Managed Identity. { "properties": { "providerType": "BlobStorage", "credentials": { "credentialType": "ManagedIdentity", "managedIdentity": { "targetResourceUrl": "https://mystorageaccount.blob.core.windows.net", "clientId": "" } }, "resourceMetadata": { "entityName": "" } } } Tip: Leave clientId empty to use the system-assigned managed identity. Provide a client ID to use a specific user-assigned managed identity. 2. DnsCheck Tests whether a hostname can be resolved from your Logic App's worker. This is useful for verifying private DNS zones and private endpoints are configured correctly. Sample Request POST https://management.azure.com/subscriptions/{subId}/resourceGroups/{rg}/providers/Microsoft.Web/sites/{logicAppName}/dnsCheck?api-version=2026-03-01-preview Content-Type: application/json { "properties": { "dnsName": "myserver.database.windows.net" } } 3. TcpPingCheck Tests whether a TCP connection can be established from your Logic App to a specific host and port. This is useful for checking if a port is open and reachable through your VNET. Sample Request POST https://management.azure.com/subscriptions/{subId}/resourceGroups/{rg}/providers/Microsoft.Web/sites/{logicAppName}/tcpPingCheck?api-version=2026-03-01-preview Content-Type: application/json { "properties": { "host": "myserver.database.windows.net", "port": "1433" } } Port 445 (SMB / Azure File Share) — Known Limitation Port 445 cannot be reliably tested using TcpPingCheck or ConnectivityCheck with the FileShare provider type. Restricted Outgoing Ports Regardless of address, applications cannot connect to anywhere using ports 445, 137, 138, and 139. In other words, even if connecting to a non-private IP address or the address of a virtual network, connections to ports 445, 137, 138, and 139 are not permitted.Using an AI Agent to Troubleshoot and Fix Azure Function App Issues
TOC Preparation Troubleshooting Workflow Conclusion Preparation Topic: Required tools AI agent: for example, Copilot CLI / OpenCode / Hermes / OpenClaw, etc. In this example, we use Copilot CLI. Model access: for example, Anthropic Claude Opus. Relevant skills: this example does not use skills, but using relevant skills can speed up troubleshooting. Topic: Compliant with your organization Enterprise-level projects are sensitive, so you must confirm with the appropriate stakeholders before using them. Enterprise environments may also have strict standards for AI agent usage. Topic: Network limitations If the process involves restarting the Function App container or restarting related settings, communication between the user and the agent may be interrupted, and you will need to use /resume. If the agent needs internet access for investigation, the app must have outbound connectivity. If the Kudu container cannot be used because of network issues, this type of investigation cannot be carried out. Topic: Permission limitations If you are using Azure blessed images, according to the official documentation, the containers use the fixed password Docker!. However, if you are using a custom container, you will need to provide an additional login method. For resources the agent does not already have permission to investigate, you will need to enable SAMI and assign the appropriate RBAC roles. Troubleshooting Workflow Let’s use a classic case where an HTTP trigger cannot be tested from the Azure Portal. As you can see, when clicking Test/Run in the Azure Portal, an error message appears. At the same time, however, the home page does not show any abnormal status. At this point, we first obtain the Function App’s SAMI and assign it the Owner role for the entire resource group. This is only for demonstration purposes. In practice, you should follow the principle of least privilege and scope permissions down to only the specific resources and operations that are actually required. Next, go to the Kudu container, which is the always-on maintenance container dedicated to the app. Install and enable Copilot CLI. Then we can describe the problem we are encountering. After the agent processes the issue and interacts with you further, it can generate a reasonable investigation report. In this example, it appears that the Function App’s Storage Account access key had been rotated previously, but the Function App had not updated the corresponding environment variable. Once we understand the issue, we could perform the follow-up actions ourselves. However, to demonstrate the agent’s capabilities, you can also allow it to fix the problem directly, provided that you have granted the corresponding permissions through SAMI. During the process, the container restart will disconnect the session, so you will need to return to the Kudu container and resume the previous session so it can continue. Finally, it will inform you that the issue has been fixed, and then you can validate the result. This is the validation result, and it looks like the repair was successful. Conclusion After each repair, we can even extract the experience from that case into a skill and store it in a Storage Account for future reuse. In this way, we can not only reduce the agent’s initial investigation time for similar issues, but also save tokens. This makes both time and cost management more efficient.463Views3likes0CommentsAzure Functions Ignite 2025 Update
Azure Functions is redefining event-driven applications and high-scale APIs in 2025, accelerating innovation for developers building the next generation of intelligent, resilient, and scalable workloads. This year, our focus has been on empowering AI and agentic scenarios: remote MCP server hosting, bulletproofing agents with Durable Functions, and first-class support for critical technologies like OpenTelemetry, .NET 10 and Aspire. With major advances in serverless Flex Consumption, enhanced performance, security, and deployment fundamentals across Elastic Premium and Flex, Azure Functions is the platform of choice for building modern, enterprise-grade solutions. Remote MCP Model Context Protocol (MCP) has taken the world by storm, offering an agent a mechanism to discover and work deeply with the capabilities and context of tools. When you want to expose MCP/tools to your enterprise or the world securely, we recommend you think deeply about building remote MCP servers that are designed to run securely at scale. Azure Functions is uniquely optimized to run your MCP servers at scale, offering serverless and highly scalable features of Flex Consumption plan, plus two flexible programming model options discussed below. All come together using the hardened Functions service plus new authentication modes for Entra and OAuth using Built-in authentication. Remote MCP Triggers and Bindings Extension GA Back in April, we shared a new extension that allows you to author MCP servers using functions with the MCP tool trigger. That MCP extension is now generally available, with support for C#(.NET), Java, JavaScript (Node.js), Python, and Typescript (Node.js). The MCP tool trigger allows you to focus on what matters most: the logic of the tool you want to expose to agents. Functions will take care of all the protocol and server logistics, with the ability to scale out to support as many sessions as you want to throw at it. [Function(nameof(GetSnippet))] public object GetSnippet( [McpToolTrigger(GetSnippetToolName, GetSnippetToolDescription)] ToolInvocationContext context, [BlobInput(BlobPath)] string snippetContent ) { return snippetContent; } New: Self-hosted MCP Server (Preview) If you’ve built servers with official MCP SDKs and want to run them as remote cloud‑scale servers without re‑writing any code, this public preview is for you. You can now self‑host your MCP server on Azure Functions—keep your existing Python, TypeScript, .NET, or Java code and get rapid 0 to N scaling, built-in server authentication and authorization, consumption-based billing, and more from the underlying Azure Functions service. This feature complements the Azure Functions MCP extension for building MCP servers using the Functions programming model (triggers & bindings). Pick the path that fits your scenario—build with the extension or standard MCP SDKs. Either way you benefit from the same scalable, secure, and serverless platform. Use the official MCP SDKs: # MCP.tool() async def get_alerts(state: str) -> str: """Get weather alerts for a US state. Args: state: Two-letter US state code (e.g. CA, NY) """ url = f"{NWS_API_BASE}/alerts/active/area/{state}" data = await make_nws_request(url) if not data or "features" not in data: return "Unable to fetch alerts or no alerts found." if not data["features"]: return "No active alerts for this state." alerts = [format_alert(feature) for feature in data["features"]] return "\n---\n".join(alerts) Use Azure Functions Flex Consumption Plan's serverless compute using Custom Handlers in host.json: { "version": "2.0", "configurationProfile": "mcp-custom-handler", "customHandler": { "description": { "defaultExecutablePath": "python", "arguments": ["weather.py"] }, "http": { "DefaultAuthorizationLevel": "anonymous" }, "port": "8000" } } Learn more about MCPTrigger and self-hosted MCP servers at https://aka.ms/remote-mcp Built-in MCP server authorization (Preview) The built-in authentication and authorization feature can now be used for MCP server authorization, using a new preview option. You can quickly define identity-based access control for your MCP servers with Microsoft Entra ID or other OpenID Connect providers. Learn more at https://aka.ms/functions-mcp-server-authorization. Better together with Foundry agents Microsoft Foundry is the starting point for building intelligent agents, and Azure Functions is the natural next step for extending those agents with remote MCP tools. Running your tools on Functions gives you clean separation of concerns, reuse across multiple agents, and strong security isolation. And with built-in authorization, Functions enables enterprise-ready authentication patterns, from calling downstream services with the agent’s identity to operating on behalf of end users with their delegated permissions. Build your first remote MCP server and connect it to your Foundry agent at https://aka.ms/foundry-functions-mcp-tutorial. Agents Microsoft Agent Framework 2.0 (Public Preview Refresh) We’re excited about the preview refresh 2.0 release of Microsoft Agent Framework that builds on battle hardened work from Semantic Kernel and AutoGen. Agent Framework is an outstanding solution for building multi-agent orchestrations that are both simple and powerful. Azure Functions is a strong fit to host Agent Framework with the service’s extreme scale, serverless billing, and enterprise grade features like VNET networking and built-in auth. Durable Task Extension for Microsoft Agent Framework (Preview) The durable task extension for Microsoft Agent Framework transforms how you build production-ready, resilient and scalable AI agents by bringing the proven durable execution (survives crashes and restarts) and distributed execution (runs across multiple instances) capabilities of Azure Durable Functions directly into the Microsoft Agent Framework. Combined with Azure Functions for hosting and event-driven execution, you can now deploy stateful, resilient AI agents that automatically handle session management, failure recovery, and scaling, freeing you to focus entirely on your agent logic. Key features of the durable task extension include: Serverless Hosting: Deploy agents on Azure Functions with auto-scaling from thousands of instances to zero, while retaining full control in a serverless architecture. Automatic Session Management: Agents maintain persistent sessions with full conversation context that survives process crashes, restarts, and distributed execution across instances Deterministic Multi-Agent Orchestrations: Coordinate specialized durable agents with predictable, repeatable, code-driven execution patterns Human-in-the-Loop with Serverless Cost Savings: Pause for human input without consuming compute resources or incurring costs Built-in Observability with Durable Task Scheduler: Deep visibility into agent operations and orchestrations through the Durable Task Scheduler UI dashboard Create a durable agent: endpoint = os.getenv("AZURE_OPENAI_ENDPOINT") deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini") # Create an AI agent following the standard Microsoft Agent Framework pattern agent = AzureOpenAIChatClient( endpoint=endpoint, deployment_name=deployment_name, credential=AzureCliCredential() ).create_agent( instructions="""You are a professional content writer who creates engaging, well-structured documents for any given topic. When given a topic, you will: 1. Research the topic using the web search tool 2. Generate an outline for the document 3. Write a compelling document with proper formatting 4. Include relevant examples and citations""", name="DocumentPublisher", tools=[ AIFunctionFactory.Create(search_web), AIFunctionFactory.Create(generate_outline) ] ) # Configure the function app to host the agent with durable session management app = AgentFunctionApp(agents=[agent]) app.run() Durable Task Scheduler dashboard for agent and agent workflow observability and debugging For more information on the durable task extension for Agent Framework, see the announcement: https://aka.ms/durable-extension-for-af-blog. Flex Consumption Updates As you know, Flex Consumption means serverless without compromise. It combines elastic scale and pay‑for‑what‑you‑use pricing with the controls you expect: per‑instance concurrency, longer executions, VNet/private networking, and Always Ready instances to minimize cold starts. Since launching GA at Ignite 2024 last year, Flex Consumption has had tremendous growth with over 1.5 billion function executions per day and nearly 40 thousand apps. Here’s what’s new for Ignite 2025: 512 MB instance size (GA). Right‑size lighter workloads, scale farther within default quota. Availability Zones (GA). Distribute instances across zones. Rolling updates (Public Preview). Unlock zero-downtime deployments of code or config by setting a single configuration. See below for more information. Even more improvements including: new diagnostic settingsto route logs/metrics, use Key Vault App Config references, new regions, and Custom Handler support. To get started, review Flex Consumption samples, or dive into the documentation to see how Flex can support your workloads. Migrating to Azure Functions Flex Consumption Migrating to Flex Consumption is simple with our step-by-step guides and agentic tools. Move your Azure Functions apps or AWS Lambda workloads, update your code and configuration, and take advantage of new automation tools. With Linux Consumption retiring, now is the time to switch. For more information, see: Migrate Consumption plan apps to the Flex Consumption plan Migrate AWS Lambda workloads to Azure Functions Durable Functions Durable Functions introduces powerful new features to help you build resilient, production-ready workflows: Distributed Tracing: lets you track requests across components and systems, giving you deep visibility into orchestration and activities with support for App Insights and OpenTelemetry. Extended Sessions support in .NET isolated: improves performance by caching orchestrations in memory, ideal for fast sequential activities and large fan-out/fan-in patterns. Orchestration versioning (public preview): enables zero-downtime deployments and backward compatibility, so you can safely roll out changes without disrupting in-flight workflows Durable Task Scheduler Updates Durable Task Scheduler Dedicated SKU (GA): Now generally available, the Dedicated SKU offers advanced orchestration for complex workflows and intelligent apps. It provides predictable pricing for steady workloads, automatic checkpointing, state protection, and advanced monitoring for resilient, reliable execution. Durable Task Scheduler Consumption SKU (Public Preview): The new Consumption SKU brings serverless, pay-as-you-go orchestration to dynamic and variable workloads. It delivers the same orchestration capabilities with flexible billing, making it easy to scale intelligent applications as needed. For more information see: https://aka.ms/dts-ga-blog OpenTelemetry support in GA Azure Functions OpenTelemetry is now generally available, bringing unified, production-ready observability to serverless applications. Developers can now export logs, traces, and metrics using open standards—enabling consistent monitoring and troubleshooting across every workload. Key capabilities include: Unified observability: Standardize logs, traces, and metrics across all your serverless workloads for consistent monitoring and troubleshooting. Vendor-neutral telemetry: Integrate seamlessly with Azure Monitor or any OpenTelemetry-compliant backend, ensuring flexibility and choice. Broad language support: Works with .NET (isolated), Java, JavaScript, Python, PowerShell, and TypeScript. Start using OpenTelemetry in Azure Functions today to unlock standards-based observability for your apps. For step-by-step guidance on enabling OpenTelemetry and configuring exporters for your preferred backend, see the documentation. Deployment with Rolling Updates (Preview) Achieving zero-downtime deployments has never been easier. The Flex Consumption plan now offers rolling updates as a site update strategy. Set a single property, and all future code deployments and configuration changes will be released with zero-downtime. Instead of restarting all instances at once, the platform now drains existing instances in batches while scaling out the latest version to match real-time demand. This ensures uninterrupted in-flight executions and resilient throughput across your HTTP, non-HTTP, and Durable workloads – even during intensive scale-out scenarios. Rolling updates are now in public preview. Learn more at https://aka.ms/functions/rolling-updates. Secure Identity and Networking Everywhere By Design Security and trust are paramount. Azure Functions incorporates proven best practices by design, with full support for managed identity—eliminating secrets and simplifying secure authentication and authorization. Flex Consumption and other plans offer enterprise-grade networking features like VNETs, private endpoints, and NAT gateways for deep protection. The Azure Portal streamlines secure function creation, and updated scenarios and samples showcase these identity and networking capabilities in action. Built-in authentication (discussed above) enables inbound client traffic to use identity as well. Check out our updated Functions Scenarios page with quickstarts or our secure samples gallery to see these identity and networking best practices in action. .NET 10 Azure Functions now supports .NET 10, bringing in a great suite of new features and performance benefits for your code. .NET 10 is supported on the isolated worker model, and it’s available for all plan types except Linux Consumption. As a reminder, support ends for the legacy in-process model on November 10, 2026, and the in-process model is not being updated with .NET 10. To stay supported and take advantage of the latest features, migrate to the isolated worker model. Aspire Aspire is an opinionated stack that simplifies development of distributed applications in the cloud. The Azure Functions integration for Aspire enables you to develop, debug, and orchestrate an Azure Functions .NET project as part of an Aspire solution. Aspire publish directly deploys to your functions to Azure Functions on Azure Container Apps. Aspire 13 includes an updated preview version of the Functions integration that acts as a release candidate with go-live support. The package will be moved to GA quality with Aspire 13.1. Java 25, Node.js 24 Azure Functions now supports Java 25 and Node.js 24 in preview. You can now develop functions using these versions locally and deploy them to Azure Functions plans. Learn how to upgrade your apps to these versions here In Summary Ready to build what’s next? Update your Azure Functions Core Tools today and explore the latest samples and quickstarts to unlock new capabilities for your scenarios. The guided quickstarts run and deploy in under 5 minutes, and incorporate best practices—from architecture to security to deployment. We’ve made it easier than ever to scaffold, deploy, and scale real-world solutions with confidence. The future of intelligent, scalable, and secure applications starts now—jump in and see what you can create!3.6KViews1like2Comments