azure paas
180 TopicsContinued Investment in Azure App Service
This blog was originally published to the App Service team blog Recent Investments Premium v4 (Pv4) Azure App Service Premium v4 delivers higher performance and scalability on newer Azure infrastructure while preserving the fully managed PaaS experience developers rely on. Premium v4 offers expanded CPU and memory options, improved price-performance, and continued support for App Service capabilities such as deployment slots, integrated monitoring, and availability zone resiliency. These improvements help teams modernize and scale demanding workloads without taking on additional operational complexity. App Service Managed Instance App Service Managed Instance extends the App Service model to support Windows web applications that require deeper environment control. It enables plan-level isolation, optional private networking, and operating system customization while retaining managed scaling, patching, identity, and diagnostics. Managed Instance is designed to reduce migration friction for existing applications, allowing teams to move to a modern PaaS environment without code changes. Faster Runtime and Language Support Azure App Service continues to invest in keeping pace with modern application stacks. Regular updates across .NET, Node.js, Python, Java, and PHP help developers adopt new language versions and runtime improvements without managing underlying infrastructure. Reliability and Availability Improvements Ongoing investments in platform reliability and resiliency strengthen production confidence. Expanded Availability Zone support and related infrastructure improvements help applications achieve higher availability with more flexible configuration options as workloads scale. Deployment Workflow Enhancements Deployment workflows across Azure App Service continue to evolve, with ongoing improvements to GitHub Actions, Azure DevOps, and platform tooling. These enhancements reduce friction from build to production while preserving the managed App Service experience. A Platform That Grows With You These recent investments reflect a consistent direction for Azure App Service: active development focused on performance, reliability, and developer productivity. Improvements to runtimes, infrastructure, availability, and deployment workflows are designed to work together, so applications benefit from platform progress without needing to re-architect or change operating models. The recent General Availability of Aspire on Azure App Service is another example of this direction. Developers building distributed .NET applications can now use the Aspire AppHost model to define, orchestrate, and deploy their services directly to App Service — bringing a code-first development experience to a fully managed platform. We are also seeing many customers build and run AI-powered applications on Azure App Service, integrating models, agents, and intelligent features directly into their web apps and APIs. App Service continues to evolve to support these scenarios, providing a managed, scalable foundation that works seamlessly with Azure's broader AI services and tooling. Whether you are modernizing with Premium v4, migrating existing workloads using App Service Managed Instance, or running production applications at scale - including AI-enabled workloads - Azure App Service provides a predictable and transparent foundation that evolves alongside your applications. Azure App Service continues to focus on long-term value through sustained investment in a managed platform developers can rely on as requirements grow, change, and increasingly incorporate AI. Get Started Ready to build on Azure App Service? Here are some resources to help you get started: Create your first web app — Deploy a web app in minutes using the Azure portal, CLI, or VS Code. App Service documentation — Explore guides, tutorials, and reference for the full platform. Aspire on Azure App Service — Now generally available. Deploy distributed .NET applications to App Service using the Aspire AppHost model. Pricing and plans — Compare tiers including Premium v4 and find the right fit for your workload. App Service on Azure Architecture Center — Reference architectures and best practices for production deployments.53Views1like0CommentsThe Durable Task Scheduler Consumption SKU is now Generally Available
Today, we're excited to announce that the Durable Task Scheduler Consumption SKU has reached General Availability. Developers can now run durable workflows and agents on Azure with pay-per-use pricing, no storage to manage, no capacity to plan, and no idle costs. Just create a scheduler, connect your app, and start orchestrating. Whether you're coordinating AI agent workflows, processing event-driven pipelines, or running background jobs, the Consumption SKU is ready to go. GET STARTED WITH THE DURABLE TASK SCHEDULER CONSUMPTION SKU Since launching the Consumption SKU in public preview last November, we've seen incredible adoption and have incorporated feedback from developers around the world to ensure the GA release is truly production ready. “The Durable Task Scheduler has become a foundational piece of what we call ‘workflows’. It gives us the reliability guarantees we need for processing financial documents and sensitive workflows, while keeping the programming model straightforward. The combination of durable execution, external event correlation, deterministic idempotency, and the local emulator experience has made it a natural fit for our event-driven architecture. We have been delighted with the consumption SKUs cost model for our lower environments.”– Emily Lewis, CarMax What is the Durable Task Scheduler? If you're new to the Durable Task Scheduler, we recommend checking out our previous blog posts for a detailed background: Announcing Limited Early Access of the Durable Task Scheduler Announcing Workflow in Azure Container Apps with the Durable Task Scheduler Announcing Dedicated SKU GA & Consumption SKU Public Preview In brief, the Durable Task Scheduler is a fully managed orchestration backend for durable execution on Azure, meaning your workflows and agent sessions can reliably resume and run to completion, even through process failures, restarts, and scaling events. Whether you’re running workflows or orchestrating durable agents, it handles task scheduling, state persistence, fault tolerance, and built-in monitoring, freeing developers from the operational overhead of managing their own execution engines and storage backends. The Durable Task Scheduler works across Azure compute environments: Azure Functions: Using the Durable Functions extension across all Function App SKUs, including Flex Consumption. Azure Container Apps: Using the Durable Functions or Durable Task SDKs with built-in workflow support and auto-scaling. Any compute: Azure Kubernetes Service, Azure App Service, or any environment where you can run the Durable Task SDKs (.NET, Python, Java, JavaScript). Why choose the Consumption SKU? With the Consumption SKU you’re charged only for actions dispatched, with no minimum commitments or idle costs. There’s no capacity to size or throughput to reserve. Create a scheduler, connect your app, and you’re running. The Consumption SKU is a natural fit for workloads with unpredictable or bursty usage patterns: AI agent orchestration: Multi-step agent workflows that call LLMs, retrieve data, and take actions. Users trigger these on demand, so volume is spiky and pay-per-use avoids idle costs between bursts. Event-driven pipelines: Processing events from queues, webhooks, or streams with reliable orchestration and automatic checkpointing, where volumes spike and dip unpredictably. API-triggered workflows: User signups, form submissions, payment flows, and other request-driven processing where volume varies throughout the day. Distributed transactions: Retries and compensation logic across microservices with durable sagas that survive failures and restarts. What's included in the Consumption SKU at GA The Consumption SKU has been hardened based on feedback and real-world usage during the public preview. Here's what's included at GA: Performance Up to 500 actions per second: Sufficient throughput for a wide range of workloads, with the option to move to the Dedicated SKU for higher-scale scenarios. Up to 30 days of data retention: View and manage orchestration history, debug failures, and audit execution data for up to 30 days. Built-in monitoring dashboard Filter orchestrations by status, drill into execution history, view visual Gantt and sequence charts, and manage orchestrations (pause, resume, terminate, or raise events), all from the dashboard, secured with Role-Based Access Control (RBAC). Identity-based security The Consumption SKU uses Entra ID for authentication and RBAC for authorization. No SAS tokens or access keys to manage, just assign the appropriate role and connect. Get started with the Durable Task Scheduler today The Consumption SKU is available now Generally Available. Provision a scheduler in the Azure portal, connect your app, and start orchestrating. You only pay for what you use. Documentation Getting started Samples Pricing Consumption SKU docs We'd love to hear your feedback. Reach out to us by filing an issue on our GitHub repository97Views0likes0CommentsBuilding the agentic future together at JDConf 2026
JDConf 2026 is just weeks away, and I’m excited to welcome Java developers, architects, and engineering leaders from around the world for two days of learning and connection. Now in its sixth year, JDConf has become a place where the Java community compares notes on their real-world production experience: patterns, tooling, and hard-earned lessons you can take back to your team, while we keep moving the Java systems that run businesses and services forward in the AI era. This year’s program lines up with a shift many of us are seeing first-hand: delivery is getting more intelligent, more automated, and more tightly coupled to the systems and data we already own. Agentic approaches are moving from demos to backlog items, and that raises practical questions: what’s the right architecture, where do you draw trust boundaries, how do you keep secrets safe, and how do you ship without trading reliability for novelty? JDConf is for and by the people who build and manage the mission-critical apps powering organizations worldwide. Across three regional livestreams, you’ll hear from open source and enterprise practitioners who are making the same tradeoffs you are—velocity vs. safety, modernization vs. continuity, experimentation vs. operational excellence. Expect sessions that go beyond “what” and get into “how”: design choices, integration patterns, migration steps, and the guardrails that make AI features safe to run in production. You’ll find several practical themes for shipping Java in the AI era: connecting agents to enterprise systems with clear governance; frameworks and runtimes adapting to AI-native workloads; and how testing and delivery pipelines evolve as automation gets more capable. To make this more concrete, a sampling of sessions would include topics like Secrets of Agentic Memory Management (patterns for short- and long-term memory and safe retrieval), Modernizing a Java App with GitHub Copilot (end-to-end upgrade and migration with AI-powered technologies), and Docker Sandboxes for AI Agents (guardrails for running agent workflows without risking your filesystem or secrets). The goal is to help you adopt what’s new while hardening your long lived codebases. JDConf is built for community learning—free to attend, accessible worldwide, and designed for an interactive live experience in three time zones. You’ll not only get 23 practitioner-led sessions with production-ready guidance but also free on-demand access after the event to re-watch with your whole team. Pro tip: join live and get more value by discussing practical implications and ideas with your peers in the chat. This is where the “how” details and tradeoffs become clearer. JDConf 2026 Keynote Building the Agentic Future Together Rod Johnson, Embabel | Bruno Borges, Microsoft | Ayan Gupta, Microsoft The JDConf 2026 keynote features Rod Johnson, creator of the Spring Framework and founder of Embabel, joined by Bruno Borges and Ayan Gupta to explore where the Java ecosystem is headed in the agentic era. Expect a practitioner-level discussion on how frameworks like Spring continue to evolve, how MCP is changing the way agents interact with enterprise systems, and what Java developers should be paying attention to right now. Register. Attend. Earn. Register for JDConf 2026 to earn Microsoft Rewards points, which you can use for gift cards, sweepstakes entries, and more. Earn 1,000 points simply by signing up. When you register for any regional JDConf 2026 event with your Microsoft account, you'll automatically receive these points. Get 5,000 additional points for attending live (limited to the first 300 attendees per stream). On the day of your regional event, check in through the Reactor page or your email confirmation link to qualify. Disclaimer: Points are added to your Microsoft account within 60 days after the event. Must register with a Microsoft account email. Up to 10,000 developers eligible. Points will be applied upon registration and attendance and will not be counted multiple times for registering or attending at different events. Terms | Privacy JDConf 2026 Regional Live Streams Americas – April 8, 8:30 AM – 12:30 PM PDT (UTC -7) Bruno Borges hosts the Americas stream, discussing practical agentic Java topics like memory management, multi-agent system design, LLM integration, modernization with AI, and dependency security. Experts from Redis, IBM, Hammerspace, HeroDevs, AI Collective, Tekskills, and Microsoft share their insights. Register for Americas → Asia-Pacific – April 9, 10:00 AM – 2:00 PM SGT (UTC +8) Brian Benz and Ayan Gupta co-host the APAC stream, highlighting Java frameworks and practices for agentic delivery. Topics include Spring AI, multi-agent orchestration, spec-driven development, scalable DevOps, and legacy modernization, with speakers from Broadcom, Alibaba, CERN, MHP (A Porsche Company), and Microsoft. Register for Asia-Pacific → Europe, Middle East and Africa – April 9, 9:00 AM – 12:30 PM GMT (UTC +0) The EMEA stream, hosted by Sandra Ahlgrimm, will address the implementation of agentic Java in production environments. Topics include self-improving systems utilizing Spring AI, Docker sandboxes for agent workflow management, Retrieval-Augmented Generation (RAG) pipelines, modernization initiatives from a national tax authority, and AI-driven CI/CD enhancements. Presentations will feature experts from Broadcom, Docker, Elastic, Azul Systems, IBM, Team Rockstars IT, and Microsoft. Register for EMEA → Make It Interactive: Join Live Come prepared with an actual challenge you’re facing, whether you’re modernizing a legacy application, connecting agents to internal APIs, or refining CI/CD processes. Test your strategies by participating in live chats and Q&As with presenters and fellow professionals. If you’re attending with your team, schedule a debrief after the live stream to discuss how to quickly use key takeaways and insights in your pilots and projects. Learning Resources Java and AI for Beginners Video Series: Practical, episode-based walkthroughs on MCP, GenAI integration, and building AI-powered apps from scratch. Modernize Java Apps Guide: Step-by-step guide using GitHub Copilot agent mode for legacy Java project upgrades, automated fixes, and cloud-ready migrations. AI Agents for Java Webinar: Embedding AI Agent capabilities into Java applications using Microsoft Foundry, from project setup to production deployment. Java Practitioner’s Guide: Learning plan for deploying, managing, and optimizing Java applications on Azure using modern cloud-native approaches. Register Now JDConf 2026 is a free global event for Java teams. Join live to ask questions, connect, and gain practical patterns. All 23 sessions will be available on-demand. Register now to earn Microsoft Rewards points for attending. Register at JDConf.com137Views0likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.12KViews6likes0CommentsIndustry-Wide Certificate Changes Impacting Azure App Service Certificates
Executive Summary In early 2026, industry-wide changes mandated by browser applications and the CA/B Forum will affect both how TLS certificates are issued as well as their validity period. The CA/B Forum is a vendor body that establishes standards for securing websites and online communications through SSL/TLS certificates. Azure App Service is aligning with these standards for both App Service Managed Certificates (ASMC, free, DigiCert-issued) and App Service Certificates (ASC, paid, GoDaddy-issued). Most customers will experience no disruption. Action is required only if you pin certificates or use them for client authentication (mTLS). Update: February 17, 2026 We’ve published new Microsoft Learn documentation, Industry-wide certificate changes impacting Azure App Service , which provides more detailed guidance on these compliance-driven changes. The documentation also includes additional information not previously covered in this blog, such as updates to domain validation reuse, along with an expanding FAQ section. The Microsoft Learn documentation now represents the most complete and up-to-date overview of these changes. Going forward, any new details or clarifications will be published there, and we recommend bookmarking the documentation for the latest guidance. Who Should Read This? App Service administrators Security and compliance teams Anyone responsible for certificate management or application security Quick Reference: What’s Changing & What To Do Topic ASMC (Managed, free) ASC (GoDaddy, paid) Required Action New Cert Chain New chain (no action unless pinned) New chain (no action unless pinned) Remove certificate pinning Client Auth EKU Not supported (no action unless cert is used for mTLS) Not supported (no action unless cert is used for mTLS) Transition from mTLS Validity No change (already compliant) Two overlapping certs issued for the full year None (automated) If you do not pin certificates or use them for mTLS, no action is required. Timeline of Key Dates Date Change Action Required Mid-Jan 2026 and after ASMC migrates to new chain ASMC stops supporting client auth EKU Remove certificate pinning if used Transition to alternative authentication if the certificate is used for mTLS Mar 2026 and after ASC validity shortened ASC migrates to new chain ASC stops supporting client auth EKU Remove certificate pinning if used Transition to alternative authentication if the certificate is used for mTLS Actions Checklist For All Users Review your use of App Service certificates. If you do not pin these certificates and do not use them for mTLS, no action is required. If You Pin Certificates (ASMC or ASC) Remove all certificate or chain pinning before their respective key change dates to avoid service disruption. See Best Practices: Certificate Pinning. If You Use Certificates for Client Authentication (mTLS) Switch to an alternative authentication method before their respective key change dates to avoid service disruption, as client authentication EKU will no longer be supported for these certificates. See Sunsetting the client authentication EKU from DigiCert public TLS certificates. See Set Up TLS Mutual Authentication - Azure App Service Details & Rationale Why Are These Changes Happening? These updates are required by major browser programs (e.g., Chrome) and apply to all public CAs. They are designed to enhance security and compliance across the industry. Azure App Service is automating updates to minimize customer impact. What’s Changing? New Certificate Chain Certificates will be issued from a new chain to maintain browser trust. Impact: Remove any certificate pinning to avoid disruption. Removal of Client Authentication EKU Newly issued certificates will not support client authentication EKU. This change aligns with Google Chrome’s root program requirements to enhance security. Impact: If you use these certificates for mTLS, transition to an alternate authentication method. Shortening of Certificate Validity Certificate validity is now limited to a maximum of 200 days. Impact: ASMC is already compliant; ASC will automatically issue two overlapping certificates to cover one year. No billing impact. Frequently Asked Questions (FAQs) Will I lose coverage due to shorter validity? No. For App Service Certificate, App Service will issue two certificates to span the full year you purchased. Is this unique to DigiCert and GoDaddy? No. This is an industry-wide change. Do these changes impact certificates from other CAs? Yes. These changes are an industry-wide change. We recommend you reach out to your certificates’ CA for more information. Do I need to act today? If you do not pin or use these certs for mTLS, no action is required. Glossary ASMC: App Service Managed Certificate (free, DigiCert-issued) ASC: App Service Certificate (paid, GoDaddy-issued) EKU: Extended Key Usage mTLS: Mutual TLS (client certificate authentication) CA/B Forum: Certification Authority/Browser Forum Additional Resources Changes to the Managed TLS Feature Set Up TLS Mutual Authentication Azure App Service Best Practices – Certificate pinning DigiCert Root and Intermediate CA Certificate Updates 2023 Sunsetting the client authentication EKU from DigiCert public TLS certificates Feedback & Support If you have questions or need help, please visit our official support channels or the Microsoft Q&A, where our team and the community can assist you.4.8KViews1like0CommentsTransforming Retry-After Headers in Azure APIM: A Step-by-Step Guide
In this blog post, you'll learn how to customize the Retry-After response header in Azure API Management (APIM) rate-limiting policies, enhancing your API's flexibility and user experience. While it does not delve into the specifics of the rate-limit or rate-limit-by-key policies, it provides a practical guide for altering the Retry-After header. For detailed information on the rate-limit policy, please visit Azure API Management policy reference - rate-limit | Microsoft Learn. Understanding Rate Limiting: Protecting Your API from Overuse Rate limiting is a technique used to control how often requests are made to a resource. It helps prevent excessive or abusive use and ensures the resource is available to all users. Rate limiting is often used to protect against denial-of-service (DoS) attacks, which aim to overwhelm a network or server with too many requests, making it unavailable to legitimate users. It can also limit the number of requests from individual users to prevent a single user or group from monopolizing the resource. Azure API Management Rate Limit Policies In Azure, access to APIs is controlled using the following API Management policies: rate-limit rate-limit-by-key The implementation of these policies is straightforward but somewhat limited and less flexible, in my opinion. The Default Retry-After Header: What You Need to Know In the Azure APIM rate-limit policy documentation, it is mentioned that once the client's requests are throttled, the service starts returning a response header containing the time interval (in seconds) after which the client should retry the request. The default name of the header is Retry-After, and this name can be customized. For example: Retry-After: 60 However, in one use case for a customer, there was a requirement to provide a timestamp instead of a time interval as a header value. For example: Retry-After: 2020-05-04T12:23:41.6181792Z To implement this, the header value needs to change, but this is something that the rate-limit policy does not support. Customizing the Retry-After Header The basis for changing the response header value lies in the on-error scope. You can implement a policy like the following: <inbound> <base> <rate-limit-by-key calls="1000" renewal-period="60" counter-key="@(context.Request.IpAddress)" increment-condition="@(context.Response.StatusCode == 200)" remaining-calls-variable-name="remainingCallsPerIP" retry-after-header-name="Retry-After" remaining-calls-header-name="Requests-Remaining" retry-after-variable-name="retryAfter"> </rate-limit-by-key></inbound> <on-error> <choose> <when condition="@(context.LastError.Reason == " ratelimitexceeded")"=""> <set-header name="Retry-After" exists-action="override"> <value>@(DateTime.UtcNow.AddSeconds(context.Variables.GetValueOrDefault<int>("retryAfter")).ToString("o"))</int></value> </set-header> </when> </choose> <base> </on-error> Please refer to APIM predefined errors for policies here: Error handling in Azure API Management policies | Microsoft Learn Here, the key point is that whenever the APIM rate limit is reached, an error occurs, which is then captured in the on-error scope. To set or override the response header only in rate-limiting scenarios, you need to filter using the RateLimitExceeded error reason. After that, the exact error value is determined by adding the current UTC timestamp with the value of the retryAfter variable in seconds. With this, you have now customized the Retry-After header with a timestamp instead of a time interval (in seconds). Conclusion In conclusion, customizing the Retry-After response header in Azure API Management can significantly enhance the flexibility and user experience of your API services. By leveraging the on-error scope and handling the RateLimitExceeded error, you can provide a more informative and user-friendly response to clients when rate limits are exceeded. This approach not only meets specific customer requirements but also demonstrates the adaptability of Azure APIM in handling various scenarios. With these steps, you can ensure that your API remains robust, efficient, and user centric.399Views1like1CommentRun Playwright Tests on Cloud Browsers using Playwright Workspaces
This post walks through setting up and running Playwright UI and API tests on Azure Playwright Testing Service (Preview). It covers workspace setup, project configuration, remote browser execution, and viewing test reports and traces using Visual Studio or VS Code.1.6KViews1like0CommentsHow SRE Agent Pulls Logs from Grafana and Creates Jira Tickets Without Native Integrations
Your tools. Your workflows. SRE Agent adapts. SRE Agent natively integrates with PagerDuty, ServiceNow, and Azure Monitor. But your team might use Jira for incident tracking. Grafana for dashboards. Loki for logs. Prometheus for metrics. These aren't natively supported. That doesn't matter. SRE Agent supports MCP, the Model Context Protocol. Any MCP-compatible server extends the agent's capabilities. Connect your Grafana instance. Connect your Jira. The agent queries logs, correlates errors, and creates tickets with root cause analysis across tools that were never designed to talk to each other. The Scenario I built a grocery store app that simulates a realistic SRE scenario: an external supplier API starts rate limiting your requests. Customers see "Unable to check inventory" errors. The on-call engineer gets paged. The goal: SRE Agent should diagnose the issue by querying Loki logs through Grafana, identify the root cause, and create a Jira ticket with findings and recommendations. The app runs on Azure Container Apps with Loki for logs and Azure Managed Grafana for visualization. 👉 Deploy it yourself: github.com/dm-chelupati/grocery-sre-demo How I Set Up SRE Agent: Step by Step Step 1: Create SRE Agent I created an SRE Agent and gave it Reader access to my subscription Step 2: Connect to Grafana and Jira via MCP Neither MCP server had a remotely hosted option, and their stdio setup didn't match what SRE Agent supports. So I hosted them myself as Azure Container Apps: Grafana MCP Server — connects to my Azure Managed Grafana instance Atlassian MCP Server — connects to my Jira Cloud instance Now I have two endpoints SRE Agent can reach: https://ca-mcp-grafana.<env>.azurecontainerapps.io/mcp https://ca-mcp-jira.<env>.azurecontainerapps.io/mcp I added both to SRE Agent's MCP configuration as remotely hosted servers. Step 3: Create Sub-Agent with Tools and Instructions I created a sub-agent specifically for incident diagnosis with these tools enabled: Grafana MCP (for querying Loki logs) Atlassian MCP (for creating Jira tickets) Instructions were simple: You are expert in diagnosing applications running on Azure services. You need to use the Grafana tools to get the logs, metrics or traces and create a summary of your findings inside Jira as a ticket. use your knowledge base file loki-queries.md to learn about app configuration with loki and Query the loki for logs in Grafana. Step 4: Invoke Sub-Agent and Watch It Work I went to the SRE Agent chat and asked: @JiraGrafanaexpert: My container app ca-api-3syj3i2fat5dm in resource group rg-groceryapp is experiencing rate limit errors from a supplier API when checking product inventory. The agent: Queried Loki via Grafana MCP: {app="grocery-api"} |= "error" Found 429 rate limit errors spiking — 55+ requests hitting supplier API limits Identified root cause: SUPPLIER_RATE_LIMIT_429 from FreshFoods Wholesale API Created a Jira ticket: One prompt. Logs queried. Root cause identified. Ticket created with remediation steps. Making It Better: The Knowledge File SRE Agent can explore and discover how your apps are wired but you can speed that up. When querying observability data sources, the agent needs to learn the schema, available labels, table structures, and query syntax. For Loki, that means understanding LogQL, knowing which labels your apps use, and what JSON fields appear in logs. SRE Agent can figure things out, but with context, it gets there faster — just like humans. I created a knowledge file that gives the agent a head start: With this context, the agent knows exactly which labels to query, what fields to extract from JSON logs, and which query patterns to use 👉 See my full knowledge file How MCP Makes This Possible SRE Agent supports two ways to connect MCP servers: stdio — runs locally via command. This works for MCP servers that can be invoked via npx, node, or uvx. For example: npx -y @modelcontextprotocol/server-github. Remotely hosted — HTTP endpoint with streamable transport: https://mcp-server.example.com/sse or /mcp The catch: Not every MCP server fits these options out of the box. Some servers only support stdio but not the npx/node/uvx formats SRE Agent expects. Others don't offer a hosted endpoint at all. The solution: host them yourself. Deploy the MCP server as a container with an HTTP endpoint. That's what I did with Grafana MCP Server and Atlassian MCP Server, deployed both as Azure Container Apps exposing /mcp endpoints. Why This Matters Enterprise tooling is fragmented across Azure and non-Azure ecosystems. Some teams use Azure Monitor, others use Datadog. Incident tracking might be ServiceNow in one org and Jira in another. Logs live in Loki, Splunk, Elasticsearch and sometimes all three. SRE Agent meets you where you are. Azure-native tools work out of the box. Everything else connects via MCP. Your observability stack stays the same. Your ticketing system stays the same. The agent becomes the orchestration layer that ties them together. One agent. Any tool. Intelligent workflows across your entire ecosystem. Try It Yourself Create an SRE Agent Deploy MCP servers for your tools (Grafana, Atlassian) Create a sub-agent with the MCP tools connected Add a knowledge file with your app context Ask it to diagnose an issue Watch logs become tickets. Errors become action items. Context becomes intelligence. Learn More Azure SRE Agent documentation Azure SRE Agent blogs Grocery SRE Demo repo MCP specification Azure SRE Agent is currently in preview.1.3KViews0likes0CommentsDeploy Dynatrace OneAgent on your Container Apps
TOC Introduction Setup References 1. Introduction Dynatrace OneAgent is an advanced monitoring tool that automatically collects performance data across your entire IT environment. It provides deep visibility into applications, infrastructure, and cloud services, enabling real-time observability. OneAgent supports multiple platforms, including containers, VMs, and serverless architectures, ensuring seamless monitoring with minimal configuration. It captures detailed metrics, traces, and logs, helping teams diagnose performance issues, optimize resources, and enhance user experiences. With AI-driven insights, OneAgent proactively detects anomalies and automates root cause analysis, making it an essential component for modern DevOps, SRE, and cloud-native monitoring strategies. 2. Setup 1. After registering your account, go to the control panel and search for Deploy OneAgent. 2. Obtain your Environment ID and create a PaaS token. Be sure to save them for later use. 3. In your local environment's console, log in to the Dynatrace registry. docker login -u XXX XXX.live.dynatrace.com # XXX is your Environment ID # Input PaaS Token when password prompt 4. Create a Dockerfile and an sshd_config file. FROM mcr.microsoft.com/devcontainers/javascript-node:20 # Change XXX into your Environment ID COPY --from=XXX.live.dynatrace.com/linux/oneagent-codemodules:all / / ENV LD_PRELOAD /opt/dynatrace/oneagent/agent/lib64/liboneagentproc.so # SSH RUN apt-get update \ && apt-get install -y --no-install-recommends dialog openssh-server tzdata screen lrzsz htop cron \ && echo "root:Docker!" | chpasswd \ && mkdir -p /run/sshd \ && chmod 700 /root/.ssh/ \ && chmod 600 /root/.ssh/id_rsa COPY ./sshd_config /etc/ssh/ # OTHER EXPOSE 2222 CMD ["/usr/sbin/sshd", "-D", "-o", "ListenAddress=0.0.0.0"] Port 2222 ListenAddress 0.0.0.0 LoginGraceTime 180 X11Forwarding yes Ciphers aes128-cbc,3des-cbc,aes256-cbc,aes128-ctr,aes192-ctr,aes256-ctr MACs hmac-sha2-256,hmac-sha2-512,hmac-sha1,hmac-sha1-96 StrictModes yes SyslogFacility DAEMON PasswordAuthentication yes PermitEmptyPasswords no PermitRootLogin yes Subsystem sftp internal-sftp AllowTcpForwarding yes 5. Build the container and push it to Azure Container Registry (ACR). # YYY is your ACR name docker build -t oneagent:202503201710 . --no-cache # you could setup your own image name docker tag oneagent:202503201710 YYY.azurecr.io/oneagent:202503201710 docker push YYY.azurecr.io/oneagent:202503201710 6. Create an Azure Container App (ACA), set Ingress to port 3000, allow all inbound traffic, and specify the ACR image you just created. 7. Once the container starts, open a console and run the following command to create a temporary HTTP server simulating a Node.js app. mkdir app && cd app echo 'console.log("Node.js app started...")' > index.js npm init -y npm install express cat <<EOF > server.js const express = require('express'); const app = express(); app.get('/', (req, res) => res.send('hello')); app.listen(3000, () => console.log('Server running on port 3000')); EOF # Please Press Ctrl + C to terminate the next command and run again for 3 times node server.js 8. You should now see the results on the ACA homepage. 9. Go back to the Dynatrace control panel, search for Host Classic, and you should see the collected data. 3. References Integrate OneAgent on Azure App Service for Linux and containers — Dynatrace Docs2.3KViews0likes1CommentSecure Unique Default Hostnames Now GA for Functions and Logic Apps
We are pleased to announce that Secure Unique Default Hostnames are now generally available (GA) for Azure Functions and Logic Apps (Standard). This expands the security model previously available for Web Apps to the entire App Service ecosystem and provides customers with stronger, more secure, and standardized hostname behavior across all workloads. Why This Feature Matters Historically, App Service resources have used default hostname format such as: <SiteName>.azurewebsites.net While straightforward, this pattern introduced potential security risks, particularly in scenarios where DNS records were left behind after deleting a resource. In those situations, a different user could create a new resource with the same name and unintentionally receive traffic or bindings associated with the old DNS configuration, creating opportunities for issues such as subdomain takeover. Secure Unique Default Hostnames address this by assigning a unique, randomized, region‑scoped hostname to each resource, for example: <SiteName>-<Hash>.<Region>.azurewebsites.net This change ensures that: No other customer can recreate the same default hostname. Apps inherently avoid risks associated with dangling DNS entries. Customers gain a more secure, reliable baseline behavior across App Service. Adopting this model now helps organizations prepare for the long‑term direction of the platform while improving security posture today. What’s New: GA Support for Functions and Logic Apps With this release, both Azure Functions and Logic Apps (Standard) fully support the Secure Unique Default Hostname capability. This brings these services in line with Web Apps and ensures customers across all App Service workloads benefit from the same secure and consistent default hostname model. Azure CLI Support The Azure CLI for Web Apps and Function Apps now includes support for the “--domain-name-scope” parameter. This allows customers to explicitly specify the scope used when generating a unique default hostname during resource creation. Examples: az webapp create --domain-name-scope {NoReuse, ResourceGroupReuse, SubscriptionReuse, TenantReuse} az functionapp create --domain-name-scope {NoReuse, ResourceGroupReuse, SubscriptionReuse, TenantReuse} Including this parameter ensures that deployments consistently use the intended hostname scope and helps teams prepare their automation and provisioning workflows for the secure unique default hostname model. Why Customers Should Adopt This Now While existing resources will continue to function normally, customers are strongly encouraged to adopt Secure Unique Default Hostnames for all new deployments. Early adoption provides several important benefits: A significantly stronger security posture. Protection against dangling DNS and subdomain takeover scenarios. Consistent default hostname behavior as the platform evolves. Alignment with recommended deployment practices and modern DNS hygiene. This feature represents the current best practice for hostname management on App Service and adopting it now helps ensure that new deployments follow the most secure and consistent model available. Recommended Next Steps Enable Secure Unique Default Hostnames for all new Web Apps, Function Apps, and Logic Apps. Update automation and CLI scripts to include the --domain-name-scope parameter when creating new resources. Begin updating internal documentation and operational processes to reflect the new hostname pattern. Additional Resources For readers who want to explore the technical background and earlier announcements in more detail, the following articles offer deeper coverage of unique default hostnames: Public Preview: Creating Web App with a Unique Default Hostname This article introduces the foundational concepts behind unique default hostnames. It explains why the feature was created, how the hostname format works, and provides examples and guidance for enabling the feature when creating new resources. Secure Unique Default Hostnames: GA on App Service Web Apps and Public Preview on Functions This article provides the initial GA announcement for Web Apps and early preview details for Functions. It covers the security benefits, recommended usage patterns, and guidance on how to handle existing resources that were created without unique default hostnames. Conclusion Secure Unique Default Hostnames now provide a more secure and consistent default hostname model across Web Apps, Function Apps, and Logic Apps. This enhancement reduces DNS‑related risks and strengthens application security, and organizations are encouraged to adopt this feature as the standard for new deployments.735Views0likes0Comments