azure container registry
45 TopicsBuild and Deploy a Microsoft Foundry Hosted Agent: A Hands-On Workshop
Agents are easy to demo, hard to ship. Most teams can put together a convincing prototype quickly. The harder part starts afterwards: shaping deterministic tools, validating behaviour with tests, building a CI path, packaging for deployment, and proving the experience through a user-facing interface. That is where many promising projects slow down. This workshop helps you close that gap without unnecessary friction. You get a guided path from local run to deployment handoff, then complete the journey with a working chat UI that calls your deployed hosted agent through the project endpoint. What You Will Build This is a hands-on, end-to-end learning experience for building and deploying AI agents with Microsoft Foundry. The lab provides a guided and practical journey through hosted-agent development, including deterministic tool design, prompt-guided workflows, CI validation, deployment preparation, and UI integration. It’s designed to reduce setup friction with a ready-to-run experience. It is a prompt-based development lab using Copilot guidance and MCP-assisted workflow options during deployment. It’s a .NET 10 workshop that includes local development, Copilot-assisted coding, CI, secure deployment to Azure, and a working chat UI. A local hosted agent that responds on the responses contract Deterministic tool improvements in core logic with xUnit coverage A GitHub Actions CI workflow for restore, build, test, and container validation An Azure-ready deployment path using azd, ACR image publishing, and Foundry manifest apply A Blazor chat UI that calls openai/v1/responses with agent_reference A repeatable implementation shape that teams can adapt to real projects Who This Lab Is For AI developers and software engineers who prefer learning by building Motivated beginners who want a guided, step-by-step path Experienced developers who want a practical hosted-agent reference implementation Architects evaluating deployment shape, validation strategy, and operational readiness Technical decision-makers who need to see how demos become deployable systems Why Hosted Agents Hosted agents run your code in a managed environment. That matters because it reduces the amount of infrastructure plumbing you need to manage directly, while giving you a clearer path to secure, observable, team-friendly deployments. Prompt-only demos are still useful. They are quick, excellent for ideation, and often the right place to start. Hosted agents complement that approach when you need custom code, tool-backed logic, and a deployment process that can be repeated by a team. Think of this lab as the bridge: you keep the speed of prompt-based iteration, then layer in the real-world patterns needed to run reliably. What You Will Learn 1) Orchestration You will practise workflow-oriented reasoning through implementation-shape recommendations and multi-step readiness scenarios. The lab introduces orchestration concepts at a practical level, rather than as a dedicated orchestration framework deep dive. 2) Tool Integration You will connect deterministic tools and understand how tool calls fit into predictable execution paths. This is a core focus of the workshop and is backed by tests in the solution. 3) Retrieval Patterns (What This Lab Covers Today) This workshop does not include a full RAG implementation with embeddings and vector search. Instead, it focuses on deterministic local tools and hosted-agent response flow, giving you a strong foundation before adding retrieval infrastructure in a follow-on phase. 4) Observability You will see light observability foundations through OpenTelemetry usage in the host and practical verification during local and deployed checks. This is introductory coverage intended to support debugging and confidence building. 5) Responsible AI You will apply production-minded safety basics, including secure secret handling and review hygiene. A full Responsible AI policy and evaluation framework is not the primary goal of this workshop, but the workflow does encourage safe habits from the start. 6) Secure Deployment Path You will move from local implementation to Azure deployment with a secure, practical workflow: azd provisioning, ACR publishing, manifest deployment, hosted-agent start, status checks, and endpoint validation. The Learning Journey The overall flow is simple and memorable: clone, open, run, iterate, deploy, observe. clone -> open -> run -> iterate -> deploy -> observe You are not expected to memorize every command. The lab is structured to help you learn through small, meaningful wins that build confidence. Your First 15 Minutes: Quick Wins Open the repo and understand the lab structure in a few minutes Set project endpoint and model deployment environment variables Run the host locally and validate the responses endpoint Inspect the deterministic tools in WorkshopLab.Core Run tests and see how behaviour changes are verified Review the deployment path so local work maps to Azure steps Understand how the UI validates end-to-end behaviour after deployment Leave the first session with a working baseline and a clear next step That first checkpoint is important. Once you see a working loop on your own machine, the rest of the workshop becomes much easier to finish. Using Copilot and MCP in the Workflow This lab emphasises prompt-based development patterns that help you move faster while still learning the underlying architecture. You are not only writing code, you are learning to describe intent clearly, inspect generated output, and iterate with discipline. Copilot supports implementation and review in the coding labs. MCP appears as a practical deployment option for hosted-agent lifecycle actions, provided your tools are authenticated to the correct tenant and project context. Together, this creates a development rhythm that is especially useful for learning: Define intent with clear prompts Generate or adjust implementation details Validate behaviour through tests and UI checks Deploy and observe outcomes in Azure Refine based on evidence, not guesswork That same rhythm transfers well to real projects. Even if your production environment differs, the patterns from this workshop are adaptable. Production-Minded Tips As you complete the lab, keep a production mindset from day one: Reliability: keep deterministic logic small, testable, and explicit Security: Treat secrets, identity, and access boundaries as first-class concerns Observability: use telemetry and status checks to speed up debugging Governance: keep deployment steps explicit so teams can review and repeat them You do not need to solve everything in one pass. The goal is to build habits that make your agent projects safer and easier to evolve. Start Today: If you have been waiting for the right time to move from “interesting demo” to “practical implementation”, this is the moment. The workshop is structured for self-study, and the steps are designed to keep your momentum high. Start here: https://github.com/microsoft/Hosted_Agents_Workshop_Lab Want deeper documentation while you go? These official guides are great companions: Hosted agent quickstart Hosted agent deployment guide When you finish, share what you built. Post a screenshot or short write-up in a GitHub issue/discussion, on social, or in comments with one lesson learned. Your example can help the next developer get unstuck faster. Copy/Paste Progress Checklist [ ] Clone the workshop repo [ ] Complete local setup and run the agent [ ] Make one prompt-based behaviour change [ ] Validate with tests and chat UI [ ] Run CI checks [ ] Provision and deploy via Azure and Foundry workflow [ ] Review observability signals and refine [ ] Share what I built + one takeaway Common Questions How long does it take? Most developers can complete a meaningful pass in a few focused sessions of 60-75 mins. You can get the first local success quickly, then continue through deployment and refinement at your own pace. Do I need an Azure subscription? Yes, for provisioning and deployment steps. You can still begin local development and testing before completing all Azure activities. Is it beginner-friendly? Yes. The labs are written for beginners, run in sequence, and include expected outcomes for each stage. Can I adapt it beyond .NET? Yes. The implementation in this workshop is .NET 10, but the architecture and development patterns can be adapted to other stacks. What if I am evaluating for a team? This lab is a strong team evaluation asset because it demonstrates end-to-end flow: local dev, integration patterns, CI, secure deployment, and operational visibility. Closing This workshop gives you more than theory. It gives you a practical path from first local run to deployed hosted agent, backed by tests, CI, and a user-facing UI validation loop. If you want a build-first route into Microsoft Foundry hosted-agent development, this is an excellent place to start. Begin now: https://github.com/microsoft/Hosted_Agents_Workshop_Lab55Views0likes0CommentsAzure Container Registry Premium SKU Now Supports 100 TiB Storage
Today, we're excited to announce that Azure Container Registry Premium SKU now supports up to 100 TiB of registry storage—a 2.5x increase from the previous 40 TiB limit, and a 5x increase from the original 20 TiB limit just two years ago. We've also improved geo-replication data sync speed, reducing data sync times for new replicas. We're also introducing an updated Portal experience for storage capacity visibility—a long-standing customer request. You can now monitor your storage consumption directly from the Monitoring tab in the Azure Portal Overview blade, making it easier to track usage against your registry limits. Imagine you're managing container infrastructure for a large enterprise. Your teams have embraced containerization, migrating critical workloads from VMs to containers for improved composability and deployment velocity. Meanwhile, your AI and machine learning teams are storing increasingly large model artifacts, agent tooling, and pipeline outputs in your registry. You've watched your storage consumption climb steadily toward the 40 TiB limit, and you're evaluating complex workarounds like splitting workloads across multiple registries. With today's announcement, that constraint is lifted. Premium SKU registries now support up to 100 TiB, giving you the headroom to consolidate workloads and scale confidently. Background: Container and AI Adoption Drive Storage Growth Organizations continue to adopt containers at an accelerating pace. The migration from virtual machines to containerized architectures—driven by the composability, portability, and operational benefits of containers—shows no signs of slowing. At the same time, the AI revolution has introduced new storage demands: large language models, vision models, agent frameworks, and their associated tooling all require substantial registry capacity. These parallel trends have pushed many enterprises toward the previous 40 TiB limit faster than anticipated. The Challenge: Storage Constraints at Scale For organizations operating at scale, the 40 TiB limit created operational challenges: Multi-Registry Complexity: Teams were forced to split workloads across multiple registries, complicating access control, networking, and operational visibility. Architectural Workarounds: Some organizations implemented custom garbage collection and artifact lifecycle policies specifically to stay under limits, rather than based on actual retention requirements. Growth Planning Uncertainty: Rapidly growing AI workloads made capacity planning difficult, with some organizations uncertain whether they could consolidate new model artifacts in their primary registry. Geo-Replication Provisioning: Syncing data to new geo-replicas for expanding global footprints took longer than desired, slowing regional expansion. Introducing 100 TiB Storage Limits Premium SKU registries now support up to 100 TiB of storage—a 2.5x increase that provides substantial headroom for continued growth. This limit applies to the total storage across all repositories in a single registry. We've also improved geo-replication data sync speed when expanding your registry's global footprint with new replicas, as detailed in the table below. What's Changing Aspect Previous New Premium SKU Storage Limit 40 TiB 100 TiB Basic/Standard SKU Limits Unchanged Unchanged No Action Required The new 100 TiB limit is automatically available for all Premium SKU registries. There's no migration, feature flag, or configuration change required—your registry can now grow beyond 40 TiB without any intervention. Who Benefits This storage increase is particularly valuable for: Enterprise platform teams managing centralized container registries for large organizations with hundreds of development teams AI and ML teams storing large model artifacts, training outputs, and inference containers Organizations migrating from VMs who are consolidating legacy workloads into containerized architectures Global enterprises using geo-replication across many regions, where storage is replicated to each replica Some of the world's largest AI and financial services organizations have been operating near the previous limit and will benefit immediately from this increase. Getting Started Check Your Current Usage You can view your registry's current storage consumption and the new 100 TiB limit in the Azure Portal (under the Monitoring tab in the Overview blade) or via CLI: # View registry storage usage and limits. The registry size limit will be under MaximumStorageCapacity. az acr show-usage --name myregistry --output table The Portal, CLI, and REST API/SDKs all now reflect the increased 100 TiB capacity. You can programmatically query your registry's storage usage via the List Usages REST API, making it easy to integrate capacity monitoring into your existing tooling and dashboards. Upgrade to Premium SKU The 100 TiB storage limit is exclusive to Premium SKU. If you're on Basic or Standard and need higher storage capacity, upgrading to Premium unlocks the full 100 TiB limit along with geo-replication, enhanced throughput, private endpoints, and other enterprise features: # Upgrade to Premium SKU az acr update --name myregistry --sku Premium Related Resources Azure Container Registry service tiers and limits Geo-replication in Azure Container Registry Best practices for Azure Container Registry Azure Container Registry pricing List Usages REST API287Views0likes2CommentsHealth-Aware Failover for Azure Container Registry Geo-Replication
Azure Container Registry (ACR) supports geo-replication: one registry resource with active-active (primary-primary), write-enabled geo-replicas across multiple Azure regions. You can push or pull through any replica, and ACR asynchronously replicates content and metadata to all other replicas using an eventual consistency model. For geo-replicated registries, ACR exposes a global endpoint like contoso.azurecr.io; that URL is backed by Azure Traffic Manager, which routes requests to the replica with the best network performance profile (usually the closest region). That's the promise. But TM routing at the global endpoint was latency-aware, not fully workload-health-aware: it could see whether the regional front door responded, but not whether that region could successfully serve real pull and push traffic end to end. This post walks through how we connected ACR Health Monitor's deep dependency checks to Traffic Manager so the global endpoint avoids routing to degraded replicas, improving failover outcomes and reducing customer-facing errors during regional incidents. The Problem: Healthy on the Outside, Broken on the Inside Traffic Manager routes traffic using performance-based routing, directing each DNS query to the endpoint with the lowest latency for the caller. To decide whether an endpoint is viable, TM periodically probes a health endpoint — and for ACR, that health check tested exactly one thing: is the reverse proxy responding? The problem is that a container registry is much more than a web server. A successful docker pull touches storage (where layers and manifests live), caching infrastructure, authentication and authorization services, and the metadata service. Any one of those backend dependencies can fail independently while the reverse proxy keeps happily returning 200 OK to Traffic Manager's health probes. This meant that during real outages — a storage degradation in a region, a caching failure, an authentication service disruption — Traffic Manager had no idea anything was wrong. It kept sending customers straight into a broken region, and those customers got 500 errors on their pull and push operations. We saw this pattern play out across multiple incidents: storage degradations, caching failures, VM outages, and full datacenter events — each lasting hours, all cases where geo-replicated registries had healthy replicas in other regions that could have served traffic, but Traffic Manager kept routing to the degraded region because the shallow health check passed. The Manual Workaround (and Its Failure Mode) Customers could work around this by manually disabling the affected endpoint: az acr replication update --name contoso --region eastus --region-endpoint-enabled false But this required customers to detect the outage, identify the affected region, and manually disable the endpoint — all during an active incident. Worse, in the most severe scenarios, the manual workaround could not be reliably executed. The endpoint-disable operation itself routes through the regional resource provider — the very infrastructure that's degraded. You can't tell the control plane to reroute traffic away from a region when the control plane in that region is the thing that's down. Customers were stuck. How Health Monitor Solves This ACR runs an internal service called Health Monitor within its data plane infrastructure. Its original job was narrowly scoped: it tracked the health of individual nodes so that the load balancer could route traffic to healthy instances within a region. What it didn't do was share that health signal with Traffic Manager for cross-region routing. We extended Health Monitor with a new deep health endpoint that aggregates the health status of multiple critical data plane dependencies. Rather than just asking "is the reverse proxy up?", this endpoint answers the real question: "can this region actually serve container registry requests right now?" Before we walk through the implementation details, here is a simplified before-and-after view: Before After What Gets Checked The deep health endpoint evaluates the availability of: Storage — The storage layer that holds image layers and manifests. This is the most fundamental dependency; if storage is unreachable, no image operations can succeed. Caching infrastructure — Used for caching and distributed coordination. Failures here degrade push operations and can affect pull latency. Container availability — The health of the internal services that process registry API requests. Authentication services — The authorization pipeline that validates whether a caller has permission to pull or push. Metadata service — For registries using metadata search capabilities, the metadata service is also monitored. If the health evaluation determines that the region cannot reliably serve requests, the endpoint returns unhealthy. Traffic Manager sees the failure, degrades the endpoint, and routes subsequent DNS queries to the next-lowest-latency replica — all automatically, with no customer intervention required. Per-Registry Intelligence Getting regional health right was the first step — but we needed to go further. A blunt "is the region healthy?" check would be too coarse. In each region, ACR distributes customer data across a large pool of storage accounts. A storage degradation might affect only a subset of those accounts — meaning most registries in the region are fine, and only those whose data lives on the affected accounts need to fail over. Health Monitor evaluates health on a per-registry basis. When a Traffic Manager probe arrives, Health Monitor determines which backing resources that specific registry depends on and evaluates health against those specific resources — not the region's overall health. This means that if contoso.azurecr.io depends on resources that are experiencing errors but fabrikam.azurecr.io depends on healthy ones in the same region, only Contoso's traffic gets rerouted. Fabrikam keeps getting served locally with no unnecessary latency penalty. The same per-registry logic applies to other dependencies. If a registry has metadata search enabled and the metadata service is down, that registry's endpoint goes unhealthy. If another registry in the same region doesn't use metadata search, it stays healthy. Tuning for Stability Failing over too eagerly is almost as bad as not failing over at all. A transient blip shouldn't send traffic across the continent. We tuned the thresholds so that the endpoint is only marked unhealthy after a sustained pattern of failures — not a single transient error. The end-to-end failover timing — from the onset of a real dependency failure through Health Monitor detection, Traffic Manager probe cycles, and DNS TTL propagation — is on the order of minutes, not seconds. This is deliberately conservative: fast enough to catch real regional degradation, but slow enough to ride out the kind of transient errors that resolve on their own. For context, Traffic Manager itself probes endpoints every 30 seconds and requires multiple consecutive failures before degrading an endpoint, and DNS TTL adds additional propagation delay before all clients switch to the new region. It's worth noting that DNS-based failover has an inherent limitation: even after Traffic Manager updates its DNS response, existing clients may continue reaching the degraded endpoint until their local DNS cache expires. Docker daemons, container runtimes, and CI/CD systems all cache DNS resolutions. The failover is not instantaneous — but it is automatic, which is a dramatic improvement over the previous state where failover either required manual intervention or simply didn't happen. Health Monitor's Own Resilience A natural question: what happens if Health Monitor itself fails? Health Monitor is designed to fail-open. If the monitor process is unable to evaluate dependencies — because it has crashed, is restarting, or cannot reach a dependency to check its status — the health endpoint returns healthy, preserving the pre-existing routing behavior. This ensures that a Health Monitor failure cannot itself cause a false failover. The system degrades gracefully back to the original latency-based routing rather than introducing a new failure mode. How Routing Changed The change is transparent to customers. They still access their registry through the same myregistry.azurecr.io hostname. The difference is that the system behind that hostname is now actively steering them away from degraded regions instead of blindly routing on latency alone. What Customers Should Know For registries with geo-replication enabled, this improvement is automatic — no configuration changes or action required: Pull operations benefit the most. When traffic is rerouted to a healthy replica, image layers are served from that replica's storage. For images that have completed replication to the target region, pulls succeed seamlessly. For recently pushed images that haven't yet replicated, a pull from the failover region may not find the image until replication catches up. If your workflow pushes an image and immediately pulls from a different region, consider building in retry logic or checking replication status before pulling. Push operations are more nuanced. If failover or DNS re-resolution happens during an in-flight push, that push can fail and may need to be retried. This failure mode is not new to health-aware failover; it can already occur when DNS resolves a client to a different region during a push. During failover, customers should expect both higher push latency and a higher chance of retries for long-running uploads. For production pipelines, use retry logic and design publish steps to be idempotent. Single-region registries are unaffected by this change. Traffic Manager is only involved when replicas exist; registries without geo-replication continue to route directly to their single region. In the edge case where the only region is degraded, Traffic Manager has nowhere else to route, so it continues routing to the original endpoint — the same behavior as before. Observability When a failover occurs, customers can observe the routing change through several signals: Increased pull latency from a different region — if your monitoring shows image pull times increasing, it may indicate traffic has been rerouted to a more distant replica. Azure Resource Health — check the Resource Health blade for your registry to see if there's a known issue in your primary region. Replication status — the replication health API shows the status of each replica, which can help confirm whether a specific region is experiencing issues. We're actively working on improving the observability story here — including richer signals for when routing changes occur and which region is currently serving your traffic. Rollout and Safety We rolled this out incrementally, following Azure's safe deployment practices across ring-based deployment stages. The migration involved updating each registry's Traffic Manager configuration to use the new deep health evaluation. This is controlled at the Traffic Manager level, making it straightforward to roll back a specific registry or region if needed. We also built in safeguards to quickly revert to previous routing behavior if needed. If Health Monitor's deep health evaluation were to malfunction and falsely report regions as unhealthy, we can disable it and revert to the original pass-through behavior — the same shallow health check as before — as a safety net. The Outcome Since rolling out Health Monitor-based routing, geo-replicated registries now automatically fail over during the types of regional degradation events that previously required manual intervention or resulted in extended customer impact. The classes of incidents we tracked — storage outages, caching failures, VM disruptions, and authentication service degradation — now trigger automatic rerouting to healthy replicas. This is one piece of a broader effort to improve ACR's resilience for geo-replicated registries. Other recent and ongoing work includes improving replication consistency for rapid tag overwrites, enabling cross-region pull-through for images that haven't finished replicating, and optimizing the replication service's resource utilization for large registries. Geo-replication has always been ACR's answer to multi-region availability. Health Monitor makes sure that promise holds when it matters most — when something goes wrong. To learn more about ACR geo-replication, see Geo-replication in Azure Container Registry. To configure geo-replication for your registry, see Enable geo-replication.241Views2likes0CommentsProactive Health Monitoring and Auto-Communication Now Available for Azure Container Registry
Today, we're introducing Azure Container Registry's (ACR) latest service health enhancement: automated auto-communication through Azure Service Health alerts. When ACR detects degradation in critical operations—authentication, image push, and pull—your teams are now proactively notified through Azure Service Health, delivering better transparency and faster communication without waiting for manual incident reporting. For platform teams, SRE organizations, and enterprises with strict SLA requirements, this means container registry health events are now communicated automatically and integrated into your existing incident management and observability workflows. Background: Why Registry Availability Matters Container registries sit at the heart of modern software delivery. Every CI/CD pipeline build, every Kubernetes pod startup, and every production deployment depends on the ability to authenticate, push artifacts, and pull images reliably. When a registry experiences degradation—even briefly—the downstream impact can cascade quickly: failed pipelines, delayed deployments, and application startup failures across multiple clusters and environments. Until now, ACR customers discovered service issues primarily through two paths: monitoring their own workloads for symptoms (failed pulls, auth errors), or checking the Azure Status page reactively. Neither approach gives your team the head start needed to coordinate an effective response before impact is felt. Auto-Communication Through Azure Service Health Alerts ACR now provides faster communication when: Degradation is detected in your region Automated remediation is in progress Engineering teams have been engaged and are actively mitigating These notifications arrive through Azure Service Health, the same platform your teams already use to track planned maintenance and health advisories across all your Azure resources. You receive timely visibility into registry health events—with rich context including tracking IDs, affected regions, impacted resources, and mitigation timelines—without needing to open a support request or continuously monitor dashboards. Who Benefits This capability delivers value across every team that depends on container registry availability: Enterprise platform teams managing centralized registries for large organizations will receive early warning before CI/CD pipelines begin failing across hundreds of development teams. SRE organizations can integrate ACR health signals into their existing incident management workflows—via webhook integration with PagerDuty, Opsgenie, ServiceNow, and similar tools—rather than relying on synthetic monitoring or customer reports. Teams with strict SLA requirements can now correlate production incidents with documented ACR service events, supporting post-incident reviews and customer communication. All ACR customers gain a level of registry observability that previously required custom monitoring infrastructure to approximate. A Part of ACR's Broader Observability Strategy Automated Service Health auto-communication is one component of ACR's ongoing investment in service health and observability. Combined with Azure Monitor metrics, diagnostic logs and events, Service Health alerts give your teams a layered observability posture: Signal What It Tells You Service Health alerts ACR-wide service events in your regions, with official mitigation status Azure Monitor metrics Registry-level request rates, success rates, and storage utilization. This will be available soon Diagnostic logs Repository and operation-level audit trail What's next: We are working on exposing additional ACR metrics through Azure Monitor, giving you deeper visibility into registry operations—such as authentication, pull and push API requests, and error breakdowns—directly in the Azure portal. This will enable self-service diagnostics, allowing your teams to investigate and troubleshoot registry issues independently without opening a support request. Getting Started To configure Service Health alerts for ACR, navigate to Service Health in the Azure portal, create an alert rule filtering on Container Registry, and attach an action group with your preferred notification channels (email, SMS, webhook). Alerts can also be created programmatically via ARM templates or Bicep for infrastructure-as-code workflows. For the full step-by-step setup guide—including recommended alert configurations for production-critical, maintenance awareness, and comprehensive monitoring scenarios—see Configure Service Health alerts for Azure Container Registry.343Views0likes0CommentsAzure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context
What if SRE Agent already knew your system before the next incident? Your most experienced SRE didn't become an expert overnight. Day one: reading runbooks, studying architecture diagrams, asking a lot of questions. Month three: knowing which services are fragile, which config changes cascade, which log patterns mean real trouble. Year two: diagnosing a production issue at 2 AM from a single alert because they'd built deep, living context about your systems. That learning process, absorbing documentation, reading code, handling incidents, building intuition from every interaction is what makes an expert. Azure SRE Agent could do the same thing From pulling context to living in it Azure SRE Agent already connects to Azure Monitor, PagerDuty, and ServiceNow. It queries Kusto logs, checks resource health, reads your code, and delivers root cause analysis often resolving incidents without waking anyone up. Thousands of incidents handled. Thousands of engineering hours saved. Deep Context takes this to the next level. Instead of accessing context on demand, your agent now lives in it — continuously reading your code, knowledge building persistent memory from every interaction, and evolving its understanding of your systems in the background. Three things makes Deep Context work: Continuous access. Source code, terminal, Python runtime, and Azure environment are available whenever the agent needs them. Connected repos are cloned into the agent's workspace automatically. The agent knows your code structure from the first message. Persistent memory. Insights from previous investigations, architecture understanding, team context — it all persists across sessions. The next time the agent picks up an alert, it already knows what happened last time. Background intelligence. Even when you're not chatting, background services continuously learn. After every conversation, the agent extracts what worked, what failed, what the root cause was. It aggregates these across all past investigations to build evolving operational insights. The agent recognizes patterns you haven't noticed yet. One example: connected to Kusto, background scanning auto-discovers every table, documents schemas, and builds reusable query templates. But this learning applies broadly — every conversation, every incident, every data source makes the agent sharper. Expertise that compound with every incident New on-call engineer SRE Agent with Deep Context Alert fires Opens runbook, looks up which service this maps to Already knows the service, its dependencies, and failure patterns from prior incidents Investigation Reads logs, searches code, asks teammates Goes straight to the relevant code path, correlates with logs and persistent insights from similar incidents After 100 incidents Becomes the team expert — irreplaceable institutional knowledge Same institutional knowledge — always available, never forgets, scales across your entire organization A human expert takes months to build this depth. An agent with Deep Context builds it in days and the knowledge compounds with every interaction. You shape what your agent learns. Deep Context learns automatically but the best results come when your team actively guides what the agent retains. Type #remember in chat to save important facts your agent should always know environment details, escalation paths, team preferences. For example: "#remember our Redis cache uses Premium tier with 6GB" or "#remember database failover takes approximately 15 minutes." These are recalled automatically during future investigations. Turn investigations into knowledge. After a good investigation, ask your agent to turn the resolution into a runbook: "Create a troubleshooting guide from the steps we just followed and save it to Knowledge settings." The agent generates a structured document, uploads it, and indexes it — so the next time a similar issue occurs, the agent finds and follows that guide automatically. The agent captures insights from every conversation on its own. Your guidance tells it which ones matter most. This is exactly how Microsoft’s own SRE team gets the best results: “Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn’t fail that class of problem again.” Read the full story in The Agent That Investigates Itself. See it in action: an Azure Monitor alert, end to end An HTTP 5xx spike fires on your container app. Your agent is in autonomous mode. It acknowledges the alert, checks resource health, reads logs, and delivers a diagnosis — that's what it already does well. Deep Context makes this dramatically better. Two things change everything: The agent already knows your environment.It'salready read your code, runbooks, and built context from previous investigations. Your route handlers, database layer, deployment configs, operational procedures, it knows all of it. So, when these alert fires, it doesn't start from scratch. It goes straight to the relevant code path, correlates a recent connection pooling commit with the deployment timeline, and confirms the root cause. The agent remembers.It's seen this pattern before a similar incident last week that was investigated but never permanently fixed. It recognizes the recurrence from persistent memory, skips rediscovery, confirms the issue is still in the code, and this time fixes it. Because it's in autonomous mode, the agent edits the source code, restarts the container, pushes the fix to a new branch, creates a PR, opens a GitHub Issue, and verifies service health, all before you wake up. The agent delivers a complete remediation summary including alert, root cause with code references, fix applied, PR created, without a single message from you. Code access turns diagnosis into action. Persistent memory turns recurring problems into solved problems. Give your agent your code — here's why it matters If you're on an IT operations, SRE, or DevOps team, you might think: "Code access? That's for developers." We'd encourage you to rethink that. Your infrastructure-as-code, deployment configs, Helm charts, Terraform files, pipeline definitions — that's all code. And it's exactly the context your agent needs to go from good to extraordinary. When your agent can read your actual configuration and infrastructure code, investigations transform. Instead of generic troubleshooting, you get root cause analysis that points to the exact file, the exact line, the exact config change. It correlates a deployment failure with a specific commit. It reads your Helm values and spots the misconfiguration that caused the pod crash loop. "Will the agent modify our production code?" No. The agent works in a secure sandbox — a copy of your repository, not your production environment. When it identifies a fix, it creates a pull request on a new branch. Your code review process, your CI/CD pipeline, your approval gates — all untouched. The agent proposes. Your team decides. Whether you're a developer, an SRE, or an IT operator managing infrastructure you didn't write — connecting your code is the single highest-impact thing you can do to make your agent smarter. The compound effects Deep Context amplifies every other SRE Agent capability: Deep Context + Incident management → Alerts fire, the agent correlates logs with actual code. Root cause references specific files and line numbers. Deep Context + Scheduled tasks → Automated code analysis, compliance checks, and drift detection — inspecting your actual infrastructure code, not just metrics. Deep Context + MCP connectors → Datadog, Splunk, PagerDuty data combined with source code context. The full picture in one conversation. Deep Context + Knowledge files → Upload runbooks, architecture docs, postmortems — in any format. The agent cross-references your team's knowledge with live code, logs, and infrastructure state. Logs tell the agent what happened. Code tells it why. Your knowledge files tell it what to do about it. Get started Deep Context is available today as part of Azure SRE Agent GA. New agents have it enabled by default. For a step-by-step walkthrough connecting your code, logs, incidents, and knowledge files, see What It Takes to Give an SRE Agent a Useful Starting Point Resources SRE Agent GA Announcement blog - https://aka.ms/sreagent/ga SRE Agent GA What’s new post - https://aka.ms/sreagent/blog/whatsnewGA SRE Agent Documentation – https://aka.ms/sreagent/newdocs SRE Agent Overview - https://aka.ms/sreagent/newdocsoverview849Views0likes0CommentsMicrosoft Azure at KubeCon Europe 2026 | Amsterdam, NL - March 23-26
Microsoft Azure is coming back to Amsterdam for KubeCon + CloudNativeCon Europe 2026 in two short weeks, from March 23-26! As a Diamond Sponsor, we have a full week of sessions, hands-on activities, and ways to connect with the engineers behind AKS and our open-source projects. Here's what's on the schedule: Azure Day with Kubernetes: 23 March 2026 Before the main conference begins, join us at Hotel Casa Amsterdam for a free, full-day technical event built around AKS (registration required for entry - capacity is limited!). Whether you're early in your Kubernetes journey, running clusters at scale, or building AI apps, the day is designed to give you practical guidance from Microsoft product and engineering teams. Morning sessions cover what's new in AKS, including how teams are building and running AI apps on Kubernetes. In the afternoon, pick your track: Hands-on AKS Labs: Instructor-led labs to put the morning's concepts into practice. Expert Roundtables: Small-group conversations with AKS engineers on topics like security, autoscaling, AI workloads, and performance. Bring your hard questions. Evening: Drinks on us. Capacity is limited, so secure your spot before it closes: aka.ms/AKSDayEU KubeCon + CloudNativeCon: 24-26 March 2026 There will be lots going on at the main conference! Here's what to add to your calendar: Keynote (24 March): Jorge Palma takes the stage to tackle a question the industry is actively wrestling with: can AI agents reliably operate and troubleshoot Kubernetes at scale, and should they? Customer Keynote (24 March): Wayve's Mukund Muralikrishnan shares how they handle GPU scheduling across multi-tenant inference workloads using Kueue, providing a practical look at what production AI infrastructure actually requires. Demo Theatre (25 March): Anson Qian and Jorge Palma walk through a Kubernetes-native approach to cross-cloud AI inference, covering elastic autoscaling with Karpenter and GPU capacity scheduling across clouds. Sessions: Microsoft engineers are presenting across all three days on topics ranging from multi-cluster networking, supply chain security, observability, Istio in production, and more. Full list below. Find our team in the Project Pavilion at kiosks for Inspektor Gadget, Headlamp, Drasi, Radius, Notary Project, Flatcar, ORAS, Ratify, and Istio. Brendan Burns, Kubernetes co-founder and Microsoft CVP & Technical Fellow, will also share his thoughts on the latest developments and key Microsoft announcements related to open-source, cloud native, and AI application development in his KubeCon Europe blog on March 24. Come find us at Microsoft Azure booth #200 all three days. We'll be running short demos and sessions on AKS, running Kubernetes at scale, AI workloads, and cloud-native topics throughout the show, plus fun activations and opportunities to unlock special swag. Read on below for full details on our KubeCon sessions and booth theater presentations: Sponsored Keynote Date: Tues 24 March 2026 Start Time: 10:18 AM CET Room: Hall 12 Title: Scaling Platform Ops with AI Agents: Troubleshooting to Remediation Speakers: Jorge Palma, Natan Yellin (Robusta) As AI agents increasingly write our code, can they also operate and troubleshoot our infrastructure? More importantly, should they? This keynote explores the practical reality of deploying AI agents to maintain Kubernetes clusters at scale. We'll demonstrate HolmesGPT, an open-source CNCF sandbox project that connects LLMs to operational and observability data to diagnose production issues. You'll see how agents reduce MTTR by correlating logs, metrics, and cluster state far faster than manual investigation. Then we'll tackle the harder problem: moving from diagnosis to remediation. We'll show how agents with remediation policies can detect and fix issues autonomously, within strict RBAC boundaries, approval workflows, and audit trails. We'll be honest about challenges: LLM non-determinism, building trust, and why guardrails are non-negotiable. This isn't about replacing SREs; it's about multiplying their effectiveness so they can focus on creative problem-solving and system design. Customer Keynote Date: Tues 24 March 2026 Start Time: 9:37 AM CET Room: Hall 12 Title: Rules of the road for shared GPUs: AI inference scheduling at Wayve Speaker: Mukund Muralikrishnan, Wayve Technologies As AI inference workloads grow in both scale and diversity, predictable access to GPUs becomes as important as raw throughput, especially in large, multi-tenant Kubernetes clusters. At Wayve, Kubernetes underpins a wide range of inference workloads, from latency-sensitive evaluation and validation to large-scale synthetic data generation supporting the development of an end-to-end self-driving system. These workloads run side by side, have very different priorities, and all compete for the same GPU capacity. In this keynote, we will share how we manage scheduling and resources for multi-tenant AI inference on Kubernetes. We will explain why default Kubernetes scheduling falls short, and how we use Kueue, a Kubernetes-native queueing and admission control solution, to operate shared GPU clusters reliably at scale. This approach gives teams predictable GPU allocations, improves cluster utilisation, and reduces operational noise. We will close by briefly showing how frameworks like Ray fit into this model as Wayve scales its AI Driver platform. KubeCon Theatre Demo Date: Wed 25 March 2026 Start Time: 13:15 CET Room: Hall 1-5 | Solutions Showcase | Demo Theater Title: Building cross-cloud AI inference on Kubernetes with OSS Speaker: Anson Qian, Jorge Palma Operating AI inference under bursty, latency-sensitive workloads is hard enough on a single cluster. It gets harder when GPU capacity is fragmented across regions and cloud providers. This demo walks through a Kubernetes-native pattern for cross-cloud AI inference, using an incident triage and root cause analysis workflow as the example. The stack is built on open-source capabilities for lifecycle management, inference, autoscaling, and cross-cloud capacity scheduling. We will specifically highlight Karpenter for elastic autoscaling and a GPU flex nodes project for scheduling capacity across multiple cloud providers into a single cluster. Models, inference endpoints, and GPU resources are treated as first-class Kubernetes objects, enabling elastic scaling, stable routing under traffic spikes, and cross-provider failover without a separate AI control plane. KubeCon Europe 2025 Sessions with Microsoft Speakers Speaker Title Jorge Palma Microsoft keynote: Scaling Platform Ops with AI Agents: Troubleshooting to Remediation Anson Qian, Jorge Palma Microsoft demo: Building cross-cloud AI inference on Kubernetes with OSS Will Tsai Leveling up with Radius: Custom Resources and Headlamp Integration for Real-World Workloads Simone Rodigari Demystifying the Kubernetes Network Stack (From Pod to Pod) Joaquin Rodriguez Privacy as Infrastructure: Declarative Data Protection for AI on Kubernetes Cijo Thomas ⚡Lightning Talk: “Metrics That Lie”: Understanding OpenTelemetry’s Cardinality Capping and Its Implications Gaurika Poplai ⚡Lightning Talk: Compliance as Code Meets Developer Portals: Kyverno + Backstage in Action Mereta Degutyte & Anubhab Majumdar Network Flow Aggregation: Pay for the Logs You Care About! Niranjan Shankar Expl(AI)n Like I’m 5: An Introduction To AI-Native Networking Danilo Chiarlone Running Wasmtime in Hardware-Isolated Microenvironments Jack Francis Cluster Autoscaler Evolution Jackie Maertens Cloud Native Theater | Istio Day: Running State of the Art Inference with Istio and LLM-D Jackie Maertens & Mitch Connors Bob and Alice Revisited: Understanding Encryption in Kubernetes Mitch Connors Istio in Production: Expected Value, Results, and Effort at GitHub Scale Mitch Connors Evolution or Revolution: Istio as the Network Platform for Cloud Native René Dudfield Ping SRE? I Am the SRE! Awesome Fun I Had Drawing a Zine for Troubleshooting Kubernetes Deployments René Dudfield & Santhosh Nagaraj Does Your Project Want a UI in Kubernetes-SIGs/headlamp? Bridget Kromhout How Will Customized Kubernetes Distributions Work for You? a Discussion on Options and Use Cases Kenneth Kilty AI-Powered Cloud Native Modernization: From Real Challenges to Concrete Solutions Mike Morris Building the Next Generation of Multi-Cluster with Gateway API Toddy Mladenov, Flora Taagen & Dallas Delaney Beyond Image Pull-Time: Ensuring Runtime Integrity With Image Layer Signing Microsoft Booth Theatre Sessions Tues 24 March (11:00 - 18:00) Zero-Migration AI with Drasi: Bridge Your Existing Infrastructure to Modern Workflows Bringing real-time Kubernetes observability to AI agents via Model Context Protocol Secure Kubernetes Across the Stack: Supply Chain to Runtime Cut the Noise, Cut the Bill: Cost‑Smart Network Observability for Kubernetes AKS everywhere: one Kubernetes experience from Cloud to Edge Teaching AI to Build Better AKS Clusters with Terraform AKS-Flex: autoscale GPU nodes from Azure and neocloud like Nebius using karpenter Block Game with Block Storage: Running Minecraft on Kubernetes with local NVMe When One Cluster Fails: Keeping Kubernetes Services Online with Cilium ClusterMesh You Spent How Much? Controlling Your AI Spend with Istio + agentgateway Azure Front Door Edge Actions: Hardware-protected CDN functions in Azure Secure Your Sensitive Workloads with Confidential Containers on Azure Red Hat OpenShift AKS Automatic Anyscale on Azure Wed 25 March Kubernetes Answers without AI (And That's Okay) Accelerating Cloud‑Native and AI Workloads with Azure Linux on AKS Codeless OpenTelemetry: Auto‑Instrumenting Kubernetes Apps in Minutes Life After ingress-nginx: Modern Kubernetes Ingress on AKS Modern Apps, Faster: Modernization with AKS + GitHub Copilot App Mod Get started developing on AKS Encrypt Everything, Complicate Nothing: Rethinking Kubernetes Workload Network Security From Repo to Running on AKS with GitHub Copilot Simplify Multi‑Cluster App Traffic with Azure Kubernetes Application Network Open Source with Chainguard and Microsoft: Better Together on AKS Accelerating Cloud-Native Delivery for Developers: API-Driven Platforms with Radius Operate Kubernetes at Scale with Azure Kubernetes Fleet Manager Thurs 26 March Oooh Wee! An AKS GUI! – Deploy, Secure & Collaborate in Minutes (No CLI Required) Sovereign Kubernetes: Run AKS Where the Cloud Can’t Go Thousand Pods, One SAN: Burst-Scaling Stateful Apps with Azure Container Storage + Elastic SAN There will also be a wide variety of demos running at our booth throughout the show – be sure to swing by to chat with the team. We look forward to seeing you at KubeCon Europe 2026 in Amsterdam Psst! Local or coming in to Amsterdam early? You can also catch the Microsoft team at: Cloud Native Rejekts on 21 March Maintainer Summit on 22 March1KViews0likes0CommentsRethinking Ingress on Azure: Application Gateway for Containers Explained
Introduction Azure Application Gateway for Containers is a managed Azure service designed to handle incoming traffic for container-based applications. It brings Layer-7 load balancing, routing, TLS termination, and web application protection outside of the Kubernetes cluster and into an Azure-managed data plane. By separating traffic management from the cluster itself, the service reduces operational complexity while providing a more consistent, secure, and scalable way to expose container workloads on Azure. Service Overview What Application Gateway for Containers does Azure Application Gateway for Containers is a managed Layer-7 load balancing and ingress service built specifically for containerized workloads. Its main job is to receive incoming application traffic (HTTP/HTTPS), apply routing and security rules, and forward that traffic to the right backend containers running in your Kubernetes cluster. Instead of deploying and operating an ingress controller inside the cluster, Application Gateway for Containers runs outside the cluster, as an Azure-managed data plane. It integrates natively with Kubernetes through the Gateway API (and Ingress API), translating Kubernetes configuration into fully managed Azure networking behavior. In practical terms, it handles: HTTP/HTTPS routing based on hostnames, paths, headers, and methods TLS termination and certificate management Web Application Firewall (WAF) protection Scaling and high availability of the ingress layer All of this is provided as a managed Azure service, without running ingress pods in your cluster. What problems it solves Application Gateway for Containers addresses several common challenges teams face with traditional Kubernetes ingress setups: Operational overhead Running ingress controllers inside the cluster means managing upgrades, scaling, certificates, and availability yourself. Moving ingress to a managed Azure service significantly reduces this burden. Security boundaries By keeping traffic management and WAF outside the cluster, you reduce the attack surface of the Kubernetes environment and keep security controls aligned with Azure-native services. Consistency across environments Platform teams can offer a standard, Azure-managed ingress layer that behaves the same way across clusters and environments, instead of relying on different in-cluster ingress configurations. Separation of responsibilities Infrastructure teams manage the gateway and security policies, while application teams focus on Kubernetes resources like routes and services. How it differs from classic Application Gateway While both services share the “Application Gateway” name, they target different use cases and operating models. In the traditional model of using Azure Application Gateway is a general-purpose Layer-7 load balancer primarily designed for VM-based or service-based backends. It relies on centralized configuration through Azure resources and is not Kubernetes-native by design. Application Gateway for Containers, on the other hand: Is designed specifically for container platforms Uses Kubernetes APIs (Gateway API / Ingress) instead of manual listener and rule configuration Separates control plane and data plane more cleanly Enables faster, near real-time updates driven by Kubernetes changes Avoids running ingress components inside the cluster In short, classic Application Gateway is infrastructure-first, while Application Gateway for Containers is platform- and Kubernetes-first. Architecture at a Glance At a high level, Azure Application Gateway for Containers is built around a clear separation between control plane and data plane. This separation is one of the key architectural ideas behind the service and explains many of its benefits. Control plane and data plane The control plane is responsible for configuration and orchestration. It listens to Kubernetes resources—such as Gateway API or Ingress objects—and translates them into a running gateway configuration. When you create or update routing rules, TLS settings, or security policies in Kubernetes, the control plane picks up those changes and applies them automatically. The data plane is where traffic actually flows. It handles incoming HTTP and HTTPS requests, applies routing rules, performs TLS termination, and forwards traffic to the correct backend services inside your cluster. This data plane is fully managed by Azure and runs outside of the Kubernetes cluster, providing isolation and high availability by design. Because the data plane is not deployed as pods inside the cluster, it does not consume cluster resources and does not need to be scaled or upgraded by the customer. Managed components vs customer responsibilities One of the goals of Application Gateway for Containers is to reduce what customers need to operate, while still giving them control where it matters. Managed by Azure Application Gateway for Containers data plane Scaling, availability, and patching of the gateway Integration with Azure networking Web Application Firewall engine and updates Translation of Kubernetes configuration into gateway rules Customer-managed Kubernetes resources (Gateway API or Ingress) Backend services and workloads TLS certificates and references Routing and security intent (hosts, paths, policies) Network design and connectivity to the cluster This split allows platform teams to keep ownership of the underlying Azure infrastructure, while application teams interact with the gateway using familiar Kubernetes APIs. The result is a cleaner operating model with fewer moving parts inside the cluster. In short, Application Gateway for Containers acts as an Azure-managed ingress layer, driven by Kubernetes configuration but operated outside the cluster. This architecture keeps traffic management simple, scalable, and aligned with Azure-native networking and security services. Traffic Handling and Routing This section explains what happens to a request from the moment it reaches Azure until it is forwarded to a container running in your cluster. Traffic Flow: From Internet to Pod Azure Application Gateway for Containers (AGC) acts as the specialized "front door" for your Kubernetes workloads. By sitting outside the cluster, it manages high-volume traffic ingestion so your environment remains focused on application logic rather than networking overhead. The Request Journey Once a request is initiated by a client—such as a browser or an API—it follows a streamlined path to your container: 1. Entry via Public Frontend: The request reaches AGC’s public frontend endpoint. Note: While private frontends are currently the most requested feature and are under high-priority development, the service currently supports public-facing endpoints. 2. Rule Evaluation: AGC evaluates the incoming request against the routing rules you’ve defined using standard Kubernetes resources (Gateway API or Ingress). 3. Direct Pod Proxying: Once a rule is matched, AGC forwards the traffic directly to the backend pods within your cluster. 4. Azure Native Delivery: Because AGC operates as a managed data plane outside the cluster, traffic reaches your workloads via Azure networking. This removes the need for managing scaling or resource contention for in-cluster ingress pods. Flexibility in Security and Routing The architecture is designed to be as "hands-off" or as "hands-on" as your security policy requires: Optional TLS Offloading: You have full control over the encryption lifecycle. Depending on your specific use case, you can choose to perform TLS termination at the gateway to offload the compute-intensive decryption, or maintain encryption all the way to the container for end-to-end security. Simplified Infrastructure: By using AGC, you eliminate the "hop" typically required by in-cluster controllers, allowing the gateway to communicate with pods with minimal latency and high predictability. Kubernetes Integration Application Gateway for Containers is designed to integrate natively with Kubernetes, allowing teams to manage ingress behavior using familiar Kubernetes resources instead of Azure-specific configuration. This makes the service feel like a natural extension of the Kubernetes platform rather than an external load balancer. Gateway API as the primary integration model The Gateway API is the preferred and recommended way to integrate Application Gateway for Containers with Kubernetes. With the Gateway API: Platform teams define the Gateway and control how traffic enters the cluster. Application teams define routes (such as HTTPRoute) to expose their services. Responsibilities are clearly separated, supporting multi-team and multi-namespace environments. Application Gateway for Containers supports core Gateway API resources such as: GatewayClass Gateway HTTPRoute When these resources are created or updated, Application Gateway for Containers automatically translates them into gateway configuration and applies the changes in near real time. Ingress API support For teams that already use the traditional Kubernetes Ingress API, Application Gateway for Containers also provides Ingress support. This allows: Reuse of existing Ingress manifests A smoother migration path from older ingress controllers Gradual adoption of Gateway API over time Ingress resources are associated with Application Gateway for Containers using a specific ingress class. While fully functional, the Ingress API offers fewer capabilities and less flexibility compared to the Gateway API. How teams interact with the service A key benefit of this integration model is the clean separation of responsibilities: Platform teams Provision and manage Application Gateway for Containers Define gateways, listeners, and security boundaries Own network and security policies Application teams Define routes using Kubernetes APIs Control how their applications are exposed Do not need direct access to Azure networking resources This approach enables self-service for application teams while keeping governance and security centralized. Why this matters By integrating deeply with Kubernetes APIs, Application Gateway for Containers avoids custom controllers, sidecars, or ingress pods inside the cluster. Configuration stays declarative, changes are automated, and the operational model stays consistent with Kubernetes best practices. Security Capabilities Security is a core part of Azure Application Gateway for Containers and one of the main reasons teams choose it over in-cluster ingress controllers. The service brings Azure-native security controls directly in front of your container workloads, without adding complexity inside the cluster. Web Application Firewall (WAF) Application Gateway for Containers integrates with Azure Web Application Firewall (WAF) to protect applications against common web attacks such as SQL injection, cross-site scripting, and other OWASP Top 10 threats. A key differentiator of this service is that it leverages Microsoft's global threat intelligence. This provides an enterprise-grade layer of security that constantly evolves to block emerging threats, a significant advantage over many open-source or standard competitor WAF solutions. Because the WAF operates within the managed data plane, it offers several operational benefits: Zero Cluster Footprint: No WAF-specific pods or components are required to run inside your Kubernetes cluster, saving resources for your actual applications. Edge Protection: Security rules and policies are applied at the Azure network edge, ensuring malicious traffic is blocked before it ever reaches your workloads. Automated Maintenance: All rule updates, patching, and engine maintenance are handled entirely by Azure. Centralized Governance: WAF policies can be managed centrally, ensuring consistent security enforcement across multiple teams and namespaces—a critical requirement for regulated environments. TLS and certificate handling TLS termination happens directly at the gateway. HTTPS traffic is decrypted at the edge, inspected, and then forwarded to backend services. Key points: Certificates are referenced from Kubernetes configuration TLS policies are enforced by the Azure-managed gateway Applications receive plain HTTP traffic, keeping workloads simpler This approach allows teams to standardize TLS behavior across clusters and environments, while avoiding certificate logic inside application pods. Network isolation and exposure control Because Application Gateway for Containers runs outside the cluster, it provides a clear security boundary between external traffic and Kubernetes workloads. Common patterns include: Internet-facing gateways with WAF protection Private gateways for internal or zero-trust access Controlled exposure of only selected services By keeping traffic management and security at the gateway layer, clusters remain more isolated and easier to protect. Security by design Overall, the security model follows a simple principle: inspect, protect, and control traffic before it enters the cluster. This reduces the attack surface of Kubernetes, centralizes security controls, and aligns container ingress with Azure’s broader security ecosystem. Scale, Performance, and Limits Azure Application Gateway for Containers is built to handle production-scale traffic without requiring customers to manage capacity, scaling rules, or availability of the ingress layer. Scalability and performance are handled as part of the managed service. Interoperability: The Best of Both Worlds A common hesitation when adopting cloud-native networking is the fear of vendor lock-in. Many organizations worry that using a provider-specific ingress service will tie their application logic too closely to a single cloud’s proprietary configuration. Azure Application Gateway for Containers (AGC) addresses this directly by utilizing the Kubernetes Gateway API as its primary integration model. This creates a powerful decoupling between how you define your traffic and how that traffic is actually delivered. Standardized API, Managed Execution By adopting this model, you gain two critical advantages simultaneously: Zero Vendor Lock-In (Standardized API): Your routing logic is defined using the open-source Kubernetes Gateway API standard. Because HTTPRoute and Gateway resources are community-driven standards, your configuration remains portable and familiar to any Kubernetes professional, regardless of the underlying infrastructure. Zero Operational Overhead (Managed Implementation): While the interface is a standard Kubernetes API, the implementation is a high-performance Azure-managed service. You gain the benefits of an enterprise-grade load balancer—automatic scaling, high availability, and integrated security—without the burden of managing, patching, or troubleshooting proxy pods inside your cluster. The "Pragmatic" Advantage As highlighted in recent architectural discussions, moving from traditional Ingress to the Gateway API is about more than just new features; it’s about interoperability. It allows platform teams to offer a consistent, self-service experience to developers while retaining the ability to leverage the best-in-class performance and security that only a native cloud provider can offer. The result is a future-proof architecture: your teams use the industry-standard language of Kubernetes to describe what they need, and Azure provides the managed muscle to make it happen. Scaling model Application Gateway for Containers uses an automatic scaling model. The gateway data plane scales up or down based on incoming traffic patterns, without manual intervention. From an operator’s perspective: There are no ingress pods to scale No node capacity planning for ingress No separate autoscaler to configure Scaling is handled entirely by Azure, allowing teams to focus on application behavior rather than ingress infrastructure. Performance characteristics Because the data plane runs outside the Kubernetes cluster, ingress traffic does not compete with application workloads for CPU or memory. This often results in: More predictable latency Better isolation between traffic management and application execution Consistent performance under load The service supports common production requirements such as: High concurrent connections Low-latency HTTP and HTTPS traffic Near real-time configuration updates driven by Kubernetes changes Service limits and considerations Like any managed service, Application Gateway for Containers has defined limits that architects should be aware of when designing solutions. These include limits around: Number of listeners and routes Backend service associations Certificates and TLS configurations Throughput and connection scaling thresholds These limits are documented and enforced by the platform to ensure stability and predictable behavior. For most application platforms, these limits are well above typical usage. However, they should be reviewed early when designing large multi-tenant or high-traffic environments. Designing with scale in mind The key takeaway is that Application Gateway for Containers removes ingress scaling from the cluster and turns it into an Azure-managed concern. This simplifies operations and provides a stable, high-performance entry point for container workloads. When to Use (and When Not to Use) Scenario Use it? Why Kubernetes workloads on Azure ✅ Yes The service is designed specifically for container platforms and integrates natively with Kubernetes APIs. Need for managed Layer-7 ingress ✅ Yes Routing, TLS, and scaling are handled by Azure without in-cluster components. Enterprise security requirements (WAF, TLS policies) ✅ Yes Built-in Azure WAF and centralized TLS enforcement simplify security. Platform team managing ingress for multiple apps ✅ Yes Clear separation between platform and application responsibilities. Multi-tenant Kubernetes clusters ✅ Yes Gateway API model supports clean ownership boundaries and isolation. Desire to avoid running ingress controllers in the cluster ✅ Yes No ingress pods, no cluster resource consumption. VM-based or non-container backends ❌ No Classic Application Gateway is a better fit for non-container workloads. Simple, low-traffic test or dev environments ❌ Maybe not A lightweight in-cluster ingress may be simpler and more cost-effective. Need for custom or unsupported L7 features ❌ Maybe not Some advanced or niche ingress features may not yet be available. Non-Kubernetes platforms ❌ No The service is tightly integrated with Kubernetes APIs. When to Choose a Different Path: Azure Container Apps While Application Gateway for Containers provides the ultimate control for Kubernetes environments, not every project requires that level of infrastructure management. For teams that don't need the full flexibility of Kubernetes and are looking for the fastest path to running containers on Azure without managing clusters or ingress infrastructure at all, Azure Container Apps offers a specialized alternative. It provides a fully managed, serverless container platform that handles scaling, ingress, and networking automatically "out of the box". Key Differences at a Glance Feature AGC + Kubernetes Azure Container Apps Control Granular control over cluster and ingress. Fully managed, serverless experience. Management You manage the cluster; Azure manages the gateway. Azure manages both the platform and ingress. Best For Complex, multi-team, or highly regulated environments. Rapid development and simplified operations. Appendix - Routing configuration examples The following examples show how Application Gateway for Containers can be configured using both Gateway API and Ingress API for common routing and TLS scenarios. More examples can be found here, in the detailed documentation. HTTP listener apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: app-route spec: parentRefs: - name: agc-gateway rules: - backendRefs: - name: app-service port: 80 Path routing logic apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: path-routing spec: parentRefs: - name: agc-gateway rules: - matches: - path: type: PathPrefix value: /api backendRefs: - name: api-service port: 80 - backendRefs: - name: web-service port: 80 Weighted canary / rollout apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: canary-route spec: parentRefs: - name: agc-gateway rules: - backendRefs: - name: app-v1 port: 80 weight: 80 - name: app-v2 port: 80 weight: 20 TLS Termination apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: app-ingress spec: ingressClassName: azure-alb-external tls: - hosts: - app.contoso.com secretName: tls-cert rules: - host: app.contoso.com http: paths: - path: / pathType: Prefix backend: service: name: app-service port: number: 80971Views2likes0CommentsRegional Endpoints for Geo-Replicated Azure Container Registries (Private Preview)
Imagine you're running Kubernetes clusters in multiple Azure regions—East US, West Europe, and Southeast Asia. You've configured ACR with geo-replication so your container images are available everywhere, but you've noticed something frustrating: you can't control which replica your clusters pull from. Sometimes your East US cluster pulls from West Europe, and you have no way to pin it to the co-located replica or troubleshoot why routing behaves unexpectedly. This scenario highlights a fundamental challenge with geo-replicated container registries: while Azure-managed routing optimizes for performance, it doesn't provide explicit control for custom failover strategies, troubleshooting, regional affinity, or predictable routing. Regional endpoints solve this by letting you choose exactly which region handles your requests. Background: How Geo-Replication Works Today Geo-replication allows you to maintain copies of your container registry in multiple Azure regions around the world. This means your container images are stored closer to where your applications run, reducing download times and improving reliability. You maintain a single registry name (like myregistry.azurecr.io), and Azure automatically routes your requests to the most suitable replica. The Challenge: Azure-Managed Routing Limitations While geo-replication has been invaluable for global deployments, the automatic routing creates challenges for some customers. When you push or pull images from a geo-replicated registry, Azure-managed routing automatically directs your request to the most suitable replica based on the client's network performance profile. While this Azure-managed routing works well for many scenarios, it creates several challenges for customers with specific requirements: Misrouting Issues: Azure-managed routing may not always select the replica you expect, particularly if network conditions fluctuate or if you're testing specific regional behavior. Geographic Ambiguity: Clients located equidistant from two replicas may experience unpredictable routing as Azure switches between them based on minor network performance variations. Push/Pull Consistency: Images pushed to one replica may be pulled from another during geo-replication synchronization, creating temporary inconsistencies that can impact deployment pipelines. For more details on troubleshooting push operations with geo-replicated registries, see Troubleshoot push operations. Lack of Regional Affinity: Clients may want to establish regional affinity between their applications and a specific replica, but Azure-managed routing doesn't provide a way to maintain this affinity. No Client-Side Failover: Without the ability to target specific replicas, you cannot implement client-side failover strategies or disaster recovery logic that explicitly switches between regions based on your own health checks and business rules. Introducing Regional Endpoints Regional endpoints solve these challenges by providing direct access to specific geo-replicated regions through dedicated login server URLs. Instead of relying solely on the global endpoint (myregistry.azurecr.io) with Azure-managed routing, you can now target specific regional replicas using the pattern: myregistry.<region-name>.geo.azurecr.io For example: myregistry.eastus.geo.azurecr.io myregistry.westeurope.geo.azurecr.io Important: Regional endpoints coexist with a geo-replicated registry's global endpoint at myregistry.azurecr.io. Enabling regional endpoints doesn't disable or replace the global endpoint - you can use both simultaneously. This allows you to use the global endpoint for most operations while selectively using regional endpoints when you need explicit regional control. How It Works Regional endpoints function as login servers—the URL endpoints you use to authenticate and interact with your registry—for specific geo-replicated regions. When you authenticate and interact with a regional endpoint instead of a registry's global endpoint, all your registry operations (authentication, artifact uploads/downloads, repository operations, and metadata actions) go directly to that specific regional replica, bypassing Azure-managed routing entirely. Downloading layer blobs (the actual container image layers) still follows your registry's existing configuration: For registries without Private Endpoints or Dedicated Data Endpoints, layer blob downloads still redirect to Azure storage accounts (*.blob.core.windows.net). For registries with Private Endpoints or Dedicated Data Endpoints enabled, layer blob downloads redirect to the corresponding region's dedicated data endpoint (myregistry.<region-name>.data.azurecr.io). Here's how the architecture compares: Global Endpoint (Azure-Managed Routing): Client → myregistry.azurecr.io (Azure-managed routing) → Geo-Replica with the Best Network Performance Profile ↓ Geo-Replica's Data Endpoint (blob storage or dedicated data endpoint) Regional Endpoint (Customer-Specified Routing): Client → myregistry.<region-name>.geo.azurecr.io (client-managed routing) → Specific Regional Geo-Replica ↓ Geo-Replica's Data Endpoint (blob storage or dedicated data endpoint) Regional vs. Global Endpoints Endpoint Type URL Format Purpose Use Case Global Endpoint myregistry.azurecr.io Login server with Azure-managed routing Default, optimal for most scenarios Regional Endpoint myregistry.<region-name>.geo.azurecr.io Login server for specific regional replica Predictable routing, client-side failover, regional affinity, troubleshooting Dedicated Data Endpoint myregistry.<region-name>.data.azurecr.io Layer blob downloads for Private Endpoint and Dedicated Data Endpoint-enabled registries Automatic blob download redirect from login server Storage Account *.blob.core.windows.net Layer blob downloads for registries without Private Endpoints or Dedicated Data Endpoints Automatic blob download redirect from login server Getting Started with Private Preview Prerequisites To participate in the regional endpoints private preview, you'll need: Premium SKU: Regional endpoints are available exclusively on Premium tier registries Azure CLI: Version 2.74.0 or later for the --regional-endpoints flag API version: The feature is available in all production regions in Azure Public Cloud via the 2026-01-01-preview ACR ARM API version NOTE: During private preview, regional endpoints are only available in Azure Public Cloud. Support for Azure Government, Azure China, and other national clouds will be available in public preview and beyond. NOTE: Regional endpoints can be enabled on any Premium SKU registry, even without geo-replication. A registry without geo-replication has a single geo-replica in the home region, which gets one regional endpoint URL. However, the feature is most useful when your registry has at least two geo-replicas. Step 1: Register the feature flag Register the RegionalEndpoints feature flag for your subscription: az feature register \ --namespace Microsoft.ContainerRegistry \ --name RegionalEndpoints The feature registration is auto-approved and takes approximately 1 hour to propagate. You can check the status with: az feature show \ --namespace Microsoft.ContainerRegistry \ --name RegionalEndpoints Wait until the state shows Registered before proceeding. Step 2: Propagate the registration Once the feature flag shows Registered, propagate the registration to your subscription's resource provider: az provider register -n Microsoft.ContainerRegistry Step 3: Install the preview CLI extension Download the preview CLI extension wheel file from https://aka.ms/acr/regionalendpoints/download and install it: az extension add \ --source acrregionalendpoint-1.0.0b1-py3-none-any.whl \ --allow-preview true What to Expect Once setup is complete, you can: Enable regional endpoints on both new and existing registries Access preview documentation Provide feedback via our GitHub roadmap Technical Deep Dive Enabling Regional Endpoints Enabling regional endpoints is simple and can be done for both new and existing registries: # Enable for new registry az acr create -n myregistry -g myrg -l <region-name> --regional-endpoints enabled --sku Premium # Enable for existing registry az acr update -n myregistry -g myrg --regional-endpoints enabled When you enable regional endpoints, ACR automatically creates login server URLs for all your geo-replicated regions. There's no need to manually configure individual regions - they're all available immediately. Authentication and Pushing/Pulling Images Using regional endpoints follows the same authentication experience as a geo-replicated registry's global endpoint: # Login to a specific regional endpoint az acr login --name myregistry --endpoint eastus # Tag an image with the regional endpoint URL docker tag myapp:v1 myregistry.eastus.geo.azurecr.io/myapp:v1 # Push images to the regional endpoint docker push myregistry.eastus.geo.azurecr.io/myapp:v1 # Pull images from the regional endpoint docker pull myregistry.eastus.geo.azurecr.io/myapp:v1 Regional endpoints support all the same authentication mechanisms as the global endpoint: Microsoft Entra, service principals, managed identities, and admin credentials. Kubernetes Integration One of the most powerful uses of regional endpoints is in Kubernetes deployments. You can specify regional endpoints directly in your deployment manifests, ensuring that Kubernetes clusters in specific regions always pull from their local replica: # East US-based AKS cluster deployment apiVersion: apps/v1 kind: Deployment metadata: name: myapp-eastus spec: template: spec: containers: - name: myapp image: myregistry.eastus.geo.azurecr.io/myapp:v1 --- # West Europe-based AKS cluster deployment apiVersion: apps/v1 kind: Deployment metadata: name: myapp-westeurope spec: template: spec: containers: - name: myapp image: myregistry.westeurope.geo.azurecr.io/myapp:v1 Integration with Dedicated Data Endpoints Regional endpoints work seamlessly with ACR's existing dedicated data endpoints feature. If your registry has dedicated data endpoints enabled, blob downloads from regional endpoints will automatically redirect to the dedicated data endpoints for that region, maintaining all the security benefits of scoped firewall rules without wildcard storage access. Integration with Private Endpoints For registries with Private Endpoints enabled, enabling regional endpoints creates an additional private IP address allocation for each geo-replicated region in all associated virtual networks (VNets). For example, if you have a registry with 3 existing geo-replicas and enable regional endpoints, each VNet with a private endpoint to your registry will consume 3 additional private IPs (one per regional endpoint). Firewall and Network Configuration When using regional endpoints, you'll need to configure your firewall rules to allow access to the specific endpoints you plan to use: # Registry operations using regional endpoints myregistry.<region-name>.geo.azurecr.io # Registry operations using the existing global endpoint for Azure-managed routing myregistry.azurecr.io # Layer blob downloads (choose based on your registry configuration) myregistry.<region-name>.data.azurecr.io # If Private Endpoints or Dedicated Data Endpoints enabled *.blob.core.windows.net # If without Private Endpoints or Dedicated Data Endpoints Related Resources Regional endpoints for geo-replicated registries (Preview) Geo-replication in Azure Container Registry Mitigate data exfiltration with dedicated data endpoints Connect privately to an Azure container registry using Azure Private Link Configure rules to access an Azure container registry behind a firewall574Views0likes0CommentsSimplifying Image Signing with Notary Project and Artifact Signing (GA)
Securing container images is a foundational part of protecting modern cloud‑native applications. Teams need a reliable way to ensure that the images moving through their pipelines are authentic, untampered, and produced by trusted publishers. We’re excited to share an updated approach that combines the Notary Project, the CNCF standard for signing and verifying OCI artifacts, with Artifact Signing—formerly Trusted Signing—which is now generally available as a managed signing service. The Notary Project provides an open, interoperable framework for signing and verification across container images and other OCI artifacts, while Notary Project tools like Notation and Ratify enable enforcement in CI/CD pipelines and Kubernetes environments. Artifact Signing complements this by removing the operational complexity of certificate management through short‑lived certificates, verified Azure identities, and role‑based access control, without changing the underlying standards. If you previously explored container image signing using Trusted Signing, the core workflows remain unchanged. As Artifact Signing reaches GA, customers will see updated terminology across documentation and tooling, while existing Notary Project–based integrations continue to work without disruption. Together, Notary Project and Artifact Signing make it easier for teams to adopt image signing as a scalable platform capability—helping ensure that only trusted artifacts move from build to deployment with confidence. Get started Sign container images using Notation CLI Sign container images in CI/CD pipelines Verify container images in CI/CD pipelines Verify container images in AKS Extend signing and verification to all OCI artifacts in registries Related content Simplifying Code Signing for Windows Apps: Artifact Signing (GA) Simplify Image Signing and Verification with Notary Project (preview article)478Views3likes0CommentsDeploy Dynatrace OneAgent on your Container Apps
TOC Introduction Setup References 1. Introduction Dynatrace OneAgent is an advanced monitoring tool that automatically collects performance data across your entire IT environment. It provides deep visibility into applications, infrastructure, and cloud services, enabling real-time observability. OneAgent supports multiple platforms, including containers, VMs, and serverless architectures, ensuring seamless monitoring with minimal configuration. It captures detailed metrics, traces, and logs, helping teams diagnose performance issues, optimize resources, and enhance user experiences. With AI-driven insights, OneAgent proactively detects anomalies and automates root cause analysis, making it an essential component for modern DevOps, SRE, and cloud-native monitoring strategies. 2. Setup 1. After registering your account, go to the control panel and search for Deploy OneAgent. 2. Obtain your Environment ID and create a PaaS token. Be sure to save them for later use. 3. In your local environment's console, log in to the Dynatrace registry. docker login -u XXX XXX.live.dynatrace.com # XXX is your Environment ID # Input PaaS Token when password prompt 4. Create a Dockerfile and an sshd_config file. FROM mcr.microsoft.com/devcontainers/javascript-node:20 # Change XXX into your Environment ID COPY --from=XXX.live.dynatrace.com/linux/oneagent-codemodules:all / / ENV LD_PRELOAD /opt/dynatrace/oneagent/agent/lib64/liboneagentproc.so # SSH RUN apt-get update \ && apt-get install -y --no-install-recommends dialog openssh-server tzdata screen lrzsz htop cron \ && echo "root:Docker!" | chpasswd \ && mkdir -p /run/sshd \ && chmod 700 /root/.ssh/ \ && chmod 600 /root/.ssh/id_rsa COPY ./sshd_config /etc/ssh/ # OTHER EXPOSE 2222 CMD ["/usr/sbin/sshd", "-D", "-o", "ListenAddress=0.0.0.0"] Port 2222 ListenAddress 0.0.0.0 LoginGraceTime 180 X11Forwarding yes Ciphers aes128-cbc,3des-cbc,aes256-cbc,aes128-ctr,aes192-ctr,aes256-ctr MACs hmac-sha2-256,hmac-sha2-512,hmac-sha1,hmac-sha1-96 StrictModes yes SyslogFacility DAEMON PasswordAuthentication yes PermitEmptyPasswords no PermitRootLogin yes Subsystem sftp internal-sftp AllowTcpForwarding yes 5. Build the container and push it to Azure Container Registry (ACR). # YYY is your ACR name docker build -t oneagent:202503201710 . --no-cache # you could setup your own image name docker tag oneagent:202503201710 YYY.azurecr.io/oneagent:202503201710 docker push YYY.azurecr.io/oneagent:202503201710 6. Create an Azure Container App (ACA), set Ingress to port 3000, allow all inbound traffic, and specify the ACR image you just created. 7. Once the container starts, open a console and run the following command to create a temporary HTTP server simulating a Node.js app. mkdir app && cd app echo 'console.log("Node.js app started...")' > index.js npm init -y npm install express cat <<EOF > server.js const express = require('express'); const app = express(); app.get('/', (req, res) => res.send('hello')); app.listen(3000, () => console.log('Server running on port 3000')); EOF # Please Press Ctrl + C to terminate the next command and run again for 3 times node server.js 8. You should now see the results on the ACA homepage. 9. Go back to the Dynatrace control panel, search for Host Classic, and you should see the collected data. 3. References Integrate OneAgent on Azure App Service for Linux and containers — Dynatrace Docs2.3KViews0likes1Comment