Azure Architecture Blog articles

Cloud Native Platforms: Build

KishoreKumarPattabiraman — Thu, 21 May 2026 22:41:29 GMT

Audience: Cloud architects, platform engineers, engineering leaders making design decisions

Reading time: 8 minutes

Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 1 of 3.

Most engineering teams can build systems.

Few can scale them without rebuilding them.

As platforms grow, complexity does not increase linearly. It multiplies across users, services, tenants, regions, and integrations. The systems that struggle and the systems that scale are rarely separated by which cloud they run on. They are separated by a handful of design choices made early and applied consistently.

This post is about those choices.

The differentiator is not the cloud

Scalable platforms are not built with the right tools. They are built with the right design choices.

Cloud services have closed the gap on infrastructure. The differentiator is no longer which managed service a team picks. It is whether the platform is designed to absorb change, tolerate failure, and support visibility from day one. Five engineering disciplines determine whether a platform scales gracefully or collects technical debt while it grows.

Figure 1. The five disciplines compound into platform scale. Any one neglected becomes the constraint that forces a rewrite later.

1. Flexibility is the foundation of scale

Hard-coded systems work until they do not. The first request to add a tenant, a region, a SKU (a sellable product variant), or a regulatory variant is the moment a rigid design starts to bend. Each subsequent request adds weight.

Scalable platforms move behavior out of code:

Configuration replaces conditional logic
Feature flags enable safer, tenant-scoped rollouts
APIs evolve through versioning, not breaking changes
Schemas evolve additively. Breaking changes go through versioned contracts with a deprecation window long enough that consumers can migrate without downtime.

In practice

The pattern that works: configuration in a managed store, feature flags with tenant scope, and APIs versioned per consumer contract. Cost is the discipline of treating configuration as code (versioned, reviewed, audited). The return is that releases stop being events and start being routine. A change that previously needed a coordinated deployment can be executed in minutes, gated to a single tenant for verification, and rolled out broadly only after the signal is clean. Most platforms reach this state by retrofit, not by design. Doing it earlier costs less than waiting.

If a change requires a redeploy, it should require a very good reason.

2. Failures are normal. Resilience is a choice.

Distributed systems will fail in unpredictable ways. The real question is not how to prevent failure. It is how the system responds when failure happens.

Resilience is engineered, not inherited from the platform. The patterns that move the needle are well known and consistently applied:

Idempotent operations (safe to call multiple times with the same result) that make retries safe
Reliable messaging patterns such as the transaction outbox (writing the message to the same database transaction as the business change, then publishing asynchronously) to avoid lost or duplicated events
Decoupled services that contain blast radius (the scope of damage when one component fails)
Timeouts, retries, and circuit breakers (a wrapper around a dependency that stops calling it for a cool-off window after repeated failures) tuned per dependency
Bulkheads (isolation pools, often a separate compute or queue lane per workload class) that keep noisy neighbours from starving critical paths of resources

In practice

The pattern that works: every write that can be retried carries an idempotency key, every queue consumer is safe to replay, every event published goes through an outbox in the same transactional unit as the business change. When peak load triggers retries, duplicates collapse cleanly instead of producing duplicate orders, double-charged customers, or split-brain state. The contract changes outwards: callers can retry without thinking, queues can be at-least-once instead of exactly-once, and recovery moves from a manual cleanup task to a property of the system. Most teams that adopt this pattern stop seeing certain classes of incident entirely.

Implementation note

An idempotent API is not just a design preference. It changes how the rest of the system can be built. Once writes are safe to repeat, retries become cheap, queues become trustworthy, and recovery becomes automatic.

The naive implementation (read the key, if absent process and save) has a race. Two concurrent requests with the same key both miss the lookup, both call the processor, and both attempt to save. That is the failure mode idempotency exists to prevent. The pattern that survives production is an atomic reserve-then-execute: insert a row keyed by the idempotency key with a unique constraint before doing any work. The first writer wins. Concurrent callers either wait for the original to complete and read its result, or they receive a conflict response.

// Contract for the idempotency store. The two key methods are TryReserveAsync // (atomic insert with unique-key constraint) and CompleteAsync (record the // result of the first writer). GetCompletedResultAsync polls until the first // writer commits or returns 409 Conflict if the in-flight window exceeds the // configured deadline. public interface IIdempotencyStore { Task<Reservation> TryReserveAsync( string idempotencyKey, string requestHash, CancellationToken ct); Task CompleteAsync( string idempotencyKey, OrderResult result, CancellationToken ct); Task<OrderResult> GetCompletedResultAsync( string idempotencyKey, CancellationToken ct, TimeSpan? maxWait = null); } public readonly record struct Reservation( bool IsFirstWriter, string RequestHash); // Idempotency via atomic reserve-then-execute. // First writer wins; replays return the original result; concurrent // duplicates lose the race and read the winner's outcome (or get 409). public async Task<OrderResult> CreateOrderAsync( Order order, string idempotencyKey, CancellationToken ct) { var requestHash = StableHash(order); // canonical content hash // Atomic insert: succeeds for the first caller, fails for the rest. var reserved = await _store.TryReserveAsync( idempotencyKey, requestHash, ct); if (!reserved.IsFirstWriter) { if (reserved.RequestHash != requestHash) throw new IdempotencyKeyReusedException(); // A previous run committed (return its result) or is in-flight // (poll with a bounded deadline; 409 if exceeded). return await _store.GetCompletedResultAsync( idempotencyKey, ct, maxWait: TimeSpan.FromSeconds(5)); } // We are the first writer. Execute, persist, mark complete. var result = await _processor.ProcessAsync(order, ct); await _store.CompleteAsync(idempotencyKey, result, ct); return result; }

Three production details matter:

TTL or compaction on the idempotency record. Without it, the store grows forever. Most teams retain records for the request retry window plus a safety margin (commonly 24 to 72 hours).
Stable content hash, not the default object hash code. The request hash detects key reuse with a different body, so a client that reuses an idempotency key with a different payload receives IdempotencyKeyReusedException rather than silently getting the wrong result. Canonicalise field ordering, locale, and null handling before hashing.
Bound the in-flight window explicitly. The genuinely hard case is when the processor succeeded but the store write failed. Production-grade implementations either run the side-effect and the store write in the same transaction (when the processor and store share a database) or use the transaction outbox pattern to bridge them. The poll-with-deadline in GetCompletedResultAsync handles the duplicate-arrives-mid-flight case; the transactional boundary handles everything else.

3. Observability is not optional

Without observability, teams operate blind. As systems grow, the price of guessing rises faster than the price of seeing.

At build time, observability is a design property. The decisions made before the system reaches production are what determine whether it can be operated at all. The dashboards, alerts, and incident practices covered in Part 2 of this series rely on instrumentation choices made here.

The build-time work that pays off in production:

Request identifiers propagated through every service hop, every queue, every async boundary, so a single user action can be traced end to end
Structured logging with a consistent schema (event name, correlation id, tenant, severity) rather than free-form strings
Metrics emitted at the boundaries that matter (every external call, every queue read or write, every database operation), not only at the entry point
Tracing libraries integrated at the framework or middleware layer so coverage is automatic, not opt-in
Schemas designed so business signals (orders, sessions, transactions) and system signals (CPU, latency, errors) share the same identifiers and can be correlated later

In practice

The pattern that works: a single request id flowing through every service hop, every queue, every async boundary, propagated automatically at the framework layer rather than per-call. Add one structured logging schema across services (event name, correlation id, tenant, severity), so that a single query joins business events with system events. The investment is hours of upfront framework wiring. The return is that production diagnosis stops being archaeology. Cross-service questions become single dashboards; postmortems shrink from days to hours; and the dashboards in Part 2 actually work because the data underneath is shaped to support them.

4. Delivery practices set the ceiling

Scaling teams requires scaling delivery. Small inefficiencies in pipelines, environments, and release coordination compound into measurable drag.

Delivery maturity that pays off at scale:

Pipelines as code, reviewed and versioned like application code
Parallel deployments across services and regions where dependencies allow
Infrastructure as code with shared modules, not hand-managed environments
Automated quality gates: tests, security scans, dependency checks
Trunk-based development (developers commit to a single shared branch many times a day) with short-lived feature branches and progressive delivery. Important caveat: trunk-based works only when test automation and feature flags are already in place. Adopting it before those foundations exist tends to amplify production incidents rather than reduce them.

In practice

The pattern that works: pipelines run in parallel where dependencies allow, infrastructure provisioning is templated rather than per-environment, and quality gates run automatically rather than as discretionary steps. Sequential deployment of a multi-service platform across three environments takes hours; parallelised deployment of the same change takes minutes. The payback is not only release speed. It is the compounding cost reduction of every wait state for every engineer on every release. Teams that treat pipelines as a product feature, not an afterthought, ship more confidently and recover from bad changes faster because the rollback path was exercised, not invented during an incident.

Slow pipelines are not a tooling problem. They are a design problem.

5. Cost discipline is engineering work

Cloud platforms can become expensive quickly when cost is treated as someone else's problem. Cost is a property of the design, not a quarterly review.

The teams that get this right treat cost the same way they treat performance:

Elastic compute and storage tiers chosen per workload pattern
Non-production environments with automated scale-down windows (the easiest savings to leave on the table)
Tagging discipline so cost can be attributed to a service, a feature, a tenant
Egress and data-tier choices, not compute, dominate cloud bills past a certain scale. Right-size storage tiers (hot vs cool vs archive), eliminate cross-region chatter, and watch egress on the data plane more closely than compute on the request path.
Budgets and usage alerts wired into the same channels as reliability alerts
Cost reviews built into design discussions, not deferred to FinOps (Financial Operations: the practice of managing cloud spend as an engineering concern)

In practice

The pattern that works: non-production environments scale down automatically outside business hours, storage tiers match access patterns (hot, cool, archive), and tagging is enforced so every dollar can be attributed to a service or feature. Cost reviews happen at design time, not after the bill arrives. The biggest savings come from data plane decisions, not compute: cross-region egress, oversized storage tiers, and forgotten test environments dominate cloud bills past a certain scale. Treat cost as a first-class non-functional requirement, alongside latency and availability, and the discipline compounds in every design discussion that follows.

A scenario that ties it together

Figure 2. A reference architecture that puts the disciplines into one shape. The request path is decoupled, the data layer is purpose-fit, identity is brokered by managed identity throughout, private endpoints isolate the data tier from public networks, and observability runs as a first-class lane.

Picture a multi-tenant platform at a growth inflection. Onboarding a new tenant takes weeks because tenant-specific behaviour is hard-coded across services. Every release carries risk because there is no way to roll out a change to one tenant without affecting the rest. Incidents linger because logs and metrics live in different tools and nobody can correlate them in production.

Do not start with a rewrite. Start with the smallest set of changes that unlocks the next year of growth: extract configuration out of code, introduce tenant-aware feature flags, wire a unified observability view into the existing services, and parallelise the pipelines. None of these are architectural revolutions. They are design choices applied with discipline, in the order the disciplines compound.

Eighteen months in, onboarding a tenant takes hours instead of weeks. Releases move from monthly events to weekly increments. Incidents are caught earlier and resolved faster. The platform did not get bigger. It got more capable. The five disciplines did the work; the team made the choice to apply them.

What teams get wrong

The common pattern is architecting for the system you have, not the system you are growing into. It looks like progress because the current sprint ships. Pillars get postponed because they feel like overhead.

The cost surfaces later. Each shortcut becomes a constraint. The constraints compound, and three releases later the team is debating a rewrite.

The fix is not premature abstraction. It is small, deliberate investments in flexibility, resilience, observability, delivery, and cost from day one. The discipline is to make these investments before they are urgent.

Where to start when you cannot do everything at once

Five disciplines is a wall, and real teams cannot fund all five at once. The right order depends on whether the platform is being built fresh or already running.

For a system already in production and already in pain, the SRE community's hierarchy of reliability needs gives the most defensible starting order: monitoring and observability first (you cannot fix what you cannot see), then incident response (close the bleeding cleanly), then resilience patterns (idempotency, retries, decoupling) so the bleeding has fewer reasons to start, then flexibility and delivery so safe change can travel at speed. Cost discipline runs alongside throughout, never as the headline.

For a system being built fresh, the order in this post (flexibility, resilience, observability, delivery, cost) reflects the Azure Well-Architected Framework's emphasis on designing for change, failure, and visibility before scaling teams or workloads. Both orders are defensible. What is not defensible is leaving any of the five for later.

The most concrete starter from this post: request id propagation. A single correlation identifier travelling through every service hop, every queue, every async boundary, costs hours up front and pays back every time someone has to debug production for the rest of the platform's life. It is the smallest unit of the observability discipline and the foundation that the dashboards, traces, and incident response in Part 2 all depend on.

The shift

The most important transformation in scaling a platform is not technical. It is mindset.

The shift is from project thinking to platform thinking:

Build reusable capabilities, not one-off solutions
Design systems for long-term evolution, not the next release
Enable other teams, not just deliver for one team

Tools change. Cloud services evolve. The architectural fashions of this year will not be the architectural fashions of the next. What persists is the discipline behind the choices. Scalable systems are not built by tools. They are built by teams that treat design as continuous work. The same discipline shows up again in Part 2 (operating these systems) and Part 3 (using AI to augment that work). The tools change. The disciplines do not.

Want to discuss? What single design choice has paid the most dividends in the platforms you run? Drop a comment with patterns you have seen in your environment. Every reply gets read.

Next in this series: Running Cloud Native Platforms: Why Day 2 Decides Everything. Building is half the journey. The next post looks at what it takes to operate these platforms once they are in production.

Cloud Native Platforms: Run

KishoreKumarPattabiraman — Thu, 21 May 2026 22:41:09 GMT

Audience: SREs (Site Reliability Engineers), platform engineers, engineering managers running production systems

Reading time: 8 minutes

Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 2 of 3.

Most systems are designed thoughtfully.

Most operations are inherited reactively.

The systems that survive are not the ones built with the most care. They are the ones operated with the most discipline. Production has a way of revealing every shortcut taken during design and every assumption left unverified.

This post is about what it takes to operate a platform once the build is done.

How they are run, not how they are built

Systems are not defined by how they are built. They are defined by how they are run.

A well-designed system that is operated reactively will fail in production. A modestly designed system that is operated with discipline will outperform it. Five operational disciplines decide which side of that line a platform lives on. Each one is engineering work, not a checklist for someone else to handle.

Figure 1. The incident lifecycle as a state machine. The states are not optional steps. They are the contract between the team and the system.

1. Observability is the backbone of reliability

Without observability, every operation becomes a guess. As systems grow, the cost of guessing rises faster than the cost of seeing.

Part 1 of this series argued that observability is a design property: instrumentation contracts, request id propagation, structured logging schemas. Production is where those design choices either pay off or do not. Strong observability in production is a contract that lets any engineer answer three questions in minutes: what failed, why it failed, and what the impact was. The shape of that contract matters more than the tool that implements it. (This three-question framing is community-popularised through the SRE community and writers such as Charity Majors. See Honeycomb's What is Observability for the canonical articulation of the three-pillars and question framing; the substance is older than the framing.)

Dashboards organised around user journeys, not infrastructure components
Service level indicators (SLIs: the specific measurements you care about, e.g., success rate, p99 latency) chosen from the user's perspective, not the database's
Alerts that page only on burn-rate against an SLO (Service Level Objective: the target value of an SLI, e.g., 99.9% of requests complete in under 800ms over a rolling month) using a multi-window strategy. A short window catches fast burns; a long window catches slow drifts. This is what makes SLOs operational rather than decorative.
Sampling and retention tuned for cost, but never for blind spots
The distinction between MTTA (mean time to acknowledge: how fast someone notices) and MTTR (mean time to restore: how fast service returns) tracked separately. Conflating them hides whether the team's bottleneck is detection, response, or fix.

In practice

The pattern that works: rebuild the operational view around two or three user journeys (sign-in, place order, view history) rather than per-component charts. Tie alerts to error budget burn rather than raw threshold crossings. Track MTTA and MTTR separately so the team's actual bottleneck (detection, response, or fix) is visible. The investment is rethinking what to measure, not buying a new tool. The return is that incidents stop being discovered by customer complaints first. Teams that make this shift typically find their existing telemetry was sufficient; only the questions being asked of it were wrong.

If a dashboard cannot answer "what is the user experiencing right now", it is not an observability dashboard. It is decoration.

2. Alerts are signals, not notifications

More alerts do not mean better monitoring. In practice, the opposite is true. Once alerts outpace the team's ability to act, important signals start getting missed.

Effective alerting works to a small set of rules:

Severity that maps to action, not to technical category
Ownership baked in, never inferred at runtime
Thresholds tied to user impact, not raw metric values
Noise treated as a defect, with a regular review cadence
Suppression and grouping for known multi-alert patterns

In practice

The pattern that works: audit every alert against one test, "what action would I take in the next five minutes if this fires now?" Demote alerts with no answer to dashboards. Remove alerts where the answer is the same as another alert's. Group related alerts so one incident produces one page, not twelve. Most teams discover their alert volume drops by an order of magnitude after a thorough audit, and the alerts that remain start getting trusted again. Trust is the precondition for every other operational practice. Without it, on-call rotations decay into noise filtering and the real signals get missed.

Figure 2. From raw events to pages, in approximate orders of magnitude. The numbers vary by team and workload; what does not vary is that each stage needs to remove one to two orders of magnitude of noise. Teams that page on raw events end up with on-call rotations nobody trusts.

3. Incident response is a practiced muscle

Failures are inevitable. Unstructured response is not.

The teams that recover quickly do not improvise during incidents. They follow a structure that has been practiced when nothing was on fire. The structure is intentionally simple, because incident time is the worst time to negotiate roles.

Clear roles: incident lead, communications lead, scribe, subject matter expert (the RACI model, Responsible-Accountable-Consulted-Informed, adapted for incident response)
Defined escalation paths with clear handoff criteria. Escalation means re-paging to a higher tier or specialist, not returning to detection. The lifecycle diagram in Figure 1 makes the distinction explicit.
Runbooks for the top failure modes, kept short enough to actually be read
Status communication on a fixed cadence, even when there is nothing new to say. Customer comms and internal comms are tracked separately.
Blameless postmortems (focus on the system that allowed the failure, not the person who pushed the button) that produce action items the team actually completes
Game days: scheduled exercises that simulate failure modes (region outage, dependency unavailability, traffic spike) under controlled conditions, so gaps in runbooks are found before incidents do

In practice

The pattern that works: name the incident lead and the comms lead before the first message goes out. Write runbooks short enough to be scannable at 3 AM. Run blameless postmortems with action items that actually get tracked to completion. Schedule game days quarterly so the runbooks are exercised before real incidents. Teams that operate with this structure do not have more engineers; they have engineers who are not single points of failure during recovery. The deepest experts stay the deepest experts, but the platform stops depending on whether they happen to be online.

Implementation note

A short, well-structured runbook outperforms a long, exhaustive one. The goal during an incident is not to think. It is to act on a procedure that has been thought through in calmer times.

# Runbook header pattern (keep it scannable in incident time) title: High latency on order API slo_protected: # this runbook protects two SLOs - order-completion-success - order-completion-latency severity: # derived from burn rate, not declared fast_burn: P1 # 14.4x budget burn over 1 hour => page now slow_burn: P2 # 6x budget burn over 6 hours => investigate owner: payments-team indicators: # triggers for evaluation, not severity - p99 (99th-percentile) latency exceeds the SLO target for 5 min - error rate exceeds the SLO target for 3 min on order-completion first_actions: - Open the order-journey dashboard. Confirm impact in business terms. - Check Service Bus queue depth and dead-letter rate (the most common cause of API latency under load is downstream backpressure) - Verify Cosmos DB RU/s saturation and partition hotspots - Inspect the most recent deployment for behavioural changes escalate_if: - Latency does not recover in 15 min - Error rate exceeds 5% (fast burn against the SLO) - Customer reports arrive before our own signals do rollback_path: - Feature flag "new-order-pipeline" can be disabled per-tenant - Last known good deployment id is in the release tracker note_on_scaling: # CPU is rarely the cause of latency in this service. Scale only after # confirming the bottleneck is compute, not a downstream dependency or # queue depth. Adding capacity to a saturated downstream amplifies the # incident; it does not resolve it.

The general principle behind that last note travels beyond this runbook: scale-out is the right remediation for compute saturation, not for downstream saturation. When latency rises because a database, queue, or external dependency is saturated, adding capacity in front of the bottleneck moves more requests into the bottleneck and makes the incident worse. This is one of the most common operational mistakes when the dashboard shows red and the on-call instinct says "add more".

4. Release confidence is engineered

Releases get harder as systems grow. The platforms that ship confidently at scale have engineered the path, not learned to fear it.

The patterns that change the math:

Feature flags that allow change without deploy
Canary deployments (releasing the new version to a small slice of traffic first, watching error budget burn before continuing) that surface problems on a small slice
Gradual rollouts with automated rollback triggers
Database migrations split from application releases
Release coordination that scales with services, not with team size

In practice

The pattern that works: every change ships behind a feature flag, canary deployments take a small slice of traffic first, and rollback is a one-click step in the pipeline rather than a procedure to be invented during an incident. The cost is the discipline of building rollback paths and exercising them. The return is releases that stop being events. Issues that previously triggered full rollbacks get isolated to a slice and rolled back automatically before they reach most users. The willingness to ship smaller, more frequent changes follows directly from the confidence that bad changes can be undone fast.

Big releases feel safe because they are rare. They are actually risky because every change rides together.

5. Reliability is continuous, not a milestone

Reliability is not achieved through tools alone. It requires continuous refinement, feedback-driven improvement, and a budget that the team can spend on operational work without negotiating each time.

The disciplines that keep systems reliable over years are codified well in the SRE-book framing of service level objectives and error budgets (the canonical reference is the Google SRE Book chapter on Service Level Objectives, with the operational follow-up in the SRE Workbook chapter on alerting on SLOs). The names matter less than the practice they enable.

SLOs chosen from the user's perspective, with two or three per service rather than ten. More SLOs means none of them shape behaviour.
Error budgets: the inverse of the SLO, expressing how much unreliability the team is willing to spend in a window. Used up early in the month means slow down on releases. Healthy means feature work keeps moving.
Multi-window burn-rate alerting turns SLOs from dashboards into pages: short window catches catastrophic failures, long window catches slow drift. Without burn-rate alerting, SLOs are observation, not operation. (The pattern is documented in the SRE Workbook.)
Reliability work has its own backlog, prioritised against features. Not a wishlist after every incident.
Regular game days that exercise failure modes (region failover, dependency outage, traffic spike) before they happen for real
Capacity planning informed by data, not by anxiety

In practice

The pattern that works: define two or three SLOs per service, expressed from the user's perspective. Compute the error budget weekly. When the budget is healthy, ship feature work. When the budget is burning fast, slow down and fix the cause. The conversation about which incidents matter and which can wait becomes possible because there is a shared number to point at. Reliability becomes a quantified property of the platform, not an opinion debated at every retrospective. Teams that adopt this discipline stop having the recurring "how reliable do we need to be?" argument and start having data-grounded trade-off discussions instead.

A scenario that ties it together

A platform was launching a new region. The build had gone well. Day 1 was clean. Two weeks in, latency started creeping up during peak hours. Alerts fired on raw thresholds, but no one could tell which ones to trust. Incident calls turned into long debugging sessions because three different teams owned overlapping pieces of the request path.

The team did not start by buying a new tool. They started by treating operations as engineering work. The dashboard was redesigned around the user journey. Alerts were audited and most were demoted or removed. Roles for incident response were written down. A short runbook covered the top failure modes. Releases were broken into canary slices behind feature flags.

None of this was new. It was discipline applied consistently to work that was previously assumed to be someone else's. The next region launch took half the effort, and the team's mean time to restore on the failures that did happen was measurably lower.

What teams get wrong

The common pattern is treating Day 2 as the cost of Day 1. Teams design beautifully, ship fast, then quietly absorb the operational debt. Dashboards proliferate. Alerts grow louder. Postmortems pile up.

The fix is not more dashboards. It is treating operations as engineering work with the same rigour as feature delivery. Operability is a property the system either has or does not. It is not earned by adding monitoring. It is earned by designing for visibility and operating with discipline.

Where to start

The most concrete starter from this post: an alert audit. List every alert that fires in the next week and apply a single test to each one: "what action would I take in the next five minutes?" Demote the alerts that have no answer. Remove the alerts where the answer is the same as another alert's. The audit takes a morning. The result usually halves alert volume and lifts trust on what remains, which is the precondition for every other operational practice in this post.

The shift

The most important shift in maturity is not technical. It is in stance.

The shift is from shipping software to operating systems:

Operations is not a phase that follows engineering. It is engineering.
Reliability is not a milestone reached. It is a discipline practiced.
Incidents are not interruptions to the work. They are the work.

The teams that internalise this shift run platforms that are smaller, calmer, and more trusted. They do not have fewer incidents because their systems are more advanced. They have fewer incidents because their operational discipline is more consistent. Part 3 of this series argues that the same discipline applies again, in a different domain: the practices that make platforms operable are the practices that make AI useful in delivery.

Want to discuss? What is the one operational practice your team adopted that changed how you sleep at night? Drop a comment with patterns you have seen in your environment. Every reply gets read.

Previously in this series: Building Cloud Native Platforms That Scale: Patterns That Actually Work. The first post covered the design choices that make scale possible.

Next in this series: AI-First Platform Engineering: From Copilot to Agentic Delivery. Cloud helped us scale infrastructure. The next post looks at how AI is now changing how we build and run platforms.

Cloud Native Platforms: Evolve

KishoreKumarPattabiraman — Thu, 21 May 2026 22:40:33 GMT

Audience: Engineering leaders, platform architects, senior developers exploring how to operationalise AI in their teams

Reading time: 8 minutes

Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 3 of 3.

Cloud helped us scale infrastructure.

AI is starting to do the same thing for the work around the code: the planning, the testing, the release communication, the incident triage, the writing that surrounds writing software.

The conversation about AI in software has narrowed too quickly to "Copilot in the editor". The bigger story is happening across the lifecycle. Planning, design, development, testing, release, and operations are all being augmented at once. The platforms that adopt AI well are not the ones with the most usage. They are the ones with the clearest discipline around how it is used.

This post is about that discipline.

AI is changing how we engineer, not how we type

AI is not changing how we write code. It is changing how we engineer software.

Code generation is the surface. Underneath it, AI is reshaping the unit of leverage. The question is no longer how fast a developer can type. It is how well a workflow can be expressed as a reusable engineering asset. Six disciplines determine whether AI moves the needle on outcomes or just adds another tool to the stack.

Figure 1. AI across the SDLC. Each phase has clear AI assist points and clear human-owned validations. The boundary is not negotiable. It is the design.

1. From assistance to augmentation

Early AI tools focused on assisting individual developers. Code suggestions. Autocomplete. Quick refactors. The value was real but bounded by the editor.

The shift now is into structured workflows that span the lifecycle. The unit of leverage is no longer a single suggestion. It is a sequence of actions executed reliably across phases. ("Agentic" later in this post means a system that makes its own next-step decisions inside guardrails. A workflow follows a fixed sequence; an agent chooses the path.)

Code generation has become baseline, not differentiator
Workflow generation is where the largest gains live
Multi-step assistance with explicit human checkpoints
Context that travels across tools, not just within one

In practice

The pattern that works: start with the single highest-volume writing task on the team (commit messages, code review comments, release notes, postmortem first drafts) and turn the AI assist for that task into a shared workflow rather than each individual's private trick. The cost is one engineer's afternoon documenting the workflow and the eval set. The return is that every engineer on the team inherits the work, and the task that used to consume an engineer's morning every two weeks becomes a background step in the release process. Workflow generation, not faster typing, is where the gains compound across a team.

Code suggestions help one developer. Reusable workflows help the next ten.

2. AI across the SDLC, with guardrails

AI now has a useful role at every phase of delivery. The role is different at each phase, and the guardrails are different too.

Phase	What AI helps with	What humans must validate
Plan	Breaking down requirements, drafting acceptance criteria	Domain context, business priorities, customer impact
Build	Code generation, refactoring, scaffolding	Architectural fit, security boundaries, performance
Test	Test case generation, edge case discovery	Coverage of business-critical paths, regulatory cases
Release	Release notes, changelog summaries, communication drafts	Accuracy, tone, customer-facing claims
Operate	Log triage, incident summaries, runbook drafts	Root cause attribution, action item ownership

The guardrails are not optional decoration. They are the design.

In practice

The pattern that works: stage AI assists for release communication (changelog drafting, customer-facing release notes, internal release announcements) and require a human review before anything goes out. The draft arrives consistently, faster than a human could produce, and easier to compare across releases. The reviewer is not eliminated; the reviewer is moved from author to editor, which is where their judgment actually matters. Teams that adopt this pattern stop missing release-note deadlines and stop publishing inconsistent communication across products.

3. From prompts to reusable assets

Many teams begin with prompt experimentation. Individuals find techniques that work for their tasks. The result is a patchwork of personal practices that do not survive a team change.

The compounding value comes when prompts mature into reusable engineering assets.

Figure 2. The maturity model from prompts to agents. The value compounds at the workflow stage and accelerates at the agent stage. The disciplines that make agents safe are the same ones that made workflows reliable.

The maturity stages, in order of leverage:

Prompts: ad-hoc, individual, hard to share
Templates: parameterised prompts versioned with the project
Workflows: multi-step sequences with clear inputs, outputs, checkpoints
Agents: autonomous task chains operating within explicit guardrails

The diagram is a maturity ladder, not a graduation. In practice teams operate at all four stages simultaneously for different tasks. A senior engineer may use a one-off prompt to explore a refactor, run a versioned template for commit messages, hand off to a workflow for release notes, and trigger an agent for routine PR triage, all in the same hour. The point of the ladder is not to leave earlier stages behind. It is to know which stage a given task belongs to and to invest accordingly.

In practice

The pattern that works: pick the three prompts your team uses every week, codify them as parameterised templates in the same repository as the application code, and treat them as engineering artefacts (reviewed, versioned, owned). New engineers inherit the team's accumulated practice instead of building their own from scratch. Quality becomes consistent because the variance between individuals shrinks. Investment pays back in weeks, not quarters, and the maturity ladder keeps producing returns as the team moves from templates to workflows to agents.

4. Agentic delivery, with guardrails that survive a security review

The next stage is agentic. AI executes sequences of tasks within a defined scope. The risk is not that the agent will fail. It is that the system around the agent will not catch the failure, and that the failure modes are different in kind from traditional automation. Agents are non-deterministic, they can be manipulated through their inputs, and their actions can have side effects in systems the team does not own.

Five guardrails make agentic delivery safe. The first four are necessary. The fifth is what carries the agent through a security review at a regulated enterprise.

Identity and scope: the agent runs as a managed identity (or scoped service principal) with the smallest set of permissions that lets it do its job. Permissions are expressed as allowlists, not denylists. Tools fetched at runtime are subject to the same identity boundary as the agent itself.
Input quarantine: anything the agent reads from a user-controlled source (work item bodies, PR descriptions, customer tickets) is treated as untrusted text. The agent does not execute instructions found in fetched content, and tool calls are validated against an output schema before execution. This is the prompt-injection mitigation, and it is the most common gap in agentic systems shipped today.
Cost and blast-radius caps: every run has a maximum token budget, a maximum number of tool calls, and a maximum spend. Exceeding any cap aborts the run cleanly. Without caps, scoped credentials are not enough to bound the damage.
Evaluations and traceability: agents are evaluated against a fixed test set before deployment, and on every prompt or model change. Every action is logged with inputs, outputs, the model and prompt versions used, and the reasoning trace where the model exposes one. Logs are redacted for secrets and personally identifiable information at write time.
Reversibility taxonomy: actions are categorised by reversibility, not asserted to be reversible in general. A draft write to a private store is reversible. A post to a customer-facing channel is not reversible (deletion does not unsend). A database update may be reversible by a compensating transaction or not at all. Irreversible actions require human approval at the boundary, before they happen, not after. The agent is allowed to draft and stage. The human is the only one who is allowed to make the move that cannot be undone.

In practice

The pattern that works: start with one low-risk agent (release-notes drafter, PR triage assistant) running on read-only inputs, write-only-to-drafts permissions, and a hard cost cap per run. Require explicit human approval at the irreversible step. Wire up an evaluation set on day one, and rerun it on every prompt or model change. Treat regressions as failures, not warnings. The first agent the team ships is rarely the most valuable; it is the rehearsal that establishes the controls every later agent inherits. Teams that skip this rehearsal end up with an agent in production that no one feels safe extending.

Implementation note

An agent without a reversibility taxonomy and a regression eval set is a liability. The discipline is the same one that made workflows reliable: scoped identity, idempotency, traceability, and a clear boundary between machine action and human decision. The YAML below is illustrative, not a runtime contract; it is meant to show the shape of the controls a real agent definition would carry, not the syntax of any specific platform.

# Agent run definition (illustrative; not a specific platform's syntax) name: release-notes-drafter trigger: pre-release identity: type: managed-identity scope: tenant=<tenant-id> resource=release-tools/<app-id> permissions: allow: - read: work-items in milestone (filter: state=Done) - read: pull-requests in milestone (filter: merged) - write: drafts/release-notes/${run-id} # Production channels are NOT in the allowlist. The agent cannot post. limits: max_tokens_per_run: 80000 max_tool_calls_per_run: 20 max_runtime_seconds: 300 max_cost_usd: 0.40 on_exceeded: abort_with_partial_artifact input_handling: treat_fetched_content_as: untrusted # Indirect prompt injection is mitigated by the layered discipline below, # not by a single feature flag. Each item is a separate control. enforce_instruction_hierarchy: true validate_tool_args_against_schema: true validate_outputs_against_schema: true steps: - fetch: completed work items in milestone - draft: release notes from items - validate: required fields present - request-review: from: release-manager idempotency_key: ${milestone-id}-${draft-hash} - on-approval: action: post-to-internal-channel reversibility: not-reversible requires: explicit-human-click # the agent does NOT click this audit: log_inputs: true log_outputs: true redact: - secrets # Pattern-based: handles structured PII like emails, phones, IDs. - pii_patterns: [email, phone, national-id, payment-card, ip-address] # Entity-based: required for unstructured PII like names. Pattern alone # cannot redact a customer name without an entity-recognition step. - pii_entities: ner-based # names, locations, organisations retain: 365_days # tune to your audit policy, not to the demo evaluation: test_set: tests/release-notes/eval-v3.jsonl on_prompt_change: rerun on_model_change: rerun fail_threshold: 5_percent_regression

5. Where AI still needs human judgment

AI has clear boundaries. The boundaries are not embarrassing. They are the design.

What must stay human-owned:

Architectural trade-offs and design decisions
Security validation and threat modelling
Correctness for business-critical and regulatory paths
Domain context that has not been written down
Accountability for outcomes, not just outputs

The goal is collaboration, not replacement. The teams that get the most value from AI are not the ones with the most automation. They are the ones with the clearest sense of where automation ends and judgment begins.

In practice

The pattern that works: name the human-owned items explicitly in the team's working agreement (architecture, security, regulatory correctness, accountability) and audit every AI workflow against that list. When a workflow asks the AI to make a decision in any of those categories, redesign it so the AI prepares the analysis and a human makes the call. Most teams over-trust AI for one of these areas in their first six months and learn the hard way. Naming the boundary up front prevents the lesson from being paid in production. The clarity is the value; the model behind the workflow is interchangeable.

6. Responsible AI is engineering work

The first five disciplines decide whether AI moves the needle. The sixth decides whether the platform can defend the choices it makes with AI. Responsible AI is the engineering practice of building systems whose AI behaviour is fair, transparent, accountable, and safe by design, not by audit after the fact. Treating it as a compliance checkbox at the end of the project is how teams end up shipping AI workflows that fail security review, embarrass the company, or harm users.

Six controls turn responsible AI from a policy into engineering work. These map directly onto the practices Microsoft and the broader industry have converged on, but the names matter less than the practice they enable.

Fairness in inputs and outputs. The training data, eval set, and prompts are reviewed for systematic bias against any group the system serves. The eval set covers under-represented cases by design, not by accident, and regressions on those cases fail the build.
Transparency to end users. When a user sees AI-generated content, they are told. When a decision is AI-assisted, the path from input to output is explainable in plain language, not just in a model card buried in documentation.
Content safety filters. Inputs and outputs pass through safety classifiers (prompt injection, prohibited content, jailbreak patterns) before reaching the model and before reaching the user. Filtering decisions are logged and reviewable.
Accountability ownership. Every AI workflow has a named owner who is accountable for its outcomes, not just its uptime. The owner has the authority to pause or roll back the workflow when harm is detected.
Data minimisation and residency. The AI sees only the data it needs to do the task. Personally identifiable information and customer data are scoped, redacted, and kept inside the boundary the customer agreed to. Cross-tenant leakage is treated as a P1 incident, not a feature request.
Harm evaluation alongside quality evaluation. The eval set measures harm potential (toxicity, hallucination on factual queries, leakage of confidential context) with the same rigour as it measures correctness. Both must pass for a release to ship.

Figure 3. Responsible AI as a set of engineering controls around the AI workflow. The six controls fall into four categories: data discipline (fairness, data minimisation), model discipline (content safety, harm evaluation), deployment discipline (transparency to users), and governance (accountability ownership). All six are necessary; none is sufficient on its own.

In practice

The pattern that works: write the responsible AI plan before the first agent ships, not after the first incident. Pick one workflow that touches user data or generates customer-facing content, and use it as the reference implementation: fairness review on the eval set, content safety filters wrapping the model call, transparency annotation in the UI, redaction of identifying details in logs, harm evals running alongside quality evals on every change, and a named owner with explicit pause authority. The first such workflow takes longer to ship than the unconstrained version. Every workflow after it inherits the controls and ships faster than it would have without them. Teams that defer responsible AI to a future quarter end up retrofitting it under pressure, which is the most expensive way to do it.

A scenario that ties it together

Picture a platform team several months into using Copilot. Adoption is high. Productivity dashboards show gains. But defect rates are not improving and lead time is flat. Leadership asks the obvious question: is AI actually helping, or just feeling like help?

The answer is not to stop using AI. It is to change how AI is measured. Move adoption metrics to the background. Move outcome metrics to the front: defect escape rate, lead time for change, change failure rate, mean time to recovery. In parallel, promote the individual prompts that have proved themselves to shared templates, and the templates to versioned workflows. Retrofit responsible AI controls onto the workflows that shipped first: content safety filters, harm evaluations alongside quality evaluations, transparency annotations on customer-facing output, and a named owner for each workflow.

Six months later, the picture is different. Defect rate improves on the parts of the codebase where reusable workflows were introduced. Onboarding for new engineers is visibly faster. Release notes are consistent across teams. The shift is from celebrating use to tracking outcomes, and once the team measures what matters, the tooling decisions start making themselves.

What teams get wrong

The common pattern is measuring AI by usage, not by outcome. Adoption metrics tell you who tried Copilot. They do not tell you whether defects dropped, lead time improved, or release notes got better.

The fix is not less AI. It is better measurement. The four metrics named in the scenario above (defect escape rate, lead time for change, change failure rate, mean time to recovery) come from the DORA research on software delivery performance and have become a useful default. Two warnings travel with them. First, attribution is hard: an AI workflow rolled out alongside a test refactor and a CI pipeline change cannot claim credit cleanly. Second, baselines matter more than headlines: a single quarter's improvement is not a trend, and a single team's gain is not the platform's gain. Outcome measurement done well needs a baseline window, an attribution discipline, and a kill criterion for workflows that are not paying back. Done poorly, it is just adoption metrics with better names.

There is also the question of cost. AI usage carries a per-run token bill, an evaluation bill on every change, and (for agents) a cost cap that limits damage when something goes wrong. None of these are large compared to the engineering time saved when the workflow works. All of them are visible enough that a finance-aware reader will ask. Track them.

Where to start

The most concrete starter from this post: promote one personal prompt to a shared template. Pick the prompt that gets used most often (commit messages, code reviews, release notes, debugging assist), move it from someone's notes into the repository where the team versions everything else, and watch what changes when the next person on the team runs it. That is the smallest unit of the workflow shift this post argues for, and it is the step where prompts stop being individual practice and start becoming engineering assets.

The shift

The shift is from building systems to building smarter systems:

AI does not replace engineers. It changes what an engineer's leverage looks like.
The unit of value is the workflow, not the suggestion.
The discipline that made platforms operable is the same discipline that makes AI useful.
Responsible AI is not a compliance step. It is the sixth engineering discipline that lets the other five compound safely.

The series ends here, but the arc is consistent across all three posts. The disciplines that make platforms scale are the same disciplines that make AI useful. Build with discipline. Run with discipline. Evolve with discipline. The tools change. The disciplines do not.

Want to discuss? Where has AI moved the needle most in your delivery, and where has it disappointed you? Drop a comment with patterns you have seen in your environment. Every reply gets read.

Previously in this series:

Building Cloud Native Platforms That Scale: Patterns That Actually Work. Part 1 covered the design choices that make scale possible.
Running Cloud Native Platforms: Why Day 2 Decides Everything. Part 2 covered the operational disciplines that decide production outcomes.

This is the third and final post in the series.

WAR, Azure Advisor, and Us (Azure Arch Diagram Builder): Three Ways to Score an Azure Architecture

arturoqu — Thu, 21 May 2026 18:06:15 GMT

Author: Arturo Quiroga, Azure AI services Engineer - Senior Partner Solutions Architect — Microsoft

A few days ago I published From Prompt to Production: Building Azure Architecture Diagrams with AI, introducing the open-source Azure Architecture Diagram Builder. One feature got more follow-up questions than any other: the Well-Architected Framework (WAF) validation. Architects from partners and customers — many of whom already use Azure Advisor and the Well-Architected Review — wanted to know exactly what scoring algorithm we use, how it compares to Microsoft's official tools, and whether they should be using all three.

This post is that answer. It's a deep dive into how design-time WAF validation works, how Microsoft's two official WAF assessment algorithms work, and where each fits in the architecture lifecycle.

TL;DR. Microsoft ships two WAF assessment vehicles — the Well-Architected Review (questionnaire, scored from human answers) and the Azure Advisor score (healthy-resources-÷-applicable-resources weighted per subcategory, with Defender Secure Score for Security and cost-weighted math for Cost). Both require either a human filling in a form or live Azure telemetry. Our app runs at design time on a diagram, before anything is deployed, using a hybrid pipeline: a deterministic rule pre-scan followed by an LLM refinement pass. Same five WAF pillars, different lifecycle stage. Complementary, not competitive.

Why design-time validation matters

Every cost overrun, reliability gap, and security incident I've ever debugged was cheaper to fix on a whiteboard than in production. Yet most WAF tooling assumes the architecture already exists — either because there are deployed resources to scan (Advisor) or because someone has built enough of it to answer 60 specific questions about it (WAR).

That leaves a gap. Between "rough sketch" and "deployed resource group" there is no algorithmic WAF feedback loop. That's the gap the Diagram Builder fills.

Microsoft's two official WAF assessment algorithms

Before describing our approach, it's worth being precise about what Microsoft already ships, because the term "WAF assessment algorithm" can mean either of two very different things.

1. Azure Well-Architected Review (WAR) — questionnaire-based

The Well-Architected Review is a free self-assessment hosted on Microsoft Learn.

Aspect	Detail
Input	Human answers to ~60 questions mapped to the WAF pillar checklists
Workload variants	Core WAR, plus AI/ML, IoT, SAP on Azure, Azure Stack Hub, SaaS, Mission Critical
Scoring	Derived from the answers — each "no" or unanswered question subtracts from the pillar score
Output	Per-pillar maturity score + prioritized recommendations + optional Advisor integration
Improvement tracking	"Milestones" (point-in-time snapshots)
When to use	Periodic deep reviews; greenfield design baselining; brownfield audits

WAR is human-driven. The algorithm is essentially "how many of the recommended practices have you confirmed you do?" — which is exactly the right algorithm when the assessor is the workload team itself.

2. Azure Advisor Score — telemetry-based

The Advisor score is the closest thing Microsoft ships to a real, deterministic WAF algorithm. It runs continuously over your deployed Azure resources.

The math:

Pillar-specific overrides:

Security uses Microsoft Defender for Cloud's Secure Score model.
Cost weights by retail $ cost of healthy resources, plus age-of-recommendation weighting; postponed/dismissed items are removed from the denominator.
Reliability / Performance / Operational Excellence use the healthy-resources ratio above.

Key terms:

Healthy resource — a deployed resource with no open Advisor recommendation against it for that pillar.
Total applicable — resources Advisor was able to evaluate (excludes dismissed/snoozed).

Advisor is the right tool once you're in production. It cannot help you before deployment, because there is nothing to count as "healthy" or "applicable."

The missing stage: design time

Here's the lifecycle, with each tool's domain shaded:



Design / Diagram — Diagram Builder validation runs here.

Operate / Observe — Azure Advisor runs here continuously.
Periodic Review — WAR runs here, typically quarterly or at major milestones.

These three stages are sequential and complementary. Our app does not replace Advisor or WAR — it adds a feedback loop earlier in the lifecycle, where corrections are cheapest.

How design-time validation works in the Azure Architecture Diagram Builder

The validator is a two-phase hybrid pipeline: deterministic local rules first, then LLM refinement. The full source lives in three files:

src/services/architectureValidator.ts — orchestrator and prompt
src/services/wafPatternDetector.ts — topology + service rule engine
src/data/wafRules.ts — the rule knowledge base

Phase 1 — Deterministic rule pre-scan (~1 ms, no LLM)

When you click Validate Architecture, the validator runs a fully client-side rule engine against the diagram's services, connections, and groups. There are two kinds of rules:

Architecture-pattern rules

These fire when a topology anti-pattern is detected:

Pattern	Detection trigger
`single-region`	No global LB (Traffic Manager / Front Door) with ≥3 services
`single-database`	Exactly one database service, no replication signal
`no-cache`	Compute + database present, no Redis/CDN
`no-monitoring`	No Azure Monitor / App Insights / Log Analytics
`no-identity`	No Microsoft Entra ID
`no-waf`	Public web tier without WAF / Front Door / App Gateway
`direct-db-access`	An edge from a frontend service directly into a database
`no-key-vault`	4+ services and no Key Vault
`no-backup`	Database present, no Azure Backup / Recovery Services
`no-api-gateway`	2+ compute services and no APIM / App Gateway / Front Door

Service-specific rules

Every service in the in the generated Azure Architecture diagram is matched against SERVICE_SPECIFIC_RULES by normalized type — App Service, Functions, AKS, Cosmos DB, SQL Database, Storage, Key Vault, and 22 more.

The knowledge base at a glance

Metric	Count
Total rules	73
Architecture-pattern rules	10
Service-specific rules	63
Distinct Azure services covered	29
Rules tagged Reliability	18
Rules tagged Security	34
Rules tagged Cost Optimization	5
Rules tagged Operational Excellence	7
Rules tagged Performance Efficiency	9

The preliminary score

Each finding has a severity, and severity drives a fixed point deduction from a starting score of 100:

Severity	Deduction
critical	−12
high	−7
medium	−3
low	−1

Result is floored at 10 (so even a deliberately bad architecture scores at least 10) and ceilinged at 95 (no findings ≠ perfect — there's always something the model might still catch). This is the deterministic baseline before the LLM ever sees the architecture, and it's what makes the pipeline reproducible.

Phase 2 — LLM contextual refinement

The pre-scan output, the topology, and the optional natural-language description are folded into a focused prompt sent to one of seven Azure OpenAI models (GPT-5.1 through 5.4, GPT-5.x Codex variants, DeepSeek V3.2 Speciale, Grok 4.1 Fast). The system prompt gives the model explicit scoring guardrails:

Score based on what IS present, not what COULD be added.

A well-connected architecture with appropriate services should score 60–80.

Score below 50 only for critical gaps (no auth, no monitoring, single points of failure).

Findings are improvement suggestions, not reasons to penalize the score severely.

The model returns strict JSON:

{
  "overallScore": 0-100,
  "summary": "2–3 sentence assessment",
  "pillars": [
    {
      "pillar": "Reliability | Security | Cost Optimization | Operational Excellence | Performance Efficiency",
      "score": 0-100,
      "findings": [
        {
          "severity": "critical | high | medium | low",
          "category": "...",
          "issue": "...",
          "recommendation": "...",
          "resources": ["service-name-1", "service-name-2"],
          "source": "rule-based | ai-analysis"
        }
      ]
    }
  ],
  "quickWins": [ /* same shape as findings */ ]
}

Two things to call out:

Every finding is tagged rule-based or ai-analysis. That tag is the credibility lever. You can always see what the deterministic engine produced versus what the model contributed on top. If you don't trust the AI layer, you can ignore it entirely — the rule layer still stands.
The LLM is given pattern hints, not the entire rule catalog. The prompt stays small and focused, which is roughly 3–5× faster and cheaper than asking the LLM to do everything from scratch.

What the user sees

On every run the modal reports:

Overall WAF score (0–100)
Per-pillar score × 5 (0–100 each)
Severity breakdown — counts of critical / high / medium / low across all findings
Quick wins — high-impact, low-effort items the model surfaces separately
Hybrid metadata — local findings count, patterns detected, KB rules used, preliminary score, local elapsed ms
AI metrics — model used, reasoning effort, prompt/completion/total tokens, elapsed time
App Insights telemetry — an Architecture_Validated event with model, overall score, finding count, elapsed time

Worked example

Take this prompt, which I've used in demos with partners:

"A multi-region web application: Azure Front Door in front of two App Service instances in West US 2 and East US 2, both reading from an Azure SQL Database with geo-replication, with Application Insights for telemetry. No Entra ID, no Key Vault."

After generation, Validate Architecture runs:

Phase 1 — pre-scan (deterministic), ~1 ms

Patterns detected: no-identity, no-key-vault
Findings produced: 8 (1 critical, 1 high, 3 medium, 3 low)
Preliminary score: 100 − 12 − 7 − (3×3) − (1×3) = 69

Phase 2 — LLM refinement, ~6–9 s depending on model

The model accepts the two pattern hints, validates them in context, and adds three more findings of its own:

Finding	Source	Pillar	Severity
No Microsoft Entra ID for authentication	rule-based	Security	critical
No Key Vault for secret management	rule-based	Security	high
App Service slots not used for safe deploys	ai-analysis	Operational Excellence	medium
SQL DB geo-replication present but RTO/RPO not documented	ai-analysis	Reliability	medium
No CDN for static assets behind Front Door	ai-analysis	Performance Efficiency	low

Final scores returned by the model:

Pillar	Score
Reliability	78
Security	52
Cost Optimization	80
Operational Excellence	70
Performance Efficiency	75
Overall	71

The Security score is the lowest because two of the highest-severity findings landed there — exactly what a human reviewer would flag first.

Multi-model comparison

Because the deterministic floor is identical across runs, the Validation Comparison view becomes a fair shootout of what each LLM adds on top of the same baseline. The same diagram is scored by all seven models, and the UI surfaces:

Overall score per model
Per-pillar score per model
Severity-count deltas
Number of ai-analysis findings each model contributed
Quick wins each model identified

This is genuinely useful for two reasons. First, it shows that LLM scores vary — typically by ±5–10 points on the same architecture — which is exactly why we publish the rule-based vs ai-analysis tag. Second, it lets architects pick the model whose review style matches their own.

How we align with Microsoft's algorithms

Alignment point	What it means
Same five pillars	Identical names and scope to the official WAF
Same source material	Rules derived from WAF docs and Azure Architecture Center service guides
Severity-graded findings	Map conceptually to Advisor's high/medium/low impact recommendations
Per-pillar + overall scoring	Mirrors WAR/Advisor output shape, so the results feel familiar

Where we deliberately differ — and why

Concern	Microsoft	Diagram Builder	Why we differ
Needs deployed resources	Advisor: yes	No — works on a diagram	We're a design-time tool; the architecture doesn't exist yet
Needs human Q&A	WAR: yes	No — derived from the diagram	One-click validation inside the design flow
Healthy/Applicable ratio	Advisor: yes	No	No resource-health signal exists pre-deployment
Subcategory fixed weights	Advisor: yes	No explicit weights	Severity is the de-facto weight (12/7/3/1)
Defender Secure Score for Security	Advisor: yes	No	Defender requires deployed resources
Cost-weighted scoring	Advisor: yes	No (separate Cost Estimation feature)	Cost is a separate pipeline in our app
AI/LLM refinement	Neither	Yes	Catches context-specific issues a static catalog misses, and explains findings in natural language
Multi-model comparison	Neither	Yes	Lets architects see scoring variance across models

Honest limitations

I'd rather you hear these from me than discover them in production:

LLM scores drift. ±5–10 points across models on the same diagram is normal. Treat the score as directional, the findings as actionable. The rule-based tag is your anchor.
No live telemetry. We can't know if your App Service is actually using availability zones — only that you have App Service in the diagram. Advisor will tell you the truth post-deployment.
Generic ruleset. No specialized workload branches yet (AI/ML, IoT, SAP, SaaS). WAR has those.
No milestone tracking. Each validation run is independent. Compare runs manually using the Validation Comparison view.
Rule coverage is finite. 29 services and 73 rules is a strong start but not exhaustive — the LLM layer exists in part to compensate for that gap.

How to use all three together

A lifecycle that actually works:

Design — Use the Diagram Builder to sketch the architecture and validate at design time. Iterate until the per-pillar scores look reasonable and the critical/high findings are addressed.
Deploy — Generate Bicep from the diagram, deploy, and let Azure Advisor start scoring real resources.
Operate — Use Azure Advisor continuously. Use Defender Secure Score for security posture.
Periodic review — Run a Core WAR every quarter or at major milestones to capture the things only humans know (business context, tradeoffs, planned debt).

None of these three replace the others. They cover different stages of the same loop.

What's next

A few things on the roadmap I'd love feedback on:

Milestone tracking so design-time scores can be compared over time the way WAR milestones work.
Workload-specific rulesets mirroring WAR's branches — starting with AI/ML.
Direct Advisor handoff — once a diagram is deployed, surface the corresponding Advisor recommendations in the same UI to close the loop.

Try it, fork it, tell me where it's wrong

Live app: https://aka.ms/diagram-builder
Source: github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder

Useful references:

If you're a partner or customer architect who's already living in Advisor and WAR, I'd genuinely value your reaction — does the design-time stage feel like a real gap to you, or are you already covering it some other way? Open an issue on the repo or reply on LinkedIn.

Posted on the Azure Architecture Blog · Comments and issues welcome on the repo.

From Prompt to Production: Building Azure Architecture Diagrams with AI

arturoqu — Fri, 22 May 2026 18:35:07 GMT

Author: Arturo Quiroga, Senior Partner Solutions Architect — Microsoft

Cloud architects spend significant time translating ideas into architecture diagrams. They toggle between Visio, draw.io, pricing calculators, and documentation. According to the 2024 Stack Overflow Developer Survey, 61% of developers spend more than 30 minutes a day searching for answers or solutions, time lost to context-switching rather than design. What if you could describe your architecture in plain English and get a diagram, cost estimate, and deployment guide in minutes?

The Challenge: Fragmented Architecture Workflows

Designing Azure architectures today typically involves multiple disconnected steps:

Sketch the architecture in a diagramming tool
Look up official Azure icons and drag them into place
Research pricing across regions using the Azure Pricing Calculator
Validate the design against the Well-Architected Framework (WAF)
Write deployment documentation and Infrastructure as Code templates
Compare alternative designs manually

Each step lives in a different tool, and keeping them in sync as designs evolve is costly. The Azure Architecture Diagram Builder brings these workflows together in a single browser-based experience.

How It Works

Describe your architecture in natural language, for example "A HIPAA-compliant healthcare platform with FHIR APIs, event-driven processing, and multi-region disaster recovery", and the AI generates a diagram with grouped services, data flow connections, and logical organization.

Figure 1. Enter a natural-language prompt describing your architecture. Curated example prompts help you get started, and you can optionally upload an existing diagram for the AI to analyze.

The tool uses Azure OpenAI to power generation across multiple models, enabling you to choose the model that best fits your scenario — from fast iterations to deeper reasoning.

Key Features

AI-Powered Architecture Generation

Describe what you need in plain English, and the AI creates an architecture diagram with:

714 official Azure service icons across 29 categories
Smart grouping: services are logically organized (Frontend, Backend, Data, Security)
Data flow connections: labeled edges showing how data moves through the system
13 curated example prompts: from simple web apps to complex enterprise scenarios like Zero Trust networks, Industrial IoT with 5,000+ sensors, and global multiplayer gaming backends

Figure 2. A generated industrial IoT architecture. Top: the clean diagram view as initially produced. Bottom: the same diagram with per-service monthly cost overlays toggled on, plus a running subscription total in the toolbar.

Architecture Image Import

Already have an architecture on a whiteboard or in a screenshot? Upload the image and let the AI analyze it, mapping services to official Azure icons and recreating the architecture as an editable, interactive diagram.

Figure 3. Upload a photo of a whiteboard sketch (top-right reference panel) and the AI recreates it as an editable diagram with official Azure service icons and labeled data flow connections.

ARM Template Import

Import existing ARM templates to visualize your current infrastructure. The AI parses resource definitions and dependencies, groups related resources into logical layers, and produces a meaningful diagram of what you actually have deployed — a fast way to document an inherited environment or sanity-check a template before deployment.

Figure 4. ARM template import in action. Top: the parser status banner while resources and dependencies are being analyzed. Bottom: the resulting diagram, with resources auto-grouped into logical layers (Web Tier, Data Layer, Container Platform, Observability & Logging) and a Generated from: ARM Template badge linking the diagram back to its source file.

Well-Architected Framework Validation

Validate your architecture against all five WAF pillars — Security, Reliability, Performance Efficiency, Cost Optimization, and Operational Excellence. The validator provides:

An overall WAF score with pillar-level breakdowns
Specific findings with severity levels
Actionable recommendations you can select and apply

Select the recommendations you agree with, and the AI regenerates an improved architecture incorporating those changes.

Figure 5. WAF validation results showing the overall score, per-pillar breakdowns, and individual findings with severity badges. Tick the recommendations you want and the AI rebuilds the diagram with those changes applied.

Multi-Model Comparison

Run the same architecture prompt through multiple AI models side-by-side and compare:

Architecture Comparison: service counts, connection counts, groups, token usage, and latency
Validation Comparison: WAF scores across models, severity breakdowns, and finding counts
Apply Winner: pick the best result and apply it to the canvas with one click
Present Critique: a talking avatar narrates the AI-generated ranking with live closed captions

Figure 6. Multi-model comparison. Top: select the models and reasoning effort, then enter the prompt. Bottom: side-by-side results across all selected models with service counts, latency, token usage, and Fastest / Cheapest / Most Thorough badges.

Multi-Region Cost Estimation

Get cost estimates from the Azure Retail Prices API across 8 Azure regions: East US 2, Australia East, Canada Central, Brazil South, Mexico Central, West Europe, Sweden Central, and Southeast Asia. Features include:

Color-coded cost legend (green / yellow / red thresholds)
SKU and tier information for each service
Export options: CSV, JSON, plain-text summary, and an analysis report with top cost drivers, Reserved Instance flags, and a ranked multi-region comparison table

Figure 7. The cost legend overlay shows per-service pricing with color-coded thresholds. The region selector in the toolbar lets you re-price the entire architecture in any of eight Azure regions.

Deployment Guide Generation with Bicep

Generate step-by-step deployment documentation including:

Prerequisites and Azure resource requirements
Step-by-step deployment instructions
Bicep templates for each service (Infrastructure as Code)
Post-deployment verification steps
Security configuration recommendations

Figure 8. Each generated Deployment Guide opens with the architecture name, an estimated deployment time, and a prerequisites checklist covering subscription roles, CLI versions, Microsoft Entra ID permissions, and region requirements, followed by numbered, copy-ready deployment steps.

Figure 9. The Infrastructure as Code section produces a main.bicep orchestrator plus a per-service module (Log Analytics, Key Vault, Cosmos DB, SQL Database, Event Hubs, Azure Functions, and more). The Download All Templates button packages everything into a ready-to-deploy folder.

Workflow Animation & Avatar Presenter

Visualize how data flows through your architecture with step-by-step animations that highlight services on the canvas as each step plays. When the Azure Speech Service is configured, a photorealistic talking avatar can narrate the workflow or present model comparison results, with live word-by-word closed captions in a draggable, resizable panel.

Figure 10. A workflow step is highlighted on the canvas as the Avatar Presenter narrates that step. Live word-by-word closed captions appear in a draggable, resizable panel, useful for accessibility and stakeholder demos.

Export Options

Figure 11. A single-slide PowerPoint export, available in dark or light theme, ready to drop straight into a stakeholder deck.

Format	Use Case
PNG	Documentation, presentations
SVG	Scalable vector graphics
PPTX	Single PowerPoint slide (dark or light theme)
Draw.io	Edit in diagrams.net
JSON	Backup, version control
CSV / ZIP	Cost analysis with multi-region comparison

Highlights

The Azure Architecture Diagram Builder unifies the architecture design lifecycle in a single tool:

End-to-end workflow: from natural-language description to deployable Bicep templates without tool switching
Official Azure icons: 714 icons across 29 categories, mapped directly from the Azure service catalog
Live pricing: queries the Azure Retail Prices API at design time rather than relying on static estimates
WAF-integrated validation: architectural best practices built into the design loop rather than applied after the fact
Multi-model flexibility: choose the AI model that best suits each task, with fast models for iteration and reasoning models for complex designs
Open source: the source code is available for customization and contribution

One-Command Deploy with Azure Developer CLI

The fastest way to get your own instance running is with azd:

# Install azd (once)
brew tap azure/azd && brew install azd   # macOS
winget install microsoft.azd             # Windows

# Clone, configure, and deploy
git clone https://github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder
cd azure-architecture-diagram-builder
azd auth login
azd env set AZURE_OPENAI_ENDPOINT "https://your-resource.openai.azure.com/"
azd env set AZURE_OPENAI_API_KEY  "your-key"
azd up   # Provisions infrastructure + builds + deploys (~8 min)

azd up provisions the following via Bicep:

Resource	Purpose
Azure Container Registry	Stores the Docker image
Azure Container Apps	Runs the app (nginx + token server)
Log Analytics + Application Insights	Monitoring and telemetry
Azure Speech (S0)	Avatar Presenter (optional, keyless auth via managed identity)

Try It Today

The Azure Architecture Diagram Builder is available now:

Live demo: https://aka.ms/diagram-builder
Source code: GitHub repository
Documentation: See the Getting Started Guide for detailed setup instructions

We welcome feedback and contributions. Use the GitHub Issues page to report bugs, suggest features, or share your experience.

Tags: artificial intelligence · application · apps & devops · well architected · infrastructure

How to Secure Azure Databricks without Public Exposure using WAF + Private Endpoints

FaizaanMerchant — Mon, 11 May 2026 23:37:12 GMT

While first thing that comes up to mind is that lets configure IP Access List with keeping Azure Databricks in Hybrid Connectivity. This approach is technically doable, but not the best approach for organizations which follows Zero Trust Architecture Framework.

This is where organizations has to design a solution which follows CAF Principles and is fully secured with Azure Application Gateway with Web Application Firewall (WAF) combined with Private Endpoints (Azure Private Link) becomes critical to enforce Zero Trust Architecture.

In this blog, I’ll walk through:

Why Zero Trust is essential for Databricks
Traffic Flow For Securing Databricks
Architecture Components
Key Considerations
Conclusion

Why Zero Trust for Azure Databricks?

Azure Databricks is a SaaS-managed service—but in many enterprise environments, data sensitivity demands full network isolation.

Azure Private Link enables organizations to:

Eliminate public internet exposure
Ensure traffic remains within private networks
Reduce risk of data exfiltration

Additionally, organizations prefer:

Controlled access via corporate network (VPN/ExpressRoute)
Full audit and inspection of inbound traffic
Strong compliance alignment (RBI, SEBI, PCI-DSS, etc.)

This drives the need for private-only access models, where public access is completely disabled.

Traffic Flow For Securing Databricks –

Organizations following Zero Trust Architecture often has concerns about accessibility in secured manner, while majority being accessed through the intranet but there are scenarios where in subset of users / partners / vendor needs access which are outside of your organization network and they cant be incorporated in the organization network, instead they have to access over Internet.

This becomes very challenging to consider the flow and secure your connectivity. Below chart would explain how the traffic will flow happen entirely for Databricks and how we will secure it –

External User Access (Red Flow)
1. External user connects via internet
2. Request hits Application Gateway (WAF)
3. Traffic is:
  - Inspected
  - Validated
4. Routed to Databricks via Private Endpoint

Ensures all external traffic is secured and inspected

Internal User Access (Green Flow)
1. Internal user connects via VPN / ExpressRoute
2. Traffic enters Hub VNet
3. Routed directly to:
  - Databricks Private Endpoint (in Spoke VNet)

Ensures private, low-latency, secure access without internet exposure

Architecture Components –

In this section of the blog, I’ll go through the architecture and the components that are required for design considerations.

As we are focused on Zero Trust Architecture, we will consider having a secured Hub & Spoke networking model.

Key Services & Considerations –

Application Gateway with WAF
WAF with Custom Rules
Secured Internal Network Connectivity
Databricks Workspace – Public Disabled
Databricks Workspace – Private Endpoint Enabled

1. Application Gateway with WAF

Acts as the single entry point to Databricks, deployed in the Hub VNet.
Provides SSL termination, routing, and traffic inspection, ensuring all access is centralized and backend services are not directly exposed.

2. WAF with Custom Rules

Protects against OWASP Top 10 threats and runs in Prevention mode.
Custom rules enable restriction based on IP, geo-location, or request patterns, giving fine-grained security and compliance control for specific networks only.

3. Secured Internal Network Connectivity

External users → access via WAF (inspected traffic)
Internal users → access via VPN/ExpressRoute (private flow)

Supported by NSGs, VPN Gateway, and firewall controls to enforce secure, segmented access.

4. Databricks Workspace – Public Access Disabled

Public access is completely disabled, ensuring the workspace is not exposed to the internet and only accessible via approved private paths.

5. Databricks Workspace – Private Endpoint Enabled

Exposed via Private Endpoint in the Spoke VNet, mapped to a private IP with DNS integration.
Ensures traffic remains within Azure backbone and is accessible only through authorized networks.

Key Considerations

Application Gateway Listener should be configured with FQDN with which external users will be accessing the Databricks.
Listener FQDN (Example - databricks.example.com) will be resolving to Public Ip of Application Gateway (Frontend Ip).
SSL Certificate & Public DNS Configurations has to be considered for the configuration to work.
Backend Pool for Application Gateway should be configured with FQDN of Databricks Workspace. Note – If you add Private Ip of workspace it will be shown as health backend, but web request will fail
Ensure that DNS Resolution for Private Endpoint is configured appropriately.
Application Gateway should be able to resolve FQDN of Databricks workspace to Private Endpoint.
End-to-End SSL Configuration at Application Gateway Level.
Custom Rules for WAF should be configured for allowing particular Public Network while others should be blocked.

Conclusion

As organizations continue adopting Azure Databricks for critical analytics and AI workloads, securing access becomes non-negotiable. A traditional approach with hybrid network model with IP Access List is no longer sufficient for enterprise-grade security.

By combining Application Gateway (WAF) with Private Endpoints, and leveraging a Hub-Spoke architecture, we can achieve a true Zero Trust design—where every request is validated, inspected, and routed securely without exposing backend services.

This architecture not only reduces the attack surface but also ensures compliance with strict regulatory standards, while still enabling seamless access for both internal and external users.

Useful Links -

Azure WAF Custom Rules - https://learn.microsoft.com/en-us/azure/web-application-firewall/ag/custom-waf-rules-overview

Azure Databricks Private Link - https://learn.microsoft.com/en-us/azure/databricks/security/network/concepts/private-link#choose-the-right-private-link-implementation

Azure Databricks Architecture - https://learn.microsoft.com/en-us/azure/databricks/security/network/concepts/architecture

Azure Databricks VNet - https://learn.microsoft.com/en-us/azure/databricks/security/network/classic/vnet-inject#overview

Azure Hub & Spoke Model - https://learn.microsoft.com/en-us/azure/architecture/networking/architecture/hub-spoke

Configure DNS forwarding for Azure NetApp Files

mkachare — Mon, 11 May 2026 04:20:10 GMT

This post has been written with the collaboration of Rizul Khanna

Applies to: Azure NetApp Files — SMB, dual-protocol, and NFSv4.1 Kerberos volumes deployed in hub-spoke or Azure Virtual WAN topologies using an external private DNS forwarder.

Overview

Azure NetApp Files (ANF) has a hard dependency on DNS for all volume types that integrate with Active Directory (AD): SMB, dual-protocol (SMB + NFS), and NFSv4.1 with Kerberos. Unlike most Azure PaaS services, ANF does not use Azure Private Link and has no privatelink.* zone. Its volumes attach directly to a delegated subnet, and their hostnames are registered into AD-integrated DNS via Secure Dynamic DNS (DDNS). This architecture means DNS design decisions for the ANF delegated subnet are fundamentally different from those that apply to storage accounts, SQL databases, or other services that use private endpoints.

This article documents what DNS resolution ANF requires, how to correctly configure an external private DNS forwarder in hub-spoke and Virtual WAN deployments, and the specific undocumented requirements that cause volume creation failures and SMB permission errors in practice. Several requirements covered here are not present in the official Azure NetApp Files documentation and have been identified through field support cases.

ANF does not inherit the VNET DNS server setting. It queries only the two DNS server IPs configured in the Active Directory connection on the NetApp account. This is not documented in the ANF networking or AD connection articles. The VNET DNS server setting is irrelevant to ANF volume creation and AD join behavior — only the AD connection DNS IPs matter.

Architecture overview

The following diagram shows the two separate DNS paths that must be configured when ANF is deployed in a hub-spoke or Virtual WAN topology with an external private DNS forwarder. The client resolution path (VNET DNS setting) and the ANF internal resolution path (AD connection DNS fields) are distinct and must not be conflated.

Note:

ANF AD connection DNS IPs must point to the external DC IPs directly — not to the private DNS forwarder. The forwarder handles client-side resolution only and must have both forward and reverse rulesets for the AD domain.
Figure 1: DNS resolution paths for ANF with an external private DNS forwarder. Client VMs use the forwarder (VNET DNS setting). ANF uses the external AD DC IPs directly (AD connection DNS fields). Both forward and reverse lookup rulesets are required on the forwarder.

What DNS must provide for Azure NetApp Files

Outbound resolution — ANF querying DNS

ANF must be able to resolve the following records from the DNS IPs specified in the AD connection:

AD domain controller SRV records: _ldap._tcp.<site>._sites.dc._msdcs.<domain>, _kerberos._tcp.dc._msdcs.<domain>, and site-scoped equivalents

Kerberos KDC records: _kerberos._tcp.<domain> and _kerberos-master._tcp/udp.<domain>

DC A records: Forward lookup for each DC hostname to its IP

PTR (reverse) records: IP-to-hostname for each DC — required for dual-protocol volume creation, NFSv4.1 Kerberos, LDAP-over-TLS certificate validation, and NTFS ACL operations on SMB shares.

Note:

_kerberos-master._tcp and _kerberos-master._udp SRV records are not created automatically by Active Directory DNS. They must be added manually in the DNS zone. Their absence causes Kerberos failures that do not clearly identify DNS as the root cause. This requirement is not documented in any ANF article.

ANF performs Secure Dynamic DNS (DDNS) using GSS-TSIG to register SMB and dual-protocol volume hostnames in AD DNS. This requires that the DNS IPs in the AD connection belong to Microsoft AD-integrated DNS servers. External private DNS forwarders (Infoblox, BIND, Unbound, and similar appliances) do not support GSS-TSIG and will silently discard DDNS updates — volume hostnames will not appear in DNS and SMB mounts will fail. No error is surfaced in the ANF portal or activity log when DDNS is silently dropped.

Inbound resolution — clients resolving ANF hostnames

SMB and dual-protocol volumes are accessed via a hostname of the form <smb-prefix>-XXXX.<ad-dns-domain>, where the four-character suffix is assigned by ANF and cannot be overridden. Clients must resolve this hostname to the volume IP via the VNET DNS server setting, which in enterprise environments points to the external private DNS forwarder. The forwarder must have a forward lookup ruleset for the AD domain pointing to the external DC IPs. NFSv3 mounts use the volume IP directly and do not require hostname resolution.

Note:

NFSv3 volume creation success does not indicate SMB readiness. NFSv3 mounts use the volume IP directly and require no AD join, Kerberos exchange, or reverse DNS. SMB and dual-protocol volumes require all three. Using NFSv3 as a connectivity proxy during SMB troubleshooting produces false confidence. This distinction is not documented in ANF troubleshooting guidance.

Configuring the external private DNS forwarder

The two DNS paths — client resolution vs ANF internal

In environments using an external private DNS forwarder (Infoblox, BIND, Windows DNS VM, or similar appliance), two distinct DNS paths must be kept separate. The VNET DNS server setting governs client resolution of ANF SMB hostnames and should point to the external forwarder. The ANF AD connection DNS fields govern ANF's own resolution of DCs and DDNS registration and must point directly to writable Microsoft AD-integrated DC IPs.

DNS Path	Used By	Correct Target
Client resolution (VNET DNS setting)	VMs, Citrix, application servers resolving ANF SMB hostnames.	External private DNS forwarder, which forwards AD zone queries to external DC IPs
ANF internal resolution (AD connection DNS fields)	ANF service — DDNS, Kerberos, LDAP, SRV lookup	Writable AD-integrated external DC IPs directly — not the forwarder

Required rulesets on the external private DNS forwarder

The external private DNS forwarder must have both of the following rulesets configured. Missing either one produces failures that are difficult to diagnose because forward DNS tests pass while the actual failure occurs in a different path.

References:
Understanding Private DNS resolver endpoints & rulesets
How to create Private Reverse DNS records

Forward lookup ruleset

Forward all queries for the AD domain to the external DC IPs:

Zone: ad.contoso.com
Targets: <DC-IP-1>:53, <DC-IP-2>:53 (writable external DC IPs)

Reverse lookup ruleset (most commonly missing)

Forward reverse lookup queries for the DC IP range to the external DC IPs:

Zone: <reverse-octets>.in-addr.arpa.
Targets: <DC-IP-1>:53, <DC-IP-2>:53 (same external DC IPs)

Critical:

The reverse lookup ruleset is the most commonly missing configuration item and causes a failure that forward DNS tests do not detect. Without it, Windows clients cannot resolve DC IPs to hostnames. This produces the following error when provisioning NTFS permissions on an ANF SMB share: 'The program cannot open the required dialog box because it cannot determine whether the computer named is joined to a domain.' All connectivity tests pass. Forward DNS passes. The volume was created successfully. Only the reverse lookup fails — and only when NTFS ACL operations are attempted. This failure mode and its root cause are not documented in any ANF article.

GSS-TSIG constraint — why the forwarder cannot be in the ANF AD connection

External private DNS forwarders (including Infoblox, BIND, Unbound, and third-party appliances) do not support GSS-TSIG, the protocol ANF uses to securely register SMB volume hostnames into AD DNS. If a forwarder IP is placed in the ANF AD connection DNS fields, ANF sends DDNS update packets to the forwarder, which discards them silently. The volume hostname never appears in DNS. Clients cannot mount by name. No error is returned in the ANF portal.

The correct design: external DC IPs in the ANF AD connection, external private DNS forwarder as the VNET DNS server for clients only.

Role of 168.63.129.16

168.63.129.16 is the Azure-provided internal resolver. It should be configured as the upstream forwarder target on the external private DNS forwarder for all queries not covered by AD or other conditional forwarders. This allows Azure-hosted DNS zones (such as any privatelink.* zones linked to your VNET) to resolve correctly through the forwarder.

IMPORTANT:
168.63.129.16 must never be placed in the ANF AD connection DNS fields. It is not AD-aware, cannot answer SRV queries for your domain, cannot accept DDNS updates, and is unreachable from on-premises over ExpressRoute or VPN. Its correct position is as an upstream target on the external private DNS forwarder — not in ANF's AD connection. This is not stated anywhere in the ANF documentation set.

DNS forwarder pattern comparison

Pattern	DDNS Support	Reverse DNS	Ops overhead	Best Fit
AD DNS on DCs + upstream 168.63.129.16	Yes, Native	Yes, if reverse zones on DC's	Medium	Default; simplest topology
External private DNS forwarder (VMs only) + external DC IPs in ANF AD connection	Yes (ANF bypasses forwarder)	Yes, if reverse ruleset on forwarder.	Medium	Enterprise with existing DNS infrastructure
External DNS forwarder in ANF AD connection (incorrect)	No — DDNS silently dropped	N/A — config is wrong	High	Not supported
Azure DNS Private Resolver in ANF AD connection (incorrect)	No — DDNS not accepted	N/A — config is wrong	High	Not supported

Azure Virtual WAN considerations

When ANF is deployed in a spoke VNET connected to an Azure Virtual WAN hub, a routing requirement applies that directly causes Kerberos and LDAP failures — which appear to be DNS or AD failures — when not addressed. This is one of the most common misdiagnoses in ANF deployments using Virtual WAN with NVA inspection.

ANF subnet prefix must be in Routing Intent additional prefixes

Azure Virtual WAN with Routing Intent routes private traffic through an NVA or Azure Firewall in the hub. For return traffic from external AD domain controllers to reach the ANF data plane IP, the hub must have an explicit routing entry for the ANF delegated subnet prefix. If the ANF delegated subnet is a /26 inside a larger VNET (/21 or /16), the broader VNET prefix alone is not sufficient — the specific /26 must be added explicitly.

Action Required:
In Azure Virtual WAN: Hub -> Routing -> Routing Intent -> Private Traffic -> Additional Prefixes. Add the ANF delegated subnet prefix (for example, 10.x.x.0/26) explicitly. Without this, Kerberos and LDAP reply traffic from external domain controllers is dropped before reaching the ANF data plane. The symptom is TCP port 88 connects succeeding followed by KRB5_KDC_UNREACH — which looks like a Kerberos or DNS problem but is a routing problem.

Use availability zone switching to surface detailed error messages

When ANF volume creation fails with the generic 'context deadline exceeded' error from the XMLrequest_filer endpoint, the error does not identify root cause. Redeploying the volume to a different availability zone (AZ1 to AZ2, or AZ2 to AZ3) forces a different backend assignment and consistently produces a more descriptive error that distinguishes routing failures (KRB5_KDC_UNREACH) from Kerberos authentication failures, DNS lookup failures, and LDAP errors.

TIP:
If the detailed error shows 'Successfully connected to ip <DC-IP>, port 88 using TCP' followed by 'Cannot contact any KDC for requested realm', the outbound path works but reply packets are dropped — this is a routing problem, not a DNS or Kerberos problem. Check vWAN Routing Intent, NVA firewall rules, and UDRs for the ANF subnet prefix. This diagnostic technique is not documented in ANF troubleshooting guidance.

Required DNS records

The following records must exist in the AD DNS zone served by the external DC DNS servers and must be resolvable from the ANF delegated subnet. Records marked * are not created automatically by AD DNS and must be added manually.

Forward lookup zone

Record	Type	Notes
_ldap._tcp.dc._msdcs.<domain>	SRV	Domain-wide DC discovery
_kerberos._tcp.dc._msdcs.<domain>	SRV	Domain-wide KDC discovery
_ldap._tcp.<site>._sites.dc._msdcs.<domain>	SRV	Site-scoped — preferred when AD site is specified in ANF
_kerberos._tcp.<site>._sites.dc._msdcs.<domain>	SRV	Site-scoped Kerberos
_kerberos-master._tcp.<domain> *	SRV	NOT auto-created by AD DNS — must be added manually
_kerberos-master._udp.<domain> *	SRV	NOT auto-created by AD DNS — must be added manually
<dc-hostname>	A	Forward A record for each external DC in the AD site
<anf-smb-hostname>	A	Registered by ANF via DDNS — must not be scavenged or blocked

Reverse lookup zone (<reverse-octets>.in-addr.arpa)

Record	Type	Notes
PTR for each external DC IP	PTR	Required for dual-protocol, NFSv4.1 Kerberos, LDAP-over-TLS, and NTFS ACL operations
PTR for each ANF volume IP *	PTR	Required for NFSv4.1 Kerberos reverse-lookup clients — create manually or via DDNS if supported

How ANF internally fails when reverse DNS is missing

When the reverse DNS ruleset is absent from the external private DNS forwarder, the failure does not surface as a DNS error in the ANF portal. Instead, it propagates through ANF's internal security daemon (secd) and presents as a generic InternalServerError or a Kerberos authentication failure. Understanding the internal failure chain explains why reverse DNS is non-negotiable and why the symptom is so misleading.

The secd service list mechanism

ANF uses an internal process called secd (Security Daemon) to manage all Active Directory communication — Kerberos ticket exchange, LDAP binds, and DC discovery. secd maintains a service list of discovered DCs. When a DC communication attempt fails for any reason, secd marks that DC as UNUSABLE and records a forgive time after which it will retry. If all DCs in the service list are simultaneously marked UNUSABLE, secd returns RESULT_ERROR_SECD_NO_SERVER_AVAILABLE, which propagates to the portal as InternalServerError.

The reverse PTR lookup inside secd

A critical and undocumented behavior: before secd completes a SASL/GSSAPI bind to an LDAP server, it performs a reverse PTR lookup of the DC's IP address. This lookup is used to validate the DC identity as part of Kerberos mutual authentication. If the PTR lookup fails — because the external private DNS forwarder has no reverse ruleset for the DC IP range — secd logs the failure and marks that DC UNUSABLE immediately, even though TCP connectivity on ports 88 and 389 succeeded.

The following is the exact failure sequence from ANF backend logs when reverse DNS is absent:

Stage 1 — TCP connects succeed, PTR lookup fails

Successfully connected to ip 10.x.x.60, port 389 using TCP
Entry for host-address: 10.x.x.60 not found in the current source: FILES
Source: DNS unavailable. Entry for host-address: 10.x.x.60 not found in any of the available sources

secd successfully opens the TCP connection to the DC on port 389 (LDAP), then immediately attempts a reverse lookup of that IP. The forwarder has no reverse ruleset, so DNS returns NXDOMAIN. secd logs 'DNS unavailable' and proceeds to mark the DC UNUSABLE.

Stage 2 — GSSAPI bind fails as a consequence

Unable to SASL bind to LDAP server using GSSAPI: Local error
Unable to connect to LDAP (Active Directory) service on dc01.ad.contoso.com (Error: Local error)

Because the PTR lookup failed, secd cannot complete the GSSAPI mutual authentication context. The 'Local error' is not a Kerberos configuration problem — it is the direct result of the identity validation step failing due to missing reverse DNS.

Stage 3 — All DCs marked UNUSABLE

10.x.x.27 UNUSABLE  Wed Apr  8 00:19:24 2026
10.x.x.28 UNUSABLE  Wed Apr  8 00:19:24 2026
10.x.x.29 UNUSABLE  Wed Apr  8 00:19:25 2026
10.x.x.30 UNUSABLE  Wed Apr  8 00:19:25 2026 
... (all DCs in the service list)

secd cycles through every DC in the discovered service list. Each DC fails the same PTR lookup. Each is marked UNUSABLE. Once the list is exhausted:

Stage 4 — Service list exhausted, error propagates

No servers in the service list which aren't marked bad
Unable to select any server in the current serviceList
RESULT_ERROR_SECD_NO_SERVER_AVAILABLE:6940
No servers available for MS_LDAP_AD, domain: ad.contoso.com

This internal error code propagates up through the ANF volume creation stack and is presented to the operator as the generic 'context deadline exceeded (Client.Timeout exceeded while awaiting headers)' error in the portal. The actual cause — missing PTR records — is completely obscured.

Stage 5 — Kerberos pre-auth error (secondary, misleading)

Received error from KDC: -1765328359 / Additional pre-authentication required

This Kerberos error code (KRB5KDC_ERR_PREAUTH_REQUIRED) appears in logs and can mislead investigation toward Kerberos configuration, encryption type mismatches, or clock skew. It is a downstream consequence of the failed PTR-based GSSAPI context — not a root cause. Chasing this error without first verifying reverse DNS is a common and time-consuming dead end.

KEY INSIGHT:
The complete failure chain is: missing reverse PTR ruleset on DNS forwarder → secd PTR lookup returns DNS unavailable → GSSAPI mutual auth cannot complete → DC marked UNUSABLE → all DCs exhausted → InternalServerError at portal. TCP connectivity on ports 88 and 389 succeeds at every stage. Only the PTR lookup fails. This is why all standard connectivity tests pass and the issue remains invisible until reverse DNS is specifically tested.

Why this failure is invisible to standard troubleshooting ?

Standard ANF DNS troubleshooting checks forward SRV records, forward A record resolution, and TCP port connectivity to DCs. All of these pass when only the reverse ruleset is missing. The secd PTR lookup is an internal step that occurs after TCP connectivity is confirmed and is not tested by any of the standard nslookup or Test-NetConnection commands used during initial validation. The only reliable way to surface this failure without access to backend logs is to explicitly test reverse PTR resolution from the ANF VNET — as documented in the verification section below.

Verify DNS configuration

Run the following commands from a test VM in the same VNET as the ANF delegated subnet. Use the external private DNS forwarder IP for client-side tests and an external DC IP for ANF-side tests.

Forward SRV lookup — site-scoped

 nslookup -type=SRV _ldap._tcp.<SITE>._sites.dc._msdcs.<domain> <forwarder-IP>
 nslookup -type=SRV _kerberos._tcp.<SITE>._sites.dc._msdcs.<domain> <forwarder-IP>

Forward SRV lookup — domain-wide

 nslookup -type=SRV _ldap._tcp.dc._msdcs.<domain> <forwarder-IP>
 nslookup -type=SRV _kerberos._tcp.dc._msdcs.<domain> <forwarder-IP>

Reverse PTR lookup — use the external forwarder IP

 nslookup <external-DC-IP> <forwarder-IP>

Expected output:

 Server:  <forwarder-IP>
 <reverse-arpa>  name = dc01.ad.contoso.com.

If reverse lookup returns NXDOMAIN or times out while forward lookup succeeds, add the reverse DNS ruleset to the external private DNS forwarder. This is the most common cause of NTFS permission failures after a volume is successfully created.

Port connectivity from ANF VNET

Test-NetConnection -ComputerName <external-DC-IP> -Port 88    # Kerberos
Test-NetConnection -ComputerName <external-DC-IP> -Port 389   # LDAP
Test-NetConnection -ComputerName <forwarder-IP>  -Port 53     # DNS

Common issues and resolutions

Symptom	Likely cause	Resolution
InternalServerError: context deadline exceeded (XMLrequest_filer)	Generic ANF backend timeout — routing or Kerberos root cause not visible at this level	Switch deployment availability zone for a detailed error. Check vWAN Routing Intent for ANF /26 prefix.
KRB5_KDC_UNREACH — TCP port 88 connects succeed but auth fails	Return traffic from external DCs dropped before reaching ANF NIC — routing issue, not DNS	Add ANF subnet /26 to vWAN Hub Routing Intent > Additional Prefixes > Private Traffic
'Cannot determine whether the computer is joined to a domain' — NTFS permissions	Reverse DNS (PTR) lookup failing on external forwarder for DC IPs	Add reverse lookup ruleset to external private DNS forwarder: <reverse-zone>.in-addr.arpa. > external DC IPs
DDNS fails — SMB hostname not in DNS after volume creation	ANF AD connection DNS IPs point to external forwarder — GSS-TSIG not supported by forwarder	Set AD connection DNS IPs to writable Microsoft AD-integrated external DC IPs directly
'Failed to validate LDAP configuration' during dual-protocol creation	Missing PTR records for external DCs, or reverse zone unreachable	Add PTR records for all external DCs. Verify reverse ruleset is present on the forwarder.
NFSv4.1 Kerberos: 'Cannot determine realm for numeric host address'	Missing PTR for ANF volume IP or external DC IPs	Add PTR records for ANF volume IPs and all external DC IPs in the reverse zone.
SMB hostname resolves on-premises but not from Azure VMs	External private DNS forwarder missing forward ruleset for AD zone, or targeting wrong DC IPs	Verify forward ruleset is present and targeting reachable writable external DC IPs.
Volume creation fails after external DC IP change	External DNS forwarder (especially BIND) caching stale DC IPs — default TTL up to 7 days	Flush forwarder cache. Set short TTLs on DC A records. Consider Microsoft AD-integrated DNS for AD zones.

Summary of key requirements:

ANF AD connection DNS IPs must point to writable Microsoft AD-integrated DNS servers (external DC IPs) — not the external private DNS forwarder, not Azure DNS Private Resolver, not 168.63.129.16.
The external private DNS forwarder must have both a forward ruleset (AD domain > external DC IPs) and a reverse ruleset (in-addr.arpa. zone for DC IP ranges > external DC IPs). The reverse ruleset is required for NTFS ACL operations on SMB shares and is not mentioned in ANF documentation.
168.63.129.16 is the upstream forwarder target on the external DNS forwarder — not a target in the ANF AD connection. It is unreachable from on-premises and is not AD-aware.
External private DNS forwarders (Infoblox, BIND, Unbound) do not support GSS-TSIG. Placing a forwarder IP in the ANF AD connection causes silent DDNS failure with no portal error.
In Virtual WAN deployments, add the ANF delegated subnet /26 to the hub Routing Intent under Additional Prefixes for Private Traffic. The broader VNET prefix alone is not sufficient.
NFSv3 volume creation success does not indicate SMB readiness — NFSv3 uses the IP directly and bypasses AD, Kerberos, and reverse DNS.
_kerberos-master SRV records are not created automatically by AD DNS and must be added manually.
DNS scavenging should be disabled on zones containing ANF records, or records pre-created as static entries, as ANF does not aggressively refresh DDNS registrations.
When volume creation fails with a generic 'context deadline exceeded' error, switch the deployment availability zone before deep troubleshooting to surface a more descriptive error.

Related documentation:

Governing Agent Sprawl: A Multi‑Region AI Agent Landing Zone on Azure (Reference Architecture)

KimVaddi — Thu, 07 May 2026 05:19:02 GMT

It doesn’t take long for AI agents to get out of hand.

In most enterprises, the first few agents are celebrated. A chatbot here. A document summarizer there. Then another team ships an agent that calls APIs. Someone else connects one to internal data. Within months, IT is staring at dozens—or hundreds—of autonomous systems running across subscriptions, regions, and tools.

At that point, the questions stop being about model quality and start being uncomfortable operational ones:

Who owns this agent?
What data can it access?
What happens if it misbehaves?
Why did it just consume half our monthly token budget in a day?

Developers can build an AI agent in minutes—the difficult part is understanding what agents are doing, how they perform, and whether they comply with organizational policy. Signals scatter across tools, context is lost, and governance becomes reactive.

This reference architecture exists to solve that problem.

It describes a multi‑region AI agent landing zone on Azure that treats agents as first‑class, governable workloads—provisioned automatically, constrained by policy, and observable from day one.

The architectural principle: separate control from execution

The design starts with a simple but non‑negotiable rule:

Control plane concerns must be separated from runtime concerns.

Azure landing zones already follow this model. Management groups, Azure Policy, and RBAC are global constructs. Workloads run in regions. This architecture applies the same discipline to AI agents.

The runtime plane is where agents execute, models infer, and data flows—often in multiple Azure regions.
The control plane is where identity, policy, safety, evaluation, and oversight live—independent of region.

This separation is what allows teams to scale agents without losing control.

Layer 1: Azure AI Gateway — governing every request

The first control layer sits directly in the request path.

The AI gateway in Azure API Management provides a policy‑enforcement and observability layer in front of AI models, agents, and tools. It is not a separate service—it extends Azure API Management.

Everything flows through it:

Microsoft Foundry model deployments
Azure AI Model Inference API endpoints
OpenAI‑compatible third‑party models
Self‑hosted models
MCP servers and A2A agent APIs (preview)

What the gateway actually enforces

This layer is intentionally narrow and operational:

Token quotas and rate limits
The llm-token-limit policy (GA) enforces tokens‑per‑minute or quota ceilings per consumer before requests reach the backend. This prevents one application—or one agent—from exhausting shared capacity.
Content safety at ingress
The llm-content-safety policy (GA) integrates Azure AI Content Safety to moderate prompts automatically. Unsafe requests never reach the model.
Traffic routing and resiliency
Azure API Management supports multi‑region gateway deployment (Premium tier). If a region fails, traffic routes to the next closest gateway automatically.
Token usage, prompts, and completions are logged to Azure Monitor and Application Insights using built‑in policies such as llm-emit-token-metric.

The gateway does not understand agent intent or business context. That is by design. It governs traffic, not behavior.

Layer 2: Azure AI Foundry Control Plane — governing behavior at scale

The second layer governs what agents do, not just how requests flow.

Azure AI Foundry Control Plane provides a unified management surface for AI agents, models, and tools across projects and subscriptions. It is designed specifically for agentic systems.

Foundry Control Plane is currently in public preview.

What Foundry Control Plane adds

Fleet‑wide inventory
Every agent, model, and tool appears in a single, searchable view across projects.
Continuous evaluation on production traffic
Foundry runs evaluations that measure task adherence, groundedness, tool‑call accuracy, sensitive data exposure, and other agent‑specific risk dimensions.
Centralized guardrails
Policy is enforced across inputs, outputs, and tool interactions—not just prompts. Bulk remediation can be applied across the fleet.
Security integration
Foundry integrates with:
- Microsoft Entra for agent identity (Entra Agent ID)
- Microsoft Defender for threat signals
- Microsoft Purview for data protection and compliance visibility

Foundry Control Plane also requires an AI Gateway to be configured for advanced governance scenarios—reinforcing the layered approach.

Layer 3: Microsoft Agent 365 — enterprise oversight, not just Azure oversight

The third layer exists because Azure governance alone is not enough.

Agents don’t just call APIs. They act on behalf of users. They access enterprise data. They operate inside Microsoft 365 workflows.

Microsoft Agent 365 is the tenant‑level control plane for AI agents. It brings agents under the same administrative model used for users and applications.

Status: Frontier Preview
General availability: May 1, 2026

Why this layer matters

Agent 365 introduces controls that Azure alone cannot provide:

Agent registry
A single inventory of all agents in the tenant—including sanctioned and shadow agents. Unsanctioned agents can be quarantined.
Identity‑first access control
Every agent is issued an Entra agent ID. Conditional Access policies apply to agents the same way they do to users.
Human‑in‑the‑loop oversight
Agents surface in Microsoft 365 admin workflows, not just Azure portals.
Security and compliance
Defender and Purview extend threat detection and data protection policies to agent activity.

Agent 365 does not replace Foundry Control Plane. It complements it—connecting agent operations to enterprise identity, compliance, and productivity systems.

How the pieces work together

Individually, these services are powerful. The architecture works because they are deliberately layered.

External approval → automated provisioning

When a use case is approved in an external governance system, it triggers an Azure DevOps pipeline using the REST API.

That pipeline:

Provisions subscriptions and resource groups
Deploys Foundry projects
Configures Azure API Management with AI Gateway policies
Enables monitoring and logging

Governance is applied before the first request is made.

One policy model, many regions

Azure landing zones are region‑agnostic at the governance layer. This architecture follows that guidance.

Policies and RBAC apply globally
AI Gateway enforces limits locally in each region
Runtime services scale region by region

Expanding to a new region does not introduce a new governance model—only new capacity.

A single operational view

Signals flow upward:

AI Gateway emits traffic and usage metrics
Foundry Control Plane correlates evaluations, guardrail enforcement, and security alerts
Agent 365 aggregates tenant‑level identity, compliance, and threat signals

Operations teams no longer hunt across dashboards. They work from one prioritized view, with context intact.

What this architecture deliberately does not promise

This is a reference architecture, not a silver bullet.

It does not eliminate the need for:

Clear agent ownership
Business‑level approval processes
Ongoing evaluation of agent usefulness

What it does provide is a foundation—one that lets organizations scale agentic AI without accepting chaos as the cost of innovation.

Closing thoughts

Agent sprawl is not a tooling failure. It’s an architectural one.

By separating control from execution, layering governance where it belongs, and aligning AI operations with existing Azure and Microsoft 365 control planes, this architecture gives enterprises a way to move fast without losing sight of what their agents are doing.

That’s the difference between experimentation—and production.

Co-Contributor:

Jorge Pena Alarcon-Sr. Cloud & AI Specialist

References (official Microsoft sources)

Architecture to Resilience: A Decision Guide

varghesejoji — Mon, 04 May 2026 21:24:03 GMT

Start with the framework, accelerate with the tool

Watch the video walkthrough

The Application Resilience Framework originated from a practical gap we saw in resilience reviews: teams had architecture diagrams, monitoring data, incident history, and runbooks, but no consistent way to connect them into a measurable resilience model.

The framework is intended to close that gap by turning architecture context into a structured lifecycle for risk identification, mitigation validation, health modeling, and governance. It aligns closely with the Reliability pillar of the Azure Well-Architected Framework, especially the guidance around identifying critical flows, performing Failure Mode Analysis, defining reliability targets, and building health models.

Application Resilience Framework flow from artifact import to measurable operational resilience.

The Application Resilience Framework Tool helps teams apply this framework faster by starting with artifacts they already have, such as data flow diagrams or sequence diagrams in Mermaid or image format. The tool extracts workflows, application components, platform components, dependencies, and initial failure modes, then guides the team through the decisions needed to make resilience measurable.

From those artifacts, the tool creates the first version of a resilience model by extracting workflows, application components, platform components, dependencies, and initial failure modes. It then guides the team through one import step followed by four phases:

Import Artifacts -> Phase 1: Failure Mode Analysis -> Phase 2: Mitigation and Validation -> Phase 3: Health Model Mapping -> Phase 4: Operations and Governance

It is not a replacement for WAF guidance or Resilience Hub style assessments. It is a practical way to operationalize those concepts at the workload and workflow level, producing prioritized risks, mitigation plans, validation paths, health signals, dashboards, reports, and governance ownership.

How to use this guide

This guide follows the same flow as the tool. For each step, it covers:

The decision: What needs to be decided?
The options: What paths are available?
The guidance: When each option fits

Use this with the video walkthrough. The video shows the tool in action. This guide explains the choices behind each step.

Question 1: What artifact should you import first?

The import step creates the starting point for the model. Regardless of the input path, the output is the same: workflows that move into Phase 1: Failure Mode Analysis.

Options

Import option	Best for	What happens
Data flow diagram	System, module, data movement, and dependency views	If imported as an image, the tool breaks it into sequence-style flows. Selected flows become workflows.
Sequence diagram	Transaction flow and service interaction views	Converted directly into workflows.
Mermaid input	Diagrams maintained as code in Mermaid format	Converted directly into workflows.
Image input	JPG or PNG diagrams	Azure Foundry Vision models interpret the image and convert it into workflows.
Manual entry	Missing or incomplete diagrams	User creates or corrects workflows manually.

When to pick which

Use data flow for system and dependency views. Use sequence diagrams for transaction or interaction views. Regardless of import path, the output is the same: workflows, components, dependencies, and initial failure modes ready for Phase 1.

Question 2: Which workflows should be analyzed first?

Phase 1 is Failure Mode Analysis. This is where the tool identifies what can fail and how important each failure is.

Options

Critical user flows: Login, checkout, payment, onboarding, request processing.
High-risk platform flows: Database writes, queue processing, storage access, identity, messaging, external APIs.
Known issue areas: Workflows with recent incidents, recurring alerts, or customer impact.

When to pick which

Start where failure creates the highest customer or business impact. The goal is not to model everything at once. The goal is to model the right thing first.

Deliverables

Failure Mode Analysis catalog
RPV risk scores
Criticality classification

Question 3: How should failure modes be prioritized?

After workflows and components are imported, the tool helps score each failure mode using Risk Priority Value or RPV, which uses the four factors of Impact, Likelihood, Detectability and Outage severity.

Options

Use generated failure modes and scores: Best for a fast first pass.
Tune the RPV scores with engineering input: Best when workload context matters.
Add custom failure modes: Best when known risks come from incidents, reviews, or customer experience.

When to pick which

Use the generated model to accelerate the first pass, then adjust it with real system knowledge. The goal is not to create the longest list of risks. The goal is to identify the risks that deserve attention first.

Deliverables

Failure Mode Catalog
RPV Risk Scores
Prioritized criticality list

Question 4: Are mitigations defined or validated?

Phase 2 is Mitigation and Validation. This is where each failure mode gets a response plan.

Options

Detection only: The team can detect the failure, but the response is not defined.
Defined mitigation: The response is documented, such as retry, fallback, failover, scaling, restore, or rebalance.
Validated mitigation: The response has been tested through a controlled validation or chaos test.

When to pick which

For low-risk items, documented mitigation may be enough. For critical and high-risk items, validation is the key. A mitigation that has not been tested is still an assumption.

Deliverables

Mitigation playbooks
Chaos test plans
Support playbooks

Question 5: Which risks need health signals?

Phase 3 is Health Model Mapping. This is where the tool connects risks to observability.

A failure mode should not just sit in a document. It should map to a signal that can show whether the system is healthy, degraded, or unhealthy.

Options

Map all failure modes: Best for small systems or highly critical workloads.
Map critical and high-risk failure modes first: Best for large systems.
Track unmapped risks as gaps: Best when observability coverage is still improving.

When to pick which

Start with the highest RPV items. Every critical failure mode should have at least one signal, such as a metric, log, alert, availability check, or dependency signal.

Deliverables

Health model
Signal definitions
Coverage report
Bicep templates

Question 6: Should the health model be exported or deployed?

Once the health model is built, the next decision is how to use it.

Options

Export for review: Best when the team needs to validate the model first.
Generate monitoring templates: Best when the team wants repeatable implementation.
Deploy to Azure: Best when the model is ready to become part of operations.
Use outputs in downstream tools: Best when support, SRE, or incident response workflows need structured playbooks.

When to pick which

Export first if the model is still being reviewed. Deploy when component relationships, signals, and coverage are accurate enough for operational use.

Question 7: How will governance keep the model current?

Phase 4 is Operations and Governance. This is where the resilience model becomes an ongoing practice.

Options

One-time assessment: Useful for quick discovery but limited long term.
Recurring review: Best for production workloads that change regularly.
Closed-loop governance: Best when incidents, failed validations, and monitoring gaps feed back into the model.

When to pick which

For production systems, use a recurring governance cadence. Assign owners, track gaps, review dashboards, and update the model as the system changes.

Deliverables

Governance model
Dashboards
Reports and exports
Runbooks

Putting it together: three adoption patterns

Once governance is defined, the tool can be used in different ways depending on the team’s maturity and objective. The three common adoption patterns are:

Pattern A: Quick resilience review

Import one critical workflow
Generate failure modes
Review RPV scores
Identify top risks
Export findings

Best for fast architecture reviews or early customer conversations.

Pattern B: Full workload assessment

Import multiple workflows
Build a full Failure Mode Catalog
Define mitigations and recovery steps
Create chaos test plans
Map risks to signals
Produce coverage reports

Best for structured resilience assessments.

Pattern C: Operational health model

Build and tune the health model
Export or deploy monitoring artifacts
Track risk and signal coverage
Review mitigation effectiveness
Assign governance ownership
Feed findings back into the model

Best when the goal is continuous operational improvement.

A short checklist before using the tool

Which workflow should we import first?
Do we have a data flow diagram, sequence diagram, or Mermaid file?
What components and dependencies should be included?
Which failure modes matter most?
How should RPV be adjusted for this workload?
Do critical failure modes have mitigations?
Have those mitigations been validated?
Are failure modes mapped to health signals?
What coverage gaps remain?
Should the health model be exported or deployed?
Who owns ongoing review?
How often should the model be updated?

Closing thought

The Application Resilience Framework Tool provides a practical way to move from architecture artifacts to measurable, continuously improving resilience.

It starts with data flow or sequence diagrams, builds a structured view of the system, and guides teams through the decisions that matter: what can fail, how severe it is, how it is mitigated, how it is detected, and how it is governed.

Tool repo: Application Resilience Framework Tool

How MS Discovery Is Empowering Scientists to Do More

sameeraman — Sun, 03 May 2026 12:05:56 GMT

Research and development has traditionally been a slow, sequential, and largely manual endeavour. Scientists formulate hypotheses, design experiments, run computations in constrained environments, and document results, each stage dependent on the last, each transition requiring human review and intervention. Knowledge is fragmented across systems, insights are bottlenecked by individual capacity, and the gap between hypothesis and actionable outcome can span weeks or months.

For organisations tackling complex scientific and operational challenges, from drug discovery to industrial process optimisation, this pace of iteration is simply no longer acceptable.

At Microsoft, we recently introduced Microsoft Discovery, a platform that I believe fundamentally changes this model. Much like Microsoft 365 transformed the way knowledge workers collaborate and create, Microsoft Discovery is designed to simplify and empower the way scientists and researchers work. It provides a unified, end-to-end platform that integrates advanced artificial intelligence, high-performance computing, and knowledge management to support the full scientific reasoning lifecycle: knowledge gathering, hypothesis generation, experiment design, simulation, results analysis, and documentation.

In this article, I want to share how we used Microsoft Discovery to automate a real-world simulation workflow for a mining organisation and what that experience taught our team about the future of AI-augmented science.

What Is Microsoft Discovery?

Microsoft Discovery is Microsoft's scientific AI platform, a solution designed to accelerate research and experimentation across the full innovation lifecycle. Rather than replacing scientific judgement, Discovery is designed to amplify human expertise, embedding AI assistance at each stage of the R&D process while maintaining governance, traceability, and scientific rigour.

From Traditional R&D to AI-Augmented Science

To appreciate what Discovery enables, it is important to understand where it fits in.

In the traditional R&D model, knowledge discovery centres on manual literature reviews and historical data analysis. Researchers individually search, read, and synthesise information which is a time-intensive process where discovery is limited by each person's capacity to locate and interpret relevant material. Hypothesis generation and experimental design are expert-led and largely manual. Computational experimentation, where it exists, runs in fixed or constrained environments with limited parallelism. Analysis and iteration follow the same sequential pattern: execute, review, document, repeat.

Microsoft Discovery changes this fundamentally. In the AI-cloud-enabled model it provides:

Knowledge synthesis at scale — Researchers can explore literature, historical experiments, and organisational knowledge through a single interface, with intelligent indexing surfacing insights faster than manual search could ever achieve.
AI-assisted hypothesis generation — Collaborative human-and-AI workflows support hypothesis exploration and feasibility assessment, while final decisions remain with the scientist.
Cloud-scale experimentation — Elastic compute and parallel processing allow simulations and experiments to run at scale, with integrated tracking and reproducibility built in.
Continuous feedback and human-in-the-loop governance — Results are analysed and compared more rapidly, enabling faster iteration, with AI-generated insights reviewed and validated by researchers before action.
Governed knowledge assets — Experiment lineage, outcomes, and best practices are captured as reusable, governed assets, supporting long-term organisational learning.

The net effect is a transition from slow, manual, and fragmented research processes to an agile, automated, and data-driven R&D model — one that improves research efficiency, increases the return on innovation investment, and enables faster, higher-impact solutions to complex challenges. In high level, the research and deveolopment loop we discussed and how Microsoft Discovery enriches it show in the following diagram.

The Real-World Problem: Screening Thousands of Molecules

To bring this to life, let me walk you through a real-world use case we worked on recently. A mining organisation needed to identify the best-performing oxidant compounds for a chemical reaction central to their operations. We will be talking about only a workflow that sits squarely in the simulation phase of the scientific loop — and it is a perfect example of the kind of work that Microsoft Discovery can strongly transform.

How Scientists Did It Before

In the traditional process, scientists would begin by selecting candidate molecules from established molecular libraries based on characteristics identified through literature review. These libraries can contain thousands of molecules, each defined in standard molecular file formats (such as XYZ or CIF files) that describe their three-dimensional atomic structures.

From there, a researcher would manually work through a multi-step pipeline:

Pre-processing and preparation: The selected molecular files are processed and prepared for quantum mechanical (QM) calculations. This involves filtering molecules based on properties like the types of metals present, electron count, and atomic weight — criteria that directly affect both the scientific relevance and the computational cost of the simulations. The output is a set of prepared input files (known as GJF files) ready for simulation.
Running quantum mechanical simulations: The prepared input files are submitted to a computational chemistry tool (Gaussian 16) to perform Density Functional Theory (DFT) calculations. These simulations compute the electronic structure and energy states of each molecule across different charge and multiplicity configurations. Crucially, each molecule requires multiple independent simulation runs, and the computational cost scales rapidly with molecular complexity. With thousands of candidate molecules, this step alone can involve thousands of individual simulation jobs.
Collecting and post-processing results: Once all simulations complete, the output log files are collected and processed. For each molecule, the lowest-energy charge and multiplicity combination is identified, and a set of quantum mechanical descriptors and classical molecular descriptors are extracted. These descriptors are then fed into a trained machine learning model to predict the redox potential of each compound, a key metric that indicates how effectively a molecule can act as an oxidant in the target reaction.
Summarisation and filtering: Finally, the predicted redox potentials and other relevant characteristics are compiled into a summary, enabling researchers to identify the most promising candidates for further investigation and experimental validation.

Every step in this pipeline required manual intervention: writing and adjusting scripts, verifying input and output files, monitoring job queues, handling failures, and stitching results together. A single researcher could easily spend days or weeks moving through this process — and any error at one stage meant going back and re-running subsequent steps.

How We Automated This with Microsoft Discovery Agents

When we looked at this workflow through the lens of Microsoft Discovery, the opportunity was clear. The scientific reasoning, selecting which molecules to test, interpreting redox potential results, deciding what to investigate next, should remain with the researcher. But the operational overhead of preparing files, submitting simulations, monitoring jobs, collecting results, and assembling summaries? That could be orchestrated by a team of AI agents.

A Team of Agents, Working Together

We designed a multi-agent architecture within Microsoft Discovery to automate this simulation workflow end to end. Here is how the team of agents operates:

Router Agent: The entry point. When a researcher submits a request for example, asking to run QM calculations on a set of candidate molecules the Router Agent interprets the intent and orchestrates the downstream workflow.

Planner Agent: Once the Router Agent identifies the task, the Planner Agent examines the input files provided by the researcher and formulates a step-by-step execution plan. It determines what needs to happen, in what order, and with what parameters, much like a project manager scoping out a piece of work.

Gaussian Prep Agent: This agent handles the preparation step. It is intelligent enough to inspect the current molecular files, apply the necessary filtering criteria, and prepare them for simulation, generating the input files that the computational chemistry tool requires. What previously involved manual scripting and file-by-file verification is now handled autonomously. We used Microsoft Discovery tools to do the underlying execution with this agent.

MPI Gaussian Agent: This is where the power of cloud-scale computing comes in. The Gaussian Agent submits the prepared simulation jobs and manages their execution using an MPI-based master-worker pattern. This approach enables massive parallel execution scaling out across the cloud to run thousands of simulations concurrently rather than sequentially. Given that the candidate molecule libraries can contain thousands of entries, and each molecule may require multiple simulation runs, this parallel execution capability is transformative. What might have taken days in a constrained local environment can now complete in a fraction of the time.

Redox Potential Agent: Once the simulations are complete, this agent takes over. It processes the simulation outputs, identifies the optimal charge and multiplicity state for each molecule, extracts the relevant QM and classical descriptors, and runs them through the trained machine learning model to predict redox potentials.

Summariser Agent: The final agent in the chain. It maps the predicted redox potentials back to the original molecules, applies any additional filtering criteria, and produces a clean, structured summary a JSON file that the researcher can immediately use to identify the most promising candidates and take them forward into the next phase of their work.

What the Researcher Experiences

From the scientist's perspective, the transformation is striking. Instead of spending days writing scripts, babysitting job queues, and manually stitching results together, they provide their input files and describe what they need. The agents take it from there planning, preparing, executing, processing, and summarising and deliver a curated output ready for scientific interpretation.

The researcher's time is freed to focus on what matters most: thinking critically about the science. Which molecules look most promising? What does the redox potential distribution tell us? Should we adjust the filtering criteria and run another round? These are the high-value questions that require human expertise and now scientists can spend their time on exactly that, rather than on operational mechanics.

The Bigger Picture: Accelerating the Entire Scientific Loop

It is important to note that this simulation workflow is just one piece of the broader scientific loop. The full cycle of scientific research, from initial knowledge gathering and literature review, through hypothesis generation, experimental design, simulation, results analysis, and documentation involves many stages, each of which can benefit from the same kind of AI-augmented approach.

Microsoft Discovery is designed to support this entire cycle. In our project, we did not stop at simulation. We also explored how agents can accelerate the knowledge gathering phase, helping researchers navigate vast bodies of literature and surface relevant prior work more efficiently. We looked at how AI can assist with hypothesis generation and evaluation, helping scientists reason about which directions are most promising before committing to expensive computations. And we examined how agents can support the analysis and reporting phases comparing results against hypotheses, generating visualisations, and even assisting with drafting research documents.

What excites me most about Microsoft Discovery is not any single capability, but the cumulative effect of embedding AI assistance across every stage of the research process. Each phase that gets faster and more efficient creates a multiplier effect on the phases that follow. When knowledge gathering takes hours instead of weeks, researchers generate better hypotheses sooner. When simulations run at cloud scale in parallel, results arrive faster. When analysis is augmented by AI, iteration cycles tighten. The entire loop accelerates.

Conclusion

The way we approach scientific research is undergoing a fundamental shift. Large language models and the AI agents built from them are not replacing scientists, they are empowering them to work at a pace and scale that was previously unimaginable.

Microsoft Discovery represents a new operating model for R&D. By combining advanced AI, high-performance cloud computing, and intelligent workflow orchestration, it enables researchers to offload the repetitive, time-consuming operational work to agents and invest their expertise where it has the greatest impact: in asking better questions, interpreting complex results, and pushing the boundaries of what we know.

In the use case I have shared here, a team of six AI agents automated a simulation pipeline that would have taken a single researcher days of manual work. They prepared molecular input files, scaled out thousands of quantum mechanical simulations in parallel across the cloud, processed the results, predicted redox potentials using machine learning, and delivered a structured summary all with minimal human intervention.

This is just the beginning. As AI agents become more capable and the tools surrounding them more mature, the potential to accelerate discovery across every scientific domain is immense. Whether you are in materials science, pharmaceuticals, energy, agriculture, or any field where complex R&D is central to progress, Microsoft Discovery offers a platform to do more, faster, and with greater confidence.

The future of science is not about working harder. It is about working smarter with AI as your partner in discovery.

Running Diffusion Models at Scale on AKS

PrabalDeb — Thu, 30 Apr 2026 04:38:08 GMT

Diffusion workloads are simple at prototype scale and unforgiving in production. A single demo can run on one GPU-backed VM, but a real platform has to handle bursty demand, long-running jobs, model artifact distribution, secure public access, rollout safety, and hardware-level observability.

Azure Kubernetes Service (AKS) is a strong fit when the requirement is not just to run a model, but to operate a repeatable platform for GPU inference. The reusable pattern is straightforward: keep the API and control layer on CPU nodes, buffer work through a dispatch layer, run inference on isolated GPU capacity, push results to durable storage, and treat security, telemetry, and deployment automation as first-class platform features.

AKS reference architecture for diffusion workloads

The architecture above shows the core operating model. DNS and an edge layer: Application Gateway with WAF, and optionally Front Door for global entry: route traffic to an AKS CPU pool that hosts the API tier. GPU jobs run on a separate GPU pool, while shared add-ons and CSI drivers run on a system pool. Teams can keep dispatch inside Kubernetes or externalize it through Service Bus plus KEDA, and Azure dependencies should be reached over Private Link with Azure Monitor covering both app and hardware telemetry.

The storage block serves two purposes: durable output storage and, if needed, a shared Hugging Face model cache exposed to GPU pods through PV and PVC mounts.

The reference pattern

The core architecture separates control-plane traffic from GPU execution:

A lightweight API tier on the CPU node pool receives requests, validates identity, and hands execution work to a dispatch layer.
That dispatch layer can stay inside Kubernetes using native queueing and controller patterns, or it can publish work to Azure Service Bus for external queue-backed dispatch.
Scaling can likewise stay AKS-native through cluster and workload autoscaling, or it can use KEDA to react directly to queue backlog.
GPU work runs on a dedicated GPU node pool, isolated from the API and cluster add-ons.
GPU workers should mount persistent storage for model caches so Hugging Face assets can survive pod restarts and repeated job submissions.
Results are stored outside the pod lifecycle in blob-backed storage and returned through a stable status API.
Edge routing, TLS termination, and WAF inspection happen at the ingress layer, while token validation is typically enforced in the API tier or a dedicated upstream auth component.

This split lets each lane scale on the right signal: request traffic for the API tier, backlog for dispatch, and job demand for GPU workers. It also keeps tuning simpler for cost, latency, and reliability. For single-region deployments, Application Gateway or Application Gateway for Containers is often enough; Azure Front Door becomes more useful for global entry, multi-region failover, or shared edge policy.

In the reference architecture, the CPU pool hosts the externally reachable APIs and control-plane components that submit work into the dispatch layer. The GPU pool hosts the actual model execution components, including short-lived diffusion jobs and longer-lived worker runtimes. A separate system pool hosts shared cluster services such as AGIC and the Secret Store and Blob CSI drivers, while KEDA is added only when teams choose the Service Bus pattern. That keeps platform plumbing off the application and GPU lanes.

The persistence layer is most useful as a model cache rather than as general-purpose application state. There are two practical ways to back it:

Node-local persistence keeps the cache close to the GPU worker and is the simplest option when jobs benefit from warm data already present on the same node.
Azure Storage backed persistence is more useful when model download times are long enough that keeping artifacts on shared durable storage materially reduces job startup latency.

Choose the dispatch model

The important design decision is not whether to use one branded queueing technology. It is whether the platform needs a fully Kubernetes-native control loop or an explicit external queue with backlog-driven scaling.

The Kubernetes-native option keeps dispatch inside the cluster:

The API creates or signals internal Kubernetes work objects.
A Kubernetes-native queue or controller pattern manages admission and dispatch.
AKS workload autoscaling and cluster autoscaling handle most scale changes.
This path is simpler when the team wants fewer external dependencies and the workload shape is already well understood.

The Azure Service Bus plus KEDA option externalizes the control loop:

The API publishes work to Azure Service Bus.
Queue consumers or schedulers materialize GPU execution from that queue.
KEDA scales the scheduling or worker path directly from queue depth.
This path is better when backlog visibility, queue durability, or burst-driven autoscaling needs to be explicit and independently observable.

Both models can fit the same AKS platform. The GPU isolation, security boundaries, storage pattern, and observability expectations remain the same.

A simple way to choose is:

Start with Kubernetes-native dispatch when the team wants the fewest moving parts and the job profile is already predictable.
Choose Azure Service Bus plus KEDA when durable backlog, explicit queue depth, and burst-driven worker scaling are important operating requirements.
Consider KAITO or the AI toolchain operator add-on when the primary need is managed serving of supported models rather than custom diffusion job orchestration.

Scale by workload lane, not by one generic pool

Not every GPU workload should share the same execution path. Keep short-lived inference, queue-backed workers, and longer-running runtimes in separate lanes so one class does not block another. Where supported, GPU multi-instance configurations can further improve utilization for lighter jobs while leaving full GPUs available for heavier ones.

On AKS, the better pattern is to define separate operating lanes:

An API admission lane on CPU nodes for authentication, validation, and request submission.
A scheduling lane that can use either Kubernetes-native queueing with AKS autoscaling or Azure Service Bus with KEDA.
A GPU execution lane for diffusion jobs and longer-lived worker runtimes.
Dedicated labels, taints, autoscaling bounds, and dashboards per lane.

Within the GPU execution lane, teams can go one step further and define capacity classes for full-GPU and fractional-GPU jobs. That is useful when some models need the memory and throughput of a whole device, while others can run efficiently on a smaller GPU partition. The NVIDIA device plugin DaemonSet shown in the GPU pool is what advertises those GPU resources (and MIG slices) to the kube-scheduler so pods can request them like any other resource.

That gives platform teams clean capacity isolation and avoids letting one workload class starve another.

Secure the edge, the workload, and the secret path

GPU platforms should treat security as a day-one requirement, not a later add-on.

At the edge, use DNS with Application Gateway and WAF, and add Front Door when global routing is needed. Store public TLS certificates in Azure Key Vault and project them into the cluster through the Secrets Store CSI Driver so renewals do not require redeployment. For protected APIs, validate Microsoft Entra ID tokens in the service or a dedicated auth layer, and keep health probe endpoints separate from business routes.

Inside the cluster, keep authorization scoped tightly:

Separate system, CPU, and GPU node pools.
Use namespace boundaries for tenant or environment isolation.
Give the API only the Kubernetes RBAC it needs to create and monitor jobs, plus Azure permissions only when the external queue option is enabled.
Prefer Microsoft Entra Workload ID over long-lived credentials for workload access to Azure resources such as Key Vault and Blob Storage, and extend that to Service Bus when the external queue pattern is used.

For operator access, keep the cluster management path separate from the public request path. In the reference architecture, developers come in through Azure Bastion rather than broad direct exposure of cluster endpoints.

For secrets, move away from cluster-local secrets as early as possible. A production-ready path uses Azure Key Vault with the Secrets Store CSI Driver so credentials are not baked into images, manifests, or CI pipelines. If the platform uses Azure Service Bus, queue access should use managed identity as well. Blob-backed result storage should likewise use managed identity and CSI-based integration instead of embedding long-lived credentials into workloads.

For the network path between the cluster and its Azure dependencies, prefer Private Endpoints over public service endpoints. The diagram uses a single PE icon as shorthand for this pattern: in practice, teams usually create private endpoints per service and pair them with private DNS so ACR pulls, Key Vault reads, Blob I/O, and optional Service Bus traffic resolve to private IPs inside the VNet, which keeps platform traffic off the public internet and simplifies firewall and DNS policy.

Observe both the application and the hardware

GPU workloads need two telemetry views: application behavior and hardware behavior. The first tracks request IDs, job IDs, latency, and failures; the second tracks GPU utilization, memory pressure, and device-level performance. Together they show whether the problem is code, dispatch pressure, or hardware saturation.

On Azure, that split maps well to:

Structured application logs and OpenTelemetry exported to Application Insights.
Azure Monitor dashboards that include internal queue pressure or Service Bus backlog, plus AKS autoscale or KEDA scale activity depending on the chosen pattern.
NVIDIA DCGM exporter metrics scraped into Azure Managed Prometheus and visualized in Azure Managed Grafana.

This model is what turns raw GPU hosting into an operable platform. Without it, teams can see requests failing but not whether the root cause is code, dispatch saturation, scheduling, or hardware contention.

The diagram reflects that split clearly. Application Insights and Azure dashboards track service and dispatch behavior, while Prometheus, Grafana, the NVIDIA device plugin, and DCGM exporter track cluster and GPU health. That combination is what allows teams to correlate dispatch delay, AKS or KEDA scale-out, execution time, and failure rates with actual GPU utilization and memory pressure.

Keep CI/CD small, secretless, and reversible

The deployment model does not need to be complex to be production-grade. A practical AKS pattern is:

Pull request validation for code quality, tests, Dockerfiles, and secret scanning.
Immutable container tags built from the commit SHA.
GitHub Actions with OpenID Connect and Azure workload identity federation.
ACR as the image source of truth.
Environment-based promotion with approval gates for production.
Rollout verification with Kubernetes health checks and smoke tests.

The key principle is separation of concerns. CI/CD should roll forward application images and validated configuration, not rebuild the whole platform on every deploy. Shared components such as ingress, node pools, identity, storage, monitoring, and optional KEDA or Service Bus should remain under controlled infrastructure change management.

What makes this pattern reusable

This AKS pattern generalizes well beyond one model family or one product surface because it is built on fundamentals:

Separate API admission from dispatch and GPU execution.
Choose the dispatch boundary that fits the workload: Kubernetes-native queueing with AKS autoscaling, or Service Bus plus KEDA.
Isolate workload classes into different scaling lanes.
Scale worker capacity from the most useful signal for the chosen model: workload pressure inside AKS or external queue backlog through KEDA.
Put authentication, TLS, and routing at the edge.
Use workload identity and externalized secrets.
Instrument both software behavior and GPU behavior.
Keep deployments automated, traceable, and easy to roll back.

That is the real architecture story. The model can change. The runner can change. Even the queue and gateway choices can evolve. But the engineering fundamentals stay stable, and that stability is what makes diffusion workloads viable at scale.

Possible alternatives: KAITO and the AI toolchain operator add-on

Some teams do not need a fully custom GPU execution platform. On AKS, two adjacent options are KAITO and the AI toolchain operator add-on.

KAITO is the lighter-weight choice for rapid experimentation with supported model presets. The AI toolchain operator add-on is the more managed option for standardized LLM or multimodal serving with AKS-native operational features. Both are less suitable when the platform needs custom diffusion pipelines, queue-backed job orchestration, artifact-heavy workflows, or application-specific dispatch logic.

Reference documents

AKS node pools: Create node pools for a cluster in Azure Kubernetes Service (AKS)
Microsoft Entra Workload ID for AKS: Use Microsoft Entra Workload ID with Azure Kubernetes Service (AKS)
Key Vault integration for AKS: Use the Azure Key Vault provider for Secrets Store CSI Driver in an Azure Kubernetes Service (AKS) cluster
Azure Blob and other CSI storage drivers on AKS: Use Container Storage Interface (CSI) drivers on Azure Kubernetes Service (AKS)
AKS GPU multi-instance support: Use multi-instance GPUs in Azure Kubernetes Service (AKS)
KEDA on AKS: Simplified application autoscaling with Kubernetes Event-driven Autoscaling (KEDA) add-on
Application Gateway Ingress Controller: What is Application Gateway Ingress Controller?
AKS monitoring with Azure Monitor, managed Prometheus, and Grafana: Enable monitoring for Azure Kubernetes Service (AKS) clusters
Application telemetry with Application Insights: Introduction to Application Insights - OpenTelemetry observability
Hugging Face Diffusers: Diffusers documentation
NVIDIA DCGM exporter: dcgm-exporter
NVIDIA device plugin DaemonSet: NVIDIA k8s-device-plugin

Closing thought

Running diffusion models in production is not mainly a model-hosting problem. It is a platform engineering problem with GPUs in the middle. Teams that treat AKS as the control surface for isolation, observability, identity, and repeatable rollout discipline end up with a system that can scale beyond a benchmark and survive real operational demand.

Transforming Video Content into Structured SOPs Using Graph-based RAG

dikshashakya — Wed, 29 Apr 2026 01:04:53 GMT

Introduction

In today’s digital-first environments, a large portion of enterprise knowledge lives inside video content, training sessions, onboarding walkthroughs, and recorded operational procedures.

While videos are great for learning, they are not ideal for quick reference, compliance, or repeatable processes. Converting that knowledge into structured documentation like Standard Operating Procedures (SOPs) is often manual and time-consuming.

What if this process could be automated using AI?

The Problem

Transcripts alone don’t solve the problem.

When videos are converted into text, the output typically lacks:

Clear structure (sections, headings, hierarchy)
Context (relationships between steps, tools, and roles)
Completeness (definitions and dependencies spread across the content)

This leads to a common challenge:

Teams spend significant effort manually reading transcripts, interpreting context, and restructuring them into usable documentation.

As seen in modern architecture challenges, manual and repetitive configurations don’t scale well and increase maintenance effort

Enter Graph-based RAG (GraphRAG)

GraphRAG extends traditional RAG by building a knowledge graph instead of treating content as disconnected chunks.

What GraphRAG Does

Extracts entities (tools, systems, roles, concepts)
Maps relationships between them
Groups related concepts into logical sections
Preserves context across the entire document

Architecture Overview

Below is the high-level pipeline:

Video → Transcription → Knowledge Graph → LLM Generation → Structured SOP

Implementation Approach (Step-by-Step)

Stage 1: Knowledge Graph Construction

Convert video to transcript
Split transcript into chunks
Feed chunks into GraphRAG

GraphRAG performs:

Text Unit Extraction
Entity Recognition
Relationship Mapping
Community Detection

Result: A structured knowledge graph representation of the transcript

Stage 2: Structure Extraction

From the knowledge graph:

Sequential Steps

Preserve procedural flow from transcript order

Logical Sections

Derived using community detection

Key Concepts

Identified using graph centrality (importance via connections)

This creates a framework for the SOP

Stage 3: Intelligent Document Generation

Using Azure OpenAI, each SOP section is generated:

Section	Generated From
Title & Purpose	High-level concepts
Scope	Entity boundaries
Definitions	Entity descriptions
Responsibilities	Role-based entities
Procedures	Sequential steps
References	Linked content

The key advantage: LLM is grounded in graph structure not raw text

Key Benefits

Context Preservation - Relationships between concepts are maintained across sections.

Comprehensive Coverage - Community detection ensures important topics are not missed.

Reduced Hallucination - LLM generation is grounded in structured knowledge.

Scalability- Works for: 30-minute tutorials, 3-hour training sessions and Enterprise knowledge bases

Real-World Impact (Example)

In enterprise scenarios like pharmaceutical SOP generation:

Processing time: ~15–20 minutes for a multi-hour video
Output quality: 8–10 structured SOP sections
Consistency: Terminology and relationships preserved
Coverage: Minimal missing topics

Where This Approach Works Best

Training videos → SOPs
Meeting recordings → action summaries
Technical demos → documentation
Interview recordings → knowledge bases
Tutorials → reference guides

Key Takeaway

This approach represents a shift from text processing → knowledge understanding.

By combining:

Knowledge graphs (structure)
LLMs (language generation)

We can transform raw, unstructured content into usable, enterprise-grade knowledge assets.

Resources

https://microsoft.github.io/graphrag/index/overview/

Final Thoughts

Have you explored GraphRAG or similar approaches in your projects?

What challenges did you face?
How did you handle unstructured knowledge?

Share your experiences — let’s learn together.

Flexible Cooling for AI Growth: How Zonal Architecture Supports Diverse Hardware Needs

stsolo — Fri, 29 May 2026 20:01:33 GMT

By: Ricardo Bianchini, Steve Solomon, Brijesh Warrier, Martin Herbert, Jay Jochim, Husam Alissa, Pulkit Misra, Eric Peterson and Cam Turner

Context -

Microsoft is pioneering zonal cooling in its next-generation AI datacenters, enabling flexible, performant, efficient, and sustainable thermal management for diverse workloads.

The unprecedented growth of artificial intelligence (AI) is transforming datacenter infrastructure. Modern facilities must now support a diverse array of IT equipment, each with distinct cooling requirements. For example, modern GPUs and other AI accelerators require liquid cooling as air cooling is impractical at power draws exceeding 1 kW per accelerator due to the limited heat capacity of air to remove the resulting thermal load. Meanwhile, non-AI-accelerator (i.e., general-purpose) hardware deployments such as CPU-based compute, storage, and networking are expected to mostly remain air-cooled for the foreseeable future.

Furthermore, liquid cooling offers a significant efficiency advantage: its superior heat dissipation allows coolant supply temperatures at the chip as high as 45°C without sacrificing peak performance. In contrast, air-cooled equipment requires much lower supply temperatures—around 30 °C—for optimal efficiency.

The divergence in hardware cooling requirements creates a complex landscape that demands a strategy that is both flexible and adaptive. As shown in Figure 1, relying on a unified facility water system (FWS) introduces major inefficiencies. For example, liquid-cooled GPU racks may receive coolant below their required operating temperature when served by a single-temperature loop. This inefficiency becomes even more pronounced as the proportion of liquid- to air-cooled equipment increases (e.g., 90:10 liquid-to-air ratio for NVIDIA GB300 servers) since a larger share of the equipment is unnecessarily overcooled.

Beyond operational efficiency, sustainability is a key priority for Microsoft even as we grow our AI infrastructure. Among our sustainability commitments, Microsoft has set goals to become carbon negative and eliminate water evaporation as a cooling method in its next-generation datacenters. A key lever for reducing carbon emissions is improving PUE (Power Usage Effectiveness, i.e., total power divided by IT power), a standard measure of datacenter power and energy efficiency. Achieving this requires dynamically matching cooling delivery to the specific needs of each equipment type, ensuring optimal performance, reduced energy consumption, and enhanced sustainability.

Zonal Cooling: Flexible by Design

Zonal cooling is a facility design that introduces multiple independent water loops, each supplying coolant at different temperatures. Figure 2 illustrates a specific implementation of the zonal concept with two facility-level zones: one loop serves air-cooled equipment, maintaining lower temperatures for human comfort and general-purpose hardware, and the other loop caters to liquid-cooled IT AI accelerators, which can operate efficiently at higher supply temperatures. This separation enables datacenter operators to precisely match cooling supply to the requirements of each zone, avoiding the inefficiency of over-cooling all equipment to the lowest common denominator.

A key strength of zonal cooling is its flexibility. As new generations of IT hardware emerge, with varying thermal profiles, zonal cooling allows datacenters to adapt without major infrastructure overhauls. For example, future AI accelerators may need different liquid temperature ranges (see 30℃ Coolant - A Durable Roadmap for the Future) or technological improvements, such as microfluidics, may enable operating at even higher coolant temperatures, while general-purpose equipment requirements may remain unchanged. Zonal cooling’s architecture supports these changes by enabling operators to adjust loop temperatures and reconfigure cooling assignments as needed.

Forms of Zonal Cooling

Liquid cooling expands the allowable coolant supply temperature range and enables temperature-specific zones. This zonal approach can be applied at multiple layers:

Facility-level: Two distinct temperature zones within a datacenter—one for air-cooled equipment and another for liquid-cooled equipment.
Row-level: Tailor coolant temperature for each row based on deployed hardware (e.g., general-purpose vs GPU servers).
Rack-level: Enable multiple temperature zones within a single rack for fine-grained optimization across servers.
Chip-level: Apply zonal cooling inside the server. For example, use colder coolant for a GPU’s high-bandwidth memory (HBM) while supplying warmer coolant for the SoC and CPUs. This fine-grained approach can enable higher HBM stacking for improved performance, while avoiding unnecessary cooling overhead.

Microsoft is building facility-level zonal cooling in the next generation of its AI datacenters going live in 2028 and beyond, while exploring the other three approaches in the lab. Facility-level zonal cooling is expected to reduce PUEs by up to 10%.

Benefits from Zonal Cooling

Zonal cooling is a strategic enabler for performance and efficiency. It can deliver:

Improved energy efficiency and sustainability: By reducing the load on datacenter cooling infrastructure, zonal cooling improves energy efficiency as captured by annualized PUE, which measures average efficiency across all operating conditions. Lower annualized PUE means energy savings and lower carbon emissions.
Increased server density: Tailored zonal cooling reduces peak cooling power demand during the hottest days, which in turn lowers peak PUE. Designers can leverage this reduction to reserve power for lower water temperatures (anticipating future accelerator needs), add more servers within the same utility power envelope, or contract less utility power per datacenter.
Higher performance: Strategic control of coolant temperatures unlocks higher chip performance without sacrificing efficiency. For example, colder loops allow GPUs and CPUs to sustain elevated clock speeds via safe overclocking, while optimized memory cooling supports greater stacking density and increased bandwidth.
Improved flexibility: With independent zones, operators can easily adjust coolant supply temperatures or reconfigure zones as new generations of hardware with varied cooling requirements emerge. This flexibility ensures compatibility with future innovations while maintaining optimal performance.

Looking Ahead

Zonal cooling represents a paradigm shift in datacenter thermal management. Its flexible, zone-specific approach to cooling air- and liquid-cooled IT equipment positions datacenters to efficiently adapt to future hardware innovations and workload diversity. As the industry continues to push boundaries in performance and sustainability, zonal cooling will be a foundational strategy for building performance and efficient infrastructure that meets tomorrow’s challenges.

Modernizing Industrial Safety and Inspection with AI-Driven Drone Automation

PeterTHLee — Sat, 25 Apr 2026 00:07:31 GMT

In large-scale manufacturing and infrastructure environments, maintaining structural integrity is a continuous operational challenge. Industrial facilities—from automotive plants to energy and infrastructure sites—depend on thousands of structural connection points such as bolts and fasteners to ensure safe and reliable operations. Over time, vibration, thermal cycling, and mechanical stress can cause these components to loosen or degrade.

While drones have dramatically improved how inspection data is captured, the analysis of that data—often involving thousands of connection points—remains largely manual. Engineers frequently review footage frame by frame, making the process labor-intensive, inconsistent, difficult to scale, and often reactive rather than predictive.

Bolt inspection is one example of a broader category of high-volume, repetitive visual inspections that are critical for safety but challenging to execute consistently. Environmental factors such as lighting variation, shadows, camera angles, image resolution, and marking inconsistencies further complicate automation.

This creates a clear opportunity for transformation through AI. By combining deterministic computer vision models along with Generative AI reasoning capabilities, organizations can move beyond manual review toward scalable, intelligent inspection systems. Computer vision provides precise detection and measurement, while Generative AI enhances interpretation, contextual validation, and cross-frame reasoning—together enabling more robust defect identification and operational insight.

This article presents validated architecture and practical lessons learned from implementing an AI-driven drone inspection solution. While bolt integrity inspection serves as a representative example, the architecture and approach apply broadly across industrial safety and infrastructure monitoring scenarios.

The Evolution from GenAI Approach to Deterministic Precision

Starting with a Generative AI–driven approach to capture and reason over bolt frames is a fundamentally more effective strategy for this problem space. It accelerates early-stage detection of degraded bolts without requiring large labeled datasets, while simultaneously enabling structured data collection needed to train deterministic machine learning models—which typically require tens of thousands of images.

This approach delivers immediate value by rapidly identifying relevant visual signals in drone footage and uncovering key factors that influence detection accuracy, such as lighting, angle, and alignment. At the same time, it naturally builds the dataset necessary to transition toward a more scalable and repeatable solution.

However, it also makes clear that while Generative AI is powerful for contextual reasoning across frames, it is inherently non-deterministic and sensitive to input variability. For enterprise-grade reliability, precision, and repeatability, a complementary approach is required.

The optimal solution is a hybrid model that combines the strengths of both:

Computer Vision machine learning models provide precise, consistent detection and measurement of structural features at scale.
Generative AI adds contextual reasoning across bolt frames, validates consistency, and interprets ambiguous or borderline defects.

Together, they form a superior system—delivering higher accuracy, reduced ambiguity, and stronger context awareness in complex real-world conditions.

AI cannot compensate for inconsistent input data.

Standardized data capture and operational discipline remain prerequisites for reliable automation.

Solution Components and Architecture

The proposed solution follows a modular, event-driven architecture that combines computer vision and Generative AI to enable scalable, intelligent inspection workflows. At a high level, inspection videos are ingested, processed through deterministic computer vision models for detection and measurement, and enhanced with Generative AI for contextual reasoning and validation. The results are evaluated, stored, and surfaced through analytics platforms to support operational decision-making.

The first diagram provides a system-level view of how core Azure services interact—from data ingestion and model execution to evaluation, storage, and reporting. It highlights the integration of the computer vision pipeline (Azure AI Vision and Azure Machine Learning), the Generative AI reasoning layer (Azure OpenAI), and downstream analytics (Cosmos DB and Power BI), enabling a scalable and flexible architecture.

The second diagram illustrates the step-by-step execution flow across the system. The process begins when a drone operator uploads inspection video to Azure Blob Storage, triggering an event-driven workflow via Azure Functions. Frames are extracted and passed through a quality gate to filter out low-quality data. Valid frames are then processed by computer vision models (Azure AI Vision and Azure Machine Learning) to detect and track bolts, generate bounding boxes, and perform deterministic alignment measurements.

These outputs are further enhanced by a Generative AI layer (Azure OpenAI), which applies contextual reasoning across frames to validate anomalies, reduce false positives, and generate structured summaries. The results are evaluated using Azure AI Foundry to ensure quality, consistency, and reliability before being stored in Cosmos DB. Finally, Power BI dashboards surface insights, trends, and alerts for operational use.

Throughout the pipeline, built-in feedback loops—such as quality filtering, evaluation checks, and quarantine mechanisms—ensure that only high-confidence results are retained, enabling a reliable and production-ready inspection system.

Azure Blob Storage
Primary storage for raw videos, extracted frames, labeled datasets, and model artifacts.
Serves as the ingestion and archival layer for inspection data and training pipelines.
Azure Functions
Serverless event-driven compute used to trigger workflows from video uploads, inspection events, or user actions.
Handles orchestration, preprocessing, and integration between AI services while maintaining lightweight, scalable execution.
Azure Machine Learning (Azure ML Studio)
End-to-end development platform for training, testing, and deploying custom machine learning, computer vision models and Gen AI and evaluation workflow.
- Quality Gate (Frame Filtering)
  Captured video passes through an automated quality gate that removes frames with blur, glare, poor lighting, or unfavorable angles. This ensures that only high-quality, inspection-grade frames are used, safeguarding model accuracy.
- Bolt Detection (CV Model)
  Detects and localizes bolts in each frame with bounding boxes, confidence scores, and coarse defect signals (e.g., using YOLO or RT-DETR).
- Bolt Identification & Tracking (CV + Logic)
  Maintains consistent bolt identity across frames using spatial context or markers (e.g., AprilTags), enabling longitudinal tracking.
- Deterministic Measurement (CV + Geometry)
  Computes precise alignment or rotation using geometric analysis, with threshold-based evaluation for repeatable, auditable results.
- Contextual Validation & Reporting (GenAI Layer)
  Applies cross-frame reasoning to validate results, resolve ambiguities, improve accuracy, reduce false positives, and generate a structured, human-readable summary report of inspection findings.
- Azure AI Evaluation Metrics – (Microsoft Foundry)
  Ensures the quality, reliability, and compliance of generative AI outputs by evaluating key dimensions such as:
  - Groundedness – Verifies that the generated summary and reasoning are based on actual frames and inspection measurements.
  - Coherence – Assesses logical consistency across frames and throughout the report, ensuring observations and conclusions align.
  - Fluency – Measures clarity, readability, and professional language in the human-readable summary report.
    These metrics act as guardrails to maintain enterprise-grade accuracy, trustworthiness, and compliance in all AI-generated inspection insights.

Azure Cosmos DB
Globally distributed NoSQL database for storing structured inspection results, metadata, agent memory, and historical asset data.
Enables longitudinal tracking, contextual retrieval, and scalable real-time querying.
Power BI
Business intelligence and visualization platform used to monitor inspection results, trends, and operational KPIs.
Provides dashboards for maintenance teams, reliability engineers, and leadership decision-making.

Security and Enterprise Considerations

Azure Blob Storage: Storage accounts can be secured by minimizing public exposure, enforcing strong identity‑based access, protecting data, and continuously monitoring for threats. Organizations should use Private Endpoints and disable public network access wherever possible, authenticate users and applications with Microsoft Entra ID instead of shared keys, and apply least‑privilege Azure RBAC with managed identities. Data should be encrypted in transit (TLS 1.2+) and at rest using Microsoft‑managed or customer‑managed keys stored in Azure Key Vault, while Microsoft Defender for Storage, logging, soft delete, backups, and Azure Policy should be enabled to detect threats, support recovery, and enforce compliance at scale. Content Safety can be called from the application layer to block uploads based on image content. Staging containers can be used to isolate untrusted uploads. Content Safety provides signals; your app enforces policy.
Azure AI Vision / Computer Vision (Custom Vision or Vision models): Azure AI Vision supports enterprise-grade security through Microsoft Entra ID–based authentication and Azure Role-Based Access Control (RBAC), ensuring only authorized users, applications, and services can access vision models and image data. Network isolation can be enforced using Virtual Network (VNet) integration and Private Link to restrict public internet exposure and ensure traffic remains within secure enterprise boundaries. All data transmitted to and from Azure AI Vision is encrypted in transit using TLS 1.2+ and encrypted at rest using Microsoft-managed keys or optional Customer-Managed Keys (CMKs).

For threat detection and monitoring, Microsoft Defender for Cloud provides security posture visibility and anomaly detection across AI workloads. Integration with Microsoft Purview enables classification and protection of sensitive image or inspection data, ensuring compliance with enterprise data governance policies.

Azure Machine Learning (Azure ML): Azure Machine Learning provides a secure environment for training, testing, and deploying machine learning and computer vision models. Access control is managed through Entra ID and Azure RBAC, enabling granular permissions for data scientists, engineers, and automated services. Managed Identities allow secure service-to-service authentication without exposing credentials.

For network security, Azure ML supports Virtual Network isolation, Private Link endpoints, and managed network configurations to prevent unauthorized external access. Data used for model training and inference is encrypted in transit and at rest, with support for Customer-Managed Keys (CMKs) stored in Azure Key Vault for enhanced control.

Microsoft Defender for Cloud provides threat detection and vulnerability management across compute instances, endpoints, and model deployments. Azure Policy ensures compliance by auditing and enforcing security configurations across ML workspaces. Additionally, model versioning and governance features support traceability and auditability for safety-critical AI deployments.

Azure Functions: Azure Functions can be secured by using Entra ID authentication and managed identities instead of keys or embedded secrets, and by enforcing least‑privilege access through Azure RBAC. Network exposure should be minimized by enabling HTTPS‑only access, using private endpoints, IP restrictions, and VNet integration where appropriate. Sensitive data and credentials should be stored in Azure Key Vault, with encryption enforced both in transit and at rest. Function apps should be hardened by keeping runtimes and dependencies up to date, disabling unused features, and enforcing secure configurations with Azure Policy. Ongoing protection relies on Azure Monitor, Application Insights, Defender for Cloud, and centralized logging or SIEM integration to detect threats and misconfigurations, along with regular vulnerability management, backups, and governance practices to maintain resilience and compliance.

Azure OpenAI (GPT-4o / GPT-4o mini): Govern which models are approved for use and protect model artifacts and training data from unauthorized access through strong identity, network, encryption, and logging controls. AI applications should be designed with layered defenses, including multi‑stage content filtering, safety meta‑prompts, and least‑privilege permissions for agents and plugins to reduce the risk of prompt injection, data leakage, and unintended actions. High‑risk AI operations should include human‑in‑the‑loop review to prevent autonomous execution of harmful or incorrect outcomes. Organizations must continuously monitor AI systems for misuse, anomalous behavior, and data exfiltration, and they should perform ongoing AI red teaming to identify vulnerabilities such as jailbreaking, adversarial inputs, and model manipulation before they can be exploited.

Azure Cosmos DB : Azure Cosmos enhances network security by supporting access restrictions via Virtual Network (VNet) integrationand secure access through Private Link. Data protection is reinforced by integration with Microsoft Purview, which helps classify and label sensitive data, and Defender for Cosmos DBto detect threats and exfiltration attempts. Cosmos DB ensures all data is encrypted in transit using TLS 1.2+ (mandatory) and at rest using Microsoft-managed or customer-managed keys (CMKs).
Power BI: Power BI leverages Microsoft Entra IDfor secure identity and access management. In Power BI embedded applications, using Credential Scanneris recommended to detect hardcoded secrets and migrate them to secure storage like Azure Key Vault. All data is encrypted both at rest and during processing, with an option for organizations to use their own Customer-Managed Keys (CMKs). Power BI also integrates with Microsoft Purview sensitivity labels to manage and protect sensitive business data throughout the analytics lifecycle. For additional context, Power BI security white paper - Power BI | Microsoft Learn.
Microsoft Foundry: Microsoft Foundry supports robust identity management using Azure Role-Based Access Control (RBAC) to assign roles within Microsoft Entra ID, and it supports Managed Identities for secure resource access. Conditional Access policies allow organizations to enforce access based on location, device, and risk level. For network security, Azure AI Foundry supports Private Link, Managed Network Isolation, and Network Security Groups (NSGs) to restrict resource access. Data is encrypted in transit and at rest using Microsoft-managed keys or optional Customer-Managed Keys (CMKs). Azure Policy enables auditing and enforcing configurations for all resources deployed in the environment. Additionally, Microsoft Entra Agent ID, which extends identity management and access capabilities to AI agents. AI agents created within Microsoft Foundry are automatically assigned identities in a Microsoft Entra directory centralizing agent and user management in one solution. AI Security Posture Management can be used to assess the security posture of AI workloads. Defender for AI Services provides threat protection and insights for you AI resources. Purview APIs enable Azure AI Foundry and developers to integrate data security and compliance controls into custom AI apps and agents. This includes enforcing policies based on how users interact with sensitive information in AI applications. Purview Sensitive Information Types can be used to detect sensitive data in user prompts and responses when interacting with AI applications.
DevOps Security: Embed security throughout the software development lifecycle. Best practices include conducting structured threat modeling with the Microsoft Threat Modeling Tool early in the design phase, securing the software supply chain by verifying provenance and scanning third‑party dependencies, and maintaining a Software Bill of Materials (SBOM).

Security is further “shifted left” by integrating automated controls directly into CI/CD pipelines. GitHub Advanced Security for Azure DevOps, which provides dependency scanning, CodeQL-based static application security testing (SAST), and secret scanning to identify vulnerabilities and exposed credentials in code and third-party libraries. Infrastructure-as-code templates can be validated with Azure Policy and Microsoft Defender for Cloud, while pipeline protections such as protected branches and approvals reduce the risk of unauthorized changes. DevOps environments can be hardened using Azure Key Vault for secrets management, Managed Identities and Microsoft Entra ID for least-privilege access, and monitoring through Azure Monitor . Microsoft Defender for Cloud DevOps Security provides centralized code‑to‑cloud visibility across Azure DevOps, GitHub, and GitLab, identifying risks in code, secrets, dependencies, and IaC and helping teams prioritize fixes early in CI/CD pipelines.

Related and Future Scenarios

Although bolt inspection served as an initial use case, this architecture establishes a scalable pattern for many industrial applications:

Predictive Maintenance: Tracking structural movement over time enables condition-based maintenance rather than schedule-based inspections.
Structural Health Monitoring: The same approach can detect cracks, corrosion, or deformation across industrial assets and infrastructure.
Equipment and Safety Compliance Monitoring: AI-driven visual inspection can monitor equipment wear, safety compliance, and environmental risks.
Digital Twin Integration: Inspection data can feed digital twin environments, enabling real-time visualization of facility health and risk conditions.

Conclusion

Modernizing industrial inspection is not simply about applying AI—it requires aligning technology, operational discipline, and data quality. Early exploration using Generative AI enabled rapid learning and feasibility validation. However, a production-grade solution must be built on deterministic computer vision models supported by standardized data capture and operational controls.

By combining drone-based data capture, deterministic computer vision, and Generative AI for reporting and insights, organizations can achieve scalable, repeatable, and auditable inspection processes. This hybrid approach enables safer operations, reduced manual effort, and the transition from reactive repairs to predictive maintenance across industrial environments.

The result is not just an automated inspection tool, but a scalable AI architecture for modern industrial safety and asset reliability.

Contributors:

This article is maintained by Microsoft. It was originally written by the following contributors. 

Principal authors: 

Peter Lee | Senior Cloud Solution Architect – US Customer Success 
Manasa Ramalinga | Senior Principal Cloud Solution Architect – US Customer Success 
Abed Sau | Principal Cloud Solution Architect – US Customer Success
Yagneswari Kanadam | Senior Cloud Solution Architect – US Customer Success

Designing a Medallion Framework — A Decision Guide

Subhajit1994 — Fri, 24 Apr 2026 16:11:09 GMT

Everyone draws the same picture: Bronze → Silver → Gold. Three boxes, three arrows. Done.

What that picture hides is the dozen design decisions you have to make inside each box — and the ones you make at the boundaries between them. Get those right and onboarding the 200th table feels like onboarding the 2nd. Get them wrong and you’ll be rewriting the framework in 18 months.

This post is a generic walkthrough of how to think about a medallion framework on Databricks (or any other platform): what each layer should own, where the responsibilities blur, and a few opinionated patterns I’ve found worth defending

The classic template -

Bronze → Silver → Gold. Three layers, broadly:

Press enter or click to view image in full size

This template is intentionally vague — and that’s the point. The same three labels can describe a framework for a 10-table marketing pipeline and a 2,000-table enterprise lakehouse. The differences are in how you tweak the template to match your project.

This post walks through the questions that drive those tweaks. There isn’t a single right answer for any of them — only the answer that fits your project’s requirements.

How to read this guide

For each architectural choice, I’ll frame it as:

The question — the requirement you need to clarify
The options — the realistic ways to answer it
When each option fits — what kind of project picks which option

Use this to make your tradeoffs explicit. Document the answers in your design doc. They’ll inform a hundred downstream decisions.

Question 1 — Do you need a Staging layer?

A Staging (stg_*) layer is a transient zone that holds just the current run’s data before it lands in Bronze.

Options:

No staging. Source → Bronze directly.
Staging as a transient table per object, overwritten every run.
Staging as a checkpointed zone (e.g., Auto Loader checkpoints + raw files in a landing path).

When to pick which:

The decision usually comes down to failure isolation and incremental capture clarity. If both are non-issues, you can skip it.

Question 2 — How “raw” should Bronze be?

This is the single biggest tweak point in the medallion architecture. The textbook says “Bronze = raw bytes.” Real projects often deviate.

Options:

A. Strictly raw. Source schema preserved exactly. All columns as STRING. No casting, no trimming.
B. Lightly cleaned. Strong typing, whitespace trimmed, null normalization (“”, “N/A” → NULL), audit columns added. Schema stable.
C. Cleansed + minor enrichment. Above plus reference data lookups, basic standardization (e.g., country codes), key normalization.

When to pick which:

A useful rule of thumb: the more sources and consumers you have, the cleaner Bronze should be. The cost of not cleaning compounds with every notebook downstream.

If you choose B or C, you’ve shifted some traditional Silver responsibilities into Bronze. That’s fine — just be explicit about it so Silver’s contract changes accordingly.

Question 3 — What does Silver actually own?

Silver is the most overloaded layer in any medallion framework. Decide upfront which of these responsibilities Silver owns vs. defers to other layers:

How to decide what Silver owns:

If Silver is the only layer business users query, give it more — including light history and aggregations. (Common in smaller projects.)
If you have a strong Gold layer with multiple marts, keep Silver narrow: business entities only, current state.
If you have multiple consuming teams with different needs, push everything consumer-specific to Gold and keep Silver as the shared canonical model.

The clearest signal that Silver is overloaded: you have one Silver table per source table. Silver should be organized by business entity, not by source. If they line up 1:1, you’ve effectively built “Bronze with cleaning” and skipped Silver’s real value.

Question 4 — Is Gold one zone or several?

The default picture shows Gold as one box. In real projects it often splits.

Options:

Single Gold zone. Marts and history live together.
Gold-Reporting + Gold-History. Reporting marts (denormalized, aggregated, fast) separated from historized snapshots (SCD2, point-in-time, append-mostly).
Gold per consumer. Separate zones per business unit, dashboard family, or external API.

The cost of splitting Gold is some duplication and more pipelines. The benefit is independent SLAs — your dashboard refresh isn’t held hostage by your audit history rebuild.

Question 5 — Load patterns: FullLoad vs DeltaLoad vs CDC

Per source table, decide the load pattern. This decision drives staging design, watermark management, and merge logic.

It’s normal to mix patterns inside the same framework. The metadata-driven approach below makes this trivial — load pattern is just a column in your config table.

Question 6 — How metadata-driven should the framework be?

Options:

Code-per-table. One notebook per ingestion. Simple, easy to reason about, scales poorly.
Hybrid. Generic ingestion notebooks for common patterns, custom notebooks for exceptions.
Fully metadata-driven. Generic notebooks for every layer, behavior driven entirely by metadata tables.

When to pick which:

A fully metadata-driven framework has higher upfront cost but flattens the per-table cost dramatically. The break-even point is usually around 30–50 tables.

Question 7 — Orchestration shape

How do you fan out work across tables?

Options:

Sequential. One table at a time. Simple, slow.
Parallel pool. ThreadPoolExecutor or Databricks Workflows fan-out. Tables run concurrently, no inter-table dependencies.
DAG. Dependency-aware execution. Required when tables depend on each other.

Per-layer guidance:

The decision driver is whether tables in that layer depend on each other. If they don’t, don’t pay the DAG complexity tax.

Question 8 — Failure handling and retries

Options to decide on:

Retry scope. Per statement, per child notebook, per master run, none.
Retry counts. Per layer? Per table? Per environment?
Backoff. Fixed, linear, exponential.
Failure semantics. Fail-fast (stop on first failure) or best-effort (continue and report at the end).

When to pick which:

A good default for most projects: process-level retry (master retries the failed child), exponential backoff, per-layer max retry count, fail-fast within a child.

Question 9 — Observability: how much do you log?

Decide what every run captures:

Execution status, start/end timestamps, duration
Row counts per activity (source read, staging write, target write)

MERGE metrics (inserted, updated, deleted)
Watermark used and watermark captured
Retry attempts
Error message (truncated)

Options for storage:

Logs in source-side metadata DB (e.g., Azure SQL). Easy to query with SQL, integrates with monitoring tools.
Logs in a Delta table in the lakehouse. Native to Databricks, queryable with Spark.
Logs in both. Source-side for ops dashboards, Delta for analytics on the pipeline itself.

When to pick which:

Whatever you pick, make count validation a first-class output. The moment counts mismatch, you want to know — not three reports later.

Question 10 — Schema evolution policy

The cheapest decision to defer and the most painful one to retrofit.

Decide which changes are allowed automatically:

Where to enforce:

At Bronze ingestion — fail loudly if source schema changes in a disallowed way
At Silver — handle by transformation; new Bronze columns don’t auto-flow to Silver
At Gold — strict contracts; consumers depend on the shape

The contract changes per layer reflects the audience. Bronze is forgiving (data engineers see issues); Gold is strict (consumers can’t tolerate surprises).

Question 11 — Idempotency and replay

Can you re-run yesterday’s load and get the same result?

Options:

Idempotent by run_id. Re-running the same run_id is a no-op or produces identical output.
Idempotent by data. Re-running with the same source data produces identical output (regardless of run_id).
Non-idempotent. Replays may produce different results (e.g., timestamps based on current_timestamp()).

Recommendation: aim for data-idempotent in every layer. Concretely:

Staging: overwrite-per-run → idempotent by construction.
Bronze: keyed MERGE → idempotent.
Silver: pure transformation of Bronze inputs → idempotent.
Gold: pure transformation of Silver inputs → idempotent.

If you can’t replay a layer cleanly, that’s a design bug worth fixing early.

Question 12 — Environment topology

How many environments? How do they differ?

Common patterns:

Dev / Test/ Stage / Prod, separate workspaces and data.
Per-developer dev, shared Test/Stage, isolated Prod.

What changes between environments (drive these from config):

Source connection strings
Target storage paths / catalog names
Retry counts (often higher in prod)
Parallelism (often lower in dev to save cost)
Logging verbosity
Data masking rules

Keep code identical across environments. Differences live in environment-scoped config (dev.yml, test.yml, stage.yml, prod.yml) loaded at runtime.

Putting it together — three example shapes

The same framework, three different projects, three different shapes:

Shape A — Small marketing analytics project

15 tables, single source, weekly batch
No staging — source is reliable, volumes small
Bronze: lightly cleaned — analysts query it directly
Silver: full ownership including light history and aggregations (no separate Gold needed)
Gold: optional, only for the executive dashboard
Code-per-table, sequential orchestration, fail-fast, minimal logging

Shape B — Mid-size enterprise data platform

80 tables, 5 source systems, daily batch with some hourly
Staging as transient table for Delta Loads
Bronze: lightly cleaned + audit columns
Silver: business entities (Customer, Policy, Claim), DAG orchestration
Gold: split into Reporting + History zones
Hybrid metadata-driven (generic ingestion, custom transforms), per-layer retry, structured count logs

Shape C — Large multi-tenant Lakehouse

500+ tables, 20+ source systems, mixed batch/streaming
Staging zone with file-level checkpoints (Auto Loader)
Bronze: strictly raw + a parallel Bronze-Curated layer for cleansed views
Silver: shared canonical model, narrow scope
Gold: per-consumer zones with independent SLAs
Fully metadata-driven, DAG everywhere, multi-store logging, strict schema contracts

Notice none of these are “wrong.” They’re calibrated to the project.

A short checklist for your own framework

Before writing code, write down your answers to:

Do we need a Staging layer? Why?
How clean is Bronze? What’s allowed and what’s not?
What does Silver own? Where does it stop?
Is Gold one zone or multiple? How are they divided?
Which load patterns do we support? Per table or universal?
How metadata-driven? Where do exceptions live?
What’s the orchestration shape per layer?
What’s our retry and failure policy per layer?
What does every run log? Where?
What’s our schema evolution policy per layer?
Are all layer's data-idempotent?
What changes per environment, and what stays the same?

If you have an answer for each, you have a framework design. If you skip any, you have a framework that will surprise you in production.

Closing thought

The medallion architecture isn’t a prescription — it’s a vocabulary. Bronze, Silver, Gold give you words to describe responsibilities. The actual responsibilities are yours to assign, based on what your project actually needs.

Tweak deliberately. Document your tweaks. And revisit them when the project’s requirements change — because they will.

Enhancing Enterprise AI Deployments with Zero Trust Networking

kirankumar_manchiwar04 — Fri, 24 Apr 2026 16:08:22 GMT

Problem Statement

Azure OpenAI is publicly accessible by default Azure OpenAI frequently asked questions | Microsoft Learn
Any application with API access can call it from anywhere
Violates:

Enterprise security policies
Zero Trust architecture Key principles of the Zero Trust network model
Regulatory compliance

👉 Enterprises require:

Private connectivity
Controlled access via VNet
DNS-based secure resolution

🏗️ Architecture Overview

✅ Key Components

Azure OpenAI Service
Azure Virtual Network (VNet)
Private Endpoint
Private DNS Zone
Application (VM / App Service / AKS)

✅ High-Level Flow

Application sends request to OpenAI endpoint
DNS resolves endpoint → Private IP
Traffic routed inside VNet
No internet exposure

👉 Private endpoints assign a private IP inside VNet, ensuring secure communication over Azure backbone.

🔹 Architecture Diagram Description

Diagram 01: Architecture Enhancing Enterprise AI Deployments with Zero Trust Networking

End-to-End Flow

User authenticates via Entra ID (MFA) Microsoft Entra multifactor authentication
Traffic passes through WAF (threat filtering)
Enters private Azure VNet
API Management enforces policies
AI services are accessed via private endpoints
Data is securely fetched from private storage/databases
Monitoring tools track all activity continuously

Key Value Proposition: This architecture ensures:

🚫 No public exposure of AI services or data
🔐 Identity-based access instead of network trust
🌐 Fully private, isolated network communication
⚡ Secure and scalable AI workloads
🛡️ Defense-in-depth with monitoring and policy enforcement

Note: This architecture demonstrates how enterprises can securely operationalize AI at scale by combining private networking, identity-driven access, and continuous monitoring—fully aligned with Zero Trust principles.

🔍 Critical Concept: Private Endpoint

A Private Endpoint:

Creates a network interface in your VNet
Assigns a private IP address
Maps to Azure OpenAI service
Redirects traffic internally

👉 Result:

No public internet usage
Fully isolated communication

🔍 Critical Concept: DNS Resolution

Why DNS is critical?

OpenAI endpoint still uses public FQDN
Must resolve to private IP instead

👉 Without correct DNS:

Traffic goes to public endpoint
Security is broken

How it works

Public DNS CNAME → Private Link domain
Private DNS overrides resolution
FQDN resolves to Private Endpoint IP

👉 DNS ensures that traffic routes correctly to private endpoint

🧱 Required Private DNS Zones

For Azure OpenAI:

privatelink.openai.azure.com
privatelink.cognitiveservices.azure.com

👉 These zones map:

OpenAI endpoint → Private IP

👉 Important: Each Private Endpoint must have proper DNS mapping

⚙️ Step-by-Step Configuration

✅ Step 1: Create Virtual Network

Create VNet with:

App subnet
Private Endpoint subnet

👉 Best practice:

Use dedicated subnet for private endpoints

✅ Step 2: Create Azure OpenAI Resource

Go to Azure Portal
Create Azure OpenAI
Select region & resource group

👉 Note:

OpenAI resource doesn’t need same region as VNet (optional)

✅ Step 3: Disable Public Network Access

Navigate to:

Networking → Public Access

Set:

Public Network Access = Disabled

👉 Ensures service is not accessible via internet

✅ Step 4: Create Private Endpoint

Go to OpenAI → Networking → Private Endpoint

Configure:

Setting	Value
VNet	Your VNet
Subnet	Private Endpoint Subnet
Resource Type	Cognitive Services
Sub-resource	account

👉 This creates:

Private IP in subnet
Network interface mapping

✅ Step 5: Configure Private DNS Zone

Create:

privatelink.openai.azure.com

Then:

Link DNS zone to VNet
Add A record automatically (or manually)

👉 DNS maps:

<your-openai-name>.openai.azure.com → Private IP

👉 DNS resolution ensures traffic flows internally

✅ Step 6: Validate Connectivity

From VM inside VNet:

nslookup <openai-name>.openai.azure.com

✅ Expected output:

Private IP (e.g., 10.x.x.x)

Then test API call → should work

✅ Step 7: Application Integration

Your application (AKS / VM / App Service):

Calls OpenAI endpoint
Traffic resolves to private IP
Routed via VNet

👉 Fully secure AI access

🔐 Security Best Practices

✔ Disable public access completely
✔ Use Private Endpoint for all AI services
✔ Use NSG + Firewall for segmentation
✔ Use Managed Identity instead of API keys
✔ Monitor via Azure Monitor

👉 Private endpoints ensure traffic stays inside Azure backbone network

🏢 Real-World Enterprise Use Case

Example:

Banking application using OpenAI
Hosted on AKS
Uses:

Private Endpoint
APIM
DNS resolution

👉 Result:

No internet exposure
Compliance with regulations
Secure data processing

✅ Key Benefits

🔒 Zero internet exposure
🌐 Private connectivity
🛡️ Zero Trust architecture
⚡ Reliable and low latency
🧩 Seamless app integration

🧾 Conclusion

Azure OpenAI is powerful, but security architecture is critical for enterprise adoption.

By using:

Private Endpoints
Private DNS Zones
VNet integration

You can build a secure, scalable, and compliant AI solution.

When RAG Isn’t Enough: Moving from Retrieval to Relationship-Aware Systems in Enterprise AI:

ankitasarkar — Fri, 24 Apr 2026 07:04:57 GMT

The Problem

In an enterprise AI scenario, the goal was to map structured feature data to relevant sections within large technical documents.

At a glance, this appears to be a straightforward semantic matching problem. Initial results using semantic search were promising. However, as the system was used more extensively, certain issues became apparent:

Inconsistent mappings across similar inputs
Occasional matches to contextually unrelated sections
Variability in results across repeated runs

Despite multiple optimizations, the system continued to produce outcomes that lacked reliability.

This pointed to a deeper realization:

The challenge was not just retrieval quality, but the absence of structure in how retrieval was being guided.

Initial Approach: Retrieval-Augmented Generation (RAG)

The system followed a standard RAG architecture:

Documents indexed using embeddings
Semantic similarity used for retrieval
Retrieved context passed to a language model for processing

RAG is highly effective in scenarios involving unstructured data, offering flexibility and strong contextual understanding.

However, an important limitation emerged:

RAG operates on semantic similarity but does not inherently understand relationships or domain constraints.

Observed Challenges

Lack of Contextual Boundaries

Concepts with similar terminology were sometimes mapped across unrelated domains due to overlapping language. Without domain awareness, the system struggled to enforce meaningful boundaries.

Underutilization of Existing Structure

The data already contained valuable structure:

Features were organized into categories
Categories aligned with specific document sections
Relationships followed consistent, rule-driven patterns

This structure was not incorporated into the retrieval process, leading to missed opportunities for improving accuracy.

Variability in Deterministic Scenarios

Some mappings followed clear and consistent rules. However, treating all queries as probabilistic retrieval problems introduced unnecessary variability and reduced confidence in the results.

Introducing Structure with Knowledge Graphs

To address these challenges, a structured layer based on Knowledge Graph concepts was introduced.

At a high level, relationships were modeled as:

Entity → belongs to → Category
Category → linked to → Knowledge Source
Knowledge Source → contains → Relevant Information

This enabled:

Constraint enforcement for rule-based mappings
Relationship traversal across hierarchical data
Improved explainability through traceable decision paths

Hybrid Approach: Combining Knowledge Graph and RAG

Rather than replacing RAG, the system evolved into a hybrid architecture:

Step 1: Knowledge Graph for filtering

Apply domain constraints

Narrow down the search space to relevant sections

Step 2: RAG for semantic refinement

Perform retrieval within the filtered scope
Extract context with greater precision

Key Insight

The transition from a retrieval-first approach to a constraint-guided retrieval model significantly improved consistency and relevance.

When to Use This Approach

RAG is sufficient when:

Data is primarily unstructured
Relationships are weak or undefined
Rapid prototyping is required

A hybrid approach is beneficial when:

The domain includes clear hierarchies or taxonomies
Relationships are deterministic or rule-driven
Consistency and explainability are important
Pure semantic retrieval produces logically incorrect results

Key Takeaways

Leverage existing structure in enterprise data instead of relying solely on semantic similarity
Analyze failure patterns to identify missing constraints
Combine structured and semantic approaches for robust system design
Prioritize explainability for production-grade AI systems

Broader Perspective

As enterprise AI systems scale, it becomes increasingly important to balance:

Semantic understanding (RAG)
Structured reasoning (Knowledge Graphs)

These approaches are not competing—they are complementary. When combined effectively, they enable systems that are both flexible and reliable.

Closing Thought

A key realization from this experience was:

Instead of focusing only on improving retrieval, it is equally important to understand how domain structure can guide and constrain that retrieval.

This article reflects personal learnings and general architectural patterns.

Centralizing Enterprise API Access for Agent-Based Architectures

sbaskaran — Thu, 23 Apr 2026 21:43:23 GMT

Problem Statement

When building AI agents or automation solutions, calling enterprise APIs directly often means configuring individual HTTP actions within each agent for every API. While this works for simple scenarios, it quickly becomes repetitive and difficult to manage as complexity grows.

The challenge becomes more pronounced when a single business domain exposes multiple APIs, or when the same APIs are consumed by multiple agents. This leads to duplicated configurations, higher maintenance effort, inconsistent behavior, and increased governance and security risks.

A more scalable approach is to centralize and reuse API access. By grouping APIs by business domain using an API management layer, shaping those APIs through a Model Context Protocol (MCP) server, and exposing the MCP server as a standardized tool or connector, agents can consume business capabilities in a consistent, reusable, and governable manner.

This pattern not only reduces duplication and configuration overhead but also enables stronger versioning, security controls, observability, and domain‑driven ownership—making agent-based systems easier to scale and operate in enterprise environments.

Designing Agent‑Ready APIs with Azure API Management, an MCP Server, and Copilot Studio

As enterprises increasingly adopt AI‑powered assistants and Copilots, API design must evolve to meet the needs of intelligent agents. Traditional APIs—often designed for user interfaces or backend integrations—can expose excessive data, lack intent-level abstraction, and increase security risk when consumed directly by AI systems. This document outlines a practical, enterprise-‑ready approach to organize APIs in Azure API Management (APIM), introduce a Model Context Protocol (MCP) server to shape and control context, and integrate the solution with Microsoft Copilot Studio. The goal is to make APIs truly agent-‑ready: secure, scalable, reusable, and easy to govern.

Architecture at a glance

Back-end services expose domain APIs.
Azure API Management (APIM) groups and governs those APIs (products, policies, authentication, throttling, versions).
An MCP server calls APIM, orchestrates/filters responses, and returns concise, model-friendly outputs.
Copilot Studio connects to the MCP server and invokes a small set of predictable operations to satisfy user intents.

Why Traditional API Designs Fall Short for AI Agents

Enterprise APIs have historically been built around CRUD operations and service-‑to-‑service integration patterns. While this works well for deterministic applications, AI agents work best with intent-driven operations and context-aware responses. When agents consume traditional APIs directly, common issues include: overly verbose payloads, multiple calls to satisfy a single user intent, and insufficient guardrails for read vs. write operations. The result can be unpredictable agent behavior that is difficult to test, validate, and govern.

Structuring APIs Effectively in Azure API Management

Azure API Management (APIM) is the control plane between enterprise systems and AI agents. A well-‑structured APIM instance improves security, discoverability, and governance through products, policies, subscriptions, and analytics. Key design principles for agent consumption Organize APIs by business capability (for example, Customer, Orders, Billing) rather than technical layers. Expose agent-facing APIs via dedicated APIM products to enable controlled access, throttling, versioning, and independent lifecycle management. Prefer read-only operations where possible; scope write operations narrowly and protect them with explicit checks, approvals, and least-privilege identities. Read‑only APIs should be prioritized, while action‑oriented APIs must be carefully scoped and gated.

The Role of the MCP Server in Agent‑Based Architectures

APIM provides governance and security, but agents also need an intent-level interface and model-friendly responses. A Model Context Protocol (MCP) server fills this gap by acting as a mediator between Copilot Studio and APIM-exposed APIs. Instead of exposing many back-end endpoints directly to the agent, the MCP server can: orchestrate multiple API calls, filter irrelevant fields, enforce business rules, enrich results with additional context, and emit concise, predictable JSON outputs. This makes agent behavior more reliable and easier to validate. Instead of exposing multiple backend APIs directly to the agent, the MCP server aggregates responses, filters irrelevant data, enriches results with business context, and formats responses into LLM‑friendly schemas. By introducing this abstraction layer, Copilot interactions become simpler, safer, and more deterministic. The agent interacts with a small number of well‑defined MCP operations that encapsulate enterprise logic without exposing internal complexity.

Designing an Effective MCP Server

An MCP server should have a focused responsibility: shaping context for AI models. It should not replace core back-end services; it should adapt enterprise capabilities for agent consumption. What MCP should do An MCP server should be designed with a clear and focused responsibility: shaping context for AI models. Its primary role is not to replace backend services, but to adapt enterprise data for intelligent consumption. MCP does not orchestrate enterprise workflows or apply business logic. It standardizes how agents discover and invoke external tools and APIs by exposing them through a structured protocol interface. Orchestration, intent resolution, and policy-driven execution are handled by the agent runtime or host framework. It is equally important to understand what does not belong in MCP. Complex transactional workflows, long‑running processes, and UI‑specific formatting should remain in backend systems. Keeping MCP lightweight ensures scalability and easier maintenance.

Call APIM-managed APIs and orchestrate multi-step retrieval when needed.
Apply security checks and business rules consistently.
Filter and minimize payloads (return only fields needed for the intent).
Normalize and reshape responses into stable, predictable JSON schemas.
Handle errors and edge cases with safe, descriptive messages.

What MCP should not do Avoid implementing complex transactional workflows, long-running processes, or UI-specific formatting in MCP. Keep it lightweight so it remains scalable, testable, and easy to maintain.

Step by step guide

1) Create an MCP server in Azure API Management (APIM)

Open the Azure portal (portal.azure.com).
Go to your API Management instance.
In the left navigation, expand APIs.
Create (or select) an API group for the business domain you want to expose (for example, Orders or Customers).
Add the relevant APIs/operations to that API group.
Create or select an APIM product dedicated for agent usage, and ensure the product requires a subscription (subscription key).
Create an MCP server in APIM and map it to the API (or API group) you want to expose as MCP operations.
In the MCP server settings, ensure Subscription key required is enabled.
From the product’s Subscriptions page, copy the subscription key you will use in Copilot Studio.
Screenshot placeholders: APIM API group, product configuration, MCP server mapping, subscription settings, subscription key location.

* Note: Using an API Management subscription key to access MCP operations is one supported way to authenticate and consume enterprise APIs. However, this approach is best suited for initial setups, demos, or scenarios where key-based access is explicitly required.

For production‑grade enterprise solutions, Microsoft recommends using managed identity–based access control. Managed identities for Azure resources eliminate the need to manage secrets such as subscription keys or client secrets, integrate natively with Microsoft Entra ID, and support fine‑grained role‑based access control (RBAC). This approach improves security posture while significantly reducing operational and governance overhead for agent and service‑to‑service integrations.

Wherever possible, agents and MCP servers should authenticate using managed identities to ensure secure, scalable, and compliant access to enterprise APIs.

2) Create a Copilot Studio agent and connect to the APIM MCP server using a subscription key

Copilot Studio natively supports Model Context Protocol (MCP) servers as tools. When an agent is connected to an MCP server, the tool metadata—including operation names, inputs, and outputs—is automatically discovered and kept in sync, reducing manual configuration and maintenance overhead.

Sign in to Copilot Studio.
Create a new agent and add clear instructions describing when to use the MCP tool and how to present results (for example, concise summaries plus key fields).
Open Tools > Add tool > Model Context Protocol, then choose Create.
Enter the MCP server details: Server endpoint URL: copy this from your MCP server in APIM. Authentication: select API Key. Header name: use the subscription key header required by your APIM configuration.
Select Create new connection, paste the APIM subscription key, and save.
Test the tool in the agent by prompting for a domain-specific task (for example, “Get order status for 12345”). Validate that responses are concise and that errors are handled safely.
Screenshot placeholders: MCP tool creation screen, endpoint + auth configuration, connection creation, test prompt and response.

Operational best practices and guardrails

Least privilege by default: create separate APIM products and identities for agent scenarios; avoid broad access to internal APIs.
Prefer intent-level operations: expose fewer, higher-level MCP operations instead of many low-level endpoints.
Protect write operations: require explicit parameters, validation, and (when appropriate) approval flows; keep “read” and “write” tools separate.
Stable schemas: return predictable JSON shapes and limit optional fields to reduce prompt brittleness.
Observability: log MCP requests/responses (with sensitive fields redacted), monitor APIM analytics, and set alerts for failures and throttling.
Versioning: version MCP operations and APIM APIs; deprecate safely.
Security hygiene: treat subscription keys as secrets, rotate regularly, and avoid exposing them in prompts or logs.

Summary

As organizations scale agent‑based and Copilot‑driven solutions, directly exposing enterprise APIs to AI agents quickly becomes complex and risky. Centralizing API access through Azure API Management, shaping agent‑ready context via a Model Context Protocol (MCP) server, and consuming those capabilities through Copilot Studio establishes a clean and governable architecture.

This pattern reduces duplication, enforces consistent security controls, and enables intent‑driven API consumption without exposing unnecessary backend complexity. By combining domain‑aligned API products, lightweight MCP operations, and least‑privilege identity‑based access, enterprises can confidently scale AI agents while maintaining strong governance, observability, and operational control.

References

Automatic SSO Takeover in Azure AD B2C Custom Policies

anammalu — Thu, 23 Apr 2026 21:34:55 GMT

Why This Matters

As organizations modernize authentication, many shift toward Single Sign-On (SSO) using providers like Microsoft EntraID.

But if you already have users in Azure AD B2C using local accounts (email + password), the transition isn’t straightforward.

You’ll run into:

Duplicate identities when users sign in with SSO using the same email
No clean migration path for existing users
Security gaps if password sign-in remains available after SSO
Confusing UX if password reset still allowed for SSO users

The Goal

A seamless, secure transition where:

Users keep a single identity (objectId)
SSO is automatically linked to existing accounts
Password-based flows are permanently disabled after migration
Non-migrated users continue normal local flows (including Forgot Password)
No manual migration or user intervention is required

The Pattern: “SSO Takeover”

This approach uses custom policies (Identity Experience Framework) to:

Detect when a user signs in via SSO
Check if a local account exists with the same email
Automatically link the identities
Set a flag: extension_ssoMigrated = true
Enforce SSO-only access going forward

Scenario	Outcome
Local user (not migrated)	✅ Password sign-in works
Local user (not migrated) – Forgot Password	✅ Allowed
First SSO login (existing user)	✅ Account linked automatically
SSO-migrated user	✅ SSO works
SSO-migrated user – password login	❌ Blocked
SSO-migrated user – password reset	❌ Blocked

Key Building Blocks

1. Migration Flag: extension_ssoMigrated

A custom boolean attribute stored on the user object. This drives all decisions.

<ClaimType Id="extension_ssoMigrated"> <DataType>boolean</DataType> </ClaimType>

2. Conditional blocking (only for migrated users)

A transformation enforces:

If extension_ssoMigrated = true → block
Otherwise → continue normally

This is applied in:

Local sign-in
Password reset

Using preconditions ensures:

❌ Migrated users are blocked
✅ Non-migrated users are NOT affected

3. Automatic Account Linking

During SSO login:

Look up user by email
If found → attach alternativeSecurityId
Set extension_ssoMigrated = true

No duplication. No manual merge.

Simplified flow

User clicks “Sign in with Microsoft”
System checks existing SSO account
If none → lookup by email
If match → link + mark migrated
Issue token

TrustFrameworkExtension.xml

<?xml version="1.0" encoding="utf-8" ?> <TrustFrameworkPolicy xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://schemas.microsoft.com/online/cpim/schemas/2013/06" PolicySchemaVersion="0.3.0.0" TenantId="yourtenant.onmicrosoft.com" PolicyId="B2C_1A_TrustFrameworkExtensions" PublicPolicyUri="http://yourtenant.onmicrosoft.com/B2C_1A_TrustFrameworkExtensions" TenantObjectId="tenantID"> <BasePolicy> <TenantId>yourtenant.onmicrosoft.com</TenantId> <PolicyId>B2C_1A_TrustFrameworkLocalization</PolicyId> </BasePolicy> <BuildingBlocks> <ClaimsSchema> <ClaimType Id="extension_ssoMigrated"> <DisplayName>SSO Migrated</DisplayName> <DataType>boolean</DataType> <UserHelpText>Indicates if user migrated to SSO</UserHelpText> </ClaimType> <ClaimType Id="isForgotPassword"> <DisplayName>isForgotPassword</DisplayName> <DataType>boolean</DataType> <AdminHelpText>Whether the user has clicked Forgot your password</AdminHelpText> </ClaimType> </ClaimsSchema> <ClaimsTransformations> <ClaimsTransformation Id="AssertNotSsoMigrated" TransformationMethod="AssertBooleanClaimIsEqualToValue"> <InputClaims> <InputClaim ClaimTypeReferenceId="extension_ssoMigrated" TransformationClaimType="inputClaim" /> </InputClaims> <InputParameters> <InputParameter Id="valueToCompareTo" DataType="boolean" Value="false" /> </InputParameters> </ClaimsTransformation> </ClaimsTransformations> </BuildingBlocks> <ClaimsProviders> <ClaimsProvider> <Domain>AADBSI</Domain> <DisplayName>Sign in with Microsoft</DisplayName> <TechnicalProfiles> <TechnicalProfile Id="AADBSI-OpenIdConnect"> <DisplayName>Sign in with Microsoft</DisplayName> <Description>Sign in with Microsoft</Description> <Protocol Name="OpenIdConnect" /> <Metadata> <Item Key="METADATA">https://login.microsoftonline.com/common/v2.0/.well-known/openid-configuration</Item> <Item Key="client_id">4a84d062-21d7-4e96-8f80-c6b688e7127b</Item> <Item Key="response_types">code</Item> <Item Key="scope">openid profile</Item> <Item Key="response_mode">form_post</Item> <Item Key="HttpBinding">POST</Item> <Item Key="UsePolicyInRedirectUri">false</Item> <Item Key="DiscoverMetadataByTokenIssuer">true</Item> <Item Key="ValidTokenIssuerPrefixes">https://sts.windows.net/,https://login.microsoftonline.com/</Item> </Metadata> <CryptographicKeys> <Key Id="client_secret" StorageReferenceId="B2C_1A_Multitenancy" /> </CryptographicKeys> <OutputClaims> <OutputClaim ClaimTypeReferenceId="issuerUserId" PartnerClaimType="oid" /> <OutputClaim ClaimTypeReferenceId="tenantId" PartnerClaimType="tid" /> <OutputClaim ClaimTypeReferenceId="givenName" PartnerClaimType="given_name" /> <OutputClaim ClaimTypeReferenceId="surName" PartnerClaimType="family_name" /> <OutputClaim ClaimTypeReferenceId="displayName" PartnerClaimType="name" /> <OutputClaim ClaimTypeReferenceId="authenticationSource" DefaultValue="socialIdpAuthentication" AlwaysUseDefaultValue="true" /> <OutputClaim ClaimTypeReferenceId="identityProvider" PartnerClaimType="iss" /> <OutputClaim ClaimTypeReferenceId="email" PartnerClaimType="email" /> <OutputClaim ClaimTypeReferenceId="otherMails" PartnerClaimType="mail" /> </OutputClaims> <OutputClaimsTransformations> <OutputClaimsTransformation ReferenceId="CreateRandomUPNUserName" /> <OutputClaimsTransformation ReferenceId="CreateUserPrincipalName" /> <OutputClaimsTransformation ReferenceId="CreateAlternativeSecurityId" /> <OutputClaimsTransformation ReferenceId="CreateSubjectClaimFromAlternativeSecurityId" /> </OutputClaimsTransformations> <UseTechnicalProfileForSessionManagement ReferenceId="SM-SocialLogin" /> </TechnicalProfile> </TechnicalProfiles> </ClaimsProvider> <ClaimsProvider> <DisplayName>Local Account SignIn</DisplayName> <TechnicalProfiles> <TechnicalProfile Id="login-NonInteractive"> <Metadata> <Item Key="client_id">a37a58e7-e96a-4365-bb8e-169bee86dde1</Item> <Item Key="IdTokenAudience">92114a5a-99df-40c8-8b08-228616b18c57</Item> </Metadata> <InputClaims> <InputClaim ClaimTypeReferenceId="client_id" DefaultValue="a37a58e7-e96a-4365-bb8e-169bee86dde1" /> <InputClaim ClaimTypeReferenceId="resource_id" PartnerClaimType="resource" DefaultValue="92114a5a-99df-40c8-8b08-228616b18c57" /> </InputClaims> </TechnicalProfile> </TechnicalProfiles> </ClaimsProvider> <ClaimsProvider> <DisplayName>Azure Active Directory - SSO Control</DisplayName> <TechnicalProfiles> <TechnicalProfile Id="AAD-Common"> <Metadata> <Item Key="ApplicationObjectId">3cc7f330-2408-4999-ab68-b74d6feccdf1</Item> <Item Key="ClientId">06c3fab4-3e2a-4d1d-9a78-8954da5d364f</Item> </Metadata> </TechnicalProfile> <TechnicalProfile Id="AAD-UserReadUsingEmailAddress-Takeover"> <Metadata> <Item Key="Operation">Read</Item> <Item Key="RaiseErrorIfClaimsPrincipalDoesNotExist">false</Item> </Metadata> <IncludeInSso>false</IncludeInSso> <InputClaims> <InputClaim ClaimTypeReferenceId="email" PartnerClaimType="signInNames.emailAddress" Required="true" /> </InputClaims> <OutputClaims> <OutputClaim ClaimTypeReferenceId="objectId" /> <OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" /> </OutputClaims> <IncludeTechnicalProfile ReferenceId="AAD-Common" /> </TechnicalProfile> <TechnicalProfile Id="AAD-LinkSSOToExistingUser"> <Metadata> <Item Key="Operation">Write</Item> <Item Key="RaiseErrorIfClaimsPrincipalDoesNotExist">true</Item> </Metadata> <IncludeInSso>false</IncludeInSso> <InputClaims> <InputClaim ClaimTypeReferenceId="objectId" Required="true" /> </InputClaims> <PersistedClaims> <PersistedClaim ClaimTypeReferenceId="objectId" /> <PersistedClaim ClaimTypeReferenceId="alternativeSecurityId" /> <PersistedClaim ClaimTypeReferenceId="extension_ssoMigrated" DefaultValue="true" /> </PersistedClaims> <OutputClaims> <OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" DefaultValue="true" AlwaysUseDefaultValue="true" /> </OutputClaims> <IncludeTechnicalProfile ReferenceId="AAD-Common" /> <UseTechnicalProfileForSessionManagement ReferenceId="SM-AAD" /> </TechnicalProfile>  <TechnicalProfile Id="AAD-UserWriteUsingAlternativeSecurityId"> <PersistedClaims> <PersistedClaim ClaimTypeReferenceId="alternativeSecurityId" /> <PersistedClaim ClaimTypeReferenceId="userPrincipalName" /> <PersistedClaim ClaimTypeReferenceId="mailNickName" DefaultValue="unknown" /> <PersistedClaim ClaimTypeReferenceId="displayName" DefaultValue="unknown" /> <PersistedClaim ClaimTypeReferenceId="otherMails" /> <PersistedClaim ClaimTypeReferenceId="givenName" /> <PersistedClaim ClaimTypeReferenceId="surname" /> <PersistedClaim ClaimTypeReferenceId="email" PartnerClaimType="signInNames.emailAddress" /> </PersistedClaims> </TechnicalProfile>  <TechnicalProfile Id="AAD-UserWriteUsingLogonEmail"> <Metadata> <Item Key="UserMessageIfClaimsPrincipalAlreadyExists">An account already exists for this email via SSO. Please sign in using SSO instead of creating a local account.</Item> </Metadata> </TechnicalProfile>  <TechnicalProfile Id="AAD-ReadSsoMigratedFlag"> <Metadata> <Item Key="Operation">Read</Item> <Item Key="RaiseErrorIfClaimsPrincipalDoesNotExist">false</Item> </Metadata> <IncludeInSso>false</IncludeInSso> <InputClaims> <InputClaim ClaimTypeReferenceId="objectId" Required="true" /> </InputClaims> <OutputClaims> <OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" /> </OutputClaims> <IncludeTechnicalProfile ReferenceId="AAD-Common" /> </TechnicalProfile> </TechnicalProfiles> </ClaimsProvider>  <ClaimsProvider> <DisplayName>Local Account Password Reset</DisplayName> <TechnicalProfiles> <TechnicalProfile Id="ForgotPassword"> <DisplayName>Forgot your password?</DisplayName> <Protocol Name="Proprietary" Handler="Web.TPEngine.Providers.ClaimsTransformationProtocolProvider, Web.TPEngine, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null" /> <OutputClaims> <OutputClaim ClaimTypeReferenceId="isForgotPassword" DefaultValue="true" AlwaysUseDefaultValue="true" /> </OutputClaims> <UseTechnicalProfileForSessionManagement ReferenceId="SM-Noop" /> </TechnicalProfile> </TechnicalProfiles> </ClaimsProvider> <ClaimsProvider> <DisplayName>SSO Migration Check</DisplayName> <TechnicalProfiles> <TechnicalProfile Id="ThrowSsoMigratedError"> <DisplayName>Block local password journeys</DisplayName> <Protocol Name="Proprietary" Handler="Web.TPEngine.Providers.ClaimsTransformationProtocolProvider, Web.TPEngine, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null" /> <InputClaims> <InputClaim ClaimTypeReferenceId="extension_ssoMigrated" /> </InputClaims> <OutputClaims> <OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" /> </OutputClaims> <OutputClaimsTransformations> <OutputClaimsTransformation ReferenceId="AssertNotSsoMigrated" /> </OutputClaimsTransformations> </TechnicalProfile> </TechnicalProfiles> </ClaimsProvider> <ClaimsProvider> <DisplayName>Local Account</DisplayName> <TechnicalProfiles> <TechnicalProfile Id="SelfAsserted-LocalAccountSignin-Email"> <Metadata>  <Item Key="SignUpTarget">SignUpWithLogonEmailExchange</Item> <Item Key="setting.operatingMode">Email</Item> <Item Key="setting.forgotPasswordLinkOverride">ForgotPasswordExchange</Item> <Item Key="ContentDefinitionReferenceId">api.localaccountsignin</Item> <Item Key="IncludeClaimResolvingInClaimsHandling">true</Item> <Item Key="UserMessageIfClaimsTransformationBooleanValueIsNotEqual">This account has been migrated to SSO. Please sign in with SSO instead.</Item> </Metadata> <IncludeInSso>false</IncludeInSso> <InputClaims> <InputClaim ClaimTypeReferenceId="signInName" DefaultValue="{OIDC:LoginHint}" AlwaysUseDefaultValue="true" /> </InputClaims> <OutputClaims> <OutputClaim ClaimTypeReferenceId="signInName" Required="true" /> <OutputClaim ClaimTypeReferenceId="password" Required="true" /> <OutputClaim ClaimTypeReferenceId="objectId" /> <OutputClaim ClaimTypeReferenceId="authenticationSource" /> <OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" /> </OutputClaims> <ValidationTechnicalProfiles>  <ValidationTechnicalProfile ReferenceId="login-NonInteractive" />  <ValidationTechnicalProfile ReferenceId="AAD-ReadSsoMigratedFlag" />  <ValidationTechnicalProfile ReferenceId="ThrowSsoMigratedError"> <Preconditions> <Precondition Type="ClaimsExist" ExecuteActionsIf="false"> <Value>extension_ssoMigrated</Value> <Action>SkipThisValidationTechnicalProfile</Action> </Precondition> <Precondition Type="ClaimEquals" ExecuteActionsIf="false"> <Value>extension_ssoMigrated</Value> <Value>True</Value> <Action>SkipThisValidationTechnicalProfile</Action> </Precondition> </Preconditions> </ValidationTechnicalProfile> </ValidationTechnicalProfiles> <UseTechnicalProfileForSessionManagement ReferenceId="SM-AAD" /> </TechnicalProfile> <TechnicalProfile Id="LocalAccountDiscoveryUsingEmailAddress"> <Metadata> <Item Key="UserMessageIfClaimsTransformationBooleanValueIsNotEqual">This account has been migrated to SSO. Password reset is not available. Please sign in with SSO.</Item> </Metadata> <OutputClaims> <OutputClaim ClaimTypeReferenceId="email" PartnerClaimType="Verified.Email" Required="true" /> <OutputClaim ClaimTypeReferenceId="objectId" /> <OutputClaim ClaimTypeReferenceId="userPrincipalName" /> <OutputClaim ClaimTypeReferenceId="authenticationSource" /> <OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" /> </OutputClaims> <ValidationTechnicalProfiles>  <ValidationTechnicalProfile ReferenceId="AAD-UserReadUsingEmailAddress" />  <ValidationTechnicalProfile ReferenceId="AAD-ReadSsoMigratedFlag" />  <ValidationTechnicalProfile ReferenceId="ThrowSsoMigratedError"> <Preconditions> <Precondition Type="ClaimsExist" ExecuteActionsIf="false"> <Value>extension_ssoMigrated</Value> <Action>SkipThisValidationTechnicalProfile</Action> </Precondition> <Precondition Type="ClaimEquals" ExecuteActionsIf="false"> <Value>extension_ssoMigrated</Value> <Value>True</Value> <Action>SkipThisValidationTechnicalProfile</Action> </Precondition> </Preconditions> </ValidationTechnicalProfile> </ValidationTechnicalProfiles> </TechnicalProfile> <TechnicalProfile Id="LocalAccountWritePasswordUsingObjectId"> <UseTechnicalProfileForSessionManagement ReferenceId="SM-AAD" /> </TechnicalProfile> </TechnicalProfiles> </ClaimsProvider> </ClaimsProviders> <UserJourneys> <UserJourney Id="CustomSignUpOrSignIn"> <OrchestrationSteps> <OrchestrationStep Order="1" Type="CombinedSignInAndSignUp" ContentDefinitionReferenceId="api.signuporsignin"> <ClaimsProviderSelections> <ClaimsProviderSelection ValidationClaimsExchangeId="LocalAccountSigninEmailExchange" /> <ClaimsProviderSelection TargetClaimsExchangeId="AzureCommon-AAD-Exchange" /> <ClaimsProviderSelection TargetClaimsExchangeId="ForgotPasswordExchange" /> </ClaimsProviderSelections> <ClaimsExchanges> <ClaimsExchange Id="LocalAccountSigninEmailExchange" TechnicalProfileReferenceId="SelfAsserted-LocalAccountSignin-Email" /> </ClaimsExchanges> </OrchestrationStep>  <OrchestrationStep Order="2" Type="ClaimsExchange"> <Preconditions> <Precondition Type="ClaimsExist" ExecuteActionsIf="true"> <Value>objectId</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> </Preconditions> <ClaimsExchanges> <ClaimsExchange Id="SignUpWithLogonEmailExchange" TechnicalProfileReferenceId="LocalAccountSignUpWithLogonEmail" /> <ClaimsExchange Id="AzureCommon-AAD-Exchange" TechnicalProfileReferenceId="AADBSI-OpenIdConnect" /> <ClaimsExchange Id="ForgotPasswordExchange" TechnicalProfileReferenceId="ForgotPassword" /> </ClaimsExchanges> </OrchestrationStep>  <OrchestrationStep Order="3" Type="ClaimsExchange"> <Preconditions> <Precondition Type="ClaimsExist" ExecuteActionsIf="false"> <Value>isForgotPassword</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> <Precondition Type="ClaimEquals" ExecuteActionsIf="false"> <Value>isForgotPassword</Value> <Value>true</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> </Preconditions> <ClaimsExchanges> <ClaimsExchange Id="PasswordResetUsingEmailAddressExchange" TechnicalProfileReferenceId="LocalAccountDiscoveryUsingEmailAddress" /> </ClaimsExchanges> </OrchestrationStep> <OrchestrationStep Order="4" Type="ClaimsExchange"> <Preconditions> <Precondition Type="ClaimsExist" ExecuteActionsIf="false"> <Value>isForgotPassword</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> <Precondition Type="ClaimEquals" ExecuteActionsIf="false"> <Value>isForgotPassword</Value> <Value>true</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> </Preconditions> <ClaimsExchanges> <ClaimsExchange Id="NewCredentials" TechnicalProfileReferenceId="LocalAccountWritePasswordUsingObjectId" /> </ClaimsExchanges> </OrchestrationStep>  <OrchestrationStep Order="5" Type="ClaimsExchange"> <Preconditions> <Precondition Type="ClaimEquals" ExecuteActionsIf="true"> <Value>isForgotPassword</Value> <Value>true</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> <Precondition Type="ClaimEquals" ExecuteActionsIf="true"> <Value>authenticationSource</Value> <Value>localAccountAuthentication</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> </Preconditions> <ClaimsExchanges> <ClaimsExchange Id="AADUserReadUsingAlternativeSecurityId" TechnicalProfileReferenceId="AAD-UserReadUsingAlternativeSecurityId-NoError" /> </ClaimsExchanges> </OrchestrationStep>  <OrchestrationStep Order="6" Type="ClaimsExchange"> <Preconditions> <Precondition Type="ClaimEquals" ExecuteActionsIf="true"> <Value>authenticationSource</Value> <Value>localAccountAuthentication</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> <Precondition Type="ClaimsExist" ExecuteActionsIf="true"> <Value>objectId</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> <Precondition Type="ClaimsExist" ExecuteActionsIf="false"> <Value>email</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> </Preconditions> <ClaimsExchanges> <ClaimsExchange Id="ReadByEmail" TechnicalProfileReferenceId="AAD-UserReadUsingEmailAddress-Takeover" /> </ClaimsExchanges> </OrchestrationStep>  <OrchestrationStep Order="7" Type="ClaimsExchange"> <Preconditions> <Precondition Type="ClaimEquals" ExecuteActionsIf="true"> <Value>isForgotPassword</Value> <Value>true</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> <Precondition Type="ClaimsExist" ExecuteActionsIf="false"> <Value>objectId</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> <Precondition Type="ClaimEquals" ExecuteActionsIf="true"> <Value>authenticationSource</Value> <Value>localAccountAuthentication</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> <Precondition Type="ClaimEquals" ExecuteActionsIf="true"> <Value>extension_ssoMigrated</Value> <Value>True</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> </Preconditions> <ClaimsExchanges> <ClaimsExchange Id="LinkSSO" TechnicalProfileReferenceId="AAD-LinkSSOToExistingUser" /> </ClaimsExchanges> </OrchestrationStep>  <OrchestrationStep Order="8" Type="ClaimsExchange"> <Preconditions> <Precondition Type="ClaimsExist" ExecuteActionsIf="true"> <Value>objectId</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> </Preconditions> <ClaimsExchanges> <ClaimsExchange Id="SelfAsserted-Social" TechnicalProfileReferenceId="SelfAsserted-Social" /> </ClaimsExchanges> </OrchestrationStep>  <OrchestrationStep Order="9" Type="ClaimsExchange"> <Preconditions> <Precondition Type="ClaimEquals" ExecuteActionsIf="true"> <Value>authenticationSource</Value> <Value>socialIdpAuthentication</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> </Preconditions> <ClaimsExchanges> <ClaimsExchange Id="AADUserReadWithObjectId" TechnicalProfileReferenceId="AAD-UserReadUsingObjectId" /> </ClaimsExchanges> </OrchestrationStep>  <OrchestrationStep Order="10" Type="ClaimsExchange"> <Preconditions> <Precondition Type="ClaimEquals" ExecuteActionsIf="true"> <Value>isForgotPassword</Value> <Value>true</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> <Precondition Type="ClaimsExist" ExecuteActionsIf="true"> <Value>objectId</Value> <Action>SkipThisOrchestrationStep</Action> </Precondition> </Preconditions> <ClaimsExchanges> <ClaimsExchange Id="AADUserWrite" TechnicalProfileReferenceId="AAD-UserWriteUsingAlternativeSecurityId" /> </ClaimsExchanges> </OrchestrationStep> <OrchestrationStep Order="11" Type="SendClaims" CpimIssuerTechnicalProfileReferenceId="JwtIssuer" /> </OrchestrationSteps> <ClientDefinition ReferenceId="DefaultWeb" /> </UserJourney> </UserJourneys> </TrustFrameworkPolicy>

Key takeaways

Migration happens silently on first SSO login
Only migrated users are restricted (not all users)
Forgot password remains fully functional for non-migrated users
A single flag (extension_ssoMigrated) controls everything
No duplication, no confusion, no manual effort

This approach removes the usual friction of identity migration.

Users don’t need to “move” to SSO—the system does it for them, automatically and securely, while preserving a smooth experience for those who haven’t migrated yet.

Enabling Agentic Data Governance with Hybrid Cloud Flexibility in Azure

Moaz_Mirza — Thu, 23 Apr 2026 18:17:35 GMT

The “Why”

Do you manage data in a complex multi-cloud environment? Are you struggling with data silos, evolving regulations, and the pressure to maintain control and compliance across on-prem and multiple clouds? Do you ever wish an intelligent assistant could help shoulder the load of data governance? If so, I can relate. Let me tell you a story that might sound familiar.

Meet Mark (pictured above). He is a data governance officer at Contoso (a fictional but very representative enterprise). Mark’s day job is ensuring data governance and compliance across his company’s vast hybrid cloud estate – think around ~2 million data assets sprawled across 12+ datacenters on-premises and in different public clouds. Regulatory requirements are constantly shifting. Customer data is increasingly sensitive. Each department and region has its own way of doing things. Mark is fighting an uphill battle with data silos and disconnected cloud operations. He bounces between a patchwork of tools – spreadsheets, cloud consoles, governance portals – trying to answer basic questions:

Where is our data?

Who’s using it?

Are we in compliance?

Armed with an old desk calculator and a pile of paper-based reports (a perfect 1990s backdrop), he is dealing with the data around him that has exploded in volume and complexity.

What if Mark had a single pane of glass. The glass that reflects and acts. It reflects your governance state and enforces compliance – a self-hydrating pane of glass accompanied by a conversational AI.

And he’s not alone. We’re all living in a data overload era. Every day, organizations generate and ingest more information than ever before. Transistors and mainframes gave way to the internet boom of the ’90s, then an explosion of mobile devices in the 2000s, social media in the 2010s, and now widespread cloud computing – all funneling data into our systems at an exponential rate. On top of that, a new wave of AI and conversational interfaces has arrived here in the mid-2020s, making data more accessible but also increasing expectations for real-time insight. It’s no wonder modern IT leaders feel overwhelmed.

But these challenges are also opportunities. The way I see it, the incredible growth of data and cloud capabilities means we have a chance to reimagine data governance. The fact that I’m writing about this right now is no coincidence. My customers are looking to resolve problems in this space. In my conversations with them, I hear the same needs: We want better governance, more visibility, streamlined oversight… and cherry on top, we want it in an “agentic” fashion. In other words, they want to delegate the grunt work to the platform toolset augmented by AI, so they can focus on higher-value tasks.

The “What”

That vision – agentic data governance with hybrid cloud flexibility – became the driver for this work. This is a modular solution, and you have these building block style components (cloud services, governance tools, AI agents), which you can snap them together into an intended solution. Think of it as a jumpstart kit for continuous data governance across multiple clouds, with autonomous (“agentic”) assistance baked in that you can leverage and build upon. It’s not the final, productized solution – more a vision of what’s possible.

Contoso’s Requirements

These are the high-level requirements from Contoso:

Data governance across clouds under one roof
A single pane of glass dashboard consolidating reporting on the 5 governance domains:

o Visibility on data residency and lineage

o PII (Personally Identifiable Information) must run on a CC (Confidential Compute)

o Security software (Defender) compliance

o Resource tagging compliance (foundational for a good governance posture)

o OS updates compliance

Ability to enforce compliance in an agentic manner with a human in the loop
Agentic enforcement of compliance pertaining to residency and confidential compute

Solution – The breakdown

The solution is comprised of 8 modules addressing these requirements. These solution modules are:

Foundational (Landing zones, Data Sources, Operational setup, Policies, etc.)
Dashboard Hydration + Agentic Reporting – Residency Compliance
Dashboard Hydration + Agentic Reporting – Confidential Compute for PII Compliance
Dashboard Hydration + Agentic Reporting – MS Defender Compliance
Dashboard Hydration + Agentic Reporting – Resource Tag Compliance
Dashboard Hydration + Agentic Reporting – OS Updates/Patch Compliance
Enforce Compliance via Copilot Agent - Residency Compliance
Enforce Compliance via Copilot Agent – CC PII Compliance

Solution – The architecture view

These are the main technical components that make up the solution architecture:

Data sources of all shapes and sizes on the left, governed by the native Azure or the Arc plane.
Additional Azure services across the bottom layer for the foundational governance posture
Microsoft Purview, in the top middle, as the unified data governance platform
Microsoft Fabric, in the bottom middle, as the end-to-end ingestion and analytics platform
Microsoft Power Platform, on the right, as the low code/no code business flow and the copilot agent experience

Solution – The end user view

So how does Mark see this solution as a data governance officer? He doesn’t see all the intricacies of the solution integration and the logic execution. He sees two things:

A Power BI dashboard running on Microsoft Fabric with

- A compliance dashboard with an overall score in each of the five compliance domains alongside scores for each of the data products across these domains
- Additional reporting views for more granular reporting
- Fabric-based pipeline that hydrates the underlying semantic models from various sources to keep the reports fresh and current

A Copilot agent (in Teams) for both:

- Reporting on all compliance domains
- Enforcing in-scope compliance across selected domains

The agent takes care of it - queries Fabric’s semantic model, calls Azure Function endpoints, updates Purview glossary terms, applies Azure tags, and sends Teams notifications.

The “How” – Residency Compliance

Let’s pick a few modules to walk through how these solution modules work together to give a cohesive agentic governance experience to Mark.

It’s Monday morning, and Mark logs into the Contoso governance portal with a cup of coffee in hand. Instead of a dozen browser tabs, he has two main tools opened: the Data Governance Dashboard and the Contoso Governance Copilot agent.

To address some inquiries that came as an assigned action to him, he interacted with the agent. During this interaction, not only did he validate if there were any residency missing in the unified data governance platform (Purview), but he was also able to address a mismatch between Purview and Azure resource, based on the designed principles. Here is the snippet of the chat:

Now, under the hood, several components have worked on behalf of the agent in performing this governance checking and applying the necessary course of action:

Even before Mark's conversation with the agent, an ongoing hydration process keeps the Fabric Power BI dashboard up to date.

Dashboard Hydration + Agentic Reporting – Residency Compliance

A Fabric notebook runs the residency scorecard code block through a pipeline.
It reads two Lakehouse tables containing latest residency information from Purview and the approved region list
Then, the notebook gets a Microsoft Entra bearer token
Once acquired, the notebook then calls an Azure Function endpoint
This endpoint, then searches for the Azure resources associated with the data products in Purview using an Azure resource tag.
The notebook then compares the declared Purview residency with the approved region list and the associated resource’s region
The notebook then calculates the final 0 / 25 / 50 / 75 / 100 residency compliance score and a reason. For example: A data product without an associated Azure resource gets a 0, while a data product whose residency in Purview is an approved region by Contoso, and also matches with the associated Azure resource, gets a 100.
It then writes the results to the relevant residency compliance Lakehouse tables
The dedicated compliance table then feeds to the semantic model for reporting
The compliance Power BI dashboard is hydrated

Enforce Compliance via Copilot Agent - Residency Compliance

With the dashboard data regularly updated, the agent follows this logic, the updated reporting data, and the actions at its disposal, during the earlier conversation with Mark :

Mark initiates the conversation with the agent
The agent calls a Power Automate flow
This flow retrieves Purview’s residency information stored in the Fabric semantic model
5, 6, 7 and 8. When Mark asks to investigate further on a data product, the agent carries the conversation using a topic, which then leverages a flow, which uses a Power Automate custom connector to access an Azure Function endpoint. This endpoint then retrieves latest glossary (residency) information about the data product in question, from Purview, and provides a preview back to the user
10, 11, 12, and 13. If the update criteria are met, and if there is no conflict, and with Mark’s blessings, the topic then calls another flow to access the Functions Purview Update endpoint, and make the glossary (residency) update in Purview for that data product

The “How” – Confidential Compute for PII Compliance

Dashboard Hydration + Agentic Reporting – Confidential Compute for PII Compliance

The following snippet shows how Mark addresses the compliance risk with a critical data product (application), S/4 HANA, and performed the necessary compliance actions, such as tagging the associated resources and notifying the data product owners via Teams channel.

The following diagram shows the under-the-hood hydration flow for confidential compute compliance:

Enforce Compliance via Copilot Agent – CC PII Compliance

Finally, the diagram below shows how Mark’s conversation flows through the main solution components:

Outcome

Stepping back, what did we accomplish for Mark and Contoso? We turned an onslaught of governance challenges into an opportunity to modernize how data is managed. This gave Mark:

Centralized Visibility into data assets across the landscape through Purview and a unified dashboard
Proactive compliance enabled with automated checks - controlled with Purview exports and Fabric pipeline schedules
And compliance enforcement using an agent
Hybrid Cloud Consistency. By using Azure Arc and a foundational data plane management setup
Reduced Operational overhead with agentic reporting and compliance

Though the solution is comprised of wide variety of components/services, it is built from standard building blocks and is relatively simple to implement. In total, the solution combined around a dozen Azure services and over 40 distinct components (from Purview catalogs to data pipelines, to custom functions and flows). You can choose to implement some or all the compliance domains. Or, better yet, build upon and create new domains and pave new paths.

Wrap-up

I believe many enterprises could take a similar journey. If you’re facing these issues, consider this an invitation to think differently about data governance. Start with the pieces you already have – your own building blocks of cloud services and data – and imagine what you could build. Chances are that a lot of the heavy lifting can be orchestrated with today’s technology. And with the rise of AI copilots, the dream of agentic data governance – where your policies are continuously enforced by smart agents – is no longer science fiction. It’s here, right now, waiting for you to take it for a spin.

Next steps

Watch the video narrative on SAP on Azure YouTube channel:
Build it with the GitHub Repository: https://github.com/moazmirza/data-sov-and-hyb-cloud
Comments/questions: Here, or @ LinkedIn /moazmirza

Solution Selfies

Azure Policy Compliance - Foundational Governance Posture

Purview Data Product Catalog and Data Lineage

Purview Governance Metadata à Fabric Lakehouse

Fabric Semantic Model

Additional Fabric Power BI Dashboard

Copilot Studio Topic Flow

Azure Function Endpoints