<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Azure Architecture Blog articles</title>
    <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/bg-p/AzureArchitectureBlog</link>
    <description>Azure Architecture Blog articles</description>
    <pubDate>Sun, 31 May 2026 14:15:56 GMT</pubDate>
    <dc:creator>AzureArchitectureBlog</dc:creator>
    <dc:date>2026-05-31T14:15:56Z</dc:date>
    <item>
      <title>Cloud Native Platforms: Build</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/cloud-native-platforms-build/ba-p/4519605</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Audience:&lt;/STRONG&gt; Cloud architects, platform engineers, engineering leaders making design decisions&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Reading time:&lt;/STRONG&gt; 8 minutes&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Series:&lt;/STRONG&gt; Cloud Native Platforms. Build, Run, Evolve. This is Part 1 of 3.&lt;/P&gt;
&lt;HR /&gt;
&lt;P&gt;Most engineering teams can build systems.&lt;/P&gt;
&lt;P&gt;Few can scale them without rebuilding them.&lt;/P&gt;
&lt;P&gt;As platforms grow, complexity does not increase linearly. It multiplies across users, services, tenants, regions, and integrations. The systems that struggle and the systems that scale are rarely separated by which cloud they run on. They are separated by a handful of design choices made early and applied consistently.&lt;/P&gt;
&lt;P&gt;This post is about those choices.&lt;/P&gt;
&lt;H2&gt;The differentiator is not the cloud&lt;/H2&gt;
&lt;P&gt;Scalable platforms are not built with the right tools. They are built with the right design choices.&lt;/P&gt;
&lt;P&gt;Cloud services have closed the gap on infrastructure. The differentiator is no longer which managed service a team picks. It is whether the platform is designed to absorb change, tolerate failure, and support visibility from day one. Five engineering disciplines determine whether a platform scales gracefully or collects technical debt while it grows.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;EM&gt;Figure 1. The five disciplines compound into platform scale. Any one neglected becomes the constraint that forces a rewrite later.&lt;/EM&gt;&lt;/P&gt;
&lt;H2&gt;1. Flexibility is the foundation of scale&lt;/H2&gt;
&lt;P&gt;Hard-coded systems work until they do not. The first request to add a tenant, a region, a SKU (a sellable product variant), or a regulatory variant is the moment a rigid design starts to bend. Each subsequent request adds weight.&lt;/P&gt;
&lt;P&gt;Scalable platforms move behavior out of code:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Configuration replaces conditional logic&lt;/LI&gt;
&lt;LI&gt;Feature flags enable safer, tenant-scoped rollouts&lt;/LI&gt;
&lt;LI&gt;APIs evolve through versioning, not breaking changes&lt;/LI&gt;
&lt;LI&gt;Schemas evolve additively. Breaking changes go through versioned contracts with a deprecation window long enough that consumers can migrate without downtime.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: configuration in a managed store, feature flags with tenant scope, and APIs versioned per consumer contract. Cost is the discipline of treating configuration as code (versioned, reviewed, audited). The return is that releases stop being events and start being routine. A change that previously needed a coordinated deployment can be executed in minutes, gated to a single tenant for verification, and rolled out broadly only after the signal is clean. Most platforms reach this state by retrofit, not by design. Doing it earlier costs less than waiting.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;If a change requires a redeploy, it should require a very good reason.&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;2. Failures are normal. Resilience is a choice.&lt;/H2&gt;
&lt;P&gt;Distributed systems will fail in unpredictable ways. The real question is not how to prevent failure. It is how the system responds when failure happens.&lt;/P&gt;
&lt;P&gt;Resilience is engineered, not inherited from the platform. The patterns that move the needle are well known and consistently applied:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Idempotent operations&lt;/STRONG&gt; (safe to call multiple times with the same result) that make retries safe&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Reliable messaging patterns such as the transaction outbox&lt;/STRONG&gt; (writing the message to the same database transaction as the business change, then publishing asynchronously) to avoid lost or duplicated events&lt;/LI&gt;
&lt;LI&gt;Decoupled services that contain &lt;STRONG&gt;blast radius&lt;/STRONG&gt; (the scope of damage when one component fails)&lt;/LI&gt;
&lt;LI&gt;Timeouts, retries, and &lt;STRONG&gt;circuit breakers&lt;/STRONG&gt; (a wrapper around a dependency that stops calling it for a cool-off window after repeated failures) tuned per dependency&lt;/LI&gt;
&lt;LI&gt;Bulkheads (isolation pools, often a separate compute or queue lane per workload class) that keep noisy neighbours from starving critical paths of resources&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: every write that can be retried carries an idempotency key, every queue consumer is safe to replay, every event published goes through an outbox in the same transactional unit as the business change. When peak load triggers retries, duplicates collapse cleanly instead of producing duplicate orders, double-charged customers, or split-brain state. The contract changes outwards: callers can retry without thinking, queues can be at-least-once instead of exactly-once, and recovery moves from a manual cleanup task to a property of the system. Most teams that adopt this pattern stop seeing certain classes of incident entirely.&lt;/P&gt;
&lt;H3&gt;Implementation note&lt;/H3&gt;
&lt;P&gt;An idempotent API is not just a design preference. It changes how the rest of the system can be built. Once writes are safe to repeat, retries become cheap, queues become trustworthy, and recovery becomes automatic.&lt;/P&gt;
&lt;P&gt;The naive implementation (read the key, if absent process and save) has a race. Two concurrent requests with the same key both miss the lookup, both call the processor, and both attempt to save. That is the failure mode idempotency exists to prevent. The pattern that survives production is an atomic reserve-then-execute: insert a row keyed by the idempotency key with a unique constraint before doing any work. The first writer wins. Concurrent callers either wait for the original to complete and read its result, or they receive a conflict response.&lt;/P&gt;
&lt;LI-CODE lang="csharp"&gt;// Contract for the idempotency store. The two key methods are TryReserveAsync
// (atomic insert with unique-key constraint) and CompleteAsync (record the
// result of the first writer). GetCompletedResultAsync polls until the first
// writer commits or returns 409 Conflict if the in-flight window exceeds the
// configured deadline.
public interface IIdempotencyStore
{
    Task&amp;lt;Reservation&amp;gt; TryReserveAsync(
        string idempotencyKey, string requestHash, CancellationToken ct);

    Task CompleteAsync(
        string idempotencyKey, OrderResult result, CancellationToken ct);

    Task&amp;lt;OrderResult&amp;gt; GetCompletedResultAsync(
        string idempotencyKey, CancellationToken ct,
        TimeSpan? maxWait = null);
}

public readonly record struct Reservation(
    bool IsFirstWriter, string RequestHash);

// Idempotency via atomic reserve-then-execute.
// First writer wins; replays return the original result; concurrent
// duplicates lose the race and read the winner's outcome (or get 409).
public async Task&amp;lt;OrderResult&amp;gt; CreateOrderAsync(
    Order order, string idempotencyKey, CancellationToken ct)
{
    var requestHash = StableHash(order); // canonical content hash

    // Atomic insert: succeeds for the first caller, fails for the rest.
    var reserved = await _store.TryReserveAsync(
        idempotencyKey, requestHash, ct);

    if (!reserved.IsFirstWriter)
    {
        if (reserved.RequestHash != requestHash)
            throw new IdempotencyKeyReusedException();

        // A previous run committed (return its result) or is in-flight
        // (poll with a bounded deadline; 409 if exceeded).
        return await _store.GetCompletedResultAsync(
            idempotencyKey, ct, maxWait: TimeSpan.FromSeconds(5));
    }

    // We are the first writer. Execute, persist, mark complete.
    var result = await _processor.ProcessAsync(order, ct);
    await _store.CompleteAsync(idempotencyKey, result, ct);
    return result;
}

&lt;/LI-CODE&gt;
&lt;P&gt;Three production details matter:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;TTL or compaction on the idempotency record.&lt;/STRONG&gt; Without it, the store grows forever. Most teams retain records for the request retry window plus a safety margin (commonly 24 to 72 hours).&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Stable content hash, not the default object hash code.&lt;/STRONG&gt; The request hash detects key reuse with a different body, so a client that reuses an idempotency key with a different payload receives &lt;CODE&gt;IdempotencyKeyReusedException&lt;/CODE&gt; rather than silently getting the wrong result. Canonicalise field ordering, locale, and null handling before hashing.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Bound the in-flight window explicitly.&lt;/STRONG&gt; The genuinely hard case is when the processor succeeded but the store write failed. Production-grade implementations either run the side-effect and the store write in the same transaction (when the processor and store share a database) or use the transaction outbox pattern to bridge them. The poll-with-deadline in &lt;CODE&gt;GetCompletedResultAsync&lt;/CODE&gt; handles the duplicate-arrives-mid-flight case; the transactional boundary handles everything else.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;3. Observability is not optional&lt;/H2&gt;
&lt;P&gt;Without observability, teams operate blind. As systems grow, the price of guessing rises faster than the price of seeing.&lt;/P&gt;
&lt;P&gt;At build time, observability is a design property. The decisions made before the system reaches production are what determine whether it can be operated at all. The dashboards, alerts, and incident practices covered in Part 2 of this series rely on instrumentation choices made here.&lt;/P&gt;
&lt;P&gt;The build-time work that pays off in production:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Request identifiers propagated through every service hop, every queue, every async boundary, so a single user action can be traced end to end&lt;/LI&gt;
&lt;LI&gt;Structured logging with a consistent schema (event name, correlation id, tenant, severity) rather than free-form strings&lt;/LI&gt;
&lt;LI&gt;Metrics emitted at the boundaries that matter (every external call, every queue read or write, every database operation), not only at the entry point&lt;/LI&gt;
&lt;LI&gt;Tracing libraries integrated at the framework or middleware layer so coverage is automatic, not opt-in&lt;/LI&gt;
&lt;LI&gt;Schemas designed so business signals (orders, sessions, transactions) and system signals (CPU, latency, errors) share the same identifiers and can be correlated later&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: a single request id flowing through every service hop, every queue, every async boundary, propagated automatically at the framework layer rather than per-call. Add one structured logging schema across services (event name, correlation id, tenant, severity), so that a single query joins business events with system events. The investment is hours of upfront framework wiring. The return is that production diagnosis stops being archaeology. Cross-service questions become single dashboards; postmortems shrink from days to hours; and the dashboards in Part 2 actually work because the data underneath is shaped to support them.&lt;/P&gt;
&lt;H2&gt;4. Delivery practices set the ceiling&lt;/H2&gt;
&lt;P&gt;Scaling teams requires scaling delivery. Small inefficiencies in pipelines, environments, and release coordination compound into measurable drag.&lt;/P&gt;
&lt;P&gt;Delivery maturity that pays off at scale:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Pipelines as code, reviewed and versioned like application code&lt;/LI&gt;
&lt;LI&gt;Parallel deployments across services and regions where dependencies allow&lt;/LI&gt;
&lt;LI&gt;Infrastructure as code with shared modules, not hand-managed environments&lt;/LI&gt;
&lt;LI&gt;Automated quality gates: tests, security scans, dependency checks&lt;/LI&gt;
&lt;LI&gt;Trunk-based development (developers commit to a single shared branch many times a day) with short-lived feature branches and progressive delivery. &lt;STRONG&gt;Important caveat:&lt;/STRONG&gt; trunk-based works only when test automation and feature flags are already in place. Adopting it before those foundations exist tends to amplify production incidents rather than reduce them.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: pipelines run in parallel where dependencies allow, infrastructure provisioning is templated rather than per-environment, and quality gates run automatically rather than as discretionary steps. Sequential deployment of a multi-service platform across three environments takes hours; parallelised deployment of the same change takes minutes. The payback is not only release speed. It is the compounding cost reduction of every wait state for every engineer on every release. Teams that treat pipelines as a product feature, not an afterthought, ship more confidently and recover from bad changes faster because the rollback path was exercised, not invented during an incident.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;Slow pipelines are not a tooling problem. They are a design problem.&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;5. Cost discipline is engineering work&lt;/H2&gt;
&lt;P&gt;Cloud platforms can become expensive quickly when cost is treated as someone else's problem. Cost is a property of the design, not a quarterly review.&lt;/P&gt;
&lt;P&gt;The teams that get this right treat cost the same way they treat performance:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Elastic compute and storage tiers chosen per workload pattern&lt;/LI&gt;
&lt;LI&gt;Non-production environments with automated scale-down windows (the easiest savings to leave on the table)&lt;/LI&gt;
&lt;LI&gt;Tagging discipline so cost can be attributed to a service, a feature, a tenant&lt;/LI&gt;
&lt;LI&gt;Egress and data-tier choices, not compute, dominate cloud bills past a certain scale. Right-size storage tiers (hot vs cool vs archive), eliminate cross-region chatter, and watch egress on the data plane more closely than compute on the request path.&lt;/LI&gt;
&lt;LI&gt;Budgets and usage alerts wired into the same channels as reliability alerts&lt;/LI&gt;
&lt;LI&gt;Cost reviews built into design discussions, not deferred to FinOps (Financial Operations: the practice of managing cloud spend as an engineering concern)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: non-production environments scale down automatically outside business hours, storage tiers match access patterns (hot, cool, archive), and tagging is enforced so every dollar can be attributed to a service or feature. Cost reviews happen at design time, not after the bill arrives. The biggest savings come from data plane decisions, not compute: cross-region egress, oversized storage tiers, and forgotten test environments dominate cloud bills past a certain scale. Treat cost as a first-class non-functional requirement, alongside latency and availability, and the discipline compounds in every design discussion that follows.&lt;/P&gt;
&lt;H2&gt;A scenario that ties it together&lt;/H2&gt;
&lt;img /&gt;
&lt;P&gt;&lt;EM&gt;Figure 2. A reference architecture that puts the disciplines into one shape. The request path is decoupled, the data layer is purpose-fit, identity is brokered by managed identity throughout, private endpoints isolate the data tier from public networks, and observability runs as a first-class lane.&lt;/EM&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Picture a multi-tenant platform at a growth inflection. Onboarding a new tenant takes weeks because tenant-specific behaviour is hard-coded across services. Every release carries risk because there is no way to roll out a change to one tenant without affecting the rest. Incidents linger because logs and metrics live in different tools and nobody can correlate them in production.&lt;/P&gt;
&lt;P&gt;Do not start with a rewrite. Start with the smallest set of changes that unlocks the next year of growth: extract configuration out of code, introduce tenant-aware feature flags, wire a unified observability view into the existing services, and parallelise the pipelines. None of these are architectural revolutions. They are design choices applied with discipline, in the order the disciplines compound.&lt;/P&gt;
&lt;P&gt;Eighteen months in, onboarding a tenant takes hours instead of weeks. Releases move from monthly events to weekly increments. Incidents are caught earlier and resolved faster. The platform did not get bigger. It got more capable. The five disciplines did the work; the team made the choice to apply them.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;What teams get wrong&lt;/H2&gt;
&lt;P&gt;The common pattern is &lt;STRONG&gt;architecting for the system you have, not the system you are growing into.&lt;/STRONG&gt; It looks like progress because the current sprint ships. Pillars get postponed because they feel like overhead.&lt;/P&gt;
&lt;P&gt;The cost surfaces later. Each shortcut becomes a constraint. The constraints compound, and three releases later the team is debating a rewrite.&lt;/P&gt;
&lt;P&gt;The fix is not premature abstraction. It is small, deliberate investments in flexibility, resilience, observability, delivery, and cost from day one. The discipline is to make these investments before they are urgent.&lt;/P&gt;
&lt;H2&gt;Where to start when you cannot do everything at once&lt;/H2&gt;
&lt;P&gt;Five disciplines is a wall, and real teams cannot fund all five at once. The right order depends on whether the platform is being built fresh or already running.&lt;/P&gt;
&lt;P&gt;For a system &lt;STRONG&gt;already in production and already in pain&lt;/STRONG&gt;, the SRE community's &lt;A href="https://sre.google/sre-book/part-III-practices/" target="_blank"&gt;hierarchy of reliability needs&lt;/A&gt; gives the most defensible starting order: &lt;EM&gt;monitoring and observability first&lt;/EM&gt; (you cannot fix what you cannot see), &lt;EM&gt;then incident response&lt;/EM&gt; (close the bleeding cleanly), &lt;EM&gt;then resilience patterns&lt;/EM&gt; (idempotency, retries, decoupling) so the bleeding has fewer reasons to start, &lt;EM&gt;then flexibility and delivery&lt;/EM&gt; so safe change can travel at speed. Cost discipline runs alongside throughout, never as the headline.&lt;/P&gt;
&lt;P&gt;For a system &lt;STRONG&gt;being built fresh&lt;/STRONG&gt;, the order in this post (flexibility, resilience, observability, delivery, cost) reflects the &lt;A href="https://learn.microsoft.com/azure/well-architected/" target="_blank"&gt;Azure Well-Architected Framework's&lt;/A&gt; emphasis on designing for change, failure, and visibility before scaling teams or workloads. Both orders are defensible. What is not defensible is leaving any of the five for later.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;The most concrete starter from this post: request id propagation.&lt;/STRONG&gt; A single correlation identifier travelling through every service hop, every queue, every async boundary, costs hours up front and pays back every time someone has to debug production for the rest of the platform's life. It is the smallest unit of the observability discipline and the foundation that the dashboards, traces, and incident response in Part 2 all depend on.&lt;/P&gt;
&lt;H2&gt;The shift&lt;/H2&gt;
&lt;P&gt;The most important transformation in scaling a platform is not technical. It is mindset.&lt;/P&gt;
&lt;P&gt;The shift is from &lt;STRONG&gt;project thinking to platform thinking&lt;/STRONG&gt;:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Build reusable capabilities, not one-off solutions&lt;/LI&gt;
&lt;LI&gt;Design systems for long-term evolution, not the next release&lt;/LI&gt;
&lt;LI&gt;Enable other teams, not just deliver for one team&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Tools change. Cloud services evolve. The architectural fashions of this year will not be the architectural fashions of the next. What persists is the discipline behind the choices. Scalable systems are not built by tools. They are built by teams that treat design as continuous work. The same discipline shows up again in Part 2 (operating these systems) and Part 3 (using AI to augment that work). The tools change. The disciplines do not.&lt;/P&gt;
&lt;HR /&gt;
&lt;P&gt;&lt;STRONG&gt;Want to discuss?&lt;/STRONG&gt; What single design choice has paid the most dividends in the platforms you run? Drop a comment with patterns you have seen in your environment. Every reply gets read.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Next in this series:&lt;/STRONG&gt; &lt;EM&gt;Running Cloud Native Platforms: Why Day 2 Decides Everything.&lt;/EM&gt; Building is half the journey. The next post looks at what it takes to operate these platforms once they are in production.&lt;/P&gt;
&lt;!--
  Taxonomy:
    Primary product: Azure
    Secondary: .NET, GitHub
    Tags: Cloud Architecture, Platform Engineering, Microservices, Reliability, FinOps

  Visuals embedded:
    1. assets/diagram-1-pillar-map.png  (source: assets/diagram-1-pillar-map.mmd, Mermaid)
    2. assets/diagram-2-reference-architecture.png (source: assets/diagram-2-reference-architecture.py, matplotlib)

  Re-render command:
    &amp; "$env:USERPROFILE\.agents\skills\technical-blog-writer\tools\render-visuals.ps1" `
        -ArticleFolder &lt;part-1-build folder&gt; -Force
    python &lt;part-1-build folder&gt;\assets\diagram-2-reference-architecture.py
--&gt;</description>
      <pubDate>Thu, 21 May 2026 22:41:29 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/cloud-native-platforms-build/ba-p/4519605</guid>
      <dc:creator>KishoreKumarPattabiraman</dc:creator>
      <dc:date>2026-05-21T22:41:29Z</dc:date>
    </item>
    <item>
      <title>Cloud Native Platforms: Run</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/cloud-native-platforms-run/ba-p/4520188</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Audience:&lt;/STRONG&gt; SREs (Site Reliability Engineers), platform engineers, engineering managers running production systems&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Reading time:&lt;/STRONG&gt; 8 minutes&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Series:&lt;/STRONG&gt; Cloud Native Platforms. Build, Run, Evolve. This is Part 2 of 3.&lt;/P&gt;
&lt;HR /&gt;
&lt;P&gt;Most systems are designed thoughtfully.&lt;/P&gt;
&lt;P&gt;Most operations are inherited reactively.&lt;/P&gt;
&lt;P&gt;The systems that survive are not the ones built with the most care. They are the ones operated with the most discipline. Production has a way of revealing every shortcut taken during design and every assumption left unverified.&lt;/P&gt;
&lt;P&gt;This post is about what it takes to operate a platform once the build is done.&lt;/P&gt;
&lt;H2&gt;How they are run, not how they are built&lt;/H2&gt;
&lt;P&gt;Systems are not defined by how they are built. They are defined by how they are run.&lt;/P&gt;
&lt;P&gt;A well-designed system that is operated reactively will fail in production. A modestly designed system that is operated with discipline will outperform it. Five operational disciplines decide which side of that line a platform lives on. Each one is engineering work, not a checklist for someone else to handle.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;EM&gt;Figure 1. The incident lifecycle as a state machine. The states are not optional steps. They are the contract between the team and the system.&lt;/EM&gt;&lt;/P&gt;
&lt;H2&gt;1. Observability is the backbone of reliability&lt;/H2&gt;
&lt;P&gt;Without observability, every operation becomes a guess. As systems grow, the cost of guessing rises faster than the cost of seeing.&lt;/P&gt;
&lt;P&gt;Part 1 of this series argued that observability is a design property: instrumentation contracts, request id propagation, structured logging schemas. Production is where those design choices either pay off or do not. Strong observability in production is a contract that lets any engineer answer three questions in minutes: what failed, why it failed, and what the impact was. The shape of that contract matters more than the tool that implements it. (This three-question framing is community-popularised through the SRE community and writers such as Charity Majors. See &lt;A href="https://www.honeycomb.io/what-is-observability" target="_blank"&gt;Honeycomb's &lt;EM&gt;What is Observability&lt;/EM&gt;&lt;/A&gt; for the canonical articulation of the three-pillars and question framing; the substance is older than the framing.)&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Dashboards organised around user journeys, not infrastructure components&lt;/LI&gt;
&lt;LI&gt;Service level indicators (SLIs: the specific measurements you care about, e.g., success rate, p99 latency) chosen from the user's perspective, not the database's&lt;/LI&gt;
&lt;LI&gt;Alerts that page only on burn-rate against an SLO (Service Level Objective: the target value of an SLI, e.g., 99.9% of requests complete in under 800ms over a rolling month) using a multi-window strategy. A short window catches fast burns; a long window catches slow drifts. This is what makes SLOs operational rather than decorative.&lt;/LI&gt;
&lt;LI&gt;Sampling and retention tuned for cost, but never for blind spots&lt;/LI&gt;
&lt;LI&gt;The distinction between MTTA (mean time to acknowledge: how fast someone notices) and MTTR (mean time to restore: how fast service returns) tracked separately. Conflating them hides whether the team's bottleneck is detection, response, or fix.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: rebuild the operational view around two or three user journeys (sign-in, place order, view history) rather than per-component charts. Tie alerts to error budget burn rather than raw threshold crossings. Track MTTA and MTTR separately so the team's actual bottleneck (detection, response, or fix) is visible. The investment is rethinking what to measure, not buying a new tool. The return is that incidents stop being discovered by customer complaints first. Teams that make this shift typically find their existing telemetry was sufficient; only the questions being asked of it were wrong.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;If a dashboard cannot answer "what is the user experiencing right now", it is not an observability dashboard. It is decoration.&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;2. Alerts are signals, not notifications&lt;/H2&gt;
&lt;P&gt;More alerts do not mean better monitoring. In practice, the opposite is true. Once alerts outpace the team's ability to act, important signals start getting missed.&lt;/P&gt;
&lt;P&gt;Effective alerting works to a small set of rules:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Severity that maps to action, not to technical category&lt;/LI&gt;
&lt;LI&gt;Ownership baked in, never inferred at runtime&lt;/LI&gt;
&lt;LI&gt;Thresholds tied to user impact, not raw metric values&lt;/LI&gt;
&lt;LI&gt;Noise treated as a defect, with a regular review cadence&lt;/LI&gt;
&lt;LI&gt;Suppression and grouping for known multi-alert patterns&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: audit every alert against one test, "what action would I take in the next five minutes if this fires now?" Demote alerts with no answer to dashboards. Remove alerts where the answer is the same as another alert's. Group related alerts so one incident produces one page, not twelve. Most teams discover their alert volume drops by an order of magnitude after a thorough audit, and the alerts that remain start getting trusted again. Trust is the precondition for every other operational practice. Without it, on-call rotations decay into noise filtering and the real signals get missed.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;EM&gt;Figure 2. From raw events to pages, in approximate orders of magnitude. The numbers vary by team and workload; what does not vary is that each stage needs to remove one to two orders of magnitude of noise. Teams that page on raw events end up with on-call rotations nobody trusts.&lt;/EM&gt;&lt;/P&gt;
&lt;H2&gt;3. Incident response is a practiced muscle&lt;/H2&gt;
&lt;P&gt;Failures are inevitable. Unstructured response is not.&lt;/P&gt;
&lt;P&gt;The teams that recover quickly do not improvise during incidents. They follow a structure that has been practiced when nothing was on fire. The structure is intentionally simple, because incident time is the worst time to negotiate roles.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Clear roles: incident lead, communications lead, scribe, subject matter expert (the RACI model, Responsible-Accountable-Consulted-Informed, adapted for incident response)&lt;/LI&gt;
&lt;LI&gt;Defined escalation paths with clear handoff criteria. Escalation means re-paging to a higher tier or specialist, not returning to detection. The lifecycle diagram in Figure 1 makes the distinction explicit.&lt;/LI&gt;
&lt;LI&gt;Runbooks for the top failure modes, kept short enough to actually be read&lt;/LI&gt;
&lt;LI&gt;Status communication on a fixed cadence, even when there is nothing new to say. Customer comms and internal comms are tracked separately.&lt;/LI&gt;
&lt;LI&gt;Blameless postmortems (focus on the system that allowed the failure, not the person who pushed the button) that produce action items the team actually completes&lt;/LI&gt;
&lt;LI&gt;Game days: scheduled exercises that simulate failure modes (region outage, dependency unavailability, traffic spike) under controlled conditions, so gaps in runbooks are found before incidents do&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: name the incident lead and the comms lead before the first message goes out. Write runbooks short enough to be scannable at 3 AM. Run blameless postmortems with action items that actually get tracked to completion. Schedule game days quarterly so the runbooks are exercised before real incidents. Teams that operate with this structure do not have more engineers; they have engineers who are not single points of failure during recovery. The deepest experts stay the deepest experts, but the platform stops depending on whether they happen to be online.&lt;/P&gt;
&lt;H3&gt;Implementation note&lt;/H3&gt;
&lt;P&gt;A short, well-structured runbook outperforms a long, exhaustive one. The goal during an incident is not to think. It is to act on a procedure that has been thought through in calmer times.&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;
# Runbook header pattern (keep it scannable in incident time)
title: High latency on order API
slo_protected:                  # this runbook protects two SLOs
  - order-completion-success
  - order-completion-latency
severity:                       # derived from burn rate, not declared
  fast_burn: P1                 # 14.4x budget burn over 1 hour =&amp;gt; page now
  slow_burn: P2                 # 6x budget burn over 6 hours =&amp;gt; investigate
owner: payments-team
indicators:                     # triggers for evaluation, not severity
  - p99 (99th-percentile) latency exceeds the SLO target for 5 min
  - error rate exceeds the SLO target for 3 min on order-completion
first_actions:
  - Open the order-journey dashboard. Confirm impact in business terms.
  - Check Service Bus queue depth and dead-letter rate (the most common
    cause of API latency under load is downstream backpressure)
  - Verify Cosmos DB RU/s saturation and partition hotspots
  - Inspect the most recent deployment for behavioural changes
escalate_if:
  - Latency does not recover in 15 min
  - Error rate exceeds 5% (fast burn against the SLO)
  - Customer reports arrive before our own signals do
rollback_path:
  - Feature flag "new-order-pipeline" can be disabled per-tenant
  - Last known good deployment id is in the release tracker
note_on_scaling:
  # CPU is rarely the cause of latency in this service. Scale only after
  # confirming the bottleneck is compute, not a downstream dependency or
  # queue depth. Adding capacity to a saturated downstream amplifies the
  # incident; it does not resolve it.
&lt;/LI-CODE&gt;
&lt;P&gt;The general principle behind that last note travels beyond this runbook: scale-out is the right remediation for compute saturation, not for downstream saturation. When latency rises because a database, queue, or external dependency is saturated, adding capacity in front of the bottleneck moves more requests into the bottleneck and makes the incident worse. This is one of the most common operational mistakes when the dashboard shows red and the on-call instinct says "add more".&lt;/P&gt;
&lt;H2&gt;4. Release confidence is engineered&lt;/H2&gt;
&lt;P&gt;Releases get harder as systems grow. The platforms that ship confidently at scale have engineered the path, not learned to fear it.&lt;/P&gt;
&lt;P&gt;The patterns that change the math:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Feature flags that allow change without deploy&lt;/LI&gt;
&lt;LI&gt;Canary deployments (releasing the new version to a small slice of traffic first, watching error budget burn before continuing) that surface problems on a small slice&lt;/LI&gt;
&lt;LI&gt;Gradual rollouts with automated rollback triggers&lt;/LI&gt;
&lt;LI&gt;Database migrations split from application releases&lt;/LI&gt;
&lt;LI&gt;Release coordination that scales with services, not with team size&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: every change ships behind a feature flag, canary deployments take a small slice of traffic first, and rollback is a one-click step in the pipeline rather than a procedure to be invented during an incident. The cost is the discipline of building rollback paths and exercising them. The return is releases that stop being events. Issues that previously triggered full rollbacks get isolated to a slice and rolled back automatically before they reach most users. The willingness to ship smaller, more frequent changes follows directly from the confidence that bad changes can be undone fast.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;Big releases feel safe because they are rare. They are actually risky because every change rides together.&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;5. Reliability is continuous, not a milestone&lt;/H2&gt;
&lt;P&gt;Reliability is not achieved through tools alone. It requires continuous refinement, feedback-driven improvement, and a budget that the team can spend on operational work without negotiating each time.&lt;/P&gt;
&lt;P&gt;The disciplines that keep systems reliable over years are codified well in the SRE-book framing of service level objectives and error budgets (the canonical reference is the &lt;A href="https://sre.google/sre-book/service-level-objectives/" target="_blank"&gt;Google SRE Book chapter on Service Level Objectives&lt;/A&gt;, with the operational follow-up in the &lt;A href="https://sre.google/workbook/alerting-on-slos/" target="_blank"&gt;SRE Workbook chapter on alerting on SLOs&lt;/A&gt;). The names matter less than the practice they enable.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;SLOs&lt;/STRONG&gt; chosen from the user's perspective, with two or three per service rather than ten. More SLOs means none of them shape behaviour.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Error budgets&lt;/STRONG&gt;: the inverse of the SLO, expressing how much unreliability the team is willing to spend in a window. Used up early in the month means slow down on releases. Healthy means feature work keeps moving.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Multi-window burn-rate alerting&lt;/STRONG&gt; turns SLOs from dashboards into pages: short window catches catastrophic failures, long window catches slow drift. Without burn-rate alerting, SLOs are observation, not operation. (The pattern is documented in the &lt;A href="https://sre.google/workbook/alerting-on-slos/" target="_blank"&gt;SRE Workbook&lt;/A&gt;.)&lt;/LI&gt;
&lt;LI&gt;Reliability work has its own backlog, prioritised against features. Not a wishlist after every incident.&lt;/LI&gt;
&lt;LI&gt;Regular game days that exercise failure modes (region failover, dependency outage, traffic spike) before they happen for real&lt;/LI&gt;
&lt;LI&gt;Capacity planning informed by data, not by anxiety&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: define two or three SLOs per service, expressed from the user's perspective. Compute the error budget weekly. When the budget is healthy, ship feature work. When the budget is burning fast, slow down and fix the cause. The conversation about which incidents matter and which can wait becomes possible because there is a shared number to point at. Reliability becomes a quantified property of the platform, not an opinion debated at every retrospective. Teams that adopt this discipline stop having the recurring "how reliable do we need to be?" argument and start having data-grounded trade-off discussions instead.&lt;/P&gt;
&lt;H2&gt;A scenario that ties it together&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;A platform was launching a new region. The build had gone well. Day 1 was clean. Two weeks in, latency started creeping up during peak hours. Alerts fired on raw thresholds, but no one could tell which ones to trust. Incident calls turned into long debugging sessions because three different teams owned overlapping pieces of the request path.&lt;/P&gt;
&lt;P&gt;The team did not start by buying a new tool. They started by treating operations as engineering work. The dashboard was redesigned around the user journey. Alerts were audited and most were demoted or removed. Roles for incident response were written down. A short runbook covered the top failure modes. Releases were broken into canary slices behind feature flags.&lt;/P&gt;
&lt;P&gt;None of this was new. It was discipline applied consistently to work that was previously assumed to be someone else's. The next region launch took half the effort, and the team's mean time to restore on the failures that did happen was measurably lower.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;What teams get wrong&lt;/H2&gt;
&lt;P&gt;The common pattern is &lt;STRONG&gt;treating Day 2 as the cost of Day 1.&lt;/STRONG&gt; Teams design beautifully, ship fast, then quietly absorb the operational debt. Dashboards proliferate. Alerts grow louder. Postmortems pile up.&lt;/P&gt;
&lt;P&gt;The fix is not more dashboards. It is treating operations as engineering work with the same rigour as feature delivery. Operability is a property the system either has or does not. It is not earned by adding monitoring. It is earned by designing for visibility and operating with discipline.&lt;/P&gt;
&lt;H2&gt;Where to start&lt;/H2&gt;
&lt;P&gt;&lt;STRONG&gt;The most concrete starter from this post: an alert audit.&lt;/STRONG&gt; List every alert that fires in the next week and apply a single test to each one: "what action would I take in the next five minutes?" Demote the alerts that have no answer. Remove the alerts where the answer is the same as another alert's. The audit takes a morning. The result usually halves alert volume and lifts trust on what remains, which is the precondition for every other operational practice in this post.&lt;/P&gt;
&lt;H2&gt;The shift&lt;/H2&gt;
&lt;P&gt;The most important shift in maturity is not technical. It is in stance.&lt;/P&gt;
&lt;P&gt;The shift is from &lt;STRONG&gt;shipping software to operating systems&lt;/STRONG&gt;:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Operations is not a phase that follows engineering. It is engineering.&lt;/LI&gt;
&lt;LI&gt;Reliability is not a milestone reached. It is a discipline practiced.&lt;/LI&gt;
&lt;LI&gt;Incidents are not interruptions to the work. They are the work.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The teams that internalise this shift run platforms that are smaller, calmer, and more trusted. They do not have fewer incidents because their systems are more advanced. They have fewer incidents because their operational discipline is more consistent. Part 3 of this series argues that the same discipline applies again, in a different domain: the practices that make platforms operable are the practices that make AI useful in delivery.&lt;/P&gt;
&lt;HR /&gt;
&lt;P&gt;&lt;STRONG&gt;Want to discuss?&lt;/STRONG&gt; What is the one operational practice your team adopted that changed how you sleep at night? Drop a comment with patterns you have seen in your environment. Every reply gets read.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Previously in this series:&lt;/STRONG&gt; &lt;EM&gt;Building Cloud Native Platforms That Scale: Patterns That Actually Work.&lt;/EM&gt; The first post covered the design choices that make scale possible.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Next in this series:&lt;/STRONG&gt; &lt;EM&gt;AI-First Platform Engineering: From Copilot to Agentic Delivery.&lt;/EM&gt; Cloud helped us scale infrastructure. The next post looks at how AI is now changing how we build and run platforms.&lt;/P&gt;
&lt;!--
  Taxonomy:
    Primary product: Azure
    Secondary: Azure Monitor, Application Insights
    Tags: Site Reliability Engineering, Observability, Incident Management, DevOps, Platform Engineering

  Visuals embedded:
    1. assets/diagram-1-incident-lifecycle.png  (source: assets/diagram-1-incident-lifecycle.mmd, Mermaid)
    2. assets/diagram-2-alert-funnel.png        (source: assets/diagram-2-alert-funnel.py, matplotlib)
--&gt;</description>
      <pubDate>Thu, 21 May 2026 22:41:09 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/cloud-native-platforms-run/ba-p/4520188</guid>
      <dc:creator>KishoreKumarPattabiraman</dc:creator>
      <dc:date>2026-05-21T22:41:09Z</dc:date>
    </item>
    <item>
      <title>Cloud Native Platforms: Evolve</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/cloud-native-platforms-evolve/ba-p/4520195</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Audience:&lt;/STRONG&gt; Engineering leaders, platform architects, senior developers exploring how to operationalise AI in their teams&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Reading time:&lt;/STRONG&gt; 8 minutes&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Series:&lt;/STRONG&gt; Cloud Native Platforms. Build, Run, Evolve. This is Part 3 of 3.&lt;/P&gt;
&lt;HR /&gt;
&lt;P&gt;Cloud helped us scale infrastructure.&lt;/P&gt;
&lt;P&gt;AI is starting to do the same thing for the work around the code: the planning, the testing, the release communication, the incident triage, the writing that surrounds writing software.&lt;/P&gt;
&lt;P&gt;The conversation about AI in software has narrowed too quickly to "Copilot in the editor". The bigger story is happening across the lifecycle. Planning, design, development, testing, release, and operations are all being augmented at once. The platforms that adopt AI well are not the ones with the most usage. They are the ones with the clearest discipline around how it is used.&lt;/P&gt;
&lt;P&gt;This post is about that discipline.&lt;/P&gt;
&lt;H2&gt;AI is changing how we engineer, not how we type&lt;/H2&gt;
&lt;P&gt;AI is not changing how we write code. It is changing how we engineer software.&lt;/P&gt;
&lt;P&gt;Code generation is the surface. Underneath it, AI is reshaping the unit of leverage. The question is no longer how fast a developer can type. It is how well a workflow can be expressed as a reusable engineering asset. Six disciplines determine whether AI moves the needle on outcomes or just adds another tool to the stack.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;EM&gt;Figure 1. AI across the SDLC. Each phase has clear AI assist points and clear human-owned validations. The boundary is not negotiable. It is the design.&lt;/EM&gt;&lt;/P&gt;
&lt;H2&gt;1. From assistance to augmentation&lt;/H2&gt;
&lt;P&gt;Early AI tools focused on assisting individual developers. Code suggestions. Autocomplete. Quick refactors. The value was real but bounded by the editor.&lt;/P&gt;
&lt;P&gt;The shift now is into structured workflows that span the lifecycle. The unit of leverage is no longer a single suggestion. It is a sequence of actions executed reliably across phases. ("Agentic" later in this post means a system that makes its own next-step decisions inside guardrails. A workflow follows a fixed sequence; an agent chooses the path.)&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Code generation has become baseline, not differentiator&lt;/LI&gt;
&lt;LI&gt;Workflow generation is where the largest gains live&lt;/LI&gt;
&lt;LI&gt;Multi-step assistance with explicit human checkpoints&lt;/LI&gt;
&lt;LI&gt;Context that travels across tools, not just within one&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: start with the single highest-volume writing task on the team (commit messages, code review comments, release notes, postmortem first drafts) and turn the AI assist for that task into a shared workflow rather than each individual's private trick. The cost is one engineer's afternoon documenting the workflow and the eval set. The return is that every engineer on the team inherits the work, and the task that used to consume an engineer's morning every two weeks becomes a background step in the release process. Workflow generation, not faster typing, is where the gains compound across a team.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;Code suggestions help one developer. Reusable workflows help the next ten.&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;2. AI across the SDLC, with guardrails&lt;/H2&gt;
&lt;P&gt;AI now has a useful role at every phase of delivery. The role is different at each phase, and the guardrails are different too.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th&gt;Phase&lt;/th&gt;&lt;th&gt;What AI helps with&lt;/th&gt;&lt;th&gt;What humans must validate&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Plan&lt;/td&gt;&lt;td&gt;Breaking down requirements, drafting acceptance criteria&lt;/td&gt;&lt;td&gt;Domain context, business priorities, customer impact&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Build&lt;/td&gt;&lt;td&gt;Code generation, refactoring, scaffolding&lt;/td&gt;&lt;td&gt;Architectural fit, security boundaries, performance&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Test&lt;/td&gt;&lt;td&gt;Test case generation, edge case discovery&lt;/td&gt;&lt;td&gt;Coverage of business-critical paths, regulatory cases&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Release&lt;/td&gt;&lt;td&gt;Release notes, changelog summaries, communication drafts&lt;/td&gt;&lt;td&gt;Accuracy, tone, customer-facing claims&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Operate&lt;/td&gt;&lt;td&gt;Log triage, incident summaries, runbook drafts&lt;/td&gt;&lt;td&gt;Root cause attribution, action item ownership&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;The guardrails are not optional decoration. They are the design.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: stage AI assists for release communication (changelog drafting, customer-facing release notes, internal release announcements) and require a human review before anything goes out. The draft arrives consistently, faster than a human could produce, and easier to compare across releases. The reviewer is not eliminated; the reviewer is moved from author to editor, which is where their judgment actually matters. Teams that adopt this pattern stop missing release-note deadlines and stop publishing inconsistent communication across products.&lt;/P&gt;
&lt;H2&gt;3. From prompts to reusable assets&lt;/H2&gt;
&lt;P&gt;Many teams begin with prompt experimentation. Individuals find techniques that work for their tasks. The result is a patchwork of personal practices that do not survive a team change.&lt;/P&gt;
&lt;P&gt;The compounding value comes when prompts mature into reusable engineering assets.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;EM&gt;Figure 2. The maturity model from prompts to agents. The value compounds at the workflow stage and accelerates at the agent stage. The disciplines that make agents safe are the same ones that made workflows reliable.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;The maturity stages, in order of leverage:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Prompts&lt;/STRONG&gt;: ad-hoc, individual, hard to share&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Templates&lt;/STRONG&gt;: parameterised prompts versioned with the project&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Workflows&lt;/STRONG&gt;: multi-step sequences with clear inputs, outputs, checkpoints&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Agents&lt;/STRONG&gt;: autonomous task chains operating within explicit guardrails&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The diagram is a maturity ladder, not a graduation. In practice teams operate at all four stages simultaneously for different tasks. A senior engineer may use a one-off prompt to explore a refactor, run a versioned template for commit messages, hand off to a workflow for release notes, and trigger an agent for routine PR triage, all in the same hour. The point of the ladder is not to leave earlier stages behind. It is to know which stage a given task belongs to and to invest accordingly.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: pick the three prompts your team uses every week, codify them as parameterised templates in the same repository as the application code, and treat them as engineering artefacts (reviewed, versioned, owned). New engineers inherit the team's accumulated practice instead of building their own from scratch. Quality becomes consistent because the variance between individuals shrinks. Investment pays back in weeks, not quarters, and the maturity ladder keeps producing returns as the team moves from templates to workflows to agents.&lt;/P&gt;
&lt;H2&gt;4. Agentic delivery, with guardrails that survive a security review&lt;/H2&gt;
&lt;P&gt;The next stage is agentic. AI executes sequences of tasks within a defined scope. The risk is not that the agent will fail. It is that the system around the agent will not catch the failure, and that the failure modes are different in kind from traditional automation. Agents are non-deterministic, they can be manipulated through their inputs, and their actions can have side effects in systems the team does not own.&lt;/P&gt;
&lt;P&gt;Five guardrails make agentic delivery safe. The first four are necessary. The fifth is what carries the agent through a security review at a regulated enterprise.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Identity and scope&lt;/STRONG&gt;: the agent runs as a managed identity (or scoped service principal) with the smallest set of permissions that lets it do its job. Permissions are expressed as allowlists, not denylists. Tools fetched at runtime are subject to the same identity boundary as the agent itself.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Input quarantine&lt;/STRONG&gt;: anything the agent reads from a user-controlled source (work item bodies, PR descriptions, customer tickets) is treated as untrusted text. The agent does not execute instructions found in fetched content, and tool calls are validated against an output schema before execution. This is the prompt-injection mitigation, and it is the most common gap in agentic systems shipped today.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Cost and blast-radius caps&lt;/STRONG&gt;: every run has a maximum token budget, a maximum number of tool calls, and a maximum spend. Exceeding any cap aborts the run cleanly. Without caps, scoped credentials are not enough to bound the damage.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Evaluations and traceability&lt;/STRONG&gt;: agents are evaluated against a fixed test set before deployment, and on every prompt or model change. Every action is logged with inputs, outputs, the model and prompt versions used, and the reasoning trace where the model exposes one. Logs are redacted for secrets and personally identifiable information at write time.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Reversibility taxonomy&lt;/STRONG&gt;: actions are categorised by reversibility, not asserted to be reversible in general. A draft write to a private store is reversible. A post to a customer-facing channel is not reversible (deletion does not unsend). A database update may be reversible by a compensating transaction or not at all. Irreversible actions require human approval at the boundary, before they happen, not after. The agent is allowed to draft and stage. The human is the only one who is allowed to make the move that cannot be undone.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: start with one low-risk agent (release-notes drafter, PR triage assistant) running on read-only inputs, write-only-to-drafts permissions, and a hard cost cap per run. Require explicit human approval at the irreversible step. Wire up an evaluation set on day one, and rerun it on every prompt or model change. Treat regressions as failures, not warnings. The first agent the team ships is rarely the most valuable; it is the rehearsal that establishes the controls every later agent inherits. Teams that skip this rehearsal end up with an agent in production that no one feels safe extending.&lt;/P&gt;
&lt;H3&gt;Implementation note&lt;/H3&gt;
&lt;P&gt;An agent without a reversibility taxonomy and a regression eval set is a liability. The discipline is the same one that made workflows reliable: scoped identity, idempotency, traceability, and a clear boundary between machine action and human decision. The YAML below is illustrative, not a runtime contract; it is meant to show the shape of the controls a real agent definition would carry, not the syntax of any specific platform.&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;# Agent run definition (illustrative; not a specific platform's syntax)
name: release-notes-drafter
trigger: pre-release
identity:
  type: managed-identity
  scope: tenant=&amp;lt;tenant-id&amp;gt; resource=release-tools/&amp;lt;app-id&amp;gt;
permissions:
  allow:
    - read: work-items in milestone (filter: state=Done)
    - read: pull-requests in milestone (filter: merged)
    - write: drafts/release-notes/${run-id}
  # Production channels are NOT in the allowlist. The agent cannot post.
limits:
  max_tokens_per_run: 80000
  max_tool_calls_per_run: 20
  max_runtime_seconds: 300
  max_cost_usd: 0.40
  on_exceeded: abort_with_partial_artifact
input_handling:
  treat_fetched_content_as: untrusted
  # Indirect prompt injection is mitigated by the layered discipline below,
  # not by a single feature flag. Each item is a separate control.
  enforce_instruction_hierarchy: true
  validate_tool_args_against_schema: true
  validate_outputs_against_schema: true
steps:
  - fetch: completed work items in milestone
  - draft: release notes from items
  - validate: required fields present
  - request-review:
      from: release-manager
      idempotency_key: ${milestone-id}-${draft-hash}
  - on-approval:
      action: post-to-internal-channel
      reversibility: not-reversible
      requires: explicit-human-click  # the agent does NOT click this
audit:
  log_inputs: true
  log_outputs: true
  redact:
    - secrets
    # Pattern-based: handles structured PII like emails, phones, IDs.
    - pii_patterns: [email, phone, national-id, payment-card, ip-address]
    # Entity-based: required for unstructured PII like names. Pattern alone
    # cannot redact a customer name without an entity-recognition step.
    - pii_entities: ner-based  # names, locations, organisations
  retain: 365_days  # tune to your audit policy, not to the demo
evaluation:
  test_set: tests/release-notes/eval-v3.jsonl
  on_prompt_change: rerun
  on_model_change: rerun
  fail_threshold: 5_percent_regression&lt;/LI-CODE&gt;
&lt;H2&gt;5. Where AI still needs human judgment&lt;/H2&gt;
&lt;P&gt;AI has clear boundaries. The boundaries are not embarrassing. They are the design.&lt;/P&gt;
&lt;P&gt;What must stay human-owned:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Architectural trade-offs and design decisions&lt;/LI&gt;
&lt;LI&gt;Security validation and threat modelling&lt;/LI&gt;
&lt;LI&gt;Correctness for business-critical and regulatory paths&lt;/LI&gt;
&lt;LI&gt;Domain context that has not been written down&lt;/LI&gt;
&lt;LI&gt;Accountability for outcomes, not just outputs&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The goal is collaboration, not replacement. The teams that get the most value from AI are not the ones with the most automation. They are the ones with the clearest sense of where automation ends and judgment begins.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: name the human-owned items explicitly in the team's working agreement (architecture, security, regulatory correctness, accountability) and audit every AI workflow against that list. When a workflow asks the AI to make a decision in any of those categories, redesign it so the AI prepares the analysis and a human makes the call. Most teams over-trust AI for one of these areas in their first six months and learn the hard way. Naming the boundary up front prevents the lesson from being paid in production. The clarity is the value; the model behind the workflow is interchangeable.&lt;/P&gt;
&lt;H2&gt;6. Responsible AI is engineering work&lt;/H2&gt;
&lt;P&gt;The first five disciplines decide whether AI moves the needle. The sixth decides whether the platform can defend the choices it makes with AI. Responsible AI is the engineering practice of building systems whose AI behaviour is fair, transparent, accountable, and safe by design, not by audit after the fact. Treating it as a compliance checkbox at the end of the project is how teams end up shipping AI workflows that fail security review, embarrass the company, or harm users.&lt;/P&gt;
&lt;P&gt;Six controls turn responsible AI from a policy into engineering work. These map directly onto the practices Microsoft and the broader industry have converged on, but the names matter less than the practice they enable.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Fairness in inputs and outputs.&lt;/STRONG&gt; The training data, eval set, and prompts are reviewed for systematic bias against any group the system serves. The eval set covers under-represented cases by design, not by accident, and regressions on those cases fail the build.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Transparency to end users.&lt;/STRONG&gt; When a user sees AI-generated content, they are told. When a decision is AI-assisted, the path from input to output is explainable in plain language, not just in a model card buried in documentation.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Content safety filters.&lt;/STRONG&gt; Inputs and outputs pass through safety classifiers (prompt injection, prohibited content, jailbreak patterns) before reaching the model and before reaching the user. Filtering decisions are logged and reviewable.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Accountability ownership.&lt;/STRONG&gt; Every AI workflow has a named owner who is accountable for its outcomes, not just its uptime. The owner has the authority to pause or roll back the workflow when harm is detected.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Data minimisation and residency.&lt;/STRONG&gt; The AI sees only the data it needs to do the task. Personally identifiable information and customer data are scoped, redacted, and kept inside the boundary the customer agreed to. Cross-tenant leakage is treated as a P1 incident, not a feature request.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Harm evaluation alongside quality evaluation.&lt;/STRONG&gt; The eval set measures harm potential (toxicity, hallucination on factual queries, leakage of confidential context) with the same rigour as it measures correctness. Both must pass for a release to ship.&lt;/LI&gt;
&lt;/UL&gt;
&lt;img /&gt;
&lt;P&gt;&lt;EM&gt;Figure 3. Responsible AI as a set of engineering controls around the AI workflow. The six controls fall into four categories: data discipline (fairness, data minimisation), model discipline (content safety, harm evaluation), deployment discipline (transparency to users), and governance (accountability ownership). All six are necessary; none is sufficient on its own.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;In practice&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The pattern that works: write the responsible AI plan before the first agent ships, not after the first incident. Pick one workflow that touches user data or generates customer-facing content, and use it as the reference implementation: fairness review on the eval set, content safety filters wrapping the model call, transparency annotation in the UI, redaction of identifying details in logs, harm evals running alongside quality evals on every change, and a named owner with explicit pause authority. The first such workflow takes longer to ship than the unconstrained version. Every workflow after it inherits the controls and ships faster than it would have without them. Teams that defer responsible AI to a future quarter end up retrofitting it under pressure, which is the most expensive way to do it.&lt;/P&gt;
&lt;H2&gt;A scenario that ties it together&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Picture a platform team several months into using Copilot. Adoption is high. Productivity dashboards show gains. But defect rates are not improving and lead time is flat. Leadership asks the obvious question: is AI actually helping, or just feeling like help?&lt;/P&gt;
&lt;P&gt;The answer is not to stop using AI. It is to change how AI is measured. Move adoption metrics to the background. Move outcome metrics to the front: defect escape rate, lead time for change, change failure rate, mean time to recovery. In parallel, promote the individual prompts that have proved themselves to shared templates, and the templates to versioned workflows. Retrofit responsible AI controls onto the workflows that shipped first: content safety filters, harm evaluations alongside quality evaluations, transparency annotations on customer-facing output, and a named owner for each workflow.&lt;/P&gt;
&lt;P&gt;Six months later, the picture is different. Defect rate improves on the parts of the codebase where reusable workflows were introduced. Onboarding for new engineers is visibly faster. Release notes are consistent across teams. The shift is from celebrating use to tracking outcomes, and once the team measures what matters, the tooling decisions start making themselves.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;What teams get wrong&lt;/H2&gt;
&lt;P&gt;The common pattern is &lt;STRONG&gt;measuring AI by usage, not by outcome&lt;/STRONG&gt;. Adoption metrics tell you who tried Copilot. They do not tell you whether defects dropped, lead time improved, or release notes got better.&lt;/P&gt;
&lt;P&gt;The fix is not less AI. It is better measurement. The four metrics named in the scenario above (defect escape rate, lead time for change, change failure rate, mean time to recovery) come from the &lt;A href="https://dora.dev/" target="_blank" rel="noopener"&gt;DORA research on software delivery performance&lt;/A&gt; and have become a useful default. Two warnings travel with them. First, attribution is hard: an AI workflow rolled out alongside a test refactor and a CI pipeline change cannot claim credit cleanly. Second, baselines matter more than headlines: a single quarter's improvement is not a trend, and a single team's gain is not the platform's gain. Outcome measurement done well needs a baseline window, an attribution discipline, and a kill criterion for workflows that are not paying back. Done poorly, it is just adoption metrics with better names.&lt;/P&gt;
&lt;P&gt;There is also the question of cost. AI usage carries a per-run token bill, an evaluation bill on every change, and (for agents) a cost cap that limits damage when something goes wrong. None of these are large compared to the engineering time saved when the workflow works. All of them are visible enough that a finance-aware reader will ask. Track them.&lt;/P&gt;
&lt;H2&gt;Where to start&lt;/H2&gt;
&lt;P&gt;&lt;STRONG&gt;The most concrete starter from this post: promote one personal prompt to a shared template.&lt;/STRONG&gt; Pick the prompt that gets used most often (commit messages, code reviews, release notes, debugging assist), move it from someone's notes into the repository where the team versions everything else, and watch what changes when the next person on the team runs it. That is the smallest unit of the workflow shift this post argues for, and it is the step where prompts stop being individual practice and start becoming engineering assets.&lt;/P&gt;
&lt;H2&gt;The shift&lt;/H2&gt;
&lt;P&gt;The shift is from &lt;STRONG&gt;building systems to building smarter systems&lt;/STRONG&gt;:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;AI does not replace engineers. It changes what an engineer's leverage looks like.&lt;/LI&gt;
&lt;LI&gt;The unit of value is the workflow, not the suggestion.&lt;/LI&gt;
&lt;LI&gt;The discipline that made platforms operable is the same discipline that makes AI useful.&lt;/LI&gt;
&lt;LI&gt;Responsible AI is not a compliance step. It is the sixth engineering discipline that lets the other five compound safely.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The series ends here, but the arc is consistent across all three posts. The disciplines that make platforms scale are the same disciplines that make AI useful. Build with discipline. Run with discipline. Evolve with discipline. The tools change. The disciplines do not.&lt;/P&gt;
&lt;HR /&gt;
&lt;P&gt;&lt;STRONG&gt;Want to discuss?&lt;/STRONG&gt; Where has AI moved the needle most in your delivery, and where has it disappointed you? Drop a comment with patterns you have seen in your environment. Every reply gets read.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Previously in this series:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;EM&gt;Building Cloud Native Platforms That Scale: Patterns That Actually Work&lt;/EM&gt;. Part 1 covered the design choices that make scale possible.&lt;/LI&gt;
&lt;LI&gt;&lt;EM&gt;Running Cloud Native Platforms: Why Day 2 Decides Everything&lt;/EM&gt;. Part 2 covered the operational disciplines that decide production outcomes.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This is the third and final post in the series.&lt;/P&gt;
&lt;!--
  Taxonomy:
    Primary product: GitHub Copilot, Azure
    Secondary: Microsoft 365 Copilot, AI Foundry
    Tags: AI Engineering, GitHub Copilot, Agentic AI, Developer Productivity, Platform Engineering

  Visuals embedded:
    1. assets/diagram-1-sdlc-with-ai.png  (source: assets/diagram-1-sdlc-with-ai.mmd, Mermaid)
    2. assets/diagram-2-maturity-model.png (source: assets/diagram-2-maturity-model.mmd, Mermaid)
--&gt;</description>
      <pubDate>Thu, 21 May 2026 22:40:33 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/cloud-native-platforms-evolve/ba-p/4520195</guid>
      <dc:creator>KishoreKumarPattabiraman</dc:creator>
      <dc:date>2026-05-21T22:40:33Z</dc:date>
    </item>
    <item>
      <title>WAR, Azure Advisor, and Us (Azure Arch Diagram Builder): Three Ways to Score an Azure Architecture</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/war-azure-advisor-and-us-azure-arch-diagram-builder-three-ways/ba-p/4521611</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Author:&lt;/STRONG&gt; Arturo Quiroga, Azure AI services Engineer - Senior Partner Solutions Architect — Microsoft&lt;/P&gt;
&lt;HR /&gt;
&lt;P&gt;A few days ago I published &lt;A href="https://techcommunity.microsoft.com/blog-draft-azure-architecture-blog.md" target="_blank"&gt;&lt;EM&gt;From Prompt to Production: Building Azure Architecture Diagrams with AI&lt;/EM&gt;&lt;/A&gt;, introducing the open-source &lt;A href="https://aka.ms/diagram-builder" target="_blank"&gt;Azure Architecture Diagram Builder&lt;/A&gt;. One feature got more follow-up questions than any other: the &lt;STRONG&gt;Well-Architected Framework (WAF) validation&lt;/STRONG&gt;. Architects from partners and customers — many of whom already use Azure Advisor and the Well-Architected Review — wanted to know exactly what scoring algorithm we use, how it compares to Microsoft's official tools, and whether they should be using all three.&lt;/P&gt;
&lt;P&gt;This post is that answer. It's a deep dive into how design-time WAF validation works, how Microsoft's two official WAF assessment algorithms work, and where each fits in the architecture lifecycle.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;TL;DR.&lt;/STRONG&gt; Microsoft ships two WAF assessment vehicles — the &lt;STRONG&gt;Well-Architected Review&lt;/STRONG&gt; (questionnaire, scored from human answers) and the &lt;STRONG&gt;Azure Advisor score&lt;/STRONG&gt; (healthy-resources-÷-applicable-resources weighted per subcategory, with Defender Secure Score for Security and cost-weighted math for Cost). Both require either a human filling in a form or live Azure telemetry. Our app runs &lt;STRONG&gt;at design time on a diagram&lt;/STRONG&gt;, before anything is deployed, using a hybrid pipeline: a deterministic rule pre-scan followed by an LLM refinement pass. Same five WAF pillars, different lifecycle stage. Complementary, not competitive.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;HR /&gt;
&lt;H2 id="why-design-time-validation-matters"&gt;Why design-time validation matters&lt;/H2&gt;
&lt;P&gt;Every cost overrun, reliability gap, and security incident I've ever debugged was cheaper to fix on a whiteboard than in production. Yet most WAF tooling assumes the architecture already exists — either because there are deployed resources to scan (Advisor) or because someone has built enough of it to answer 60 specific questions about it (WAR).&lt;/P&gt;
&lt;P&gt;That leaves a gap. &lt;STRONG&gt;Between "rough sketch" and "deployed resource group" there is no algorithmic WAF feedback loop.&lt;/STRONG&gt; That's the gap the Diagram Builder fills.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="microsofts-two-official-waf-assessment-algorithms"&gt;Microsoft's two official WAF assessment algorithms&lt;/H2&gt;
&lt;P&gt;Before describing our approach, it's worth being precise about what Microsoft already ships, because the term "WAF assessment algorithm" can mean either of two very different things.&lt;/P&gt;
&lt;H3 id="1-azure-well-architected-review-war--questionnaire-based"&gt;1. Azure Well-Architected Review (WAR) — questionnaire-based&lt;/H3&gt;
&lt;P&gt;The &lt;A href="https://learn.microsoft.com/assessments/azure-architecture-review/" target="_blank"&gt;Well-Architected Review&lt;/A&gt; is a free self-assessment hosted on Microsoft Learn.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-custom-d6dee6 lia-border-style-solid" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Aspect&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Detail&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;Input&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Human answers to ~60 questions mapped to the WAF pillar checklists&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;Workload variants&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Core WAR, plus AI/ML, IoT, SAP on Azure, Azure Stack Hub, SaaS, Mission Critical&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;Scoring&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Derived from the answers — each "no" or unanswered question subtracts from the pillar score&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;Output&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Per-pillar maturity score + prioritized recommendations + optional Advisor integration&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;Improvement tracking&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;"Milestones" (point-in-time snapshots)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;When to use&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Periodic deep reviews; greenfield design baselining; brownfield audits&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;WAR is human-driven. The algorithm is essentially &lt;EM&gt;"how many of the recommended practices have you confirmed you do?"&lt;/EM&gt; — which is exactly the right algorithm when the assessor is the workload team itself.&lt;/P&gt;
&lt;H3 id="2-azure-advisor-score--telemetry-based"&gt;2. Azure Advisor Score — telemetry-based&lt;/H3&gt;
&lt;P&gt;The &lt;A href="https://learn.microsoft.com/azure/advisor/advisor-score#calculation-of-advisor-score" target="_blank"&gt;Advisor score&lt;/A&gt; is the closest thing Microsoft ships to a real, deterministic WAF &lt;EM&gt;algorithm&lt;/EM&gt;. It runs continuously over your deployed Azure resources.&lt;/P&gt;
&lt;P&gt;The math:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;Pillar-specific overrides:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Security&lt;/STRONG&gt; uses Microsoft Defender for Cloud's &lt;STRONG&gt;Secure Score&lt;/STRONG&gt; model.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Cost&lt;/STRONG&gt; weights by retail $ cost of healthy resources, plus age-of-recommendation weighting; postponed/dismissed items are removed from the denominator.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Reliability / Performance / Operational Excellence&lt;/STRONG&gt; use the healthy-resources ratio above.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Key terms:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;EM&gt;Healthy resource&lt;/EM&gt; — a deployed resource with no open Advisor recommendation against it for that pillar.&lt;/LI&gt;
&lt;LI&gt;&lt;EM&gt;Total applicable&lt;/EM&gt; — resources Advisor was able to evaluate (excludes dismissed/snoozed).&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Advisor is the right tool once you're in production. It cannot help you before deployment, because there is nothing to count as "healthy" or "applicable."&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="the-missing-stage-design-time"&gt;The missing stage: design time&lt;/H2&gt;
&lt;P&gt;Here's the lifecycle, with each tool's domain shaded:&lt;/P&gt;
&lt;PRE class="mermaid"&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;PRE class="mermaid"&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE class="mermaid"&gt;&lt;CODE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/CODE&gt;&lt;STRONG&gt;Design / Diagram&lt;/STRONG&gt; — &lt;EM&gt;Diagram Builder validation&lt;/EM&gt; runs here.&lt;/PRE&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Operate / Observe&lt;/STRONG&gt; — &lt;EM&gt;Azure Advisor&lt;/EM&gt; runs here continuously.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Periodic Review&lt;/STRONG&gt; — &lt;EM&gt;WAR&lt;/EM&gt; runs here, typically quarterly or at major milestones.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;These three stages are sequential and complementary. Our app does not replace Advisor or WAR — it adds a feedback loop earlier in the lifecycle, where corrections are cheapest.&lt;/STRONG&gt;&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="how-design-time-validation-works-in-the-diagram-builder"&gt;How design-time validation works in the Azure Architecture Diagram Builder&lt;/H2&gt;
&lt;P&gt;The validator is a &lt;STRONG&gt;two-phase hybrid pipeline&lt;/STRONG&gt;: deterministic local rules first, then LLM refinement. The full source lives in three files:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://techcommunity.microsoft.com/../src/services/architectureValidator.ts" target="_blank"&gt;&lt;CODE&gt;src/services/architectureValidator.ts&lt;/CODE&gt;&lt;/A&gt; — orchestrator and prompt&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://techcommunity.microsoft.com/../src/services/wafPatternDetector.ts" target="_blank"&gt;&lt;CODE&gt;src/services/wafPatternDetector.ts&lt;/CODE&gt;&lt;/A&gt; — topology + service rule engine&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://techcommunity.microsoft.com/../src/data/wafRules.ts" target="_blank"&gt;&lt;CODE&gt;src/data/wafRules.ts&lt;/CODE&gt;&lt;/A&gt; — the rule knowledge base&lt;/LI&gt;
&lt;/UL&gt;
&lt;PRE class="mermaid"&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;img /&gt;
&lt;PRE class="mermaid"&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE class="mermaid"&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;H3 id="phase-1--deterministic-rule-pre-scan-1-ms-no-llm"&gt;Phase 1 — Deterministic rule pre-scan (~1 ms, no LLM)&lt;/H3&gt;
&lt;P&gt;When you click &lt;STRONG&gt;Validate Architecture&lt;/STRONG&gt;, the validator runs a fully client-side rule engine against the diagram's services, connections, and groups. There are two kinds of rules:&lt;/P&gt;
&lt;H4 id="architecture-pattern-rules"&gt;Architecture-pattern rules&lt;/H4&gt;
&lt;P&gt;&lt;STRONG&gt;These fire when a topology anti-pattern is detected:&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-custom-d6dee6 lia-border-style-solid" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Pattern&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Detection trigger&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;CODE&gt;single-region&lt;/CODE&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No global LB (Traffic Manager / Front Door) with ≥3 services&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;CODE&gt;single-database&lt;/CODE&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Exactly one database service, no replication signal&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;CODE&gt;no-cache&lt;/CODE&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Compute + database present, no Redis/CDN&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;CODE&gt;no-monitoring&lt;/CODE&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No Azure Monitor / App Insights / Log Analytics&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;CODE&gt;no-identity&lt;/CODE&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No Microsoft Entra ID&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;CODE&gt;no-waf&lt;/CODE&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Public web tier without WAF / Front Door / App Gateway&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;CODE&gt;direct-db-access&lt;/CODE&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;An edge from a frontend service directly into a database&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;CODE&gt;no-key-vault&lt;/CODE&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;4+ services and no Key Vault&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;CODE&gt;no-backup&lt;/CODE&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Database present, no Azure Backup / Recovery Services&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;CODE&gt;no-api-gateway&lt;/CODE&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;2+ compute services and no APIM / App Gateway / Front Door&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H4 id="service-specific-rules"&gt;Service-specific rules&lt;/H4&gt;
&lt;P&gt;&lt;STRONG&gt;Every service in the in the generated Azure Architecture diagram is matched against&amp;nbsp;&lt;CODE&gt;SERVICE_SPECIFIC_RULES&lt;/CODE&gt; by normalized type — App Service, Functions, AKS, Cosmos DB, SQL Database, Storage, Key Vault, and 22 more.&lt;/STRONG&gt;&lt;/P&gt;
&lt;H4 id="the-knowledge-base-at-a-glance"&gt;The knowledge base at a glance&lt;/H4&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-custom-d6dee6 lia-border-style-solid" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Metric&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Count&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Total rules&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;73&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Architecture-pattern rules&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;10&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Service-specific rules&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;63&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Distinct Azure services covered&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;29&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Rules tagged &lt;EM&gt;Reliability&lt;/EM&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;18&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Rules tagged &lt;EM&gt;Security&lt;/EM&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;34&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Rules tagged &lt;EM&gt;Cost Optimization&lt;/EM&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;5&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Rules tagged &lt;EM&gt;Operational Excellence&lt;/EM&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;7&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Rules tagged &lt;EM&gt;Performance Efficiency&lt;/EM&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;9&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H4 id="the-preliminary-score"&gt;The preliminary score&lt;/H4&gt;
&lt;P&gt;Each finding has a severity, and severity drives a fixed point deduction from a starting score of 100:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-custom-d6dee6 lia-border-style-solid" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Severity&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Deduction&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #FEE2E2; color: #b91c1c; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;critical&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;−12&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #FFEDD5; color: #c2410c; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;high&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;−7&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #FEF3C7; color: #a16207; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;medium&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;−3&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #DCFCE7; color: #15803d; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;low&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;−1&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;Result is floored at 10 (so even a deliberately bad architecture scores at least 10) and ceilinged at 95 (no findings ≠ perfect — there's always something the model might still catch). This is the &lt;STRONG&gt;deterministic baseline&lt;/STRONG&gt; before the LLM ever sees the architecture, and it's what makes the pipeline reproducible.&lt;/P&gt;
&lt;H3 id="phase-2--llm-contextual-refinement"&gt;Phase 2 — LLM contextual refinement&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;The pre-scan output, the topology, and the optional natural-language description are folded into a focused prompt sent to one of seven Azure OpenAI models (GPT-5.1 through 5.4, GPT-5.x Codex variants, DeepSeek V3.2 Speciale, Grok 4.1 Fast). The system prompt gives the model explicit scoring guardrails:&lt;/STRONG&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;UL&gt;
&lt;LI&gt;Score based on what IS present, not what COULD be added.&lt;/LI&gt;
&lt;LI&gt;A well-connected architecture with appropriate services should score 60–80.&lt;/LI&gt;
&lt;LI&gt;Score below 50 only for critical gaps (no auth, no monitoring, single points of failure).&lt;/LI&gt;
&lt;LI&gt;Findings are improvement suggestions, not reasons to penalize the score severely.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;The model returns strict JSON:&lt;/P&gt;
&lt;PRE class="jsonc"&gt;&lt;CODE&gt;{
  "overallScore": 0-100,
  "summary": "2–3 sentence assessment",
  "pillars": [
    {
      "pillar": "Reliability | Security | Cost Optimization | Operational Excellence | Performance Efficiency",
      "score": 0-100,
      "findings": [
        {
          "severity": "critical | high | medium | low",
          "category": "...",
          "issue": "...",
          "recommendation": "...",
          "resources": ["service-name-1", "service-name-2"],
          "source": "rule-based | ai-analysis"
        }
      ]
    }
  ],
  "quickWins": [ /* same shape as findings */ ]
}&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Two things to call out:&lt;/P&gt;
&lt;OL type="1"&gt;
&lt;LI&gt;&lt;STRONG&gt;Every finding is tagged &lt;CODE&gt;rule-based&lt;/CODE&gt; or &lt;CODE&gt;ai-analysis&lt;/CODE&gt;.&lt;/STRONG&gt; That tag is the credibility lever. You can always see what the deterministic engine produced versus what the model contributed on top. If you don't trust the AI layer, you can ignore it entirely — the rule layer still stands.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The LLM is given pattern hints, not the entire rule catalog.&lt;/STRONG&gt; The prompt stays small and focused, which is roughly 3–5× faster and cheaper than asking the LLM to do everything from scratch.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3 id="what-the-user-sees"&gt;What the user sees&lt;/H3&gt;
&lt;P&gt;On every run the modal reports:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Overall WAF score&lt;/STRONG&gt; (0–100)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Per-pillar score&lt;/STRONG&gt; × 5 (0–100 each)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Severity breakdown&lt;/STRONG&gt; — counts of critical / high / medium / low across all findings&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Quick wins&lt;/STRONG&gt; — high-impact, low-effort items the model surfaces separately&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Hybrid metadata&lt;/STRONG&gt; — local findings count, patterns detected, KB rules used, preliminary score, local elapsed ms&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;AI metrics&lt;/STRONG&gt; — model used, reasoning effort, prompt/completion/total tokens, elapsed time&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;App Insights telemetry&lt;/STRONG&gt; — an &lt;CODE&gt;Architecture_Validated&lt;/CODE&gt; event with model, overall score, finding count, elapsed time&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;H2 id="worked-example"&gt;Worked example&lt;/H2&gt;
&lt;P&gt;Take this prompt, which I've used in demos with partners:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;"A multi-region web application: Azure Front Door in front of two App Service instances in West US 2 and East US 2, both reading from an Azure SQL Database with geo-replication, with Application Insights for telemetry. No Entra ID, no Key Vault."&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;After generation, &lt;STRONG&gt;Validate Architecture&lt;/STRONG&gt; runs:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Phase 1 — pre-scan (deterministic), ~1 ms&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Patterns detected: &lt;CODE&gt;no-identity&lt;/CODE&gt;, &lt;CODE&gt;no-key-vault&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;Findings produced: 8 (1 critical, 1 high, 3 medium, 3 low)&lt;/LI&gt;
&lt;LI&gt;Preliminary score: &lt;STRONG&gt;100 − 12 − 7 − (3×3) − (1×3) = 69&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Phase 2 — LLM refinement, ~6–9 s depending on model&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The model accepts the two pattern hints, validates them in context, and adds three more findings of its own:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-custom-d6dee6 lia-border-style-solid" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Finding&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Source&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Pillar&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Severity&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No Microsoft Entra ID for authentication&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #DBEAFE; color: #1e3a8a; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;rule-based&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Security&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #FEE2E2; color: #b91c1c; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;critical&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No Key Vault for secret management&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #DBEAFE; color: #1e3a8a; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;rule-based&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Security&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #FFEDD5; color: #c2410c; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;high&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;App Service slots not used for safe deploys&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #F3E8FF; color: #6b21a8; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;ai-analysis&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Operational Excellence&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #FEF3C7; color: #a16207; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;medium&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;SQL DB geo-replication present but RTO/RPO not documented&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #F3E8FF; color: #6b21a8; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;ai-analysis&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Reliability&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #FEF3C7; color: #a16207; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;medium&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No CDN for static assets behind Front Door&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #F3E8FF; color: #6b21a8; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;ai-analysis&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Performance Efficiency&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;SPAN style="display: inline-block; padding: 2px 8px; border-radius: 10px; background: #DCFCE7; color: #15803d; font-weight: 600; font-size: 12px; font-family: 'Segoe UI',Arial,sans-serif;"&gt;low&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;Final scores returned by the model:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-custom-d6dee6 lia-border-style-solid" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Pillar&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Score&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Reliability&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;78&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Security&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;52&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Cost Optimization&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;80&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Operational Excellence&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;70&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Performance Efficiency&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;75&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;Overall&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;&lt;STRONG&gt;71&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;The Security score is the lowest because two of the highest-severity findings landed there — exactly what a human reviewer would flag first.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="multi-model-comparison"&gt;Multi-model comparison&lt;/H2&gt;
&lt;P&gt;Because the deterministic floor is identical across runs, the &lt;STRONG&gt;Validation Comparison&lt;/STRONG&gt; view becomes a fair shootout of what each LLM adds on top of the same baseline. The same diagram is scored by all seven models, and the UI surfaces:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Overall score per model&lt;/LI&gt;
&lt;LI&gt;Per-pillar score per model&lt;/LI&gt;
&lt;LI&gt;Severity-count deltas&lt;/LI&gt;
&lt;LI&gt;Number of &lt;CODE&gt;ai-analysis&lt;/CODE&gt; findings each model contributed&lt;/LI&gt;
&lt;LI&gt;Quick wins each model identified&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This is genuinely useful for two reasons. First, it shows that LLM scores vary — typically by ±5–10 points on the same architecture — which is exactly why we publish the &lt;CODE&gt;rule-based&lt;/CODE&gt; vs &lt;CODE&gt;ai-analysis&lt;/CODE&gt; tag. Second, it lets architects pick the model whose review style matches their own.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="how-we-align-with-microsofts-algorithms"&gt;How we align with Microsoft's algorithms&lt;/H2&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-custom-d6dee6 lia-border-style-solid" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Alignment point&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Same five pillars&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Identical names and scope to the official WAF&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Same source material&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Rules derived from WAF docs and Azure Architecture Center service guides&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Severity-graded findings&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Map conceptually to Advisor's high/medium/low impact recommendations&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Per-pillar + overall scoring&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Mirrors WAR/Advisor output shape, so the results feel familiar&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2 id="where-we-deliberately-differ--and-why"&gt;Where we deliberately differ — and why&lt;/H2&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-custom-d6dee6 lia-border-style-solid" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Concern&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Microsoft&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Diagram Builder&lt;/th&gt;&lt;th class="lia-border-color-custom-0078d4 lia-border-style-solid" style="border-width: 1px; padding: 10px 12px;"&gt;Why we differ&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Needs deployed resources&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Advisor: yes&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No — works on a diagram&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;We're a &lt;EM&gt;design-time&lt;/EM&gt; tool; the architecture doesn't exist yet&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Needs human Q&amp;amp;A&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;WAR: yes&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No — derived from the diagram&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;One-click validation inside the design flow&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Healthy/Applicable ratio&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Advisor: yes&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No resource-health signal exists pre-deployment&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Subcategory fixed weights&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Advisor: yes&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No explicit weights&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Severity is the de-facto weight (12/7/3/1)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Defender Secure Score for Security&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Advisor: yes&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Defender requires deployed resources&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Cost-weighted scoring&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Advisor: yes&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;No (separate Cost Estimation feature)&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Cost is a separate pipeline in our app&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;AI/LLM refinement&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Neither&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Yes&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Catches context-specific issues a static catalog misses, and explains findings in natural language&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Multi-model comparison&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Neither&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Yes&lt;/td&gt;&lt;td class="lia-border-color-custom-d6dee6 lia-vertical-align-top lia-border-style-solid" style="border-width: 1px; padding: 8px 12px;"&gt;Lets architects see scoring variance across models&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H2 id="honest-limitations"&gt;Honest limitations&lt;/H2&gt;
&lt;P&gt;I'd rather you hear these from me than discover them in production:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;LLM scores drift.&lt;/STRONG&gt; ±5–10 points across models on the same diagram is normal. Treat the score as directional, the findings as actionable. The &lt;CODE&gt;rule-based&lt;/CODE&gt; tag is your anchor.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;No live telemetry.&lt;/STRONG&gt; We can't know if your App Service is actually using availability zones — only that you have App Service in the diagram. Advisor will tell you the truth post-deployment.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Generic ruleset.&lt;/STRONG&gt; No specialized workload branches yet (AI/ML, IoT, SAP, SaaS). WAR has those.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;No milestone tracking.&lt;/STRONG&gt; Each validation run is independent. Compare runs manually using the Validation Comparison view.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Rule coverage is finite.&lt;/STRONG&gt; 29 services and 73 rules is a strong start but not exhaustive — the LLM layer exists in part to compensate for that gap.&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;H2 id="how-to-use-all-three-together"&gt;How to use all three together&lt;/H2&gt;
&lt;P&gt;&lt;STRONG&gt;A lifecycle that actually works:&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL type="1"&gt;
&lt;LI&gt;&lt;STRONG&gt;Design&lt;/STRONG&gt; — &lt;STRONG&gt;Use the &lt;A href="https://aka.ms/diagram-builder" target="_blank"&gt;Diagram Builder&lt;/A&gt; to sketch the architecture and validate at design time. &lt;/STRONG&gt;Iterate until the per-pillar scores look reasonable and the critical/high findings are addressed.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Deploy&lt;/STRONG&gt; — &lt;STRONG&gt;Generate Bicep from the diagram, deploy, &lt;/STRONG&gt;and let Azure Advisor start scoring real resources.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Operate&lt;/STRONG&gt; — &lt;STRONG&gt;Use Azure Advisor continuously&lt;/STRONG&gt;. Use Defender Secure Score for security posture.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Periodic review&lt;/STRONG&gt; — &lt;STRONG&gt;Run a Core WAR every quarter&lt;/STRONG&gt; or at major milestones to capture the things only humans know (business context, tradeoffs, planned debt).&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;STRONG&gt;None of these three replace the others. They cover different stages of the same loop.&lt;/STRONG&gt;&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="whats-next"&gt;What's next&lt;/H2&gt;
&lt;P&gt;A few things on the roadmap I'd love feedback on:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Milestone tracking&lt;/STRONG&gt; so design-time scores can be compared over time the way WAR milestones work.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Workload-specific rulesets&lt;/STRONG&gt; mirroring WAR's branches — starting with AI/ML.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Direct Advisor handoff&lt;/STRONG&gt; — once a diagram is deployed, surface the corresponding Advisor recommendations in the same UI to close the loop.&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;H2 id="try-it-fork-it-tell-me-where-its-wrong"&gt;Try it, fork it, tell me where it's wrong&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Live app:&lt;/STRONG&gt; &lt;A href="https://aka.ms/diagram-builder" target="_blank"&gt;https://aka.ms/diagram-builder&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Source:&lt;/STRONG&gt; &lt;A href="https://github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder" target="_blank"&gt;github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Useful references:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/well-architected/pillars" target="_blank"&gt;Azure Well-Architected Framework pillars&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/assessments/azure-architecture-review/" target="_blank"&gt;Azure Well-Architected Review tool&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/advisor/advisor-score#calculation-of-advisor-score" target="_blank"&gt;Azure Advisor score — calculation&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/advisor/advisor-assessments#create-azure-advisor-waf-assessments" target="_blank"&gt;Use Azure WAF assessments (Advisor)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/well-architected/design-guides/implementing-recommendations" target="_blank"&gt;Complete an Azure Well-Architected Review assessment&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;If you're a partner or customer architect who's already living in Advisor and WAR, I'd genuinely value your reaction — does the design-time stage feel like a real gap to you, or are you already covering it some other way? Open an issue on the repo or reply on LinkedIn.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Posted on the Azure Architecture Blog · Comments and issues welcome on the &lt;A href="https://github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder" target="_blank"&gt;repo&lt;/A&gt;.&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 21 May 2026 18:06:15 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/war-azure-advisor-and-us-azure-arch-diagram-builder-three-ways/ba-p/4521611</guid>
      <dc:creator>arturoqu</dc:creator>
      <dc:date>2026-05-21T18:06:15Z</dc:date>
    </item>
    <item>
      <title>From Prompt to Production: Building Azure Architecture Diagrams with AI</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/from-prompt-to-production-building-azure-architecture-diagrams/ba-p/4520336</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Author:&lt;/STRONG&gt; Arturo Quiroga, Senior Partner Solutions Architect — Microsoft&lt;/P&gt;
&lt;HR /&gt;
&lt;P&gt;Cloud architects spend significant time translating ideas into architecture diagrams. They toggle between Visio, draw.io, pricing calculators, and documentation. According to the &lt;A href="https://survey.stackoverflow.co/2024/professional-developers#1-daily-time-spent-searching-for-answers-solutions" target="_blank" rel="noopener"&gt;2024 Stack Overflow Developer Survey&lt;/A&gt;, 61% of developers spend more than 30 minutes a day searching for answers or solutions, time lost to context-switching rather than design. What if you could describe your architecture in plain English and get a diagram, cost estimate, and deployment guide in minutes?&lt;/P&gt;
&lt;H2 id="the-challenge-fragmented-architecture-workflows"&gt;The Challenge: Fragmented Architecture Workflows&lt;/H2&gt;
&lt;P&gt;Designing Azure architectures today typically involves multiple disconnected steps:&lt;/P&gt;
&lt;OL type="1"&gt;
&lt;LI&gt;&lt;STRONG&gt;Sketch&lt;/STRONG&gt; the architecture in a diagramming tool&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Look up&lt;/STRONG&gt; official Azure icons and drag them into place&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Research&lt;/STRONG&gt; pricing across regions using the Azure Pricing Calculator&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Validate&lt;/STRONG&gt; the design against the Well-Architected Framework (WAF)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Write&lt;/STRONG&gt; deployment documentation and Infrastructure as Code templates&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Compare&lt;/STRONG&gt; alternative designs manually&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Each step lives in a different tool, and keeping them in sync as designs evolve is costly. The Azure Architecture Diagram Builder brings these workflows together in a single browser-based experience.&lt;/P&gt;
&lt;H2 id="how-it-works"&gt;How It Works&lt;/H2&gt;
&lt;P&gt;Describe your architecture in natural language, for example &lt;EM&gt;"A HIPAA-compliant healthcare platform with FHIR APIs, event-driven processing, and multi-region disaster recovery"&lt;/EM&gt;, and the AI generates a diagram with grouped services, data flow connections, and logical organization.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 1.&lt;/STRONG&gt; Enter a natural-language prompt describing your architecture. Curated example prompts help you get started, and you can optionally upload an existing diagram for the AI to analyze.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;The tool uses &lt;STRONG&gt;Azure OpenAI&lt;/STRONG&gt; to power generation across multiple models, enabling you to choose the model that best fits your scenario — from fast iterations to deeper reasoning.&lt;/P&gt;
&lt;H2 id="key-features"&gt;Key Features&lt;/H2&gt;
&lt;H3 id="ai-powered-architecture-generation"&gt;AI-Powered Architecture Generation&lt;/H3&gt;
&lt;P&gt;Describe what you need in plain English, and the AI creates an architecture diagram with:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;714 official Azure service icons&lt;/STRONG&gt; across 29 categories&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Smart grouping&lt;/STRONG&gt;: services are logically organized (Frontend, Backend, Data, Security)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Data flow connections&lt;/STRONG&gt;: labeled edges showing how data moves through the system&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;13 curated example prompts&lt;/STRONG&gt;: from simple web apps to complex enterprise scenarios like Zero Trust networks, Industrial IoT with 5,000+ sensors, and global multiplayer gaming backends&lt;/LI&gt;
&lt;/UL&gt;
&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 2.&lt;/STRONG&gt; A generated industrial IoT architecture. &lt;STRONG&gt;Top:&lt;/STRONG&gt; the clean diagram view as initially produced. &lt;STRONG&gt;Bottom:&lt;/STRONG&gt; the same diagram with per-service monthly cost overlays toggled on, plus a running subscription total in the toolbar.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3 id="architecture-image-import"&gt;Architecture Image Import&lt;/H3&gt;
&lt;P&gt;Already have an architecture on a whiteboard or in a screenshot? Upload the image and let the AI analyze it, mapping services to official Azure icons and recreating the architecture as an editable, interactive diagram.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 3.&lt;/STRONG&gt; Upload a photo of a whiteboard sketch (top-right reference panel) and the AI recreates it as an editable diagram with official Azure service icons and labeled data flow connections.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3 id="arm-template-import"&gt;ARM Template Import&lt;/H3&gt;
&lt;P&gt;Import existing ARM templates to visualize your current infrastructure. The AI parses resource definitions and dependencies, groups related resources into logical layers, and produces a meaningful diagram of what you actually have deployed — a fast way to document an inherited environment or sanity-check a template before deployment.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 4.&lt;/STRONG&gt; ARM template import in action. &lt;STRONG&gt;Top:&lt;/STRONG&gt; the parser status banner while resources and dependencies are being analyzed. &lt;STRONG&gt;Bottom:&lt;/STRONG&gt; the resulting diagram, with resources auto-grouped into logical layers (Web Tier, Data Layer, Container Platform, Observability &amp;amp; Logging) and a &lt;EM&gt;Generated from: ARM Template&lt;/EM&gt; badge linking the diagram back to its source file.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3 id="well-architected-framework-validation"&gt;Well-Architected Framework Validation&lt;/H3&gt;
&lt;P&gt;Validate your architecture against all five WAF pillars — Security, Reliability, Performance Efficiency, Cost Optimization, and Operational Excellence. The validator provides:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;An overall WAF score with pillar-level breakdowns&lt;/LI&gt;
&lt;LI&gt;Specific findings with severity levels&lt;/LI&gt;
&lt;LI&gt;Actionable recommendations you can select and apply&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Select the recommendations you agree with, and the AI regenerates an improved architecture incorporating those changes.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 5.&lt;/STRONG&gt; WAF validation results showing the overall score, per-pillar breakdowns, and individual findings with severity badges. Tick the recommendations you want and the AI rebuilds the diagram with those changes applied.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3 id="multi-model-comparison"&gt;Multi-Model Comparison&lt;/H3&gt;
&lt;P&gt;Run the same architecture prompt through multiple AI models side-by-side and compare:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Architecture Comparison&lt;/STRONG&gt;: service counts, connection counts, groups, token usage, and latency&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Validation Comparison&lt;/STRONG&gt;: WAF scores across models, severity breakdowns, and finding counts&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Apply Winner&lt;/STRONG&gt;: pick the best result and apply it to the canvas with one click&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Present Critique&lt;/STRONG&gt;: a talking avatar narrates the AI-generated ranking with live closed captions&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 6.&lt;/STRONG&gt; Multi-model comparison. &lt;STRONG&gt;Top:&lt;/STRONG&gt; select the models and reasoning effort, then enter the prompt. &lt;STRONG&gt;Bottom:&lt;/STRONG&gt; side-by-side results across all selected models with service counts, latency, token usage, and &lt;EM&gt;Fastest / Cheapest / Most Thorough&lt;/EM&gt; badges.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3 id="multi-region-cost-estimation"&gt;Multi-Region Cost Estimation&lt;/H3&gt;
&lt;P&gt;Get cost estimates from the Azure Retail Prices API across &lt;STRONG&gt;8 Azure regions&lt;/STRONG&gt;: East US 2, Australia East, Canada Central, Brazil South, Mexico Central, West Europe, Sweden Central, and Southeast Asia. Features include:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Color-coded cost legend (green / yellow / red thresholds)&lt;/LI&gt;
&lt;LI&gt;SKU and tier information for each service&lt;/LI&gt;
&lt;LI&gt;Export options: CSV, JSON, plain-text summary, and an analysis report with top cost drivers, Reserved Instance flags, and a ranked multi-region comparison table&lt;/LI&gt;
&lt;/UL&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 7.&lt;/STRONG&gt; The cost legend overlay shows per-service pricing with color-coded thresholds. The region selector in the toolbar lets you re-price the entire architecture in any of eight Azure regions.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3 id="deployment-guide-generation-with-bicep"&gt;Deployment Guide Generation with Bicep&lt;/H3&gt;
&lt;P&gt;Generate step-by-step deployment documentation including:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Prerequisites and Azure resource requirements&lt;/LI&gt;
&lt;LI&gt;Step-by-step deployment instructions&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Bicep templates&lt;/STRONG&gt; for each service (Infrastructure as Code)&lt;/LI&gt;
&lt;LI&gt;Post-deployment verification steps&lt;/LI&gt;
&lt;LI&gt;Security configuration recommendations&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 8.&lt;/STRONG&gt; Each generated Deployment Guide opens with the architecture name, an estimated deployment time, and a prerequisites checklist covering subscription roles, CLI versions, Microsoft Entra ID permissions, and region requirements, followed by numbered, copy-ready deployment steps.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 9.&lt;/STRONG&gt; The Infrastructure as Code section produces a &lt;CODE&gt;main.bicep&lt;/CODE&gt; orchestrator plus a per-service module (Log Analytics, Key Vault, Cosmos DB, SQL Database, Event Hubs, Azure Functions, and more). The &lt;STRONG&gt;Download All Templates&lt;/STRONG&gt; button packages everything into a ready-to-deploy folder.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3 id="workflow-animation--avatar-presenter"&gt;Workflow Animation &amp;amp; Avatar Presenter&lt;/H3&gt;
&lt;P&gt;Visualize how data flows through your architecture with step-by-step animations that highlight services on the canvas as each step plays. When the Azure Speech Service is configured, a photorealistic talking avatar can narrate the workflow or present model comparison results, with live word-by-word closed captions in a draggable, resizable panel.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 10.&lt;/STRONG&gt; A workflow step is highlighted on the canvas as the Avatar Presenter narrates that step. Live word-by-word closed captions appear in a draggable, resizable panel, useful for accessibility and stakeholder demos.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3 id="export-options"&gt;Export Options&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 11.&lt;/STRONG&gt; A single-slide PowerPoint export, available in dark or light theme, ready to drop straight into a stakeholder deck.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Format&lt;/th&gt;&lt;th&gt;Use Case&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;PNG&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Documentation, presentations&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;SVG&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Scalable vector graphics&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;PPTX&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Single PowerPoint slide (dark or light theme)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Draw.io&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Edit in diagrams.net&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;JSON&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Backup, version control&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;CSV / ZIP&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Cost analysis with multi-region comparison&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2 id="highlights"&gt;Highlights&lt;/H2&gt;
&lt;P&gt;The Azure Architecture Diagram Builder unifies the architecture design lifecycle in a single tool:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;End-to-end workflow&lt;/STRONG&gt;: from natural-language description to deployable Bicep templates without tool switching&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Official Azure icons&lt;/STRONG&gt;: 714 icons across 29 categories, mapped directly from the Azure service catalog&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Live pricing&lt;/STRONG&gt;: queries the Azure Retail Prices API at design time rather than relying on static estimates&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;WAF-integrated validation&lt;/STRONG&gt;: architectural best practices built into the design loop rather than applied after the fact&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Multi-model flexibility&lt;/STRONG&gt;: choose the AI model that best suits each task, with fast models for iteration and reasoning models for complex designs&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Open source&lt;/STRONG&gt;: the source code is available for customization and contribution&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 id="one-command-deploy-with-azure-developer-cli"&gt;One-Command Deploy with Azure Developer CLI&lt;/H2&gt;
&lt;P&gt;The fastest way to get your own instance running is with &lt;A href="https://aka.ms/azd" target="_blank" rel="noopener"&gt;&lt;CODE&gt;azd&lt;/CODE&gt;&lt;/A&gt;:&lt;/P&gt;
&lt;DIV id="cb1" class="sourceCode"&gt;
&lt;PRE class="sourceCode bash"&gt;&lt;CODE class="sourceCode bash"&gt;&lt;SPAN id="cb1-1"&gt;&lt;SPAN class="co"&gt;# Install azd (once)&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN id="cb1-2"&gt;&lt;SPAN class="ex"&gt;brew&lt;/SPAN&gt; tap azure/azd &lt;SPAN class="kw"&gt;&amp;amp;&amp;amp;&lt;/SPAN&gt; &lt;SPAN class="ex"&gt;brew&lt;/SPAN&gt; install azd   &lt;SPAN class="co"&gt;# macOS&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN id="cb1-3"&gt;&lt;SPAN class="ex"&gt;winget&lt;/SPAN&gt; install microsoft.azd             &lt;SPAN class="co"&gt;# Windows&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN id="cb1-4"&gt;&lt;/SPAN&gt;
&lt;SPAN id="cb1-5"&gt;&lt;SPAN class="co"&gt;# Clone, configure, and deploy&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN id="cb1-6"&gt;&lt;SPAN class="fu"&gt;git&lt;/SPAN&gt; clone https://github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder&lt;/SPAN&gt;
&lt;SPAN id="cb1-7"&gt;&lt;SPAN class="bu"&gt;cd&lt;/SPAN&gt; azure-architecture-diagram-builder&lt;/SPAN&gt;
&lt;SPAN id="cb1-8"&gt;&lt;SPAN class="ex"&gt;azd&lt;/SPAN&gt; auth login&lt;/SPAN&gt;
&lt;SPAN id="cb1-9"&gt;&lt;SPAN class="ex"&gt;azd&lt;/SPAN&gt; env set AZURE_OPENAI_ENDPOINT &lt;SPAN class="st"&gt;"https://your-resource.openai.azure.com/"&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN id="cb1-10"&gt;&lt;SPAN class="ex"&gt;azd&lt;/SPAN&gt; env set AZURE_OPENAI_API_KEY  &lt;SPAN class="st"&gt;"your-key"&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN id="cb1-11"&gt;&lt;SPAN class="ex"&gt;azd&lt;/SPAN&gt; up   &lt;SPAN class="co"&gt;# Provisions infrastructure + builds + deploys (~8 min)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;P&gt;&lt;CODE&gt;azd up&lt;/CODE&gt; provisions the following via Bicep:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Resource&lt;/th&gt;&lt;th&gt;Purpose&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Azure Container Registry&lt;/td&gt;&lt;td&gt;Stores the Docker image&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Azure Container Apps&lt;/td&gt;&lt;td&gt;Runs the app (nginx + token server)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Log Analytics + Application Insights&lt;/td&gt;&lt;td&gt;Monitoring and telemetry&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Azure Speech (S0)&lt;/td&gt;&lt;td&gt;Avatar Presenter (optional, keyless auth via managed identity)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2 id="try-it-today"&gt;Try It Today&lt;/H2&gt;
&lt;P&gt;The Azure Architecture Diagram Builder is available now:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Live demo&lt;/STRONG&gt;: &lt;A href="https://aka.ms/diagram-builder" target="_blank" rel="noopener"&gt;https://aka.ms/diagram-builder&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Source code&lt;/STRONG&gt;: &lt;A href="https://github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder" target="_blank" rel="noopener"&gt;GitHub repository&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Documentation&lt;/STRONG&gt;: See the &lt;A href="https://github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder/blob/main/DOCS/getting-started-guide.md" target="_blank" rel="noopener"&gt;Getting Started Guide&lt;/A&gt; for detailed setup instructions&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;We welcome feedback and contributions. Use the GitHub Issues page to report bugs, suggest features, or share your experience.&lt;/P&gt;
&lt;HR /&gt;
&lt;P&gt;&lt;STRONG&gt;Tags:&lt;/STRONG&gt; &lt;CODE&gt;artificial intelligence&lt;/CODE&gt; · &lt;CODE&gt;application&lt;/CODE&gt; · &lt;CODE&gt;apps &amp;amp; devops&lt;/CODE&gt; · &lt;CODE&gt;well architected&lt;/CODE&gt; · &lt;CODE&gt;infrastructure&lt;/CODE&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 22 May 2026 18:35:07 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/from-prompt-to-production-building-azure-architecture-diagrams/ba-p/4520336</guid>
      <dc:creator>arturoqu</dc:creator>
      <dc:date>2026-05-22T18:35:07Z</dc:date>
    </item>
    <item>
      <title>How to Secure Azure Databricks without Public Exposure using WAF + Private Endpoints</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/how-to-secure-azure-databricks-without-public-exposure-using-waf/ba-p/4517721</link>
      <description>&lt;P&gt;While first thing that comes up to mind is that lets configure IP Access List with keeping Azure Databricks in Hybrid Connectivity. This approach is technically doable, but not the best approach for organizations which follows Zero Trust Architecture Framework.&lt;/P&gt;
&lt;P&gt;This is where organizations has to design a solution which follows CAF Principles and is fully secured with Azure Application Gateway with Web Application Firewall (WAF) combined with Private Endpoints (Azure Private Link) becomes critical to enforce Zero Trust Architecture.&lt;/P&gt;
&lt;P&gt;In this blog, I’ll walk through:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Why Zero Trust is essential for Databricks&lt;/LI&gt;
&lt;LI&gt;Traffic Flow For Securing Databricks&lt;/LI&gt;
&lt;LI&gt;Architecture Components&lt;/LI&gt;
&lt;LI&gt;Key Considerations&lt;/LI&gt;
&lt;LI&gt;Conclusion&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Why Zero Trust for Azure Databricks?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Azure Databricks is a SaaS-managed service—but in many enterprise environments, data sensitivity demands full network isolation.&lt;/P&gt;
&lt;P&gt;Azure Private Link enables organizations to:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Eliminate public internet exposure&lt;/LI&gt;
&lt;LI&gt;Ensure traffic remains within private networks&lt;/LI&gt;
&lt;LI&gt;Reduce risk of data exfiltration&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Additionally, organizations prefer:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Controlled access via corporate network (VPN/ExpressRoute)&lt;/LI&gt;
&lt;LI&gt;Full audit and inspection of inbound traffic&lt;/LI&gt;
&lt;LI&gt;Strong compliance alignment (RBI, SEBI, PCI-DSS, etc.)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This drives the need for private-only access models, where public access is completely disabled.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Traffic Flow For Securing Databricks –&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Organizations following Zero Trust Architecture often has concerns about accessibility in secured manner, while majority being accessed through the intranet but there are scenarios where in subset of users / partners / vendor needs access which are outside of your organization network and they cant be incorporated in the organization network, instead they have to access over Internet.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This becomes very challenging to consider the flow and secure your connectivity. Below chart would explain how the traffic will flow happen entirely for Databricks and how we will secure it –&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;OL&gt;
&lt;LI&gt;External User Access (Red Flow)
&lt;OL&gt;
&lt;LI&gt;External user connects via internet&lt;/LI&gt;
&lt;LI&gt;Request hits Application Gateway (WAF)&lt;/LI&gt;
&lt;LI&gt;Traffic is:
&lt;UL&gt;
&lt;LI&gt;Inspected&lt;/LI&gt;
&lt;LI&gt;Validated&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;Routed to Databricks via Private Endpoint&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Ensures all external traffic is secured and inspected&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Internal User Access (Green Flow)
&lt;OL&gt;
&lt;LI&gt;Internal user connects via VPN / ExpressRoute&lt;/LI&gt;
&lt;LI&gt;Traffic enters Hub VNet&lt;/LI&gt;
&lt;LI&gt;Routed directly to:
&lt;UL&gt;
&lt;LI&gt;Databricks Private Endpoint (in Spoke VNet)&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Ensures private, low-latency, secure access without internet exposure&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Architecture Components –&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;In this section of the blog, I’ll go through the architecture and the components that are required for design considerations.&lt;/P&gt;
&lt;P&gt;As we are focused on Zero Trust Architecture, we will consider having a secured Hub &amp;amp; Spoke networking model.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key Services &amp;amp; Considerations –&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Application Gateway with WAF&lt;/LI&gt;
&lt;LI&gt;WAF with Custom Rules&lt;/LI&gt;
&lt;LI&gt;Secured Internal Network Connectivity&lt;/LI&gt;
&lt;LI&gt;Databricks Workspace – Public Disabled&lt;/LI&gt;
&lt;LI&gt;Databricks Workspace – Private Endpoint Enabled&lt;/LI&gt;
&lt;/OL&gt;
&lt;img /&gt;
&lt;P&gt;&lt;STRONG&gt;1. Application Gateway with WAF&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Acts as the single entry point to Databricks, deployed in the Hub VNet.&lt;BR /&gt;Provides SSL termination, routing, and traffic inspection, ensuring all access is centralized and backend services are not directly exposed.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;2. WAF with Custom Rules&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Protects against OWASP Top 10 threats and runs in Prevention mode.&lt;BR /&gt;Custom rules enable restriction based on IP, geo-location, or request patterns, giving fine-grained security and compliance control for specific networks only.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;3. Secured Internal Network Connectivity&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;External users&lt;/STRONG&gt;&amp;nbsp;→ access via WAF (inspected traffic)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Internal users&lt;/STRONG&gt;&amp;nbsp;→ access via VPN/ExpressRoute (private flow)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Supported by NSGs, VPN Gateway, and firewall controls to enforce secure, segmented access.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;4. Databricks Workspace – Public Access Disabled&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Public access is completely disabled, ensuring the workspace is not exposed to the internet and only accessible via approved private paths.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;5. Databricks Workspace – Private Endpoint Enabled&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Exposed via Private Endpoint in the Spoke VNet, mapped to a private IP with DNS integration.&lt;BR /&gt;Ensures traffic remains within Azure backbone and is accessible only through authorized networks.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key Considerations&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Application Gateway Listener should be configured with FQDN with which external users will be accessing the Databricks.&lt;/LI&gt;
&lt;LI&gt;Listener FQDN (Example -&amp;nbsp;&lt;STRONG&gt;databricks.example.com&lt;/STRONG&gt;) will be resolving to Public Ip of Application Gateway (Frontend Ip).&lt;/LI&gt;
&lt;LI&gt;SSL Certificate &amp;amp; Public DNS Configurations has to be considered for the configuration to work.&lt;/LI&gt;
&lt;LI&gt;Backend Pool for Application Gateway should be configured with FQDN of Databricks Workspace.&amp;nbsp;&lt;EM&gt;Note – If you add Private Ip of workspace it will be shown as health backend, but web request will fail&lt;/EM&gt;&lt;/LI&gt;
&lt;LI&gt;Ensure that DNS Resolution for Private Endpoint is configured appropriately.&lt;/LI&gt;
&lt;LI&gt;Application Gateway should be able to resolve FQDN of Databricks workspace to Private Endpoint.&lt;/LI&gt;
&lt;LI&gt;End-to-End SSL Configuration at Application Gateway Level.&lt;/LI&gt;
&lt;LI&gt;Custom Rules for WAF should be configured for allowing particular Public Network while others should be blocked.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Conclusion&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;As organizations continue adopting Azure Databricks for critical analytics and AI workloads, securing access becomes non-negotiable. A traditional approach with hybrid network model with IP Access List is no longer sufficient for enterprise-grade security.&lt;/P&gt;
&lt;P&gt;By combining Application Gateway (WAF) with Private Endpoints, and leveraging a Hub-Spoke architecture, we can achieve a true Zero Trust design—where every request is validated, inspected, and routed securely without exposing backend services.&lt;/P&gt;
&lt;P&gt;This architecture not only reduces the attack surface but also ensures compliance with strict regulatory standards, while still enabling seamless access for both internal and external users.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Useful Links -&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Azure WAF Custom Rules -&lt;/STRONG&gt; &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/web-application-firewall/ag/custom-waf-rules-overview" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/web-application-firewall/ag/custom-waf-rules-overview&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Azure Databricks Private Link -&lt;/STRONG&gt; &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/databricks/security/network/concepts/private-link#choose-the-right-private-link-implementation" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/security/network/concepts/private-link#choose-the-right-private-link-implementation&lt;/A&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Azure Databricks Architecture -&lt;/STRONG&gt; &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/databricks/security/network/concepts/architecture" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/security/network/concepts/architecture&lt;/A&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Azure Databricks VNet -&lt;/STRONG&gt; &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/databricks/security/network/classic/vnet-inject#overview" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/security/network/classic/vnet-inject#overview&lt;/A&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Azure Hub &amp;amp; Spoke Model -&lt;/STRONG&gt; &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/architecture/networking/architecture/hub-spoke" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/architecture/networking/architecture/hub-spoke&lt;/A&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 11 May 2026 23:37:12 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/how-to-secure-azure-databricks-without-public-exposure-using-waf/ba-p/4517721</guid>
      <dc:creator>FaizaanMerchant</dc:creator>
      <dc:date>2026-05-11T23:37:12Z</dc:date>
    </item>
    <item>
      <title>Configure DNS forwarding for Azure NetApp Files</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/configure-dns-forwarding-for-azure-netapp-files/ba-p/4516381</link>
      <description>&lt;P&gt;&lt;EM&gt;This post has been written with the collaboration of &lt;A href="http://rizul@netapp.com" target="_blank" rel="noopener"&gt;Rizul Khanna&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;SPAN data-contrast="none"&gt;Applies to: Azure NetApp Files — SMB, dual-protocol, and NFSv4.1 Kerberos volumes deployed in hub-spoke or Azure Virtual WAN topologies using an external private DNS forwarder.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H4 class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;Overview&lt;/H4&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Azure NetApp Files (ANF) has a hard dependency on DNS for all volume types that integrate with Active Directory (AD): SMB,&amp;nbsp;dual-protocol&amp;nbsp;(SMB + NFS), and NFSv4.1 with Kerberos. Unlike most Azure PaaS services, ANF does not use Azure Private Link and has no&amp;nbsp;privatelink.*&amp;nbsp;zone. Its volumes attach directly to a delegated subnet, and their hostnames are registered into AD-integrated DNS via Secure Dynamic DNS (DDNS). This architecture means DNS design decisions for the ANF delegated subnet are fundamentally different from those that apply to storage accounts, SQL databases, or other services that use private endpoints.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;This article documents what DNS resolution ANF&amp;nbsp;requires, how to correctly configure an external private DNS forwarder in hub-spoke and Virtual WAN deployments, and the specific undocumented requirements that cause volume creation failures and SMB permission errors in practice. Several requirements covered here are not present in the official Azure NetApp Files documentation and have been&amp;nbsp;identified&amp;nbsp;through field support cases.&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;ANF does not inherit the VNET DNS server setting. It queries only the two DNS server IPs configured in the Active Directory connection on the NetApp account. This is not documented in the ANF networking or AD connection articles. The VNET DNS server setting is irrelevant to ANF volume creation and AD join behavior — only the AD connection DNS IPs matter.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;/DIV&gt;
&lt;H4 aria-level="1"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;Architecture overview&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:400,&amp;quot;335559739&amp;quot;:120,&amp;quot;335572079&amp;quot;:4,&amp;quot;335572080&amp;quot;:2,&amp;quot;335572081&amp;quot;:12611584,&amp;quot;469789806&amp;quot;:&amp;quot;single&amp;quot;}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;The following diagram shows the two separate DNS paths that must be configured when ANF is deployed in a hub-spoke or Virtual WAN topology with an external private DNS forwarder. The client resolution path (VNET&amp;nbsp;DNS setting) and the ANF internal resolution path (AD connection DNS fields) are distinct and must not be conflated.&lt;/SPAN&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN class="lia-text-color-10"&gt;&lt;STRONG&gt;Note&lt;/STRONG&gt;&lt;/SPAN&gt;:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;ANF AD connection DNS IPs must point to the external DC IPs directly — not to the private DNS forwarder. The forwarder handles client-side resolution only and must have both forward and reverse rulesets for the AD domain.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt;Figure 1&lt;/STRONG&gt;: DNS resolution paths for ANF with an external private DNS forwarder. Client VMs use the forwarder (VNET DNS setting). ANF uses the external AD DC IPs directly (AD connection DNS fields). Both forward and reverse lookup rulesets are required on the forwarder.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H5 aria-level="1"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;What DNS must provide for Azure NetApp Files&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/H5&gt;
&lt;P aria-level="1"&gt;&lt;STRONG&gt;Outbound resolution — ANF querying DNS&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;ANF must be able to resolve the following records from the DNS IPs specified in the AD connection:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt;AD domain controller SRV records&lt;/STRONG&gt;:&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;EM&gt;_ldap._tcp.&amp;lt;site&amp;gt;._sites.dc._msdcs.&amp;lt;domain&amp;gt;, _kerberos._tcp.dc._msdcs.&amp;lt;domain&amp;gt;&lt;/EM&gt;, and site-scoped equivalents&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="2" data-aria-level="1"&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt;Kerberos KDC records&lt;/STRONG&gt;:&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&lt;EM&gt;&amp;nbsp;_kerberos._tcp.&amp;lt;domain&amp;gt;&lt;/EM&gt; and &lt;EM&gt;_kerberos-master._tcp/udp.&amp;lt;domain&amp;gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="3" data-aria-level="1"&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt;DC A records&lt;/STRONG&gt;:&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt; Forward lookup for each DC hostname to its IP&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="4" data-aria-level="1"&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt;PTR (reverse) records&lt;/STRONG&gt;:&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt; IP-to-hostname for each DC —&amp;nbsp;required&amp;nbsp;for dual-protocol volume creation, NFSv4.1 Kerberos, LDAP-over-TLS certificate validation, and NTFS ACL operations on SMB shares.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:120}"&gt;&lt;SPAN class="lia-text-color-10"&gt;&lt;STRONG&gt;Note&lt;/STRONG&gt;&lt;/SPAN&gt;: &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:120}"&gt;_&lt;/SPAN&gt;&lt;/EM&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:120}"&gt;&lt;SPAN data-contrast="none"&gt;&lt;EM&gt;kerberos-master._tcp and _kerberos-master._udp&lt;/EM&gt; SRV records are not created automatically by Active Directory DNS. They must be added manually in the DNS zone. Their absence causes Kerberos failures that do not clearly identify DNS as the root cause. This requirement is not documented in any ANF article.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;&lt;BR /&gt;ANF performs Secure Dynamic DNS (DDNS) using GSS-TSIG to register SMB and dual-protocol volume hostnames in AD DNS. This requires that the DNS IPs in the AD connection belong to Microsoft AD-integrated DNS servers. External private DNS forwarders (Infoblox, BIND, Unbound, and similar appliances) do not support GSS-TSIG and will silently discard DDNS updates — volume hostnames will not appear in DNS and SMB mounts will fail. No error is surfaced in the ANF portal or activity log when DDNS is silently dropped.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Inbound resolution — clients resolving ANF hostnames&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;SMB and dual-protocol volumes are accessed via a hostname of the form &lt;EM&gt;&amp;lt;smb-prefix&amp;gt;-XXXX.&amp;lt;ad-dns-domain&amp;gt;&lt;/EM&gt;, where the four-character suffix is assigned by ANF and cannot be overridden. Clients must resolve this hostname to the&amp;nbsp;volume&amp;nbsp;IP via the&amp;nbsp;VNET&amp;nbsp;DNS server setting, which in enterprise environments points to the external private DNS forwarder. The forwarder must have a forward lookup ruleset for the AD domain pointing to the external DC IPs. NFSv3 mounts use the volume IP directly and do not require hostname resolution.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&lt;SPAN class="lia-text-color-10"&gt;&lt;STRONG&gt;Note&lt;/STRONG&gt;&lt;/SPAN&gt;:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&lt;SPAN data-contrast="none"&gt;NFSv3 volume creation success does not indicate SMB readiness. NFSv3 mounts use the volume IP directly and require no AD join, Kerberos exchange, or reverse DNS. SMB and dual-protocol volumes require all three. Using NFSv3 as a connectivity proxy during SMB troubleshooting produces false confidence. This distinction is not documented in ANF troubleshooting guidance.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H5 aria-level="1"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;Configuring the external private DNS forwarder&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:400,&amp;quot;335559739&amp;quot;:120,&amp;quot;335572079&amp;quot;:4,&amp;quot;335572080&amp;quot;:2,&amp;quot;335572081&amp;quot;:12611584,&amp;quot;469789806&amp;quot;:&amp;quot;single&amp;quot;}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H5&gt;
&lt;P aria-level="2"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;The two DNS paths — client resolution vs ANF internal&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:280,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;In environments using an external private DNS forwarder (Infoblox, BIND, Windows DNS VM, or similar appliance), two distinct DNS paths must be kept separate. The&amp;nbsp;VNET&amp;nbsp;DNS server setting governs client resolution of ANF SMB hostnames and should point to the external forwarder. The ANF AD connection DNS fields govern ANF's own resolution of DCs and DDNS registration and must point directly to writable Microsoft AD-integrated DC IPs.&lt;/SPAN&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="width: 99.9983%; height: 69.8618px; border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td style="height: 34.9309px;"&gt;&lt;STRONG&gt;DNS Path&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 34.9309px;"&gt;&lt;STRONG&gt;Used By&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 34.9309px;"&gt;&lt;STRONG&gt;Correct Target&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td style="height: 34.9309px;"&gt;Client resolution (VNET DNS setting)&lt;/td&gt;&lt;td style="height: 34.9309px;"&gt;&lt;SPAN data-contrast="none"&gt;VMs, Citrix, application servers resolving ANF SMB hostnames.&lt;/SPAN&gt;&lt;/td&gt;&lt;td style="height: 34.9309px;"&gt;&lt;SPAN data-contrast="none"&gt;External private DNS forwarder, which forwards AD zone queries to external DC IPs&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;ANF internal resolution (AD connection DNS fields)&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;ANF service — DDNS, Kerberos, LDAP, SRV lookup&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;Writable AD-integrated external DC IPs directly — not the forwarder&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H5&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Required rulesets on the external private DNS forwarder&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/H5&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;The external private DNS forwarder must have both of the following rulesets configured. Missing either one produces failures that are difficult to diagnose because forward DNS tests pass while the actual failure occurs in a different path.&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;&lt;BR /&gt;References:&lt;BR /&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/dns/private-resolver-endpoints-rulesets" target="_blank" rel="noopener"&gt;Understanding Private DNS resolver endpoints &amp;amp; rulesets&lt;/A&gt;&amp;nbsp;&lt;BR /&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/dns/private-reverse-dns" target="_blank" rel="noopener"&gt;How to create Private Reverse DNS records&amp;nbsp;&lt;/A&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Forward lookup ruleset&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:200,&amp;quot;335559739&amp;quot;:60}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Forward all queries for the AD domain to the external DC IPs:&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt;Zone&lt;/STRONG&gt;:&amp;nbsp;&amp;nbsp;&amp;nbsp; ad.contoso.com&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt;Targets&lt;/STRONG&gt;: &amp;lt;DC-IP-1&amp;gt;:53, &amp;lt;DC-IP-2&amp;gt;:53&amp;nbsp;&amp;nbsp; (writable external DC IPs)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Reverse lookup ruleset (most commonly missing)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:200,&amp;quot;335559739&amp;quot;:60}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Forward reverse lookup queries for the DC IP range to the external DC IPs:&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt;Zone&lt;/STRONG&gt;:&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;reverse-octets&amp;gt;.in-addr.arpa.&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt;Targets&lt;/STRONG&gt;: &amp;lt;DC-IP-1&amp;gt;:53, &amp;lt;DC-IP-2&amp;gt;:53&amp;nbsp;&amp;nbsp; (same external DC IPs)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN class="lia-text-color-10"&gt;&lt;STRONG&gt;Critical&lt;/STRONG&gt;&lt;/SPAN&gt;:&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;The reverse lookup ruleset is the most commonly missing configuration item and causes a failure that forward DNS tests do not detect. Without it, Windows clients cannot resolve DC IPs to hostnames. This produces the following error when provisioning NTFS permissions on an ANF SMB share: 'The program cannot open the required dialog box because it cannot determine whether the computer named is joined to a domain.' All connectivity tests pass. Forward DNS passes. The volume was created successfully. Only the reverse lookup fails — and only when NTFS ACL operations are attempted. This failure mode and its root cause are not documented in any ANF article.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H5 aria-level="2"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;GSS-TSIG constraint — why the forwarder cannot be in the ANF AD connection&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:280,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H5&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;External private DNS forwarders (including Infoblox, BIND, Unbound, and third-party appliances) do not support GSS-TSIG,&amp;nbsp;the protocol ANF uses to securely register SMB volume hostnames into AD DNS. If a forwarder IP is placed in the ANF AD&amp;nbsp;connection&amp;nbsp;DNS fields, ANF sends DDNS update packets to the forwarder, which discards them silently. The volume hostname never appears in DNS. Clients cannot mount by name. No error is returned in the ANF portal.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;The correct design: external DC IPs in the ANF AD connection, external private DNS forwarder as the&amp;nbsp;VNET&amp;nbsp;DNS server for clients only.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Role of 168.63.129.16&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:280,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;168.63.129.16 is the Azure-provided internal resolver. It should be configured as the upstream forwarder target on the external private DNS forwarder for all queries not covered by AD or other conditional forwarders. This allows Azure-hosted DNS zones (such as any&amp;nbsp;privatelink.*&amp;nbsp;zones linked to your&amp;nbsp;VNET) to resolve correctly through the forwarder.&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN class="lia-text-color-13"&gt;&lt;STRONG&gt;IMPORTANT&lt;/STRONG&gt;&lt;/SPAN&gt;: &lt;BR /&gt;&lt;STRONG&gt;168.63.129.16&lt;/STRONG&gt; must never be placed in the ANF AD connection DNS fields. It is not AD-aware, cannot answer SRV queries for your domain, cannot accept DDNS updates, and is unreachable from on-premises over ExpressRoute or VPN. Its correct position is as an upstream target on the external private DNS forwarder — not in ANF's AD connection. This is not stated anywhere in the ANF documentation set.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H5 aria-level="2"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;DNS forwarder pattern comparison&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/H5&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="width: 99.9983%; border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Pattern&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;DDNS Support&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;Reverse DNS&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;Ops overhead&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;Best Fit&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;AD DNS on DCs + upstream 168.63.129.16&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;Yes, Native&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;Yes, if reverse zones on DC's&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;Medium&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;Default; simplest topology&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;External private DNS forwarder (VMs only) + external DC IPs in ANF AD connection&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;Yes (ANF bypasses forwarder)&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;Yes, if reverse ruleset on forwarder.&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;Medium&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;Enterprise with existing DNS infrastructure&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;External DNS forwarder in ANF AD connection (incorrect)&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;No — DDNS silently dropped&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;N/A — config is wrong&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;Not supported&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;Azure DNS Private Resolver in ANF AD connection (incorrect)&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;No — DDNS not accepted&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;N/A — config is wrong&lt;/SPAN&gt;&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;td&gt;&lt;SPAN data-contrast="none"&gt;Not supported&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H4&gt;&lt;STRONG&gt;Azure Virtual WAN considerations&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;When ANF is deployed in a spoke VNET connected to an Azure Virtual WAN hub, a routing requirement applies that directly causes Kerberos and LDAP failures — which appear to be DNS or AD failures — when not addressed. This is one of the most common misdiagnoses in ANF deployments using Virtual WAN with NVA inspection.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P aria-level="2"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;ANF subnet prefix must be in Routing Intent &lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;additional&lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;&amp;nbsp;prefixes&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:280,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Azure Virtual WAN with Routing Intent routes private traffic through an NVA or Azure Firewall in the hub. For return traffic from external AD domain controllers to reach the ANF data plane IP, the hub must have an explicit routing entry for the ANF delegated subnet prefix. If the ANF delegated subnet is a /26 inside a larger VNET (/21 or /16), the broader VNET prefix alone is not sufficient — the specific /26 must be added explicitly.&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN class="lia-text-color-13"&gt;&lt;STRONG&gt;Action Required&lt;/STRONG&gt;&lt;/SPAN&gt;: &lt;BR /&gt;In Azure Virtual WAN: Hub -&amp;gt; Routing -&amp;gt; Routing Intent -&amp;gt; Private Traffic -&amp;gt; Additional Prefixes. Add the ANF delegated subnet prefix (&lt;EM&gt;for example, 10.x.x.0/26&lt;/EM&gt;) explicitly. Without this, Kerberos and LDAP reply traffic from external domain controllers is dropped before reaching the ANF data plane. The symptom is TCP port 88 connects succeeding followed by &lt;EM&gt;KRB5_KDC_UNREACH&lt;/EM&gt; — which looks like a Kerberos or DNS problem but is a routing problem.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P aria-level="2"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;&lt;BR /&gt;Use availability zone switching to surface detailed error messages&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:280,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;When ANF volume creation fails with the generic 'context deadline exceeded' error from the&amp;nbsp;XMLrequest_filer&amp;nbsp;endpoint, the error does not&amp;nbsp;identify&amp;nbsp;root cause. Redeploying the volume to a different availability zone (AZ1 to AZ2, or AZ2 to AZ3) forces a different backend assignment and consistently produces a more descriptive error that distinguishes routing failures (&lt;EM&gt;KRB5_KDC_UNREACH&lt;/EM&gt;) from Kerberos authentication failures, DNS lookup failures, and LDAP errors.&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&lt;STRONG&gt;&lt;SPAN class="lia-text-color-11"&gt;TIP&lt;/SPAN&gt;&lt;/STRONG&gt;: &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;If the detailed error shows 'Successfully connected to ip &amp;lt;DC-IP&amp;gt;, port 88 using TCP' followed by 'Cannot contact any KDC for requested realm', the outbound path works but reply packets are dropped — this is a routing problem, not a DNS or Kerberos problem. Check vWAN Routing Intent, NVA firewall rules, and UDRs for the ANF subnet prefix. This diagnostic technique is not documented in ANF troubleshooting guidance.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H4&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;Required DNS records&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;The following records must exist in the AD DNS zone served by the external DC DNS servers and must be resolvable from the ANF delegated subnet. Records marked * are not created automatically by AD DNS and must be added manually.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Forward lookup zone&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="width: 99.9983%; border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Record&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;Type&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;Notes&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;_ldap._tcp.dc._msdcs.&amp;lt;domain&amp;gt;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;SRV&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Domain-wide DC discovery&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;_kerberos._tcp.dc._msdcs.&amp;lt;domain&amp;gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;SRV&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Domain-wide KDC discovery&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;_ldap._tcp.&amp;lt;site&amp;gt;._sites.dc._msdcs.&amp;lt;domain&amp;gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;SRV&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Site-scoped — preferred when AD site is specified in ANF&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;_kerberos._tcp.&amp;lt;site&amp;gt;._sites.dc._msdcs.&amp;lt;domain&amp;gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;SRV&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Site-scoped Kerberos&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;_kerberos-master._tcp.&amp;lt;domain&amp;gt; *&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;SRV&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;NOT auto-created by AD DNS — must be added manually&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;_kerberos-master._udp.&amp;lt;domain&amp;gt; *&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;SRV&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;NOT auto-created by AD DNS — must be added manually&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;&amp;lt;dc-hostname&amp;gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;A&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Forward A record for each external DC in the AD site&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;&amp;lt;anf-smb-hostname&amp;gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;A&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Registered by ANF via DDNS — must not be scavenged or blocked&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Reverse lookup zone (&amp;lt;reverse-octets&amp;gt;.in-&lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;addr.arpa&lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:280,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="width: 99.9983%; border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Record&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Type&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Notes&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;PTR for each external DC IP&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;PTR&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Required for dual-protocol, NFSv4.1 Kerberos, LDAP-over-TLS, and NTFS ACL operations&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;PTR for each ANF volume IP *&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;PTR&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Required for NFSv4.1 Kerberos reverse-lookup clients — create manually or via DDNS if supported&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P aria-level="1"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;How ANF internally fails when reverse DNS is missing&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:400,&amp;quot;335559739&amp;quot;:120,&amp;quot;335572079&amp;quot;:4,&amp;quot;335572080&amp;quot;:2,&amp;quot;335572081&amp;quot;:12611584,&amp;quot;469789806&amp;quot;:&amp;quot;single&amp;quot;}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;When the reverse DNS ruleset is absent from the external private DNS forwarder, the failure does not surface as a DNS error in the ANF portal. Instead, it propagates through ANF's internal security daemon (secd) and presents as a generic&amp;nbsp;InternalServerError&amp;nbsp;or a Kerberos authentication failure. Understanding the internal failure chain explains why reverse DNS is non-negotiable and why the symptom is so misleading.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P aria-level="2"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;The&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;secd&lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;&amp;nbsp;service list mechanism&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:280,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;ANF uses an internal process called&amp;nbsp;secd&amp;nbsp;(Security Daemon) to manage all Active Directory communication — Kerberos ticket exchange, LDAP binds, and DC discovery.&amp;nbsp;secd&amp;nbsp;maintains a service list of discovered DCs. When a DC communication attempt fails for any reason,&amp;nbsp;secd&amp;nbsp;marks that DC as UNUSABLE and records a forgive time after which it will retry. If all DCs in the service list are simultaneously marked UNUSABLE,&amp;nbsp;secd&amp;nbsp;returns RESULT_ERROR_SECD_NO_SERVER_AVAILABLE, which propagates to the portal as&amp;nbsp;InternalServerError.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P aria-level="2"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;The reverse PTR lookup inside&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;secd&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:280,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;A critical and undocumented&amp;nbsp;behavior: before&amp;nbsp;secd&amp;nbsp;completes a SASL/GSSAPI bind to an LDAP server, it performs a reverse PTR lookup of the DC's IP address. This lookup is used to&amp;nbsp;validate&amp;nbsp;the DC identity as part of&amp;nbsp;Kerberos&amp;nbsp;mutual authentication. If the PTR lookup fails — because the external private DNS forwarder has no reverse ruleset for the DC IP range —&amp;nbsp;secd&amp;nbsp;logs the failure and marks that DC UNUSABLE&amp;nbsp;immediately, even though TCP connectivity on ports 88 and 389 succeeded.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;The following is the exact failure sequence from ANF backend logs when reverse DNS is absent:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;&lt;BR /&gt;Stage 1 — TCP connects succeed, PTR lookup fails&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:200,&amp;quot;335559739&amp;quot;:60}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;Successfully connected to ip 10.x.x.60, port 389 using TCP&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;Entry for host-address: 10.x.x.60 not found in the current source: FILES&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;Source: DNS unavailable. Entry for host-address: 10.x.x.60 not found in any of the available sources&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;secd&amp;nbsp;successfully opens the TCP connection to the DC on port 389 (LDAP), then&amp;nbsp;immediately&amp;nbsp;attempts&amp;nbsp;a reverse lookup of that IP. The forwarder has no reverse ruleset, so DNS returns NXDOMAIN.&amp;nbsp;secd&amp;nbsp;logs 'DNS unavailable' and&amp;nbsp;proceeds&amp;nbsp;to mark the DC UNUSABLE.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:120}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Stage 2 — GSSAPI bind fails&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;as a consequence&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:200,&amp;quot;335559739&amp;quot;:60}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;Unable to SASL bind to LDAP server using GSSAPI: Local error&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;Unable to connect to LDAP (Active Directory) service on dc01.ad.contoso.com (Error: Local error)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Because the PTR lookup failed,&amp;nbsp;secd&amp;nbsp;cannot complete the GSSAPI mutual authentication context. The 'Local error' is not a Kerberos configuration problem — it is the direct result of the identity validation step failing due to missing reverse DNS.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:120}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Stage 3 — All DCs marked UNUSABLE&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:200,&amp;quot;335559739&amp;quot;:60}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;10.x.x.27 UNUSABLE&amp;nbsp; Wed Apr&amp;nbsp; 8 00:19:24 2026&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;10.x.x.28 UNUSABLE&amp;nbsp; Wed Apr&amp;nbsp; 8 00:19:24 2026&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;10.x.x.29 UNUSABLE&amp;nbsp; Wed Apr&amp;nbsp; 8 00:19:25 2026&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;10.x.x.30&amp;nbsp;UNUSABLE&amp;nbsp; Wed&amp;nbsp;Apr&amp;nbsp; 8&amp;nbsp;00:19:25 2026&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;... (all DCs in the service list)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;secd&amp;nbsp;cycles through every DC in the discovered service list. Each DC fails the same PTR lookup. Each is marked UNUSABLE. Once the list is exhausted:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:120}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Stage 4 — Service list exhausted, error propagates&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:200,&amp;quot;335559739&amp;quot;:60}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;No servers in the service list which aren't marked bad&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;Unable to select any server in the current serviceList&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;RESULT_ERROR_SECD_NO_SERVER_AVAILABLE:6940&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;No servers available for MS_LDAP_AD, domain: ad.contoso.com&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;This internal error code&amp;nbsp;propagates up&amp;nbsp;through the ANF volume creation stack and is presented to the operator as the generic 'context deadline exceeded (Client.Timeout&amp;nbsp;exceeded while awaiting headers)' error in the portal.&amp;nbsp;The actual cause — missing PTR records — is completely obscured.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:120}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Stage 5 — Kerberos pre-auth error (secondary, misleading)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:200,&amp;quot;335559739&amp;quot;:60}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;Received error from KDC: -1765328359 / Additional pre-authentication&amp;nbsp;required&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;This Kerberos error code (&lt;EM&gt;KRB5KDC_ERR_PREAUTH_REQUIRED&lt;/EM&gt;) appears in logs and can mislead investigation&amp;nbsp;toward&amp;nbsp;Kerberos configuration, encryption type mismatches, or clock skew. It is a downstream consequence of the failed PTR-based GSSAPI context — not a root cause. Chasing this error without first verifying reverse DNS is a common and time-consuming dead end.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&lt;STRONG&gt;&lt;SPAN class="lia-text-color-12"&gt;KEY INSIGHT&lt;/SPAN&gt;&lt;/STRONG&gt;: &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;The complete failure chain is: missing reverse PTR ruleset on DNS forwarder → secd PTR lookup returns DNS unavailable → GSSAPI mutual auth cannot complete → DC marked UNUSABLE → all DCs exhausted → InternalServerError at portal. TCP connectivity on ports 88 and 389 succeeds at every stage. Only the PTR lookup fails. This is why all standard connectivity tests pass and the issue remains invisible until reverse DNS is specifically tested.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P aria-level="2"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Why this failure is invisible to standard troubleshooting ?&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Standard ANF DNS troubleshooting checks forward SRV records, forward A record resolution, and TCP port connectivity to DCs. All of these pass when only the reverse ruleset is missing. The secd PTR lookup is an internal step that occurs after TCP connectivity is confirmed and is not tested by any of the standard nslookup or Test-NetConnection commands used during initial validation. The only reliable way to surface this failure without access to backend logs is to explicitly test reverse PTR resolution from the ANF VNET — as documented in the verification section below.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P aria-level="1"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;&lt;BR /&gt;&lt;STRONG&gt;Verify DNS configuration&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:400,&amp;quot;335559739&amp;quot;:120,&amp;quot;335572079&amp;quot;:4,&amp;quot;335572080&amp;quot;:2,&amp;quot;335572081&amp;quot;:12611584,&amp;quot;469789806&amp;quot;:&amp;quot;single&amp;quot;}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Run the following commands from a test VM in the same&amp;nbsp;VNET&amp;nbsp;as the ANF delegated subnet. Use the external private DNS forwarder IP for client-side tests and an external DC IP for ANF-side tests.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Forward SRV lookup — site-scoped&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt; nslookup -type=SRV _ldap._tcp.&amp;lt;SITE&amp;gt;._sites.dc._msdcs.&amp;lt;domain&amp;gt; &amp;lt;forwarder-IP&amp;gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt; nslookup -type=SRV _kerberos._tcp.&amp;lt;SITE&amp;gt;._sites.dc._msdcs.&amp;lt;domain&amp;gt; &amp;lt;forwarder-IP&amp;gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Forward SRV lookup — domain-wide&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:200,&amp;quot;335559739&amp;quot;:60}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt; nslookup -type=SRV _ldap._tcp.dc._msdcs.&amp;lt;domain&amp;gt; &amp;lt;forwarder-IP&amp;gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;nslookup -type=SRV _kerberos._tcp.dc._msdcs.&amp;lt;domain&amp;gt; &amp;lt;forwarder-IP&amp;gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Reverse PTR lookup — use the external forwarder IP&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;nslookup &amp;lt;external-DC-IP&amp;gt; &amp;lt;forwarder-IP&amp;gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Expected output:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt; Server:&amp;nbsp; &amp;lt;forwarder-IP&amp;gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt; &amp;lt;reverse-arpa&amp;gt;&amp;nbsp; name = dc01.ad.contoso.com.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;If reverse lookup returns NXDOMAIN or times out while forward lookup succeeds, add the reverse DNS ruleset to the external private DNS forwarder. This is the most common cause of NTFS permission failures after a volume is successfully created.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:60,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Port connectivity from ANF&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;VNET&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:200,&amp;quot;335559739&amp;quot;:60}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;Test-NetConnection -ComputerName &amp;lt;external-DC-IP&amp;gt; -Port 88&amp;nbsp;&amp;nbsp;&amp;nbsp; # &lt;STRONG&gt;Kerberos&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;Test-NetConnection -ComputerName &amp;lt;external-DC-IP&amp;gt; -Port 389&amp;nbsp;&amp;nbsp; # &lt;STRONG&gt;LDAP&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;Test-NetConnection&amp;nbsp;-ComputerName&amp;nbsp;&amp;lt;forwarder-IP&amp;gt;&amp;nbsp; -Port 53&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # &lt;STRONG&gt;DNS&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335557856&amp;quot;:3022366,&amp;quot;335559685&amp;quot;:240,&amp;quot;335559737&amp;quot;:240,&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/PRE&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;Common issues and resolutions&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="width: 99.9983%; height: 314.378px; border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td style="height: 34.9309px;"&gt;&lt;STRONG&gt;Symptom&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 34.9309px;"&gt;&lt;STRONG&gt;Likely cause&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 34.9309px;"&gt;&lt;STRONG&gt;Resolution&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;InternalServerError: context deadline exceeded (XMLrequest_filer)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Generic ANF backend timeout — routing or Kerberos root cause not visible at this level&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Switch deployment availability zone for a detailed error. Check&amp;nbsp;vWAN&amp;nbsp;Routing Intent for ANF /26 prefix.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;KRB5_KDC_UNREACH — TCP port 88 connects succeed but auth fails&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Return traffic from external DCs dropped before reaching ANF NIC — routing issue, not DNS&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Add ANF subnet /26 to&amp;nbsp;vWAN&amp;nbsp;Hub Routing Intent &amp;gt; Additional Prefixes &amp;gt; Private Traffic&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;'Cannot determine whether the computer is joined to a domain' — NTFS permissions&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Reverse DNS (PTR) lookup failing on external forwarder for DC IPs&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Add reverse lookup ruleset to external private DNS forwarder: &amp;lt;reverse-zone&amp;gt;.in-addr.arpa. &amp;gt; external DC IPs&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;DDNS fails — SMB hostname not in DNS after volume creation&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;ANF AD connection DNS IPs point to external forwarder — GSS-TSIG not supported by forwarder&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Set AD connection DNS IPs to writable Microsoft AD-integrated external DC IPs directly&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;'Failed to validate LDAP configuration' during dual-protocol creation&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Missing PTR records for external DCs, or reverse zone unreachable&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Add PTR records for all external DCs. Verify reverse ruleset is present on the forwarder.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;NFSv4.1 Kerberos: 'Cannot determine realm for numeric host address'&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Missing PTR for ANF volume IP or external DC IPs&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Add PTR records for ANF volume IPs and all external DC IPs in the reverse zone.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;SMB hostname resolves&amp;nbsp;on-premises&amp;nbsp;but not from Azure VMs&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;External private DNS forwarder missing forward ruleset for AD zone, or targeting wrong DC IPs&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Verify forward ruleset is present and targeting reachable writable external DC IPs.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.9309px;"&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN data-contrast="none"&gt;Volume creation fails after external DC IP change&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;External DNS forwarder (especially BIND) caching&amp;nbsp;stale DC IPs — default TTL up to 7 days&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;Flush forwarder cache. Set short TTLs on DC A records. Consider Microsoft AD-integrated DNS for AD zones.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H4 aria-level="1"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;Summary of key requirements:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;SPAN data-contrast="none"&gt;ANF AD connection DNS IPs must point to writable Microsoft AD-integrated DNS servers (external DC IPs) — not the external private DNS forwarder, not Azure DNS Private Resolver, not 168.63.129.16.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="none"&gt;The external private DNS forwarder must have both a forward ruleset (AD domain &amp;gt; external DC IPs) and a reverse ruleset (in-addr.arpa. zone for DC IP ranges &amp;gt; external DC IPs). The reverse ruleset is&amp;nbsp;required&amp;nbsp;for NTFS ACL operations on SMB shares and is not mentioned in ANF documentation.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="none"&gt;168.63.129.16 is the upstream forwarder target on the external DNS forwarder — not a target in the ANF AD connection. It is unreachable from on-premises and is not AD-aware.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="none"&gt;External private DNS forwarders (Infoblox, BIND, Unbound) do not support GSS-TSIG. Placing a forwarder IP in the ANF AD connection causes silent DDNS failure with no portal error.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="none"&gt;In Virtual WAN deployments, add the ANF delegated subnet /26 to the hub Routing Intent under Additional Prefixes for Private Traffic. The broader&amp;nbsp;VNET&amp;nbsp;prefix alone is not sufficient.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="none"&gt;NFSv3 volume creation success does not&amp;nbsp;indicate&amp;nbsp;SMB readiness — NFSv3 uses the IP directly and bypasses AD, Kerberos, and reverse DNS.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="none"&gt;_kerberos-master SRV records are not created automatically by AD DNS and must be added manually.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="none"&gt;DNS scavenging should be disabled on zones&amp;nbsp;containing&amp;nbsp;ANF records, or records pre-created as static entries, as ANF does not aggressively refresh DDNS registrations.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="none"&gt;When volume creation fails with a generic 'context deadline exceeded' error, switch the deployment availability zone before deep troubleshooting to surface a more descriptive error.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P aria-level="1"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 1"&gt;Related documentation:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="5" data-aria-level="1"&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/azure-netapp-files/domain-name-system-concept" target="_blank" rel="noopener"&gt;Understand Domain Name Systems in Azure NetApp Files&lt;/A&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="5" data-aria-level="1"&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/azure-netapp-files/azure-netapp-files-network-topologies" target="_blank" rel="noopener"&gt;Guidelines for Azure NetApp Files network planning&lt;/A&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="5" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-netapp-files/configure-virtual-wan" target="_blank" rel="noopener"&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;Configure Virtual WAN for Azure NetApp Files&amp;nbsp;&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="5" data-aria-level="1"&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/azure-netapp-files/create-active-directory-connections" target="_blank" rel="noopener"&gt;Create and manage Active Directory connections for Azure NetApp Files&lt;/A&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="5" data-aria-level="1"&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/dns/private-reverse-dns" target="_blank" rel="noopener"&gt;Create and manage reverse DNS zones in Azure Private DNS&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="5" data-aria-level="1"&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/azure-netapp-files/understand-guidelines-active-directory-domain-service-site" target="_blank" rel="noopener"&gt;Understand guidelines for Active Directory Domain Services site design and planning&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="5" data-aria-level="1"&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:40,&amp;quot;335559739&amp;quot;:40}"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-wan/how-to-routing-policies" target="_blank" rel="noopener"&gt;How to configure Virtual WAN Hub routing policies&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="•" data-font="" data-listid="2" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;•&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="5" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/virtual-network/what-is-ip-address-168-63-129-16" target="_blank" rel="noopener"&gt;What is IP address 168.63.129.16?&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Mon, 11 May 2026 04:20:10 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/configure-dns-forwarding-for-azure-netapp-files/ba-p/4516381</guid>
      <dc:creator>mkachare</dc:creator>
      <dc:date>2026-05-11T04:20:10Z</dc:date>
    </item>
    <item>
      <title>Governing Agent Sprawl: A Multi‑Region AI Agent Landing Zone on Azure (Reference Architecture)</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/governing-agent-sprawl-a-multi-region-ai-agent-landing-zone-on/ba-p/4516036</link>
      <description>&lt;P&gt;It doesn’t take long for AI agents to get out of hand.&lt;/P&gt;
&lt;P&gt;In most enterprises, the first few agents are celebrated. A chatbot here. A document summarizer there. Then another team ships an agent that calls APIs. Someone else connects one to internal data. Within months, IT is staring at dozens—or hundreds—of autonomous systems running across subscriptions, regions, and tools.&lt;/P&gt;
&lt;P&gt;At that point, the questions stop being about model quality and start being uncomfortable operational ones:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;EM&gt;Who owns this agent?&lt;/EM&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;EM&gt;What data can it access?&lt;/EM&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;EM&gt;What happens if it misbehaves?&lt;/EM&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;EM&gt;Why did it just consume half our monthly token budget in a day?&lt;/EM&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Developers can build an AI agent in minutes—the difficult part is understanding what agents are doing, how they perform, and whether they comply with organizational policy. Signals scatter across tools, context is lost, and governance becomes reactive.&lt;/P&gt;
&lt;P&gt;This reference architecture exists to solve that problem.&lt;/P&gt;
&lt;P&gt;It describes a &lt;STRONG&gt;multi‑region AI agent landing zone on Azure&lt;/STRONG&gt; that treats agents as first‑class, governable workloads—provisioned automatically, constrained by policy, and observable from day one.&lt;/P&gt;
&lt;H2&gt;The architectural principle: separate control from execution&lt;/H2&gt;
&lt;P&gt;The design starts with a simple but non‑negotiable rule:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Control plane concerns must be separated from runtime concerns.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Azure landing zones already follow this model. Management groups, Azure Policy, and RBAC are global constructs. Workloads run in regions. This architecture applies the same discipline to AI agents.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;The &lt;STRONG&gt;runtime plane&lt;/STRONG&gt; is where agents execute, models infer, and data flows—often in multiple Azure regions.&lt;/LI&gt;
&lt;LI&gt;The &lt;STRONG&gt;control plane&lt;/STRONG&gt; is where identity, policy, safety, evaluation, and oversight live—independent of region.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This separation is what allows teams to scale agents without losing control.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Layer 1: Azure AI Gateway — governing every request&lt;/H2&gt;
&lt;P&gt;The first control layer sits directly in the request path.&lt;/P&gt;
&lt;P&gt;The &lt;STRONG&gt;AI gateway in Azure API Management&lt;/STRONG&gt; provides a policy‑enforcement and observability layer in front of AI models, agents, and tools. It is not a separate service—it extends Azure API Management.&lt;/P&gt;
&lt;P&gt;Everything flows through it:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Microsoft Foundry model deployments&lt;/LI&gt;
&lt;LI&gt;Azure AI Model Inference API endpoints&lt;/LI&gt;
&lt;LI&gt;OpenAI‑compatible third‑party models&lt;/LI&gt;
&lt;LI&gt;Self‑hosted models&lt;/LI&gt;
&lt;LI&gt;MCP servers and A2A agent APIs (preview)&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;What the gateway actually enforces&lt;/H3&gt;
&lt;P&gt;This layer is intentionally narrow and operational:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Token quotas and rate limits&lt;/STRONG&gt;&lt;BR /&gt;The llm-token-limit policy (GA) enforces tokens‑per‑minute or quota ceilings per consumer before requests reach the backend. This prevents one application—or one agent—from exhausting shared capacity.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Content safety at ingress&lt;/STRONG&gt;&lt;BR /&gt;The llm-content-safety policy (GA) integrates Azure AI Content Safety to moderate prompts automatically. Unsafe requests never reach the model.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Traffic routing and resiliency&lt;/STRONG&gt;&lt;BR /&gt;Azure API Management supports multi‑region gateway deployment (Premium tier). If a region fails, traffic routes to the next closest gateway automatically.&lt;BR /&gt;Token usage, prompts, and completions are logged to Azure Monitor and Application Insights using built‑in policies such as llm-emit-token-metric.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The gateway does &lt;EM&gt;not&lt;/EM&gt; understand agent intent or business context. That is by design. It governs traffic, not behavior.&lt;/P&gt;
&lt;H2&gt;Layer 2: Azure AI Foundry Control Plane — governing behavior at scale&lt;/H2&gt;
&lt;P&gt;The second layer governs &lt;STRONG&gt;what agents do&lt;/STRONG&gt;, not just how requests flow.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Azure AI Foundry Control Plane&lt;/STRONG&gt; provides a unified management surface for AI agents, models, and tools across projects and subscriptions. It is designed specifically for agentic systems.&lt;/P&gt;
&lt;P&gt;Foundry Control Plane is currently in &lt;STRONG&gt;public preview&lt;/STRONG&gt;.&lt;/P&gt;
&lt;H3&gt;What Foundry Control Plane adds&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Fleet‑wide inventory&lt;/STRONG&gt;&lt;BR /&gt;Every agent, model, and tool appears in a single, searchable view across projects.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Continuous evaluation on production traffic&lt;/STRONG&gt;&lt;BR /&gt;Foundry runs evaluations that measure task adherence, groundedness, tool‑call accuracy, sensitive data exposure, and other agent‑specific risk dimensions.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Centralized guardrails&lt;/STRONG&gt;&lt;BR /&gt;Policy is enforced across inputs, outputs, and tool interactions—not just prompts. Bulk remediation can be applied across the fleet.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Security integration&lt;/STRONG&gt;&lt;BR /&gt;Foundry integrates with:
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Microsoft Entra&lt;/STRONG&gt; for agent identity (Entra Agent ID)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Microsoft Defender&lt;/STRONG&gt; for threat signals&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Microsoft Purview&lt;/STRONG&gt; for data protection and compliance visibility&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Foundry Control Plane also requires an &lt;STRONG&gt;AI Gateway to be configured&lt;/STRONG&gt; for advanced governance scenarios—reinforcing the layered approach.&lt;/P&gt;
&lt;H2&gt;Layer 3: Microsoft Agent 365 — enterprise oversight, not just Azure oversight&lt;/H2&gt;
&lt;P&gt;The third layer exists because Azure governance alone is not enough.&lt;/P&gt;
&lt;P&gt;Agents don’t just call APIs. They act on behalf of users. They access enterprise data. They operate inside Microsoft 365 workflows.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Microsoft Agent 365&lt;/STRONG&gt; is the &lt;STRONG&gt;tenant‑level control plane for AI agents&lt;/STRONG&gt;. It brings agents under the same administrative model used for users and applications.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Status: &lt;STRONG&gt;Frontier Preview&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;General availability: May 1, 2026&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Why this layer matters&lt;/H3&gt;
&lt;P&gt;Agent 365 introduces controls that Azure alone cannot provide:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Agent registry&lt;/STRONG&gt;&lt;BR /&gt;A single inventory of all agents in the tenant—including sanctioned and shadow agents. Unsanctioned agents can be quarantined.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Identity‑first access control&lt;/STRONG&gt;&lt;BR /&gt;Every agent is issued an Entra agent ID. Conditional Access policies apply to agents the same way they do to users.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Human‑in‑the‑loop oversight&lt;/STRONG&gt;&lt;BR /&gt;Agents surface in Microsoft 365 admin workflows, not just Azure portals.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Security and compliance&lt;/STRONG&gt;&lt;BR /&gt;Defender and Purview extend threat detection and data protection policies to agent activity.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Agent 365 does not replace Foundry Control Plane. It complements it—connecting agent operations to enterprise identity, compliance, and productivity systems.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;H2&gt;How the pieces work together&lt;/H2&gt;
&lt;P&gt;Individually, these services are powerful. The architecture works because they are &lt;STRONG&gt;deliberately layered&lt;/STRONG&gt;.&lt;/P&gt;
&lt;H3&gt;External approval → automated provisioning&lt;/H3&gt;
&lt;P&gt;When a use case is approved in an external governance system, it triggers an Azure DevOps pipeline using the REST API.&lt;/P&gt;
&lt;P&gt;That pipeline:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Provisions subscriptions and resource groups&lt;/LI&gt;
&lt;LI&gt;Deploys Foundry projects&lt;/LI&gt;
&lt;LI&gt;Configures Azure API Management with AI Gateway policies&lt;/LI&gt;
&lt;LI&gt;Enables monitoring and logging&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Governance is applied &lt;EM&gt;before&lt;/EM&gt; the first request is made.&lt;/P&gt;
&lt;H3&gt;One policy model, many regions&lt;/H3&gt;
&lt;P&gt;Azure landing zones are region‑agnostic at the governance layer. This architecture follows that guidance.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Policies and RBAC apply globally&lt;/LI&gt;
&lt;LI&gt;AI Gateway enforces limits locally in each region&lt;/LI&gt;
&lt;LI&gt;Runtime services scale region by region&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Expanding to a new region does not introduce a new governance model—only new capacity.&lt;/P&gt;
&lt;H3&gt;A single operational view&lt;/H3&gt;
&lt;P&gt;Signals flow upward:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;AI Gateway&lt;/STRONG&gt; emits traffic and usage metrics&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Foundry Control Plane&lt;/STRONG&gt; correlates evaluations, guardrail enforcement, and security alerts&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Agent 365&lt;/STRONG&gt; aggregates tenant‑level identity, compliance, and threat signals&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Operations teams no longer hunt across dashboards. They work from one prioritized view, with context intact.&lt;/P&gt;
&lt;H2&gt;What this architecture deliberately does &lt;EM&gt;not&lt;/EM&gt; promise&lt;/H2&gt;
&lt;P&gt;This is a reference architecture, not a silver bullet.&lt;/P&gt;
&lt;P&gt;It does not eliminate the need for:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Clear agent ownership&lt;/LI&gt;
&lt;LI&gt;Business‑level approval processes&lt;/LI&gt;
&lt;LI&gt;Ongoing evaluation of agent usefulness&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;What it does provide is a &lt;STRONG&gt;foundation&lt;/STRONG&gt;—one that lets organizations scale agentic AI without accepting chaos as the cost of innovation.&lt;/P&gt;
&lt;H2&gt;Closing thoughts&lt;/H2&gt;
&lt;P&gt;Agent sprawl is not a tooling failure. It’s an architectural one.&lt;/P&gt;
&lt;P&gt;By separating control from execution, layering governance where it belongs, and aligning AI operations with existing Azure and Microsoft 365 control planes, this architecture gives enterprises a way to move fast &lt;EM&gt;without losing sight of what their agents are doing&lt;/EM&gt;.&lt;/P&gt;
&lt;P&gt;That’s the difference between experimentation—and production.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Co-Contributor:&amp;nbsp;&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;Jorge Pena Alarcon-Sr.&amp;nbsp; Cloud &amp;amp; AI Specialist&lt;/P&gt;
&lt;H3&gt;References (official Microsoft sources)&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/api-management/genai-gateway-capabilities" target="_blank"&gt;Azure AI Gateway in Azure API Management&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/foundry/configuration/enable-ai-api-management-gateway-portal" target="_blank"&gt;Configure AI Gateway for Foundry&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/foundry/control-plane/overview" target="_blank"&gt;Foundry Control Plane overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://www.microsoft.com/microsoft-365/blog/2025/11/18/microsoft-agent-365-the-control-plane-for-ai-agents/" target="_blank"&gt;Microsoft Agent 365 announcement&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-forum" href="https://techcommunity.microsoft.com/discussions/agent-365-discussions/agent-365-will-be-generally-available-on-may-1-2026/4500380" data-lia-auto-title="Agent 365 GA annoucement" data-lia-auto-title-active="0" target="_blank"&gt;Agent 365 GA annoucement&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/cloud-adoption-framework/ready/considerations/regions" target="_blank"&gt;Azure landing zones and regions&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/rest/api/azure/devops/pipelines/runs/run-pipeline" target="_blank"&gt;Azure DevOps pipeline REST API&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 07 May 2026 05:19:02 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/governing-agent-sprawl-a-multi-region-ai-agent-landing-zone-on/ba-p/4516036</guid>
      <dc:creator>KimVaddi</dc:creator>
      <dc:date>2026-05-07T05:19:02Z</dc:date>
    </item>
    <item>
      <title>Architecture to Resilience: A Decision Guide</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/architecture-to-resilience-a-decision-guide/ba-p/4516552</link>
      <description>&lt;H2&gt;Start with the framework, accelerate with the tool&lt;/H2&gt;
&lt;P&gt;&lt;A href="https://youtu.be/2UlnJGTGHY4" target="_blank" rel="noopener"&gt;Watch the video walkthrough&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;The &lt;STRONG&gt;Application Resilience Framework&lt;/STRONG&gt; originated from a practical gap we saw in resilience reviews: teams had architecture diagrams, monitoring data, incident history, and runbooks, but no consistent way to connect them into a measurable resilience model.&lt;/P&gt;
&lt;P&gt;The framework is intended to close that gap by turning architecture context into a structured lifecycle for &lt;STRONG&gt;risk identification, mitigation validation, health modeling, and governance&lt;/STRONG&gt;. It aligns closely with the Reliability pillar of the&lt;STRONG&gt; Azure Well-Architected Framework&lt;/STRONG&gt;, especially the guidance around identifying critical flows, performing Failure Mode Analysis, defining reliability targets, and building health models.&lt;/P&gt;
&lt;img&gt;&lt;EM&gt;Application Resilience Framework flow from artifact import to measurable operational resilience.&lt;/EM&gt;&lt;/img&gt;
&lt;P&gt;The&amp;nbsp;&lt;STRONG&gt;Application Resilience Framework Tool&lt;/STRONG&gt; helps teams apply this framework faster by starting with artifacts they already have, such as data flow diagrams or sequence diagrams in Mermaid or image format. The tool extracts workflows, application components, platform components, dependencies, and initial failure modes, then guides the team through the decisions needed to make resilience measurable.&lt;/P&gt;
&lt;P data-start="564" data-end="826"&gt;From those artifacts, the tool creates the first version of a resilience model by extracting workflows, application components, platform components, dependencies, and initial failure modes. It then guides the team through one import step followed by &lt;STRONG&gt;four phases&lt;/STRONG&gt;:&lt;/P&gt;
&lt;P&gt;Import Artifacts -&amp;gt; &lt;STRONG&gt;Phase 1&lt;/STRONG&gt;: Failure Mode Analysis -&amp;gt; &lt;STRONG&gt;Phase 2&lt;/STRONG&gt;: Mitigation and Validation -&amp;gt; &lt;STRONG&gt;Phase 3&lt;/STRONG&gt;: Health Model Mapping -&amp;gt; &lt;STRONG&gt;Phase 4&lt;/STRONG&gt;: Operations and Governance&lt;/P&gt;
&lt;P&gt;It is not a replacement for WAF guidance or Resilience Hub style assessments. It is a practical way to operationalize those concepts at the workload and workflow level, producing prioritized risks, mitigation plans, validation paths, health signals, dashboards, reports, and governance ownership.&lt;/P&gt;
&lt;H2&gt;How to use this guide&lt;/H2&gt;
&lt;P&gt;This guide follows the same flow as the tool. For each step, it covers:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;The decision:&lt;/STRONG&gt; What needs to be decided?&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The options:&lt;/STRONG&gt; What paths are available?&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The guidance:&lt;/STRONG&gt; When each option fits&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Use this with the video walkthrough. The video shows the tool in action. This guide explains the choices behind each step.&lt;/P&gt;
&lt;H2&gt;Question 1: What artifact should you import first?&lt;/H2&gt;
&lt;P&gt;The import step creates the starting point for the model. Regardless of the input path, the output is the same: workflows that move into &lt;STRONG&gt;Phase 1: Failure Mode Analysis&lt;/STRONG&gt;.&lt;/P&gt;
&lt;H3&gt;Options&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-double" border="1" style="width: 100%; height: 290px; border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;STRONG&gt;Import option&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;STRONG&gt;Best for&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;STRONG&gt;What happens&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 67px;"&gt;&lt;td style="height: 67px;"&gt;
&lt;P&gt;&lt;STRONG&gt;Data flow diagram&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 67px;"&gt;
&lt;P&gt;System, module, data movement, and dependency views&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 67px;"&gt;
&lt;P&gt;If imported as an image, the tool breaks it into sequence-style flows. Selected flows become workflows.&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 67px;"&gt;&lt;td style="height: 67px;"&gt;
&lt;P&gt;&lt;STRONG&gt;Sequence diagram&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 67px;"&gt;
&lt;P&gt;Transaction flow and service interaction views&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 67px;"&gt;
&lt;P&gt;Converted directly into workflows.&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;STRONG&gt;Mermaid input&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;Diagrams maintained as code in Mermaid format&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;Converted directly into workflows.&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;STRONG&gt;Image input&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;JPG or PNG diagrams&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;Azure Foundry Vision models interpret the image and convert it into workflows.&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;STRONG&gt;Manual entry&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;Missing or incomplete diagrams&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;User creates or corrects workflows manually.&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H3&gt;When to pick which&lt;/H3&gt;
&lt;P&gt;Use &lt;STRONG&gt;data flow&lt;/STRONG&gt; for system and dependency views. Use &lt;STRONG&gt;sequence diagrams&lt;/STRONG&gt; for transaction or interaction views. Regardless of import path, the output is the same: workflows, components, dependencies, and initial failure modes ready for Phase 1.&lt;/P&gt;
&lt;H2&gt;Question 2: Which workflows should be analyzed first?&lt;/H2&gt;
&lt;P&gt;Phase 1 is &lt;STRONG&gt;Failure Mode Analysis&lt;/STRONG&gt;. This is where the tool identifies what can fail and how important each failure is.&lt;/P&gt;
&lt;H3&gt;Options&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Critical user flows:&lt;/STRONG&gt; Login, checkout, payment, onboarding, request processing.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;High-risk platform flows:&lt;/STRONG&gt; Database writes, queue processing, storage access, identity, messaging, external APIs.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Known issue areas:&lt;/STRONG&gt; Workflows with recent incidents, recurring alerts, or customer impact.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;When to pick which&lt;/H3&gt;
&lt;P&gt;Start where failure creates the highest customer or business impact. The goal is not to model everything at once. The goal is to model the right thing first.&lt;/P&gt;
&lt;H3&gt;Deliverables&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Failure Mode Analysis catalog&lt;/LI&gt;
&lt;LI&gt;RPV risk scores&lt;/LI&gt;
&lt;LI&gt;Criticality classification&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Question 3: How should failure modes be prioritized?&lt;/H2&gt;
&lt;P&gt;After workflows and components are imported, the tool helps score each failure mode using &lt;STRONG&gt;Risk Priority Value&lt;/STRONG&gt; or RPV, which uses the four factors of Impact, Likelihood, Detectability and Outage severity.&lt;/P&gt;
&lt;H3&gt;Options&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Use generated failure modes and scores:&lt;/STRONG&gt; Best for a fast first pass.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Tune the RPV scores with engineering input:&lt;/STRONG&gt; Best when workload context matters.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Add custom failure modes:&lt;/STRONG&gt; Best when known risks come from incidents, reviews, or customer experience.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;When to pick which&lt;/H3&gt;
&lt;P&gt;Use the generated model to accelerate the first pass, then adjust it with real system knowledge. The goal is not to create the longest list of risks. The goal is to identify the risks that deserve attention first.&lt;/P&gt;
&lt;H3&gt;Deliverables&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Failure Mode Catalog&lt;/LI&gt;
&lt;LI&gt;RPV Risk Scores&lt;/LI&gt;
&lt;LI&gt;Prioritized criticality list&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Question 4: Are mitigations defined or validated?&lt;/H2&gt;
&lt;P&gt;Phase 2 is &lt;STRONG&gt;Mitigation and Validation&lt;/STRONG&gt;. This is where each failure mode gets a response plan.&lt;/P&gt;
&lt;H3&gt;Options&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Detection only:&lt;/STRONG&gt; The team can detect the failure, but the response is not defined.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Defined mitigation:&lt;/STRONG&gt; The response is documented, such as retry, fallback, failover, scaling, restore, or rebalance.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Validated mitigation:&lt;/STRONG&gt; The response has been tested through a controlled validation or chaos test.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;When to pick which&lt;/H3&gt;
&lt;P&gt;For low-risk items, documented mitigation may be enough. For critical and high-risk items, validation is the key. A mitigation that has not been tested is still an assumption.&lt;/P&gt;
&lt;H3&gt;Deliverables&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Mitigation playbooks&lt;/LI&gt;
&lt;LI&gt;Chaos test plans&lt;/LI&gt;
&lt;LI&gt;Support playbooks&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Question 5: Which risks need health signals?&lt;/H2&gt;
&lt;P&gt;Phase 3 is &lt;STRONG&gt;Health Model Mapping&lt;/STRONG&gt;. This is where the tool connects risks to observability.&lt;/P&gt;
&lt;P&gt;A failure mode should not just sit in a document. It should map to a signal that can show whether the system is healthy, degraded, or unhealthy.&lt;/P&gt;
&lt;H3&gt;Options&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Map all failure modes:&lt;/STRONG&gt; Best for small systems or highly critical workloads.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Map critical and high-risk failure modes first:&lt;/STRONG&gt; Best for large systems.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Track unmapped risks as gaps:&lt;/STRONG&gt; Best when observability coverage is still improving.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;When to pick which&lt;/H3&gt;
&lt;P&gt;Start with the highest RPV items. Every critical failure mode should have at least one signal, such as a metric, log, alert, availability check, or dependency signal.&lt;/P&gt;
&lt;H3&gt;Deliverables&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Health model&lt;/LI&gt;
&lt;LI&gt;Signal definitions&lt;/LI&gt;
&lt;LI&gt;Coverage report&lt;/LI&gt;
&lt;LI&gt;Bicep templates&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Question 6: Should the health model be exported or deployed?&lt;/H2&gt;
&lt;P&gt;Once the health model is built, the next decision is how to use it.&lt;/P&gt;
&lt;H3&gt;Options&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Export for review:&lt;/STRONG&gt; Best when the team needs to validate the model first.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Generate monitoring templates:&lt;/STRONG&gt; Best when the team wants repeatable implementation.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Deploy to Azure:&lt;/STRONG&gt; Best when the model is ready to become part of operations.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Use outputs in downstream tools:&lt;/STRONG&gt; Best when support, SRE, or incident response workflows need structured playbooks.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;When to pick which&lt;/H3&gt;
&lt;P&gt;Export first if the model is still being reviewed. Deploy when component relationships, signals, and coverage are accurate enough for operational use.&lt;/P&gt;
&lt;H2&gt;Question 7: How will governance keep the model current?&lt;/H2&gt;
&lt;P&gt;Phase 4 is &lt;STRONG&gt;Operations and Governance&lt;/STRONG&gt;. This is where the resilience model becomes an ongoing practice.&lt;/P&gt;
&lt;H3&gt;Options&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;One-time assessment:&lt;/STRONG&gt; Useful for quick discovery but limited long term.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Recurring review:&lt;/STRONG&gt; Best for production workloads that change regularly.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Closed-loop governance:&lt;/STRONG&gt; Best when incidents, failed validations, and monitoring gaps feed back into the model.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;When to pick which&lt;/H3&gt;
&lt;P&gt;For production systems, use a recurring governance cadence. Assign owners, track gaps, review dashboards, and update the model as the system changes.&lt;/P&gt;
&lt;H3&gt;Deliverables&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Governance model&lt;/LI&gt;
&lt;LI&gt;Dashboards&lt;/LI&gt;
&lt;LI&gt;Reports and exports&lt;/LI&gt;
&lt;LI&gt;Runbooks&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Putting it together: three adoption patterns&lt;/H2&gt;
&lt;P&gt;Once governance is defined, the tool can be used in different ways depending on the team’s maturity and objective. The three common adoption patterns are:&lt;/P&gt;
&lt;H3&gt;Pattern A: Quick resilience review&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Import one critical workflow&lt;/LI&gt;
&lt;LI&gt;Generate failure modes&lt;/LI&gt;
&lt;LI&gt;Review RPV scores&lt;/LI&gt;
&lt;LI&gt;Identify top risks&lt;/LI&gt;
&lt;LI&gt;Export findings&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Best for fast architecture reviews or early customer conversations.&lt;/P&gt;
&lt;H3&gt;Pattern B: Full workload assessment&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Import multiple workflows&lt;/LI&gt;
&lt;LI&gt;Build a full Failure Mode Catalog&lt;/LI&gt;
&lt;LI&gt;Define mitigations and recovery steps&lt;/LI&gt;
&lt;LI&gt;Create chaos test plans&lt;/LI&gt;
&lt;LI&gt;Map risks to signals&lt;/LI&gt;
&lt;LI&gt;Produce coverage reports&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Best for structured resilience assessments.&lt;/P&gt;
&lt;H3&gt;Pattern C: Operational health model&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Build and tune the health model&lt;/LI&gt;
&lt;LI&gt;Export or deploy monitoring artifacts&lt;/LI&gt;
&lt;LI&gt;Track risk and signal coverage&lt;/LI&gt;
&lt;LI&gt;Review mitigation effectiveness&lt;/LI&gt;
&lt;LI&gt;Assign governance ownership&lt;/LI&gt;
&lt;LI&gt;Feed findings back into the model&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Best when the goal is continuous operational improvement.&lt;/P&gt;
&lt;H2&gt;A short checklist before using the tool&lt;/H2&gt;
&lt;OL&gt;
&lt;LI&gt;Which workflow should we import first?&lt;/LI&gt;
&lt;LI&gt;Do we have a data flow diagram, sequence diagram, or Mermaid file?&lt;/LI&gt;
&lt;LI&gt;What components and dependencies should be included?&lt;/LI&gt;
&lt;LI&gt;Which failure modes matter most?&lt;/LI&gt;
&lt;LI&gt;How should RPV be adjusted for this workload?&lt;/LI&gt;
&lt;LI&gt;Do critical failure modes have mitigations?&lt;/LI&gt;
&lt;LI&gt;Have those mitigations been validated?&lt;/LI&gt;
&lt;LI&gt;Are failure modes mapped to health signals?&lt;/LI&gt;
&lt;LI&gt;What coverage gaps remain?&lt;/LI&gt;
&lt;LI&gt;Should the health model be exported or deployed?&lt;/LI&gt;
&lt;LI&gt;Who owns ongoing review?&lt;/LI&gt;
&lt;LI&gt;How often should the model be updated?&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2&gt;Closing thought&lt;/H2&gt;
&lt;P data-start="20" data-end="176"&gt;The &lt;STRONG data-start="24" data-end="65"&gt;Application Resilience Framework Tool&lt;/STRONG&gt; provides a practical way to move from architecture artifacts to measurable, continuously improving resilience.&lt;/P&gt;
&lt;P data-start="178" data-end="414"&gt;It starts with data flow or sequence diagrams, builds a structured view of the system, and guides teams through the decisions that matter: what can fail, how severe it is, how it is mitigated, how it is detected, and how it is governed.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;STRONG data-start="416" data-end="430"&gt;Tool repo:&lt;/STRONG&gt; &lt;A class="lia-external-url" href="https://github.com/jvargh/ApplicationResilienceFrameworkTool" target="_blank"&gt;Application Resilience Framework Tool&amp;nbsp;&lt;/A&gt;&lt;BR data-start="532" data-end="535" /&gt;&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 04 May 2026 21:24:03 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/architecture-to-resilience-a-decision-guide/ba-p/4516552</guid>
      <dc:creator>varghesejoji</dc:creator>
      <dc:date>2026-05-04T21:24:03Z</dc:date>
    </item>
    <item>
      <title>How MS Discovery Is Empowering Scientists to Do More</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/how-ms-discovery-is-empowering-scientists-to-do-more/ba-p/4516670</link>
      <description>&lt;P data-line="2"&gt;Research and development has traditionally been a slow, sequential, and largely manual endeavour. Scientists formulate hypotheses, design experiments, run computations in constrained environments, and document results, each stage dependent on the last, each transition requiring human review and intervention. Knowledge is fragmented across systems, insights are bottlenecked by individual capacity, and the gap between hypothesis and actionable outcome can span weeks or months.&lt;/P&gt;
&lt;P data-line="4"&gt;For organisations tackling complex scientific and operational challenges, from drug discovery to industrial process optimisation, this pace of iteration is simply no longer acceptable.&lt;/P&gt;
&lt;P data-line="6"&gt;At Microsoft, we recently introduced&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/microsoft-discovery/" target="_blank"&gt;&lt;STRONG&gt;Microsoft Discovery&lt;/STRONG&gt;&lt;/A&gt;, a platform that I believe fundamentally changes this model. Much like Microsoft 365 transformed the way knowledge workers collaborate and create, Microsoft Discovery is designed to simplify and empower the way scientists and researchers work. It provides a unified, end-to-end platform that integrates advanced artificial intelligence, high-performance computing, and knowledge management to support the full scientific reasoning lifecycle: knowledge gathering, hypothesis generation, experiment design, simulation, results analysis, and documentation.&lt;/P&gt;
&lt;P data-line="8"&gt;In this article, I want to share how we used Microsoft Discovery to automate a real-world simulation workflow for a mining organisation and what that experience taught our team about the future of AI-augmented science.&lt;/P&gt;
&lt;P data-line="8"&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2 data-line="12"&gt;What Is Microsoft Discovery?&lt;/H2&gt;
&lt;P data-line="14"&gt;Microsoft Discovery is Microsoft's scientific AI platform, a solution designed to accelerate research and experimentation across the full innovation lifecycle. Rather than replacing scientific judgement, Discovery is designed to&amp;nbsp;&lt;STRONG&gt;amplify human expertise&lt;/STRONG&gt;, embedding AI assistance at each stage of the R&amp;amp;D process while maintaining governance, traceability, and scientific rigour.&lt;/P&gt;
&lt;H3 data-line="16"&gt;From Traditional R&amp;amp;D to AI-Augmented Science&lt;/H3&gt;
&lt;P data-line="18"&gt;To appreciate what Discovery enables, it is important to understand where it fits in.&lt;/P&gt;
&lt;P data-line="20"&gt;In the&amp;nbsp;&lt;STRONG&gt;traditional R&amp;amp;D model&lt;/STRONG&gt;, knowledge discovery centres on manual literature reviews and historical data analysis. Researchers individually search, read, and synthesise information which is a time-intensive process where discovery is limited by each person's capacity to locate and interpret relevant material. Hypothesis generation and experimental design are expert-led and largely manual. Computational experimentation, where it exists, runs in fixed or constrained environments with limited parallelism. Analysis and iteration follow the same sequential pattern: execute, review, document, repeat.&lt;/P&gt;
&lt;P data-line="22"&gt;&lt;STRONG&gt;Microsoft Discovery changes this fundamentally.&lt;/STRONG&gt;&amp;nbsp;In the AI-cloud-enabled model it provides:&lt;/P&gt;
&lt;UL data-line="24"&gt;
&lt;LI data-line="24"&gt;&lt;STRONG&gt;Knowledge synthesis at scale&lt;/STRONG&gt;&amp;nbsp;— Researchers can explore literature, historical experiments, and organisational knowledge through a single interface, with intelligent indexing surfacing insights faster than manual search could ever achieve.&lt;/LI&gt;
&lt;LI data-line="25"&gt;&lt;STRONG&gt;AI-assisted hypothesis generation&lt;/STRONG&gt;&amp;nbsp;— Collaborative human-and-AI workflows support hypothesis exploration and feasibility assessment, while final decisions remain with the scientist.&lt;/LI&gt;
&lt;LI data-line="26"&gt;&lt;STRONG&gt;Cloud-scale experimentation&lt;/STRONG&gt;&amp;nbsp;— Elastic compute and parallel processing allow simulations and experiments to run at scale, with integrated tracking and reproducibility built in.&lt;/LI&gt;
&lt;LI data-line="27"&gt;&lt;STRONG&gt;Continuous feedback and human-in-the-loop governance&lt;/STRONG&gt;&amp;nbsp;— Results are analysed and compared more rapidly, enabling faster iteration, with AI-generated insights reviewed and validated by researchers before action.&lt;/LI&gt;
&lt;LI data-line="28"&gt;&lt;STRONG&gt;Governed knowledge assets&lt;/STRONG&gt;&amp;nbsp;— Experiment lineage, outcomes, and best practices are captured as reusable, governed assets, supporting long-term organisational learning.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-line="30"&gt;The net effect is a transition from slow, manual, and fragmented research processes to an&amp;nbsp;&lt;STRONG&gt;agile, automated, and data-driven R&amp;amp;D model&lt;/STRONG&gt; — one that improves research efficiency, increases the return on innovation investment, and enables faster, higher-impact solutions to complex challenges. In high level, the research and deveolopment loop we discussed and how Microsoft Discovery enriches it show in the following diagram.&lt;/P&gt;
&lt;img /&gt;
&lt;H2 data-line="36"&gt;The Real-World Problem: Screening Thousands of Molecules&lt;/H2&gt;
&lt;P data-line="38"&gt;To bring this to life, let me walk you through a real-world use case we worked on recently. A mining organisation needed to identify the best-performing oxidant compounds for a chemical reaction central to their operations. We will be talking about only a workflow that sits squarely in the&amp;nbsp;&lt;STRONG&gt;simulation&lt;/STRONG&gt;&amp;nbsp;phase of the scientific loop — and it is a perfect example of the kind of work that Microsoft Discovery can strongly transform.&lt;/P&gt;
&lt;H3 data-line="40"&gt;How Scientists Did It Before&lt;/H3&gt;
&lt;P data-line="42"&gt;In the traditional process, scientists would begin by selecting candidate molecules from established molecular libraries based on characteristics identified through literature review. These libraries can contain thousands of molecules, each defined in standard molecular file formats (such as XYZ or CIF files) that describe their three-dimensional atomic structures.&lt;/P&gt;
&lt;P data-line="44"&gt;From there, a researcher would manually work through a multi-step pipeline:&lt;/P&gt;
&lt;OL data-line="46"&gt;
&lt;LI data-line="46"&gt;&lt;STRONG&gt;Pre-processing and preparation&lt;/STRONG&gt;: The selected molecular files are processed and prepared for quantum mechanical (QM) calculations. This involves filtering molecules based on properties like the types of metals present, electron count, and atomic weight — criteria that directly affect both the scientific relevance and the computational cost of the simulations. The output is a set of prepared input files (known as GJF files) ready for simulation.&lt;/LI&gt;
&lt;LI data-line="48"&gt;&lt;STRONG&gt;Running quantum mechanical simulations&lt;/STRONG&gt;: The prepared input files are submitted to a computational chemistry tool (Gaussian 16) to perform Density Functional Theory (DFT) calculations. These simulations compute the electronic structure and energy states of each molecule across different charge and multiplicity configurations. Crucially, each molecule requires multiple independent simulation runs, and the computational cost scales rapidly with molecular complexity. With thousands of candidate molecules, this step alone can involve thousands of individual simulation jobs.&lt;/LI&gt;
&lt;LI data-line="50"&gt;&lt;STRONG&gt;Collecting and post-processing results&lt;/STRONG&gt;: Once all simulations complete, the output log files are collected and processed. For each molecule, the lowest-energy charge and multiplicity combination is identified, and a set of quantum mechanical descriptors and classical molecular descriptors are extracted. These descriptors are then fed into a trained machine learning model to predict the&amp;nbsp;&lt;STRONG&gt;redox potential&lt;/STRONG&gt;&amp;nbsp;of each compound, a key metric that indicates how effectively a molecule can act as an oxidant in the target reaction.&lt;/LI&gt;
&lt;LI data-line="52"&gt;&lt;STRONG&gt;Summarisation and filtering&lt;/STRONG&gt;: Finally, the predicted redox potentials and other relevant characteristics are compiled into a summary, enabling researchers to identify the most promising candidates for further investigation and experimental validation.&lt;/LI&gt;
&lt;/OL&gt;
&lt;img /&gt;
&lt;P&gt;Every step in this pipeline required manual intervention: writing and adjusting scripts, verifying input and output files, monitoring job queues, handling failures, and stitching results together. A single researcher could easily spend days or weeks moving through this process — and any error at one stage meant going back and re-running subsequent steps.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2 data-line="60"&gt;How We Automated This with Microsoft Discovery Agents&lt;/H2&gt;
&lt;P data-line="62"&gt;When we looked at this workflow through the lens of Microsoft Discovery, the opportunity was clear. The scientific reasoning, selecting which molecules to test, interpreting redox potential results, deciding what to investigate next, should remain with the researcher. But the&amp;nbsp;&lt;STRONG&gt;operational overhead&lt;/STRONG&gt;&amp;nbsp;of preparing files, submitting simulations, monitoring jobs, collecting results, and assembling summaries? That could be orchestrated by a team of AI agents.&lt;/P&gt;
&lt;H3 data-line="64"&gt;A Team of Agents, Working Together&lt;/H3&gt;
&lt;P data-line="66"&gt;We designed a multi-agent architecture within Microsoft Discovery to automate this simulation workflow end to end. Here is how the team of agents operates:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-line="70"&gt;&lt;STRONG&gt;Router Agent:&lt;/STRONG&gt; The entry point. When a researcher submits a request for example, asking to run QM calculations on a set of candidate molecules the Router Agent interprets the intent and orchestrates the downstream workflow.&lt;/P&gt;
&lt;P data-line="72"&gt;&lt;STRONG&gt;Planner Agent:&lt;/STRONG&gt;&amp;nbsp;Once the Router Agent identifies the task, the Planner Agent examines the input files provided by the researcher and formulates a step-by-step execution plan. It determines what needs to happen, in what order, and with what parameters, much like a project manager scoping out a piece of work.&lt;/P&gt;
&lt;P data-line="74"&gt;&lt;STRONG&gt;Gaussian Prep Agent:&lt;/STRONG&gt;&amp;nbsp;This agent handles the preparation step. It is intelligent enough to inspect the current molecular files, apply the necessary filtering criteria, and prepare them for simulation, generating the input files that the computational chemistry tool requires. What previously involved manual scripting and file-by-file verification is now handled autonomously. We used Microsoft Discovery tools to do the underlying execution with this agent.&lt;/P&gt;
&lt;P data-line="76"&gt;&lt;STRONG&gt;MPI Gaussian Agent:&lt;/STRONG&gt;&amp;nbsp;This is where the power of cloud-scale computing comes in. The Gaussian Agent submits the prepared simulation jobs and manages their execution using an MPI-based master-worker pattern. This approach enables &lt;STRONG&gt;massive parallel execution&lt;/STRONG&gt; scaling out across the cloud to run thousands of simulations concurrently rather than sequentially. Given that the candidate molecule libraries can contain thousands of entries, and each molecule may require multiple simulation runs, this parallel execution capability is transformative. What might have taken days in a constrained local environment can now complete in a fraction of the time.&lt;/P&gt;
&lt;P data-line="78"&gt;&lt;STRONG&gt;Redox Potential Agent: &lt;/STRONG&gt;Once the simulations are complete, this agent takes over. It processes the simulation outputs, identifies the optimal charge and multiplicity state for each molecule, extracts the relevant QM and classical descriptors, and runs them through the trained machine learning model to predict redox potentials.&lt;/P&gt;
&lt;P data-line="80"&gt;&lt;STRONG&gt;Summariser Agent:&lt;/STRONG&gt; The final agent in the chain. It maps the predicted redox potentials back to the original molecules, applies any additional filtering criteria, and produces a clean, structured summary a JSON file that the researcher can immediately use to identify the most promising candidates and take them forward into the next phase of their work.&lt;/P&gt;
&lt;H3 data-line="82"&gt;What the Researcher Experiences&lt;/H3&gt;
&lt;P data-line="84"&gt;From the scientist's perspective, the transformation is striking. Instead of spending days writing scripts, babysitting job queues, and manually stitching results together, they provide their input files and describe what they need. The agents take it from there planning, preparing, executing, processing, and summarising and deliver a curated output ready for scientific interpretation.&lt;/P&gt;
&lt;P data-line="86"&gt;The researcher's time is freed to focus on what matters most:&amp;nbsp;&lt;STRONG&gt;thinking critically about the science&lt;/STRONG&gt;. Which molecules look most promising? What does the redox potential distribution tell us? Should we adjust the filtering criteria and run another round? These are the high-value questions that require human expertise and now scientists can spend their time on exactly that, rather than on operational mechanics.&lt;/P&gt;
&lt;H2 data-line="90"&gt;The Bigger Picture: Accelerating the Entire Scientific Loop&lt;/H2&gt;
&lt;P data-line="92"&gt;It is important to note that this simulation workflow is just&amp;nbsp;&lt;STRONG&gt;one piece of the broader scientific loop&lt;/STRONG&gt;. The full cycle of scientific research, from initial knowledge gathering and literature review, through hypothesis generation, experimental design, simulation, results analysis, and documentation involves many stages, each of which can benefit from the same kind of AI-augmented approach.&lt;/P&gt;
&lt;P data-line="94"&gt;Microsoft Discovery is designed to support this entire cycle. In our project, we did not stop at simulation. We also explored how agents can accelerate the&amp;nbsp;&lt;STRONG&gt;knowledge gathering&lt;/STRONG&gt;&amp;nbsp;phase, helping researchers navigate vast bodies of literature and surface relevant prior work more efficiently. We looked at how AI can assist with&amp;nbsp;&lt;STRONG&gt;hypothesis generation and evaluation&lt;/STRONG&gt;, helping scientists reason about which directions are most promising before committing to expensive computations. And we examined how agents can support the&amp;nbsp;&lt;STRONG&gt;analysis and reporting&lt;/STRONG&gt; phases comparing results against hypotheses, generating visualisations, and even assisting with drafting research documents.&lt;/P&gt;
&lt;P data-line="96"&gt;What excites me most about Microsoft Discovery is not any single capability, but the&amp;nbsp;&lt;STRONG&gt;cumulative effect&lt;/STRONG&gt;&amp;nbsp;of embedding AI assistance across every stage of the research process. Each phase that gets faster and more efficient creates a multiplier effect on the phases that follow. When knowledge gathering takes hours instead of weeks, researchers generate better hypotheses sooner. When simulations run at cloud scale in parallel, results arrive faster. When analysis is augmented by AI, iteration cycles tighten. The entire loop accelerates.&lt;/P&gt;
&lt;H2 data-line="100"&gt;Conclusion&lt;/H2&gt;
&lt;P data-line="102"&gt;The way we approach scientific research is undergoing a fundamental shift. Large language models and the AI agents built from them are not replacing scientists, they are &lt;STRONG&gt;empowering them&lt;/STRONG&gt;&amp;nbsp;to work at a pace and scale that was previously unimaginable.&lt;/P&gt;
&lt;P data-line="104"&gt;Microsoft Discovery represents a new operating model for R&amp;amp;D. By combining advanced AI, high-performance cloud computing, and intelligent workflow orchestration, it enables researchers to offload the repetitive, time-consuming operational work to agents and invest their expertise where it has the greatest impact: in asking better questions, interpreting complex results, and pushing the boundaries of what we know.&lt;/P&gt;
&lt;P data-line="106"&gt;In the use case I have shared here, a team of six AI agents automated a simulation pipeline that would have taken a single researcher days of manual work. They prepared molecular input files, scaled out thousands of quantum mechanical simulations in parallel across the cloud, processed the results, predicted redox potentials using machine learning, and delivered a structured summary all with minimal human intervention.&lt;/P&gt;
&lt;P data-line="108"&gt;This is just the beginning. As AI agents become more capable and the tools surrounding them more mature, the potential to accelerate discovery across every scientific domain is immense. Whether you are in materials science, pharmaceuticals, energy, agriculture, or any field where complex R&amp;amp;D is central to progress, Microsoft Discovery offers a platform to&amp;nbsp;&lt;STRONG&gt;do more, faster, and with greater confidence&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P data-line="110"&gt;The future of science is not about working harder. It is about working smarter with AI as your partner in discovery.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 03 May 2026 12:05:56 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/how-ms-discovery-is-empowering-scientists-to-do-more/ba-p/4516670</guid>
      <dc:creator>sameeraman</dc:creator>
      <dc:date>2026-05-03T12:05:56Z</dc:date>
    </item>
    <item>
      <title>Running Diffusion Models at Scale on AKS</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/running-diffusion-models-at-scale-on-aks/ba-p/4513687</link>
      <description>&lt;P&gt;Diffusion workloads are simple at prototype scale and unforgiving in production. A single demo can run on one GPU-backed VM, but a real platform has to handle bursty demand, long-running jobs, model artifact distribution, secure public access, rollout safety, and hardware-level observability.&lt;/P&gt;
&lt;P&gt;Azure Kubernetes Service (AKS) is a strong fit when the requirement is not just to run a model, but to operate a repeatable platform for GPU inference. The reusable pattern is straightforward: keep the API and control layer on CPU nodes, buffer work through a dispatch layer, run inference on isolated GPU capacity, push results to durable storage, and treat security, telemetry, and deployment automation as first-class platform features.&lt;/P&gt;
&lt;img&gt;AKS reference architecture for diffusion workloads&lt;/img&gt;
&lt;P&gt;The architecture above shows the core operating model. DNS and an edge layer: Application Gateway with WAF, and optionally Front Door for global entry: route traffic to an AKS CPU pool that hosts the API tier. GPU jobs run on a separate GPU pool, while shared add-ons and CSI drivers run on a system pool. Teams can keep dispatch inside Kubernetes or externalize it through Service Bus plus KEDA, and Azure dependencies should be reached over Private Link with Azure Monitor covering both app and hardware telemetry.&lt;/P&gt;
&lt;P&gt;The storage block serves two purposes: durable output storage and, if needed, a shared Hugging Face model cache exposed to GPU pods through PV and PVC mounts.&lt;/P&gt;
&lt;H2&gt;The reference pattern&lt;/H2&gt;
&lt;P&gt;The core architecture separates control-plane traffic from GPU execution:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;A lightweight API tier on the CPU node pool receives requests, validates identity, and hands execution work to a dispatch layer.&lt;/LI&gt;
&lt;LI&gt;That dispatch layer can stay inside Kubernetes using native queueing and controller patterns, or it can publish work to Azure Service Bus for external queue-backed dispatch.&lt;/LI&gt;
&lt;LI&gt;Scaling can likewise stay AKS-native through cluster and workload autoscaling, or it can use KEDA to react directly to queue backlog.&lt;/LI&gt;
&lt;LI&gt;GPU work runs on a dedicated GPU node pool, isolated from the API and cluster add-ons.&lt;/LI&gt;
&lt;LI&gt;GPU workers should mount persistent storage for model caches so Hugging Face assets can survive pod restarts and repeated job submissions.&lt;/LI&gt;
&lt;LI&gt;Results are stored outside the pod lifecycle in blob-backed storage and returned through a stable status API.&lt;/LI&gt;
&lt;LI&gt;Edge routing, TLS termination, and WAF inspection happen at the ingress layer, while token validation is typically enforced in the API tier or a dedicated upstream auth component.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This split lets each lane scale on the right signal: request traffic for the API tier, backlog for dispatch, and job demand for GPU workers. It also keeps tuning simpler for cost, latency, and reliability. For single-region deployments, Application Gateway or Application Gateway for Containers is often enough; Azure Front Door becomes more useful for global entry, multi-region failover, or shared edge policy.&lt;/P&gt;
&lt;P&gt;In the reference architecture, the CPU pool hosts the externally reachable APIs and control-plane components that submit work into the dispatch layer. The GPU pool hosts the actual model execution components, including short-lived diffusion jobs and longer-lived worker runtimes. A separate system pool hosts shared cluster services such as AGIC and the Secret Store and Blob CSI drivers, while KEDA is added only when teams choose the Service Bus pattern. That keeps platform plumbing off the application and GPU lanes.&lt;/P&gt;
&lt;P&gt;The persistence layer is most useful as a model cache rather than as general-purpose application state. There are two practical ways to back it:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Node-local persistence keeps the cache close to the GPU worker and is the simplest option when jobs benefit from warm data already present on the same node.&lt;/LI&gt;
&lt;LI&gt;Azure Storage backed persistence is more useful when model download times are long enough that keeping artifacts on shared durable storage materially reduces job startup latency.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Choose the dispatch model&lt;/H2&gt;
&lt;P&gt;The important design decision is not whether to use one branded queueing technology. It is whether the platform needs a fully Kubernetes-native control loop or an explicit external queue with backlog-driven scaling.&lt;/P&gt;
&lt;P&gt;The Kubernetes-native option keeps dispatch inside the cluster:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;The API creates or signals internal Kubernetes work objects.&lt;/LI&gt;
&lt;LI&gt;A Kubernetes-native queue or controller pattern manages admission and dispatch.&lt;/LI&gt;
&lt;LI&gt;AKS workload autoscaling and cluster autoscaling handle most scale changes.&lt;/LI&gt;
&lt;LI&gt;This path is simpler when the team wants fewer external dependencies and the workload shape is already well understood.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The Azure Service Bus plus KEDA option externalizes the control loop:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;The API publishes work to Azure Service Bus.&lt;/LI&gt;
&lt;LI&gt;Queue consumers or schedulers materialize GPU execution from that queue.&lt;/LI&gt;
&lt;LI&gt;KEDA scales the scheduling or worker path directly from queue depth.&lt;/LI&gt;
&lt;LI&gt;This path is better when backlog visibility, queue durability, or burst-driven autoscaling needs to be explicit and independently observable.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Both models can fit the same AKS platform. The GPU isolation, security boundaries, storage pattern, and observability expectations remain the same.&lt;/P&gt;
&lt;P&gt;A simple way to choose is:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Start with Kubernetes-native dispatch when the team wants the fewest moving parts and the job profile is already predictable.&lt;/LI&gt;
&lt;LI&gt;Choose Azure Service Bus plus KEDA when durable backlog, explicit queue depth, and burst-driven worker scaling are important operating requirements.&lt;/LI&gt;
&lt;LI&gt;Consider KAITO or the AI toolchain operator add-on when the primary need is managed serving of supported models rather than custom diffusion job orchestration.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Scale by workload lane, not by one generic pool&lt;/H2&gt;
&lt;P&gt;Not every GPU workload should share the same execution path. Keep short-lived inference, queue-backed workers, and longer-running runtimes in separate lanes so one class does not block another. Where supported, GPU multi-instance configurations can further improve utilization for lighter jobs while leaving full GPUs available for heavier ones.&lt;/P&gt;
&lt;P&gt;On AKS, the better pattern is to define separate operating lanes:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;An API admission lane on CPU nodes for authentication, validation, and request submission.&lt;/LI&gt;
&lt;LI&gt;A scheduling lane that can use either Kubernetes-native queueing with AKS autoscaling or Azure Service Bus with KEDA.&lt;/LI&gt;
&lt;LI&gt;A GPU execution lane for diffusion jobs and longer-lived worker runtimes.&lt;/LI&gt;
&lt;LI&gt;Dedicated labels, taints, autoscaling bounds, and dashboards per lane.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Within the GPU execution lane, teams can go one step further and define capacity classes for full-GPU and fractional-GPU jobs. That is useful when some models need the memory and throughput of a whole device, while others can run efficiently on a smaller GPU partition. The NVIDIA device plugin DaemonSet shown in the GPU pool is what advertises those GPU resources (and MIG slices) to the kube-scheduler so pods can request them like any other resource.&lt;/P&gt;
&lt;P&gt;That gives platform teams clean capacity isolation and avoids letting one workload class starve another.&lt;/P&gt;
&lt;H2&gt;Secure the edge, the workload, and the secret path&lt;/H2&gt;
&lt;P&gt;GPU platforms should treat security as a day-one requirement, not a later add-on.&lt;/P&gt;
&lt;P&gt;At the edge, use DNS with Application Gateway and WAF, and add Front Door when global routing is needed. Store public TLS certificates in Azure Key Vault and project them into the cluster through the Secrets Store CSI Driver so renewals do not require redeployment. For protected APIs, validate Microsoft Entra ID tokens in the service or a dedicated auth layer, and keep health probe endpoints separate from business routes.&lt;/P&gt;
&lt;P&gt;Inside the cluster, keep authorization scoped tightly:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Separate system, CPU, and GPU node pools.&lt;/LI&gt;
&lt;LI&gt;Use namespace boundaries for tenant or environment isolation.&lt;/LI&gt;
&lt;LI&gt;Give the API only the Kubernetes RBAC it needs to create and monitor jobs, plus Azure permissions only when the external queue option is enabled.&lt;/LI&gt;
&lt;LI&gt;Prefer Microsoft Entra Workload ID over long-lived credentials for workload access to Azure resources such as Key Vault and Blob Storage, and extend that to Service Bus when the external queue pattern is used.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;For operator access, keep the cluster management path separate from the public request path. In the reference architecture, developers come in through Azure Bastion rather than broad direct exposure of cluster endpoints.&lt;/P&gt;
&lt;P&gt;For secrets, move away from cluster-local secrets as early as possible. A production-ready path uses Azure Key Vault with the Secrets Store CSI Driver so credentials are not baked into images, manifests, or CI pipelines. If the platform uses Azure Service Bus, queue access should use managed identity as well. Blob-backed result storage should likewise use managed identity and CSI-based integration instead of embedding long-lived credentials into workloads.&lt;/P&gt;
&lt;P&gt;For the network path between the cluster and its Azure dependencies, prefer Private Endpoints over public service endpoints. The diagram uses a single PE icon as shorthand for this pattern: in practice, teams usually create private endpoints per service and pair them with private DNS so ACR pulls, Key Vault reads, Blob I/O, and optional Service Bus traffic resolve to private IPs inside the VNet, which keeps platform traffic off the public internet and simplifies firewall and DNS policy.&lt;/P&gt;
&lt;H2&gt;Observe both the application and the hardware&lt;/H2&gt;
&lt;P&gt;GPU workloads need two telemetry views: application behavior and hardware behavior. The first tracks request IDs, job IDs, latency, and failures; the second tracks GPU utilization, memory pressure, and device-level performance. Together they show whether the problem is code, dispatch pressure, or hardware saturation.&lt;/P&gt;
&lt;P&gt;On Azure, that split maps well to:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Structured application logs and OpenTelemetry exported to Application Insights.&lt;/LI&gt;
&lt;LI&gt;Azure Monitor dashboards that include internal queue pressure or Service Bus backlog, plus AKS autoscale or KEDA scale activity depending on the chosen pattern.&lt;/LI&gt;
&lt;LI&gt;NVIDIA DCGM exporter metrics scraped into Azure Managed Prometheus and visualized in Azure Managed Grafana.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This model is what turns raw GPU hosting into an operable platform. Without it, teams can see requests failing but not whether the root cause is code, dispatch saturation, scheduling, or hardware contention.&lt;/P&gt;
&lt;P&gt;The diagram reflects that split clearly. Application Insights and Azure dashboards track service and dispatch behavior, while Prometheus, Grafana, the NVIDIA device plugin, and DCGM exporter track cluster and GPU health. That combination is what allows teams to correlate dispatch delay, AKS or KEDA scale-out, execution time, and failure rates with actual GPU utilization and memory pressure.&lt;/P&gt;
&lt;H2&gt;Keep CI/CD small, secretless, and reversible&lt;/H2&gt;
&lt;P&gt;The deployment model does not need to be complex to be production-grade. A practical AKS pattern is:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Pull request validation for code quality, tests, Dockerfiles, and secret scanning.&lt;/LI&gt;
&lt;LI&gt;Immutable container tags built from the commit SHA.&lt;/LI&gt;
&lt;LI&gt;GitHub Actions with OpenID Connect and Azure workload identity federation.&lt;/LI&gt;
&lt;LI&gt;ACR as the image source of truth.&lt;/LI&gt;
&lt;LI&gt;Environment-based promotion with approval gates for production.&lt;/LI&gt;
&lt;LI&gt;Rollout verification with Kubernetes health checks and smoke tests.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The key principle is separation of concerns. CI/CD should roll forward application images and validated configuration, not rebuild the whole platform on every deploy. Shared components such as ingress, node pools, identity, storage, monitoring, and optional KEDA or Service Bus should remain under controlled infrastructure change management.&lt;/P&gt;
&lt;H2&gt;What makes this pattern reusable&lt;/H2&gt;
&lt;P&gt;This AKS pattern generalizes well beyond one model family or one product surface because it is built on fundamentals:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Separate API admission from dispatch and GPU execution.&lt;/LI&gt;
&lt;LI&gt;Choose the dispatch boundary that fits the workload: Kubernetes-native queueing with AKS autoscaling, or Service Bus plus KEDA.&lt;/LI&gt;
&lt;LI&gt;Isolate workload classes into different scaling lanes.&lt;/LI&gt;
&lt;LI&gt;Scale worker capacity from the most useful signal for the chosen model: workload pressure inside AKS or external queue backlog through KEDA.&lt;/LI&gt;
&lt;LI&gt;Put authentication, TLS, and routing at the edge.&lt;/LI&gt;
&lt;LI&gt;Use workload identity and externalized secrets.&lt;/LI&gt;
&lt;LI&gt;Instrument both software behavior and GPU behavior.&lt;/LI&gt;
&lt;LI&gt;Keep deployments automated, traceable, and easy to roll back.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;That is the real architecture story. The model can change. The runner can change. Even the queue and gateway choices can evolve. But the engineering fundamentals stay stable, and that stability is what makes diffusion workloads viable at scale.&lt;/P&gt;
&lt;H2&gt;Possible alternatives: KAITO and the AI toolchain operator add-on&lt;/H2&gt;
&lt;P&gt;Some teams do not need a fully custom GPU execution platform. On AKS, two adjacent options are &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/aks-extension-kaito" target="_blank" rel="noopener"&gt;KAITO&lt;/A&gt; and the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/ai-toolchain-operator" target="_blank" rel="noopener"&gt;AI toolchain operator add-on&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;KAITO is the lighter-weight choice for rapid experimentation with supported model presets. The AI toolchain operator add-on is the more managed option for standardized LLM or multimodal serving with AKS-native operational features. Both are less suitable when the platform needs custom diffusion pipelines, queue-backed job orchestration, artifact-heavy workflows, or application-specific dispatch logic.&lt;/P&gt;
&lt;H2&gt;Reference documents&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;AKS node pools: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/create-node-pools" target="_blank" rel="noopener"&gt;Create node pools for a cluster in Azure Kubernetes Service (AKS)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Microsoft Entra Workload ID for AKS:&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview" target="_blank" rel="noopener"&gt;Use Microsoft Entra Workload ID with Azure Kubernetes Service (AKS)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Key Vault integration for AKS: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/csi-secrets-store-driver" target="_blank" rel="noopener"&gt;Use the Azure Key Vault provider for Secrets Store CSI Driver in an Azure Kubernetes Service (AKS) cluster&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Azure Blob and other CSI storage drivers on AKS: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/csi-storage-drivers" target="_blank" rel="noopener"&gt;Use Container Storage Interface (CSI) drivers on Azure Kubernetes Service (AKS)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;AKS GPU multi-instance support: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/gpu-multi-instance?tabs=azure-cli" target="_blank" rel="noopener"&gt;Use multi-instance GPUs in Azure Kubernetes Service (AKS)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;KEDA on AKS: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/keda-about" target="_blank" rel="noopener"&gt;Simplified application autoscaling with Kubernetes Event-driven Autoscaling (KEDA) add-on&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Application Gateway Ingress Controller:&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/application-gateway/ingress-controller-overview" target="_blank" rel="noopener"&gt;What is Application Gateway Ingress Controller?&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;AKS monitoring with Azure Monitor, managed Prometheus, and Grafana: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-monitor/containers/kubernetes-monitoring-enable" target="_blank" rel="noopener"&gt;Enable monitoring for Azure Kubernetes Service (AKS) clusters&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Application telemetry with Application Insights: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview" target="_blank" rel="noopener"&gt;Introduction to Application Insights - OpenTelemetry observability&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Hugging Face Diffusers: &lt;A class="lia-external-url" href="https://huggingface.co/docs/diffusers/en/index" target="_blank" rel="noopener"&gt;Diffusers documentation&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;NVIDIA DCGM exporter: &lt;A class="lia-external-url" href="https://github.com/NVIDIA/dcgm-exporter" target="_blank" rel="noopener"&gt;dcgm-exporter&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;NVIDIA device plugin DaemonSet: &lt;A class="lia-external-url" href="https://github.com/NVIDIA/k8s-device-plugin" target="_blank" rel="noopener"&gt;NVIDIA k8s-device-plugin&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Closing thought&lt;/H2&gt;
&lt;P&gt;Running diffusion models in production is not mainly a model-hosting problem. It is a platform engineering problem with GPUs in the middle. Teams that treat AKS as the control surface for isolation, observability, identity, and repeatable rollout discipline end up with a system that can scale beyond a benchmark and survive real operational demand.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Apr 2026 04:38:08 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/running-diffusion-models-at-scale-on-aks/ba-p/4513687</guid>
      <dc:creator>PrabalDeb</dc:creator>
      <dc:date>2026-04-30T04:38:08Z</dc:date>
    </item>
    <item>
      <title>Transforming Video Content into Structured SOPs Using Graph-based RAG</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/transforming-video-content-into-structured-sops-using-graph/ba-p/4515038</link>
      <description>&lt;H1&gt;Introduction&lt;/H1&gt;
&lt;P&gt;In today’s digital-first environments, a large portion of enterprise knowledge lives inside video content, training sessions, onboarding walkthroughs, and recorded operational procedures.&lt;/P&gt;
&lt;P&gt;While videos are great for learning, they are &lt;STRONG&gt;not ideal for quick reference, compliance, or repeatable processes&lt;/STRONG&gt;. Converting that knowledge into structured documentation like Standard Operating Procedures (SOPs) is often manual and time-consuming.&lt;/P&gt;
&lt;P&gt;What if this process could be automated using AI?&lt;/P&gt;
&lt;H2&gt;The Problem&lt;/H2&gt;
&lt;P&gt;Transcripts alone don’t solve the problem.&lt;/P&gt;
&lt;P&gt;When videos are converted into text, the output typically lacks:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Clear structure (sections, headings, hierarchy)&lt;/LI&gt;
&lt;LI&gt;Context (relationships between steps, tools, and roles)&lt;/LI&gt;
&lt;LI&gt;Completeness (definitions and dependencies spread across the content)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This leads to a common challenge:&lt;/P&gt;
&lt;P&gt;Teams spend significant effort manually reading transcripts, interpreting context, and restructuring them into usable documentation.&lt;/P&gt;
&lt;P&gt;As seen in modern architecture challenges, &lt;STRONG&gt;manual and repetitive configurations don’t scale well and increase maintenance effort&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Enter Graph-based RAG (GraphRAG)&lt;/H2&gt;
&lt;P&gt;GraphRAG extends traditional RAG by building a &lt;STRONG&gt;knowledge graph&lt;/STRONG&gt; instead of treating content as disconnected chunks.&lt;/P&gt;
&lt;H3&gt;What GraphRAG Does&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Extracts entities&lt;/STRONG&gt; (tools, systems, roles, concepts)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Maps relationships&lt;/STRONG&gt; between them&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Groups related concepts into logical sections&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Preserves context across the entire document&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Architecture Overview&lt;/H2&gt;
&lt;P&gt;Below is the high-level pipeline:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Video → Transcription → Knowledge Graph → LLM Generation → Structured SOP&lt;/STRONG&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;H2&gt;Implementation Approach (Step-by-Step)&lt;/H2&gt;
&lt;H3&gt;Stage 1: Knowledge Graph Construction&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;Convert video to transcript&lt;/LI&gt;
&lt;LI&gt;Split transcript into chunks&lt;/LI&gt;
&lt;LI&gt;Feed chunks into GraphRAG&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;GraphRAG performs:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Text Unit Extraction&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Entity Recognition&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Relationship Mapping&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Community Detection&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Result: A structured&amp;nbsp;&lt;STRONG&gt;knowledge graph representation of the transcript&lt;/STRONG&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;H3&gt;Stage 2: Structure Extraction&lt;/H3&gt;
&lt;P&gt;From the knowledge graph:&lt;/P&gt;
&lt;H4&gt;Sequential Steps&lt;/H4&gt;
&lt;UL&gt;
&lt;LI&gt;Preserve procedural flow from transcript order&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;Logical Sections&lt;/H4&gt;
&lt;UL&gt;
&lt;LI&gt;Derived using community detection&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;Key Concepts&lt;/H4&gt;
&lt;UL&gt;
&lt;LI&gt;Identified using graph centrality (importance via connections)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This creates a&amp;nbsp;&lt;STRONG&gt;framework for the SOP&lt;/STRONG&gt;&lt;/P&gt;
&lt;H3&gt;Stage 3: Intelligent Document Generation&lt;/H3&gt;
&lt;P&gt;Using Azure OpenAI, each SOP section is generated:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th&gt;Section&lt;/th&gt;&lt;th&gt;Generated From&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Title &amp;amp; Purpose&lt;/td&gt;&lt;td&gt;High-level concepts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Scope&lt;/td&gt;&lt;td&gt;Entity boundaries&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Definitions&lt;/td&gt;&lt;td&gt;Entity descriptions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Responsibilities&lt;/td&gt;&lt;td&gt;Role-based entities&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Procedures&lt;/td&gt;&lt;td&gt;Sequential steps&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;References&lt;/td&gt;&lt;td&gt;Linked content&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;The key advantage:&amp;nbsp;&lt;STRONG&gt;LLM is grounded in graph structure not raw text&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;H2&gt;Key Benefits&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;Context Preservation - Relationships between concepts are maintained across sections.&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;Comprehensive Coverage - Community detection ensures important topics are not missed.&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;Reduced Hallucination - LLM generation is grounded in structured knowledge.&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;Scalability- Works for:&amp;nbsp; 30-minute tutorials, 3-hour training sessions and Enterprise knowledge bases&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Real-World Impact (Example)&lt;/H2&gt;
&lt;P&gt;In enterprise scenarios like pharmaceutical SOP generation:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Processing time:&lt;/STRONG&gt; ~15–20 minutes for a multi-hour video&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Output quality:&lt;/STRONG&gt; 8–10 structured SOP sections&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Consistency:&lt;/STRONG&gt; Terminology and relationships preserved&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Coverage:&lt;/STRONG&gt; Minimal missing topics&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Where This Approach Works Best&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;Training videos → SOPs&lt;/LI&gt;
&lt;LI&gt;Meeting recordings → action summaries&lt;/LI&gt;
&lt;LI&gt;Technical demos → documentation&lt;/LI&gt;
&lt;LI&gt;Interview recordings → knowledge bases&lt;/LI&gt;
&lt;LI&gt;Tutorials → reference guides&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Key Takeaway&lt;/H2&gt;
&lt;P&gt;This approach represents a &lt;STRONG&gt;shift from text processing → knowledge understanding&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;By combining:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Knowledge graphs (structure)&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;LLMs (language generation)&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;We can transform &lt;STRONG&gt;raw, unstructured content into usable, enterprise-grade knowledge assets&lt;/STRONG&gt;.&lt;/P&gt;
&lt;H2&gt;Resources&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://microsoft.github.io/graphrag/index/overview/" target="_blank" rel="noopener"&gt;https://microsoft.github.io/graphrag/index/overview/&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Final Thoughts&lt;/H2&gt;
&lt;P&gt;Have you explored &lt;STRONG&gt;GraphRAG or similar approaches&lt;/STRONG&gt; in your projects?&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;What challenges did you face?&lt;/LI&gt;
&lt;LI&gt;How did you handle unstructured knowledge?&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Share your experiences — let’s learn together.&lt;/P&gt;</description>
      <pubDate>Wed, 29 Apr 2026 01:04:53 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/transforming-video-content-into-structured-sops-using-graph/ba-p/4515038</guid>
      <dc:creator>dikshashakya</dc:creator>
      <dc:date>2026-04-29T01:04:53Z</dc:date>
    </item>
    <item>
      <title>Flexible Cooling for AI Growth: How Zonal Architecture Supports Diverse Hardware Needs</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/flexible-cooling-for-ai-growth-how-zonal-architecture-supports/ba-p/4514042</link>
      <description>&lt;P&gt;&lt;STRONG&gt;By: Ricardo Bianchini, Steve Solomon, Brijesh Warrier, Martin Herbert, Jay Jochim, Husam Alissa, Pulkit Misra, Eric Peterson and Cam Turner&lt;/STRONG&gt;&lt;/P&gt;
&lt;H3&gt;Context -&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;Microsoft is pioneering zonal cooling in its next-generation AI datacenters, enabling flexible, performant, efficient, and sustainable thermal management for diverse workloads.&lt;/P&gt;
&lt;P&gt;The unprecedented growth of artificial intelligence (AI) is transforming datacenter infrastructure. Modern facilities must now support a diverse array of IT equipment, each with distinct cooling requirements. For example, modern GPUs and other AI accelerators require liquid cooling as air cooling is impractical at power draws exceeding 1 kW per accelerator due to the limited heat capacity of air to remove the resulting thermal load. Meanwhile, non-AI-accelerator (i.e., general-purpose) hardware deployments such as CPU-based compute, storage, and networking are expected to mostly remain air-cooled for the foreseeable future.&lt;/P&gt;
&lt;P&gt;Furthermore, liquid cooling offers a significant efficiency advantage: its superior heat dissipation allows coolant supply temperatures at the chip as high as 45°C without sacrificing peak performance. In contrast, air-cooled equipment requires much lower supply temperatures—around 30&amp;nbsp;°C—for optimal efficiency.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;BR /&gt;The divergence in hardware cooling requirements creates a complex landscape that demands a strategy that is both flexible and adaptive. As shown in Figure 1, relying on a unified facility water system (FWS) introduces major inefficiencies. For example, liquid-cooled GPU racks may receive coolant below their required operating temperature when served by a single-temperature loop. This inefficiency becomes even more pronounced as the proportion of liquid- to air-cooled equipment increases (e.g., &lt;A href="https://buy.hpe.com/us/en/compute/rack-scale-system/nvidia-nvl-system/nvidia-gb300-nvl72-by-hpe/p/1014890105" target="_blank" rel="noopener"&gt;90:10 liquid-to-air ratio for NVIDIA GB300 servers&lt;/A&gt;) since a larger share of the equipment is unnecessarily overcooled.&lt;/P&gt;
&lt;P&gt;Beyond operational efficiency, sustainability is a key priority for Microsoft even as we grow our AI infrastructure. Among our &lt;A href="https://www.microsoft.com/en-us/corporate-responsibility/sustainability" target="_blank" rel="noopener"&gt;sustainability commitments&lt;/A&gt;, Microsoft has set goals to become carbon negative and &lt;A href="https://www.microsoft.com/en-us/microsoft-cloud/blog/2024/12/09/sustainable-by-design-next-generation-datacenters-consume-zero-water-for-cooling/" target="_blank" rel="noopener"&gt;eliminate water evaporation&lt;/A&gt; as a cooling method in its next-generation datacenters. A key lever for reducing carbon emissions is improving PUE (&lt;A href="https://datacenters.microsoft.com/sustainability/efficiency/" target="_blank" rel="noopener"&gt;Power Usage Effectiveness&lt;/A&gt;, i.e., total power divided by IT power), a standard measure of datacenter power and energy efficiency. Achieving this requires dynamically matching cooling delivery to the specific needs of each equipment type, ensuring optimal performance, reduced energy consumption, and enhanced sustainability.&lt;/P&gt;
&lt;H3&gt;Zonal Cooling: Flexible by Design&lt;/H3&gt;
&lt;P&gt;Zonal cooling is a facility design that introduces multiple independent water loops, each supplying coolant at different temperatures. Figure 2 illustrates a specific implementation of the zonal concept with two facility-level zones: one loop serves air-cooled equipment, maintaining lower temperatures for human comfort and general-purpose hardware, and the other loop caters to liquid-cooled IT AI accelerators, which can operate efficiently at higher supply temperatures. This separation enables datacenter operators to precisely match cooling supply to the requirements of each zone, avoiding the inefficiency of over-cooling all equipment to the lowest common denominator.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;A key strength of zonal cooling is its flexibility. As new generations of IT hardware emerge, with varying thermal profiles, zonal cooling allows datacenters to adapt without major infrastructure overhauls. For example, future AI accelerators may need different liquid temperature ranges (see&amp;nbsp;&lt;A href="https://www.opencompute.org/documents/30-coolant-a-durable-roadmap-for-the-future-rev1-0-pdf" target="_blank" rel="noopener"&gt;30℃ Coolant - A Durable Roadmap for the Future&lt;/A&gt;) or technological improvements, such as &lt;A href="https://news.microsoft.com/source/features/innovation/microfluidics-liquid-cooling-ai-chips/" target="_blank" rel="noopener"&gt;microfluidics&lt;/A&gt;, may enable operating at even higher coolant temperatures, while general-purpose equipment requirements may remain unchanged. Zonal cooling’s architecture supports these changes by enabling operators to adjust loop temperatures and reconfigure cooling assignments as needed.&lt;/P&gt;
&lt;H3&gt;Forms of Zonal Cooling&lt;/H3&gt;
&lt;P&gt;Liquid cooling expands the allowable coolant supply temperature range and enables temperature-specific zones. This zonal approach can be applied at multiple layers:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Facility-level&lt;/STRONG&gt;: Two distinct temperature zones within a datacenter—one for air-cooled equipment and another for liquid-cooled equipment.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Row-level&lt;/STRONG&gt;: Tailor coolant temperature for each row based on deployed hardware (e.g., general-purpose vs GPU servers).&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Rack-level&lt;/STRONG&gt;: Enable multiple temperature zones within a single rack for fine-grained optimization across servers.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Chip-level&lt;/STRONG&gt;: Apply zonal cooling inside the server. For example, use colder coolant for a GPU’s high-bandwidth memory (HBM) while supplying warmer coolant for the SoC and CPUs. This fine-grained approach can enable higher HBM stacking for improved performance, while avoiding unnecessary cooling overhead.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Microsoft is building facility-level zonal cooling in the next generation of its AI datacenters going live in 2028 and beyond, while exploring the other three approaches in the lab.&amp;nbsp; Facility-level zonal cooling is expected to reduce PUEs by up to 10%.&lt;/P&gt;
&lt;H3&gt;Benefits from Zonal Cooling&lt;/H3&gt;
&lt;P&gt;Zonal cooling is a strategic enabler for performance and efficiency. It can deliver:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Improved energy efficiency and sustainability: &lt;/STRONG&gt;By reducing the load on datacenter cooling infrastructure, zonal cooling improves energy efficiency as captured by annualized PUE, which measures average efficiency across all operating conditions. Lower annualized PUE means energy savings and lower carbon emissions.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Increased server density:&lt;/STRONG&gt; Tailored zonal cooling reduces peak cooling power demand during the hottest days, which in turn lowers peak PUE. Designers can leverage this reduction to reserve power for lower water temperatures (anticipating future accelerator needs), add more servers within the same utility power envelope, or contract less utility power per datacenter.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Higher performance: &lt;/STRONG&gt;Strategic control of coolant temperatures unlocks higher chip performance without sacrificing efficiency. For example, colder loops allow GPUs and CPUs to sustain elevated clock speeds via safe overclocking, while optimized memory cooling supports greater stacking density and increased bandwidth.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Improved flexibility: &lt;/STRONG&gt;With independent zones, operators can easily adjust coolant supply temperatures or reconfigure zones as new generations of hardware with varied cooling requirements emerge. This flexibility ensures compatibility with future innovations while maintaining optimal performance.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Looking Ahead&lt;/H3&gt;
&lt;P&gt;Zonal cooling represents a paradigm shift in datacenter thermal management. Its flexible, zone-specific approach to cooling air- and liquid-cooled IT equipment positions datacenters to efficiently adapt to future hardware innovations and workload diversity. As the industry continues to push boundaries in performance and sustainability, zonal cooling will be a foundational strategy for building performance and efficient infrastructure that meets tomorrow’s challenges.&lt;/P&gt;</description>
      <pubDate>Fri, 29 May 2026 20:01:33 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/flexible-cooling-for-ai-growth-how-zonal-architecture-supports/ba-p/4514042</guid>
      <dc:creator>stsolo</dc:creator>
      <dc:date>2026-05-29T20:01:33Z</dc:date>
    </item>
    <item>
      <title>Modernizing Industrial Safety and Inspection with AI-Driven Drone Automation</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/modernizing-industrial-safety-and-inspection-with-ai-driven/ba-p/4514284</link>
      <description>&lt;P&gt;In large-scale manufacturing and infrastructure environments, maintaining structural integrity is a continuous operational challenge. Industrial facilities—from automotive plants to energy and infrastructure sites—depend on thousands of structural connection points such as bolts and fasteners to ensure safe and reliable operations. Over time, vibration, thermal cycling, and mechanical stress can cause these components to loosen or degrade.&lt;/P&gt;
&lt;P&gt;While drones have dramatically improved how inspection data is captured, the analysis of that data—often involving thousands of connection points—remains largely manual. Engineers frequently review footage frame by frame, making the process labor-intensive, inconsistent, difficult to scale, and often reactive rather than predictive.&lt;/P&gt;
&lt;P&gt;Bolt inspection is one example of a broader category of high-volume, repetitive visual inspections that are critical for safety but challenging to execute consistently. Environmental factors such as lighting variation, shadows, camera angles, image resolution, and marking inconsistencies further complicate automation.&lt;/P&gt;
&lt;P&gt;This creates a clear opportunity for transformation through AI. By combining deterministic computer vision models along with Generative AI reasoning capabilities, organizations can move beyond manual review toward scalable, intelligent inspection systems. Computer vision provides precise detection and measurement, while Generative AI enhances interpretation, contextual validation, and cross-frame reasoning—together enabling more robust defect identification and operational insight.&lt;/P&gt;
&lt;P&gt;This article presents validated architecture and practical lessons learned from implementing an AI-driven drone inspection solution. While bolt integrity inspection serves as a representative example, the architecture and approach apply broadly across industrial safety and infrastructure monitoring scenarios.&lt;/P&gt;
&lt;H2&gt;&lt;STRONG&gt;The Evolution from GenAI Approach to Deterministic Precision&lt;/STRONG&gt;&lt;/H2&gt;
&lt;P&gt;Starting with a Generative AI–driven approach to capture and reason over bolt frames is a fundamentally more effective strategy for this problem space. It accelerates early-stage detection of degraded bolts without requiring large labeled datasets, while simultaneously enabling structured data collection needed to train deterministic machine learning models—which typically require tens of thousands of images.&lt;/P&gt;
&lt;P&gt;This approach delivers immediate value by rapidly identifying relevant visual signals in drone footage and uncovering key factors that influence detection accuracy, such as lighting, angle, and alignment. At the same time, it naturally builds the dataset necessary to transition toward a more scalable and repeatable solution.&lt;/P&gt;
&lt;P&gt;However, it also makes clear that while Generative AI is powerful for contextual reasoning across frames, it is inherently non-deterministic and sensitive to input variability. For enterprise-grade reliability, precision, and repeatability, a complementary approach is required.&lt;/P&gt;
&lt;P&gt;The optimal solution is a hybrid model that combines the strengths of both:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Computer Vision machine learning models&lt;/STRONG&gt; provide precise, consistent detection and measurement of structural features at scale.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Generative AI&lt;/STRONG&gt; adds contextual reasoning across bolt frames, validates consistency, and interprets ambiguous or borderline defects.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Together&lt;/STRONG&gt;, they form a superior system—delivering higher accuracy, reduced ambiguity, and stronger context awareness in complex real-world conditions.&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;AI cannot compensate for inconsistent input data.&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;Standardized data capture and operational discipline remain prerequisites for reliable automation.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Solution Components and Architecture&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;The proposed solution follows a modular, event-driven architecture that combines computer vision and Generative AI to enable scalable, intelligent inspection workflows. At a high level, inspection videos are ingested, processed through deterministic computer vision models for detection and measurement, and enhanced with Generative AI for contextual reasoning and validation. The results are evaluated, stored, and surfaced through analytics platforms to support operational decision-making.&lt;/SPAN&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;The first diagram provides a system-level view of how core Azure services interact—from data ingestion and model execution to evaluation, storage, and reporting. It highlights the integration of the computer vision pipeline (Azure AI Vision and Azure Machine Learning), the Generative AI reasoning layer (Azure OpenAI), and downstream analytics (Cosmos DB and Power BI), enabling a scalable and flexible architecture.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;The second diagram illustrates the step-by-step execution flow across the system. The process begins when a drone operator uploads inspection video to Azure Blob Storage, triggering an event-driven workflow via Azure Functions. Frames are extracted and passed through a quality gate to filter out low-quality data. Valid frames are then processed by computer vision models (&lt;STRONG&gt;Azure AI Vision and Azure Machine Learning&lt;/STRONG&gt;) to detect and track bolts, generate bounding boxes, and perform deterministic alignment measurements.&lt;/P&gt;
&lt;P&gt;These outputs are further enhanced by a Generative AI layer (Azure OpenAI), which applies contextual reasoning across frames to validate anomalies, reduce false positives, and generate structured summaries. The results are evaluated using Azure AI Foundry to ensure quality, consistency, and reliability before being stored in Cosmos DB. Finally, Power BI dashboards surface insights, trends, and alerts for operational use.&lt;/P&gt;
&lt;P&gt;Throughout the pipeline, built-in feedback loops—such as quality filtering, evaluation checks, and quarantine mechanisms—ensure that only high-confidence results are retained, enabling a reliable and production-ready inspection system.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure Blob Storage&lt;/STRONG&gt;&lt;BR /&gt;Primary storage for raw videos, extracted frames, labeled datasets, and model artifacts.&lt;BR /&gt;Serves as the ingestion and archival layer for inspection data and training pipelines.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure Functions&lt;/STRONG&gt;&lt;BR /&gt;Serverless event-driven compute used to trigger workflows from video uploads, inspection events, or user actions.&lt;BR /&gt;Handles orchestration, preprocessing, and integration between AI services while maintaining lightweight, scalable execution.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure Machine Learning (Azure ML Studio)&lt;/STRONG&gt;&lt;BR /&gt;End-to-end development platform for training, testing, and deploying custom machine learning, computer vision models and Gen AI and evaluation workflow.
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Quality Gate (Frame Filtering)&lt;/STRONG&gt;&lt;BR /&gt;Captured video passes through an automated quality gate that removes frames with blur, glare, poor lighting, or unfavorable angles. This ensures that only high-quality, inspection-grade frames are used, safeguarding model accuracy.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Bolt Detection (CV Model)&lt;/STRONG&gt;&lt;BR /&gt;Detects and localizes bolts in each frame with bounding boxes, confidence scores, and coarse defect signals (e.g., using YOLO or RT-DETR).&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Bolt Identification &amp;amp; Tracking (CV + Logic)&lt;/STRONG&gt;&lt;BR /&gt;Maintains consistent bolt identity across frames using spatial context or markers (e.g., AprilTags), enabling longitudinal tracking.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Deterministic Measurement (CV + Geometry)&lt;/STRONG&gt;&lt;BR /&gt;Computes precise alignment or rotation using geometric analysis, with threshold-based evaluation for repeatable, auditable results.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Contextual Validation &amp;amp; Reporting (GenAI Layer)&lt;/STRONG&gt;&lt;BR /&gt;Applies cross-frame reasoning to validate results, resolve ambiguities, improve accuracy, reduce false positives, and generate a structured, human-readable summary report of inspection findings.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure AI Evaluation Metrics – (Microsoft Foundry)&lt;/STRONG&gt;&lt;BR /&gt;Ensures the quality, reliability, and compliance of generative AI outputs by evaluating key dimensions such as:
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Groundedness&lt;/STRONG&gt; – Verifies that the generated summary and reasoning are based on actual frames and inspection measurements.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Coherence&lt;/STRONG&gt; – Assesses logical consistency across frames and throughout the report, ensuring observations and conclusions align.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Fluency&lt;/STRONG&gt; – Measures clarity, readability, and professional language in the human-readable summary report.&lt;BR /&gt;These metrics act as guardrails to maintain enterprise-grade accuracy, trustworthiness, and compliance in all AI-generated inspection insights.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;OL start="4"&gt;
&lt;LI&gt;&lt;STRONG&gt; Azure Cosmos DB&lt;/STRONG&gt;&lt;BR /&gt;Globally distributed NoSQL database for storing structured inspection results, metadata, agent memory, and historical asset data.&lt;BR /&gt;Enables longitudinal tracking, contextual retrieval, and scalable real-time querying.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt; Power BI&lt;/STRONG&gt;&lt;BR /&gt;Business intelligence and visualization platform used to monitor inspection results, trends, and operational KPIs.&lt;BR /&gt;Provides dashboards for maintenance teams, reliability engineers, and leadership decision-making.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;&lt;STRONG&gt;Security and Enterprise Considerations&lt;/STRONG&gt;&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt; Azure Blob Storage: &lt;/STRONG&gt;Storage accounts can be secured by minimizing public exposure, enforcing strong identity‑based access, protecting data, and continuously monitoring for threats. Organizations should use &lt;A href="https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-overview" target="_blank" rel="noopener"&gt;Private Endpoints&lt;/A&gt; and disable public network access wherever possible, authenticate users and applications with Microsoft Entra ID instead of shared keys, and apply least‑privilege Azure RBAC with managed identities. Data should be encrypted in transit (TLS 1.2+) and at rest using Microsoft‑managed or customer‑managed keys stored in &lt;A href="https://learn.microsoft.com/en-us/azure/key-vault/general/overview" target="_blank" rel="noopener"&gt;Azure Key Vault&lt;/A&gt;, while &lt;A href="https://learn.microsoft.com/en-us/azure/defender-for-cloud/defender-for-storage-introduction" target="_blank" rel="noopener"&gt;Microsoft Defender for Storage&lt;/A&gt;, logging, soft delete, backups, and Azure Policy should be enabled to &lt;A href="https://learn.microsoft.com/en-us/azure/defender-for-cloud/enable-defender-for-storage-data-sensitivity" target="_blank" rel="noopener"&gt;detect threats&lt;/A&gt;, support recovery, and enforce compliance at scale. &lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/en-us/azure/ai-services/content-safety/quickstart-image?tabs=visual-studio%2Cwindows&amp;amp;pivots=programming-language-foundry-portal" target="_blank" rel="noopener"&gt;Content Safety&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; can be called from the application layer to block uploads based on &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/harm-categories?tabs=warning" target="_blank" rel="noopener"&gt;image content&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;. Staging containers can be used to isolate untrusted uploads. Content Safety provides signals; your app enforces policy.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt; Azure AI Vision / Computer Vision (Custom Vision or Vision models): &lt;/STRONG&gt;Azure AI Vision supports enterprise-grade security through Microsoft Entra ID–based authentication and &lt;A href="https://learn.microsoft.com/en-us/azure/role-based-access-control/overview" target="_blank" rel="noopener"&gt;Azure Role-Based Access Control (RBAC)&lt;/A&gt;, ensuring only authorized users, applications, and services can access vision models and image data. Network isolation can be enforced using &lt;A href="https://learn.microsoft.com/en-us/azure/virtual-network/quickstart-create-virtual-network?tabs=portal" target="_blank" rel="noopener"&gt;Virtual Network (VNet)&lt;/A&gt; integration and &lt;A href="https://learn.microsoft.com/en-us/azure/private-link/private-link-overview" target="_blank" rel="noopener"&gt;Private Link&lt;/A&gt; to restrict public internet exposure and ensure traffic remains within secure enterprise boundaries. All data transmitted to and from Azure AI Vision is encrypted in transit using TLS 1.2+ and encrypted at rest using Microsoft-managed keys or optional Customer-Managed Keys (CMKs).&lt;BR /&gt;&lt;BR /&gt;For threat detection and monitoring, Microsoft &lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/en-us/azure/defender-for-cloud/defender-for-cloud-introduction" target="_blank" rel="noopener"&gt;Defender for Cloud&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; provides security posture visibility and anomaly detection across AI workloads. Integration with &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://docs.azure.cn/en-us/purview/purview" target="_blank" rel="noopener"&gt;Microsoft Purview&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; enables classification and protection of sensitive image or inspection data, ensuring compliance with enterprise data governance policies.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;OL start="3"&gt;
&lt;LI&gt;&lt;STRONG&gt; Azure Machine Learning (Azure ML): &lt;/STRONG&gt;Azure Machine Learning provides a secure environment for training, testing, and deploying machine learning and computer vision models. Access control is managed through &lt;A href="https://learn.microsoft.com/en-us/entra/identity/" target="_blank" rel="noopener"&gt;Entra ID&lt;/A&gt; and Azure RBAC, enabling granular permissions for data scientists, engineers, and automated services. Managed Identities allow secure service-to-service authentication without exposing credentials.&lt;BR /&gt;&lt;BR /&gt;For network security, Azure ML supports Virtual Network isolation, &lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-overview" target="_blank" rel="noopener"&gt;Private Link endpoints&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;, and managed network configurations to prevent unauthorized external access. Data used for model training and inference is encrypted in transit and at rest, with support for Customer-Managed Keys (CMKs) stored in &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/en-us/azure/key-vault/general/overview" target="_blank" rel="noopener"&gt;Azure Key Vault&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; for enhanced control.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;Microsoft &lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/en-us/azure/defender-for-cloud/defender-for-cloud-introduction" target="_blank" rel="noopener"&gt;Defender for Cloud&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; provides threat detection and vulnerability management across compute instances, endpoints, and model deployments. &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/en-us/azure/governance/policy/" target="_blank" rel="noopener"&gt;Azure Policy&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; ensures compliance by auditing and enforcing security configurations across ML workspaces. Additionally, model versioning and governance features support traceability and auditability for safety-critical AI deployments.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;OL start="4"&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure Functions: &lt;/STRONG&gt;Azure Functions can be secured by using &lt;A href="https://learn.microsoft.com/en-us/entra/identity/" target="_blank" rel="noopener"&gt;Entra ID&lt;/A&gt; authentication and managed identities instead of keys or embedded secrets, and by enforcing least‑privilege access through &lt;A href="https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles" target="_blank" rel="noopener"&gt;Azure RBAC&lt;/A&gt;. Network exposure should be minimized by enabling HTTPS‑only access, using private endpoints, IP restrictions, and VNet integration where appropriate. Sensitive data and credentials should be stored in &lt;A href="https://learn.microsoft.com/en-us/azure/key-vault/general/overview" target="_blank" rel="noopener"&gt;Azure Key Vault&lt;/A&gt;, with encryption enforced both in transit and at rest. Function apps should be hardened by keeping runtimes and dependencies up to date, disabling unused features, and enforcing secure configurations with Azure Policy. Ongoing protection relies on Azure Monitor, Application Insights, &lt;A href="https://learn.microsoft.com/en-us/azure/defender-for-cloud/defender-for-cloud-introduction" target="_blank" rel="noopener"&gt;Defender for Cloud&lt;/A&gt;, and centralized logging or SIEM integration to detect threats and misconfigurations, along with regular vulnerability management, backups, and governance practices to maintain resilience and compliance.&lt;/LI&gt;
&lt;/OL&gt;
&lt;OL start="5"&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure OpenAI (GPT-4o / GPT-4o mini):&amp;nbsp;&lt;/STRONG&gt;&lt;A href="https://learn.microsoft.com/en-us/security/benchmark/azure/mcsb-v2-artificial-intelligence-security" target="_blank" rel="noopener"&gt;Govern&lt;/A&gt; which models are approved for use and protect model artifacts and training data from unauthorized access through strong identity, network, encryption, and logging controls. AI applications should be designed with layered defenses, including multi‑stage content filtering, safety meta‑prompts, and least‑privilege permissions for agents and plugins to reduce the risk of prompt injection, data leakage, and unintended actions. High‑risk AI operations should include human‑in‑the‑loop review to prevent autonomous execution of harmful or incorrect outcomes. Organizations must continuously monitor AI systems for misuse, anomalous behavior, and data exfiltration, and they should perform ongoing AI red teaming to identify vulnerabilities such as jailbreaking, adversarial inputs, and model manipulation before they can be exploited.&lt;/LI&gt;
&lt;/OL&gt;
&lt;OL start="6"&gt;
&lt;LI&gt;&lt;STRONG&gt; Azure Cosmos DB : &lt;/STRONG&gt;Azure Cosmos enhances network security by supporting access restrictions via&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fazure%2Fcosmos-db%2Fhow-to-configure-vnet-service-endpoint" target="_blank" rel="noopener"&gt;Virtual Network (VNet) integration&lt;/A&gt;and secure access through&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fazure%2Fcosmos-db%2Fhow-to-configure-private-endpoints" target="_blank" rel="noopener"&gt;Private Link&lt;/A&gt;. Data protection is reinforced by integration with&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fazure%2Fpurview%2Foverview" target="_blank" rel="noopener"&gt;Microsoft Purview&lt;/A&gt;, which helps classify and label sensitive data, and&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Ftechcommunity.microsoft.com%2FOverview%2520of%2520Defender%2520for%2520Azure%2520Cosmos%2520DB%2520-%2520Microsoft%2520Defender%2520for%2520Cloud%2520%7C%2520Microsoft%2520Learn" target="_blank" rel="noopener"&gt;Defender for Cosmos DB&lt;/A&gt;to detect threats and exfiltration attempts. Cosmos DB ensures all data is encrypted in transit using TLS 1.2+ (mandatory) and at rest using Microsoft-managed or&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fazure%2Fsecurity%2Ffundamentals%2Fencryption-atrest%23customer-managed-keys" target="_blank" rel="noopener"&gt;customer-managed keys (CMKs)&lt;/A&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt; Power BI: &lt;/STRONG&gt;Power BI leverages&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fpower-bi%2Fadmin%2Fservice-security%23identity-management" target="_blank" rel="noopener"&gt;Microsoft Entra ID&lt;/A&gt;for secure identity and access management. In Power BI embedded applications, using&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fsecurity%2Fdevelop%2Fcredscan" target="_blank" rel="noopener"&gt;Credential Scanner&lt;/A&gt;is recommended to detect hardcoded secrets and migrate them to secure storage like&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fazure%2Fkey-vault%2Fgeneral%2Foverview" target="_blank" rel="noopener"&gt;Azure Key Vault&lt;/A&gt;. All data is encrypted both at rest and during processing, with an option for organizations to use their own&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fpower-bi%2Fadmin%2Fservice-security-data-encryption" target="_blank" rel="noopener"&gt;Customer-Managed Keys (CMKs)&lt;/A&gt;. Power BI also integrates with&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fpower-bi%2Fadmin%2Fservice-security-sensitivity-label-overview" target="_blank" rel="noopener"&gt;Microsoft Purview sensitivity labels&lt;/A&gt;&amp;nbsp;to manage and protect sensitive business data throughout the analytics lifecycle. For additional context,&amp;nbsp;&lt;A href="https://www.google.com/url?sa=E&amp;amp;q=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fpower-bi%2Fguidance%2Fwhitepaper-powerbi-security" target="_blank" rel="noopener"&gt;Power BI security white paper - Power BI | Microsoft Learn&lt;/A&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Microsoft Foundry:&amp;nbsp;&lt;/STRONG&gt;Microsoft&lt;STRONG&gt; &lt;/STRONG&gt;Foundry supports robust identity management using &lt;A href="https://learn.microsoft.com/en-us/azure/role-based-access-control/overview" target="_blank" rel="noopener"&gt;Azure Role-Based Access Control (RBAC)&lt;/A&gt;&lt;U&gt; &lt;/U&gt;to assign roles within&lt;U&gt; &lt;/U&gt;&lt;A href="https://learn.microsoft.com/en-us/entra/identity/" target="_blank" rel="noopener"&gt;Microsoft Entra ID&lt;/A&gt;, and it supports&lt;U&gt; &lt;/U&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/overview" target="_blank" rel="noopener"&gt;Managed Identities&lt;/A&gt; for secure resource access.&lt;U&gt; &lt;/U&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/active-directory/conditional-access/overview" target="_blank" rel="noopener"&gt;Conditional Access&lt;/A&gt; policies allow organizations to enforce access based on location, device, and risk level. For network security, Azure AI Foundry supports &lt;A href="https://learn.microsoft.com/en-us/azure/private-link/private-link-overview" target="_blank" rel="noopener"&gt;Private Link&lt;/A&gt;, Managed Network Isolation, and &lt;A href="https://learn.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview" target="_blank" rel="noopener"&gt;Network Security Groups (NSGs)&lt;/A&gt; to restrict resource access. Data is encrypted in transit and at rest using Microsoft-managed keys or optional &lt;A href="https://learn.microsoft.com/en-us/azure/security/fundamentals/encryption-atrest#customer-managed-keys" target="_blank" rel="noopener"&gt;Customer-Managed Keys (CMKs)&lt;/A&gt;&lt;U&gt;.&lt;/U&gt; &lt;A href="https://learn.microsoft.com/en-us/azure/governance/policy/overview" target="_blank" rel="noopener"&gt;Azure Policy&lt;/A&gt;&lt;U&gt; &lt;/U&gt;enables auditing and enforcing configurations for all resources deployed in the environment. Additionally, &lt;A href="https://learn.microsoft.com/en-us/entra/agent-id/identity-professional/microsoft-entra-agent-identities-for-ai-agents" target="_blank" rel="noopener"&gt;Microsoft Entra Agent ID&lt;/A&gt;, which extends identity management and access capabilities to AI agents. AI agents created within Microsoft Foundry are automatically assigned identities in a Microsoft Entra directory centralizing agent and user management in one solution. &lt;A href="https://learn.microsoft.com/en-us/azure/defender-for-cloud/ai-security-posture" target="_blank" rel="noopener"&gt;AI Security Posture Management&lt;/A&gt; can be used to assess the security posture of AI workloads. &lt;A href="https://learn.microsoft.com/en-us/azure/defender-for-cloud/ai-onboarding" target="_blank" rel="noopener"&gt;Defender for AI Services&lt;/A&gt; provides threat protection and insights for you AI resources. &lt;A href="https://learn.microsoft.com/en-us/purview/developer/secure-ai-with-purview" target="_blank" rel="noopener"&gt;Purview APIs&lt;/A&gt; enable Azure AI Foundry and developers to integrate data security and compliance controls into custom AI apps and agents. This includes enforcing policies based on how users interact with sensitive information in AI applications. &lt;A href="https://learn.microsoft.com/en-us/purview/ai-microsoft-purview" target="_blank" rel="noopener"&gt;Purview&lt;/A&gt; Sensitive Information Types can be used to detect sensitive data in user prompts and responses when interacting with AI applications.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt; DevOps Security: &lt;/STRONG&gt;Embed security throughout the software development lifecycle. Best practices include conducting structured threat modeling with the &lt;A href="https://learn.microsoft.com/azure/security/develop/threat-modeling-tool" target="_blank" rel="noopener"&gt;Microsoft Threat Modeling Tool&lt;/A&gt; early in the design phase, securing the software supply chain by verifying provenance and scanning third‑party dependencies, and maintaining a &lt;A href="https://github.com/microsoft/sbom-tool/blob/main/docs/setting-up-ado-pipelines.md" target="_blank" rel="noopener"&gt;Software Bill of Materials (SBOM)&lt;/A&gt;.&lt;BR /&gt;&lt;BR /&gt;Security is further “shifted left” by integrating automated controls directly into CI/CD pipelines.&lt;STRONG style="color: rgb(30, 30, 30);"&gt; &lt;/STRONG&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/en-us/azure/devops/repos/security/configure-github-advanced-security-features?view=azure-devops&amp;amp;tabs=yaml&amp;amp;pivots=standalone-ghazdo" target="_blank" rel="noopener"&gt;GitHub Advanced Security for Azure DevOps&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;, which provides dependency scanning, &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://codeql.github.com/" target="_blank" rel="noopener"&gt;CodeQL&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;-based static application security testing (SAST), and secret scanning&lt;/SPAN&gt;&lt;STRONG style="color: rgb(30, 30, 30);"&gt; &lt;/STRONG&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;to identify vulnerabilities and exposed credentials in code and third-party libraries. Infrastructure-as-code templates can be validated with &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/azure/governance/policy/overview" target="_blank" rel="noopener"&gt;Azure Policy&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; and &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/azure/defender-for-cloud/" target="_blank" rel="noopener"&gt;Microsoft Defender for Cloud&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;, while pipeline protections such as protected branches and approvals reduce the risk of unauthorized changes. DevOps environments can be hardened using &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/azure/key-vault/general/overview" target="_blank" rel="noopener"&gt;Azure Key Vault&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; for secrets management, &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/azure/active-directory/managed-identities-azure-resources/overview" target="_blank" rel="noopener"&gt;Managed Identities&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; and &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/entra/fundamentals/whatis" target="_blank" rel="noopener"&gt;Microsoft Entra ID&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; for least-privilege access, and monitoring through &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/azure/azure-monitor/overview" target="_blank" rel="noopener"&gt;Azure Monitor&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; . &lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255);" href="https://learn.microsoft.com/en-us/azure/defender-for-cloud/defender-for-devops-introduction" target="_blank" rel="noopener"&gt;Microsoft Defender for Cloud DevOps Security&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; provides centralized code‑to‑cloud visibility across Azure DevOps, GitHub, and GitLab, identifying risks in code, secrets, dependencies, and IaC and helping teams prioritize fixes early in CI/CD pipelines.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;&lt;STRONG&gt;Related and Future Scenarios&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;Although bolt inspection served as an initial use case, this architecture establishes a scalable pattern for many industrial applications:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Predictive Maintenance: Tracking structural movement over time enables condition-based maintenance rather than schedule-based inspections.&lt;/LI&gt;
&lt;LI&gt;Structural Health Monitoring: The same approach can detect cracks, corrosion, or deformation across industrial assets and infrastructure.&lt;/LI&gt;
&lt;LI&gt;Equipment and Safety Compliance Monitoring: AI-driven visual inspection can monitor equipment wear, safety compliance, and environmental risks.&lt;/LI&gt;
&lt;LI&gt;Digital Twin Integration: Inspection data can feed digital twin environments, enabling real-time visualization of facility health and risk conditions.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;&lt;STRONG&gt;Conclusion&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;Modernizing industrial inspection is not simply about applying AI—it requires aligning technology, operational discipline, and data quality. Early exploration using Generative AI enabled rapid learning and feasibility validation. However, a production-grade solution must be built on deterministic computer vision models supported by standardized data capture and operational controls.&lt;/P&gt;
&lt;P&gt;By combining drone-based data capture, deterministic computer vision, and Generative AI for reporting and insights, organizations can achieve scalable, repeatable, and auditable inspection processes. This hybrid approach enables safer operations, reduced manual effort, and the transition from reactive repairs to predictive maintenance across industrial environments.&lt;/P&gt;
&lt;P&gt;The result is not just an automated inspection tool, but a scalable AI architecture for modern industrial safety and asset reliability.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;Contributors:&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;&lt;EM&gt;This article is maintained by Microsoft. It was originally written by the following contributors.&lt;/EM&gt; &amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Principal authors:&lt;/STRONG&gt;&lt;STRONG&gt; &amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://www.linkedin.com/in/peter-t-lee/" target="_blank" rel="noopener"&gt;Peter Lee&lt;/A&gt; | Senior Cloud Solution Architect – US Customer Success &lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://www.linkedin.com/in/trmanasa" target="_blank" rel="noopener"&gt;Manasa Ramalinga&lt;/A&gt; | Senior Principal Cloud Solution Architect – US Customer Success &lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://www.linkedin.com/in/abed-sau/" target="_blank" rel="noopener"&gt;Abed Sau&lt;/A&gt; | Principal Cloud Solution Architect – US Customer Success&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://www.linkedin.com/in/yagneswari-kanadam/" target="_blank" rel="noopener"&gt;Yagneswari Kanadam&lt;/A&gt; | Senior Cloud Solution Architect – US Customer Success &lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Sat, 25 Apr 2026 00:07:31 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/modernizing-industrial-safety-and-inspection-with-ai-driven/ba-p/4514284</guid>
      <dc:creator>PeterTHLee</dc:creator>
      <dc:date>2026-04-25T00:07:31Z</dc:date>
    </item>
    <item>
      <title>Designing a Medallion Framework — A Decision Guide</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/designing-a-medallion-framework-a-decision-guide/ba-p/4514349</link>
      <description>&lt;P data-selectable-paragraph=""&gt;Everyone draws the same picture: Bronze → Silver → Gold. Three boxes, three arrows. Done.&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;What that picture hides is the dozen design decisions you have to make&amp;nbsp;&lt;EM&gt;inside&lt;/EM&gt;&amp;nbsp;each box — and the ones you make at the boundaries between them. Get those right and onboarding the 200th table feels like onboarding the 2nd. Get them wrong and you’ll be rewriting the framework in 18 months.&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;This post is a generic walkthrough of how to think about a medallion framework on Databricks (or any other platform): what each layer should own, where the responsibilities blur, and a few opinionated patterns I’ve found worth defending&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;The classic template -&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;Bronze → Silver → Gold. Three layers, broadly:&lt;/P&gt;
&lt;P&gt;Press enter or click to view image in full size&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;This template is intentionally vague — and that’s the point. The same three labels can describe a framework for a 10-table marketing pipeline and a 2,000-table enterprise lakehouse. The differences are in how you tweak the template to match your project.&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;This post walks through the questions that drive those tweaks. There isn’t a single right answer for any of them — only the answer that fits your project’s requirements.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;How to read this guide&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;For each architectural choice, I’ll frame it as:&lt;/P&gt;
&lt;OL&gt;
&lt;LI data-selectable-paragraph=""&gt;The question — the requirement you need to clarify&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;The options — the realistic ways to answer it&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;When each option fits — what kind of project picks which option&lt;/LI&gt;
&lt;/OL&gt;
&lt;P data-selectable-paragraph=""&gt;Use this to make your tradeoffs explicit. Document the answers in your design doc. They’ll inform a hundred downstream decisions.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 1 — Do you need a Staging layer?&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;A Staging (stg_*) layer is a transient zone that holds&amp;nbsp;&lt;EM&gt;just the current run’s data&lt;/EM&gt;&amp;nbsp;before it lands in Bronze.&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;Options:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;No staging. Source → Bronze directly.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Staging as a transient table per object, overwritten every run.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Staging as a checkpointed zone (e.g., Auto Loader checkpoints + raw files in a landing path).&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;When to pick which:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;The decision usually comes down to failure isolation and incremental capture clarity. If both are non-issues, you can skip it.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 2 — How “raw” should Bronze be?&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;This is the single biggest tweak point in the medallion architecture. The textbook says “Bronze = raw bytes.” Real projects often deviate.&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;Options:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;A. Strictly raw. Source schema preserved exactly. All columns as STRING. No casting, no trimming.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;B. Lightly cleaned. Strong typing, whitespace trimmed, null normalization (“”, “N/A” → NULL), audit columns added. Schema stable.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;C. Cleansed + minor enrichment. Above plus reference data lookups, basic standardization (e.g., country codes), key normalization.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;When to pick which:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;A useful rule of thumb: the more sources and consumers you have, the cleaner Bronze should be. The cost of&amp;nbsp;&lt;EM&gt;not&lt;/EM&gt;&amp;nbsp;cleaning compounds with every notebook downstream.&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;If you choose B or C, you’ve shifted some traditional Silver responsibilities into Bronze. That’s fine — just be explicit about it so Silver’s contract changes accordingly.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 3 — What does Silver actually own?&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;Silver is the most overloaded layer in any medallion framework. Decide upfront which of these responsibilities Silver owns vs. defers to other layers:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;How to decide what Silver owns:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;If Silver is the only layer business users query, give it more — including light history and aggregations. (Common in smaller projects.)&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;If you have a strong Gold layer with multiple marts, keep Silver narrow: business entities only, current state.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;If you have multiple consuming teams with different needs, push everything consumer-specific to Gold and keep Silver as the shared canonical model.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;The clearest signal that Silver is overloaded: you have one Silver table per source table. Silver should be organized by business entity, not by source. If they line up 1:1, you’ve effectively built “Bronze with cleaning” and skipped Silver’s real value.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 4 — Is Gold one zone or several?&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;The default picture shows Gold as one box. In real projects it often splits.&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;Options:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;Single Gold zone. Marts and history live together.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Gold-Reporting + Gold-History. Reporting marts (denormalized, aggregated, fast) separated from historized snapshots (SCD2, point-in-time, append-mostly).&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Gold per consumer. Separate zones per business unit, dashboard family, or external API.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;The cost of splitting Gold is some duplication and more pipelines. The benefit is independent SLAs — your dashboard refresh isn’t held hostage by your audit history rebuild.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 5 — Load patterns: FullLoad vs DeltaLoad vs CDC&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;Per source table, decide the load pattern. This decision drives staging design, watermark management, and merge logic.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;It’s normal to mix patterns inside the same framework. The metadata-driven approach below makes this trivial — load pattern is just a column in your config table.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 6 — How metadata-driven should the framework be?&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;Options:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;Code-per-table. One notebook per ingestion. Simple, easy to reason about, scales poorly.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Hybrid. Generic ingestion notebooks for common patterns, custom notebooks for exceptions.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Fully metadata-driven. Generic notebooks for every layer, behavior driven entirely by metadata tables.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;When to pick which:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;A fully metadata-driven framework has higher upfront cost but flattens the per-table cost dramatically. The break-even point is usually around 30–50 tables.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 7 — Orchestration shape&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;How do you fan out work across tables?&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;Options:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;Sequential. One table at a time. Simple, slow.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Parallel pool. ThreadPoolExecutor or Databricks Workflows fan-out. Tables run concurrently, no inter-table dependencies.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;DAG. Dependency-aware execution. Required when tables depend on each other.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;Per-layer guidance:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;The decision driver is whether tables in that layer depend on each other. If they don’t, don’t pay the DAG complexity tax.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 8 — Failure handling and retries&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;Options to decide on:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;Retry scope. Per statement, per child notebook, per master run, none.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Retry counts. Per layer? Per table? Per environment?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Backoff. Fixed, linear, exponential.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Failure semantics. Fail-fast (stop on first failure) or best-effort (continue and report at the end).&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;When to pick which:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;A good default for most projects: process-level retry (master retries the failed child), exponential backoff, per-layer max retry count, fail-fast within a child.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 9 — Observability: how much do you log?&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;Decide what every run captures:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;Execution status, start/end timestamps, duration&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Row counts per activity (source read, staging write, target write)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;A href="https://medium.com/download-app?source=promotion_paragraph---post_body_banner_surround_scribble--64cdf2552b24---------------------------------------" data-discover="true" target="_blank"&gt;&lt;IMG /&gt;&lt;/A&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;MERGE metrics (inserted, updated, deleted)&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Watermark used and watermark captured&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Retry attempts&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Error message (truncated)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;Options for storage:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;Logs in source-side metadata DB (e.g., Azure SQL). Easy to query with SQL, integrates with monitoring tools.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Logs in a Delta table in the lakehouse. Native to Databricks, queryable with Spark.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Logs in both. Source-side for ops dashboards, Delta for analytics on the pipeline itself.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;When to pick which:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;Whatever you pick, make count validation a first-class output. The moment counts mismatch, you want to know — not three reports later.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 10 — Schema evolution policy&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;The cheapest decision to defer and the most painful one to retrofit.&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;Decide which changes are allowed automatically:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;Where to enforce:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;At Bronze ingestion — fail loudly if source schema changes in a disallowed way&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;At Silver — handle by transformation; new Bronze columns don’t auto-flow to Silver&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;At Gold — strict contracts; consumers depend on the shape&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;The contract changes per layer reflects the audience. Bronze is forgiving (data engineers see issues); Gold is strict (consumers can’t tolerate surprises).&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 11 — Idempotency and replay&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;Can you re-run yesterday’s load and get the same result?&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;Options:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;Idempotent by run_id. Re-running the same run_id is a no-op or produces identical output.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Idempotent by data. Re-running with the same source data produces identical output (regardless of run_id).&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Non-idempotent. Replays may produce different results (e.g., timestamps based on current_timestamp()).&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;Recommendation: aim for data-idempotent in every layer. Concretely:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;Staging: overwrite-per-run → idempotent by construction.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Bronze: keyed MERGE → idempotent.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Silver: pure transformation of Bronze inputs → idempotent.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Gold: pure transformation of Silver inputs → idempotent.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;If you can’t replay a layer cleanly, that’s a design bug worth fixing early.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Question 12 — Environment topology&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;How many environments? How do they differ?&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;Common patterns:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;Dev / Test/ Stage / Prod, separate workspaces and data.&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Per-developer dev, shared Test/Stage, isolated Prod.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;What changes between environments (drive these from config):&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;Source connection strings&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Target storage paths / catalog names&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Retry counts (often higher in prod)&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Parallelism (often lower in dev to save cost)&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Logging verbosity&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Data masking rules&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;Keep code identical across environments. Differences live in environment-scoped config (dev.yml, test.yml, stage.yml, prod.yml) loaded at runtime.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Putting it together — three example shapes&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;The same framework, three different projects, three different shapes:&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Shape A — Small marketing analytics project&lt;/H2&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;15 tables, single source, weekly batch&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;No staging — source is reliable, volumes small&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Bronze: lightly cleaned — analysts query it directly&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Silver: full ownership including light history and aggregations (no separate Gold needed)&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Gold: optional, only for the executive dashboard&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Code-per-table, sequential orchestration, fail-fast, minimal logging&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 data-selectable-paragraph=""&gt;Shape B — Mid-size enterprise data platform&lt;/H2&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;80 tables, 5 source systems, daily batch with some hourly&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Staging as transient table for Delta Loads&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Bronze: lightly cleaned + audit columns&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Silver: business entities (Customer, Policy, Claim), DAG orchestration&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Gold: split into Reporting + History zones&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Hybrid metadata-driven (generic ingestion, custom transforms), per-layer retry, structured count logs&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 data-selectable-paragraph=""&gt;Shape C — Large multi-tenant Lakehouse&lt;/H2&gt;
&lt;UL&gt;
&lt;LI data-selectable-paragraph=""&gt;500+ tables, 20+ source systems, mixed batch/streaming&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Staging zone with file-level checkpoints (Auto Loader)&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Bronze: strictly raw + a parallel Bronze-Curated layer for cleansed views&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Silver: shared canonical model, narrow scope&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Gold: per-consumer zones with independent SLAs&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Fully metadata-driven, DAG everywhere, multi-store logging, strict schema contracts&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-selectable-paragraph=""&gt;Notice none of these are “wrong.” They’re calibrated to the project.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;A short checklist for your own framework&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;Before writing code, write down your answers to:&lt;/P&gt;
&lt;OL start="4"&gt;
&lt;LI data-selectable-paragraph=""&gt;Do we need a Staging layer? Why?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;How clean is Bronze? What’s allowed and what’s not?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;What does Silver own? Where does it stop?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Is Gold one zone or multiple? How are they divided?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Which load patterns do we support? Per table or universal?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;How metadata-driven? Where do exceptions live?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;What’s the orchestration shape per layer?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;What’s our retry and failure policy per layer?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;What does every run log? Where?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;What’s our schema evolution policy per layer?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;Are all layer's data-idempotent?&lt;/LI&gt;
&lt;LI data-selectable-paragraph=""&gt;What changes per environment, and what stays the same?&lt;/LI&gt;
&lt;/OL&gt;
&lt;P data-selectable-paragraph=""&gt;If you have an answer for each, you have a framework design. If you skip any, you have a framework that will surprise you in production.&lt;/P&gt;
&lt;H2 data-selectable-paragraph=""&gt;Closing thought&lt;/H2&gt;
&lt;P data-selectable-paragraph=""&gt;The medallion architecture isn’t a prescription — it’s a vocabulary. Bronze, Silver, Gold give you words to describe responsibilities. The actual responsibilities are yours to assign, based on what your project actually needs.&lt;/P&gt;
&lt;P data-selectable-paragraph=""&gt;Tweak deliberately. Document your tweaks. And revisit them when the project’s requirements change — because they will.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Apr 2026 16:11:09 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/designing-a-medallion-framework-a-decision-guide/ba-p/4514349</guid>
      <dc:creator>Subhajit1994</dc:creator>
      <dc:date>2026-04-24T16:11:09Z</dc:date>
    </item>
    <item>
      <title>Enhancing Enterprise AI Deployments with Zero Trust Networking</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/enhancing-enterprise-ai-deployments-with-zero-trust-networking/ba-p/4513662</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Problem Statement&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Azure OpenAI is publicly accessible by default&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/foundry-classic/openai/faq" target="_blank" rel="noopener"&gt;Azure OpenAI frequently asked questions | Microsoft Learn&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Any application with API access can call it from anywhere&lt;/LI&gt;
&lt;LI&gt;Violates:&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Enterprise security policies&lt;/LI&gt;
&lt;LI&gt;Zero Trust architecture &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/security/zero-trust/deploy/networks#key-principles-of-the-zero-trust-network-model" target="_blank" rel="noopener"&gt;Key principles of the Zero Trust network model&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Regulatory compliance&lt;/LI&gt;
&lt;/UL&gt;
&lt;/UL&gt;
&lt;P&gt;👉 Enterprises require:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Private connectivity&lt;/LI&gt;
&lt;LI&gt;Controlled access via VNet&lt;/LI&gt;
&lt;LI&gt;DNS-based secure resolution&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;🏗️ Architecture Overview&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;✅ Key Components&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Azure OpenAI Service&lt;/LI&gt;
&lt;LI&gt;Azure Virtual Network (VNet)&lt;/LI&gt;
&lt;LI&gt;Private Endpoint&lt;/LI&gt;
&lt;LI&gt;Private DNS Zone&lt;/LI&gt;
&lt;LI&gt;Application (VM / App Service / AKS)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;✅ High-Level Flow&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Application sends request to OpenAI endpoint&lt;/LI&gt;
&lt;LI&gt;DNS resolves endpoint → Private IP&lt;/LI&gt;
&lt;LI&gt;Traffic routed inside VNet&lt;/LI&gt;
&lt;LI&gt;No internet exposure&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;👉 Private endpoints assign a&lt;STRONG&gt; &lt;/STRONG&gt;private IP inside VNet, ensuring secure communication over Azure backbone.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;🔹 Architecture Diagram Description&lt;/STRONG&gt;&lt;/P&gt;
&lt;img&gt;Diagram 01: Architecture &lt;SPAN style="color: rgb(112, 112, 112);" data-mce-style="color: rgb(112, 112, 112);"&gt;Enhancing Enterprise AI Deployments with Zero Trust Networking&lt;/SPAN&gt;&lt;/img&gt;
&lt;P&gt;&lt;STRONG style="color: rgb(30, 30, 30);"&gt;End-to-End Flow&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;User authenticates via Entra ID (MFA) &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/entra/identity/authentication/concept-mfa-howitworks" target="_blank" rel="noopener"&gt;Microsoft Entra multifactor authentication&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Traffic passes through WAF (threat filtering)&lt;/LI&gt;
&lt;LI&gt;Enters private Azure VNet&lt;/LI&gt;
&lt;LI&gt;API Management enforces policies&lt;/LI&gt;
&lt;LI&gt;AI services are accessed via private endpoints&lt;/LI&gt;
&lt;LI&gt;Data is securely fetched from private storage/databases&lt;/LI&gt;
&lt;LI&gt;Monitoring tools track all activity continuously&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Key Value Proposition: &lt;/STRONG&gt;This architecture ensures:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;🚫 No public exposure of AI services or data&lt;/LI&gt;
&lt;LI&gt;🔐 Identity-based access instead of network trust&lt;/LI&gt;
&lt;LI&gt;🌐 Fully private, isolated network communication&lt;/LI&gt;
&lt;LI&gt;⚡ Secure and scalable AI workloads&lt;/LI&gt;
&lt;LI&gt;🛡️ Defense-in-depth with monitoring and policy enforcement&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Note: &lt;/STRONG&gt;This architecture demonstrates how enterprises can securely operationalize AI at scale by combining private networking, identity-driven access, and continuous monitoring—fully aligned with Zero Trust principles.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;🔍 Critical Concept: Private Endpoint&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;A Private Endpoint:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Creates a network interface in your VNet&lt;/LI&gt;
&lt;LI&gt;Assigns a private IP address&lt;/LI&gt;
&lt;LI&gt;Maps to Azure OpenAI service&lt;/LI&gt;
&lt;LI&gt;Redirects traffic internally&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;👉 Result:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;No public internet usage&lt;/LI&gt;
&lt;LI&gt;Fully isolated communication&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;🔍 Critical Concept: DNS Resolution&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Why DNS is critical?&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;OpenAI endpoint still uses public FQDN&lt;/LI&gt;
&lt;LI&gt;Must resolve to private IP instead&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;👉 Without correct DNS:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Traffic goes to public endpoint&lt;/LI&gt;
&lt;LI&gt;Security is broken&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;How it works&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Public DNS CNAME → Private Link domain&lt;/LI&gt;
&lt;LI&gt;Private DNS overrides resolution&lt;/LI&gt;
&lt;LI&gt;FQDN resolves to Private Endpoint IP&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;👉 DNS ensures that traffic routes correctly to private endpoint&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;🧱 Required Private DNS Zones&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;For Azure OpenAI:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;privatelink.openai.azure.com&lt;/LI&gt;
&lt;LI&gt;privatelink.cognitiveservices.azure.com&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;👉 These zones map:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;OpenAI endpoint → Private IP&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;👉 Important: Each Private Endpoint must have proper DNS mapping&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;⚙️ Step-by-Step Configuration&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;✅ Step 1: Create Virtual Network&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Create VNet with:&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;App subnet&lt;/LI&gt;
&lt;LI&gt;Private Endpoint subnet&lt;/LI&gt;
&lt;/UL&gt;
&lt;/UL&gt;
&lt;P&gt;👉 Best practice:&lt;/P&gt;
&lt;P&gt;Use dedicated subnet for private endpoints&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;✅ Step 2: Create Azure OpenAI Resource&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Go to Azure Portal&lt;/LI&gt;
&lt;LI&gt;Create Azure OpenAI&lt;/LI&gt;
&lt;LI&gt;Select region &amp;amp; resource group&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;👉 Note:&lt;/P&gt;
&lt;P&gt;OpenAI resource doesn’t need same region as VNet (optional)&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;✅ Step 3: Disable Public Network Access&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Navigate to:&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Networking → Public Access&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI&gt;Set:&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Public Network Access = Disabled&lt;/P&gt;
&lt;P&gt;👉 Ensures service is not accessible via internet&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;✅ Step 4: Create Private Endpoint&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Go to OpenAI → Networking → Private Endpoint&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Configure:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Setting&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Value&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;VNet&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Your VNet&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Subnet&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Private Endpoint Subnet&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Resource Type&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Cognitive Services&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Sub-resource&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;account&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;👉 This creates:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Private IP in subnet&lt;/LI&gt;
&lt;LI&gt;Network interface mapping&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;✅ Step 5: Configure Private DNS Zone&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Create:&lt;/P&gt;
&lt;P&gt;privatelink.openai.azure.com&lt;/P&gt;
&lt;P&gt;Then:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Link DNS zone to VNet&lt;/LI&gt;
&lt;LI&gt;Add A record automatically (or manually)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;👉 DNS maps:&lt;/P&gt;
&lt;P&gt;&amp;lt;your-openai-name&amp;gt;.openai.azure.com → Private IP&lt;/P&gt;
&lt;P&gt;👉 DNS resolution ensures traffic flows internally&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;✅ Step 6: Validate Connectivity&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;From VM inside VNet:&lt;/P&gt;
&lt;P&gt;nslookup &amp;lt;openai-name&amp;gt;.openai.azure.com&lt;/P&gt;
&lt;P&gt;✅ Expected output:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Private IP (e.g., 10.x.x.x)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Then test API call → should work&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;✅ Step 7: Application Integration&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Your application (AKS / VM / App Service):&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Calls OpenAI endpoint&lt;/LI&gt;
&lt;LI&gt;Traffic resolves to private IP&lt;/LI&gt;
&lt;LI&gt;Routed via VNet&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;👉 Fully secure AI access&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;🔐 Security Best Practices&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;✔ Disable public access completely&lt;BR /&gt;✔ Use Private Endpoint for all AI services&lt;BR /&gt;✔ Use NSG + Firewall for segmentation&lt;BR /&gt;✔ Use Managed Identity instead of API keys&lt;BR /&gt;✔ Monitor via Azure Monitor&lt;/P&gt;
&lt;P&gt;👉 Private endpoints ensure traffic stays inside Azure backbone network&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;🏢 Real-World Enterprise Use Case&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Example:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Banking application using OpenAI&lt;/LI&gt;
&lt;LI&gt;Hosted on AKS&lt;/LI&gt;
&lt;LI&gt;Uses:&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Private Endpoint&lt;/LI&gt;
&lt;LI&gt;APIM&lt;/LI&gt;
&lt;LI&gt;DNS resolution&lt;/LI&gt;
&lt;/UL&gt;
&lt;/UL&gt;
&lt;P&gt;👉 Result:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;No internet exposure&lt;/LI&gt;
&lt;LI&gt;Compliance with regulations&lt;/LI&gt;
&lt;LI&gt;Secure data processing&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;✅ Key Benefits&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;🔒 Zero internet exposure&lt;/LI&gt;
&lt;LI&gt;🌐 Private connectivity&lt;/LI&gt;
&lt;LI&gt;🛡️ Zero Trust architecture&lt;/LI&gt;
&lt;LI&gt;⚡ Reliable and low latency&lt;/LI&gt;
&lt;LI&gt;🧩 Seamless app integration&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;🧾 Conclusion&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Azure OpenAI is powerful, but security architecture is critical for enterprise adoption.&lt;/P&gt;
&lt;P&gt;By using:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Private Endpoints&lt;/LI&gt;
&lt;LI&gt;Private DNS Zones&lt;/LI&gt;
&lt;LI&gt;VNet integration&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;You can build a secure, scalable, and compliant AI solution.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Apr 2026 16:08:22 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/enhancing-enterprise-ai-deployments-with-zero-trust-networking/ba-p/4513662</guid>
      <dc:creator>kirankumar_manchiwar04</dc:creator>
      <dc:date>2026-04-24T16:08:22Z</dc:date>
    </item>
    <item>
      <title>When RAG Isn’t Enough: Moving from Retrieval to Relationship-Aware Systems in Enterprise AI:</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/when-rag-isn-t-enough-moving-from-retrieval-to-relationship/ba-p/4514185</link>
      <description>&lt;H3&gt;&lt;STRONG&gt;The Problem&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;In an enterprise AI scenario, the goal was to map structured feature data to relevant sections within large technical documents.&lt;/P&gt;
&lt;P&gt;At a glance, this appears to be a straightforward semantic matching problem. Initial results using semantic search were promising. However, as the system was used more extensively, certain issues became apparent:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Inconsistent mappings across similar inputs&lt;/LI&gt;
&lt;LI&gt;Occasional matches to contextually unrelated sections&lt;/LI&gt;
&lt;LI&gt;Variability in results across repeated runs&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Despite multiple optimizations, the system continued to produce outcomes that lacked reliability.&lt;/P&gt;
&lt;P&gt;This pointed to a deeper realization:&lt;/P&gt;
&lt;P&gt;The challenge was not just retrieval quality, but the absence of structure in how retrieval was being guided.&lt;BR /&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Initial Approach: Retrieval-Augmented Generation (RAG)&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The system followed a standard RAG architecture:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Documents indexed using embeddings&lt;/LI&gt;
&lt;LI&gt;Semantic similarity used for retrieval&lt;/LI&gt;
&lt;LI&gt;Retrieved context passed to a language model for processing&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;RAG is highly effective in scenarios involving unstructured data, offering flexibility and strong contextual understanding.&lt;/P&gt;
&lt;P&gt;However, an important limitation emerged:&lt;/P&gt;
&lt;P&gt;RAG operates on semantic similarity but does not inherently understand relationships or domain constraints.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;Observed Challenges&lt;/STRONG&gt;&lt;/H5&gt;
&lt;OL&gt;
&lt;LI&gt;Lack of Contextual Boundaries&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Concepts with similar terminology were sometimes mapped across unrelated domains due to overlapping language. Without domain awareness, the system struggled to enforce meaningful boundaries.&lt;/P&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;Underutilization of Existing Structure&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;The data already contained valuable structure:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Features were organized into categories&lt;/LI&gt;
&lt;LI&gt;Categories aligned with specific document sections&lt;/LI&gt;
&lt;LI&gt;Relationships followed consistent, rule-driven patterns&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This structure was not incorporated into the retrieval process, leading to missed opportunities for improving accuracy.&lt;/P&gt;
&lt;OL start="3"&gt;
&lt;LI&gt;Variability in Deterministic Scenarios&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Some mappings followed clear and consistent rules. However, treating all queries as probabilistic retrieval problems introduced unnecessary variability and reduced confidence in the results.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;Introducing Structure with Knowledge Graphs&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;To address these challenges, a structured layer based on Knowledge Graph concepts was introduced.&lt;/P&gt;
&lt;P&gt;At a high level, relationships were modeled as:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-section-id="12odvye" data-start="661" data-end="699"&gt;&lt;STRONG data-start="663" data-end="697"&gt;Entity → belongs to → Category&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-section-id="1ypknl9" data-start="700" data-end="747"&gt;&lt;STRONG data-start="702" data-end="745"&gt;Category → linked to → Knowledge Source&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-section-id="u3fe4b" data-start="748" data-end="804"&gt;&lt;STRONG data-start="750" data-end="804"&gt;Knowledge Source → contains → Relevant Information&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This enabled:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Constraint enforcement for rule-based mappings&lt;/LI&gt;
&lt;LI&gt;Relationship traversal across hierarchical data&lt;/LI&gt;
&lt;LI&gt;Improved explainability through traceable decision paths&lt;/LI&gt;
&lt;/UL&gt;
&lt;H5&gt;&lt;STRONG&gt;Hybrid Approach: Combining Knowledge Graph and RAG&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;Rather than replacing RAG, the system evolved into a hybrid architecture:&lt;/P&gt;
&lt;P&gt;Step 1: Knowledge Graph for filtering&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Apply domain constraints&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;Narrow down the search space to relevant sections&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Step 2: RAG for semantic refinement&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Perform retrieval within the filtered scope&lt;/LI&gt;
&lt;LI&gt;Extract context with greater precision&lt;/LI&gt;
&lt;/UL&gt;
&lt;H5&gt;&lt;STRONG&gt;Key Insight&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;The transition from a retrieval-first approach to a constraint-guided retrieval model significantly improved consistency and relevance.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;When to Use This Approach&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;RAG is sufficient when:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Data is primarily unstructured&lt;/LI&gt;
&lt;LI&gt;Relationships are weak or undefined&lt;/LI&gt;
&lt;LI&gt;Rapid prototyping is required&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;A hybrid approach is beneficial when:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;The domain includes clear hierarchies or taxonomies&lt;/LI&gt;
&lt;LI&gt;Relationships are deterministic or rule-driven&lt;/LI&gt;
&lt;LI&gt;Consistency and explainability are important&lt;/LI&gt;
&lt;LI&gt;Pure semantic retrieval produces logically incorrect results&lt;/LI&gt;
&lt;/UL&gt;
&lt;H5&gt;&lt;STRONG&gt;Key Takeaways&lt;/STRONG&gt;&lt;/H5&gt;
&lt;UL&gt;
&lt;LI&gt;Leverage existing structure in enterprise data instead of relying solely on semantic similarity&lt;/LI&gt;
&lt;LI&gt;Analyze failure patterns to identify missing constraints&lt;/LI&gt;
&lt;LI&gt;Combine structured and semantic approaches for robust system design&lt;/LI&gt;
&lt;LI&gt;Prioritize explainability for production-grade AI systems&lt;/LI&gt;
&lt;/UL&gt;
&lt;H5&gt;&lt;STRONG&gt;Broader Perspective&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;As enterprise AI systems scale, it becomes increasingly important to balance:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Semantic understanding (RAG)&lt;/LI&gt;
&lt;LI&gt;Structured reasoning (Knowledge Graphs)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;These approaches are not competing—they are complementary. When combined effectively, they enable systems that are both flexible and reliable.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;Closing Thought&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;A key realization from this experience was:&lt;/P&gt;
&lt;P&gt;Instead of focusing only on improving retrieval, it is equally important to understand how domain structure can guide and constrain that retrieval.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;This article reflects personal learnings and general architectural patterns.&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 24 Apr 2026 07:04:57 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/when-rag-isn-t-enough-moving-from-retrieval-to-relationship/ba-p/4514185</guid>
      <dc:creator>ankitasarkar</dc:creator>
      <dc:date>2026-04-24T07:04:57Z</dc:date>
    </item>
    <item>
      <title>Centralizing Enterprise API Access for Agent-Based Architectures</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/centralizing-enterprise-api-access-for-agent-based-architectures/ba-p/4511792</link>
      <description>&lt;H2&gt;Problem Statement&lt;/H2&gt;
&lt;P&gt;When building AI agents or automation solutions, calling enterprise APIs directly often means configuring individual HTTP actions within each agent for every API. While this works for simple scenarios, it quickly becomes repetitive and difficult to manage as complexity grows.&lt;/P&gt;
&lt;P&gt;The challenge becomes more pronounced when a single business domain exposes multiple APIs, or when the same APIs are consumed by multiple agents. This leads to duplicated configurations, higher maintenance effort, inconsistent behavior, and increased governance and security risks.&lt;/P&gt;
&lt;P&gt;A more scalable approach is to centralize and reuse API access. By grouping APIs by business domain using an API management layer, shaping those APIs through a Model Context Protocol (MCP) server, and exposing the MCP server as a standardized tool or connector, agents can consume business capabilities in a consistent, reusable, and governable manner.&lt;/P&gt;
&lt;P&gt;This pattern not only reduces duplication and configuration overhead but also enables stronger versioning, security controls, observability, and domain‑driven ownership—making agent-based systems easier to scale and operate in enterprise environments.&lt;/P&gt;
&lt;H2&gt;Designing Agent‑Ready APIs with Azure API Management, an MCP Server, and Copilot Studio&lt;/H2&gt;
&lt;P&gt;As enterprises increasingly adopt AI‑powered assistants and Copilots, API design must evolve to meet the needs of intelligent agents. Traditional APIs—often designed for user interfaces or backend integrations—can expose excessive data, lack intent-level abstraction, and increase security risk when consumed directly by AI systems. This document outlines a practical, enterprise-‑ready approach to organize APIs in Azure API Management (APIM), introduce a Model Context Protocol (MCP) server to shape and control context, and integrate the solution with Microsoft Copilot Studio. The goal is to make APIs truly agent-‑ready: secure, scalable, reusable, and easy to govern.&lt;/P&gt;
&lt;H3&gt;Architecture at a glance&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;Back-end services expose domain APIs.&lt;/LI&gt;
&lt;LI&gt;Azure API Management (APIM) groups and governs those APIs (products, policies, authentication, throttling, versions).&lt;/LI&gt;
&lt;LI&gt;An MCP server calls APIM, orchestrates/filters responses, and returns concise, model-friendly outputs.&lt;/LI&gt;
&lt;LI&gt;Copilot Studio connects to the MCP server and invokes a small set of predictable operations to satisfy user intents.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Why Traditional API Designs Fall Short for AI Agents&lt;/H3&gt;
&lt;P&gt;Enterprise APIs have historically been built around CRUD operations and service-‑to-‑service integration patterns. While this works well for deterministic applications, AI agents work best with intent-driven operations and context-aware responses. When agents consume traditional APIs directly, common issues include: overly verbose payloads, multiple calls to satisfy a single user intent, and insufficient guardrails for read vs. write operations. The result can be unpredictable agent behavior that is difficult to test, validate, and govern.&lt;/P&gt;
&lt;H3&gt;Structuring APIs Effectively in Azure API Management&lt;/H3&gt;
&lt;P&gt;Azure API Management (APIM) is the control plane between enterprise systems and AI agents. A well-‑structured APIM instance improves security, discoverability, and governance through products, policies, subscriptions, and analytics. &lt;STRONG&gt;Key design principles for agent consumption&lt;/STRONG&gt; Organize APIs by &lt;EM&gt;business capability&lt;/EM&gt; (for example, Customer, Orders, Billing) rather than technical layers. Expose agent-facing APIs via dedicated APIM products to enable controlled access, throttling, versioning, and independent lifecycle management. Prefer read-only operations where possible; scope write operations narrowly and protect them with explicit checks, approvals, and least-privilege identities. Read‑only APIs should be prioritized, while action‑oriented APIs must be carefully scoped and gated.&lt;/P&gt;
&lt;H3&gt;The Role of the MCP Server in Agent‑Based Architectures&lt;/H3&gt;
&lt;P&gt;APIM provides governance and security, but agents also need an intent-level interface and model-friendly responses. A Model Context Protocol (MCP) server fills this gap by acting as a mediator between Copilot Studio and APIM-exposed APIs. Instead of exposing many back-end endpoints directly to the agent, the MCP server can: orchestrate multiple API calls, filter irrelevant fields, enforce business rules, enrich results with additional context, and emit concise, predictable JSON outputs. This makes agent behavior more reliable and easier to validate. Instead of exposing multiple backend APIs directly to the agent, the MCP server aggregates responses, filters irrelevant data, enriches results with business context, and formats responses into LLM‑friendly schemas. By introducing this abstraction layer, Copilot interactions become simpler, safer, and more deterministic. The agent interacts with a small number of well‑defined MCP operations that encapsulate enterprise logic without exposing internal complexity.&lt;/P&gt;
&lt;H3&gt;Designing an Effective MCP Server&lt;/H3&gt;
&lt;P&gt;An MCP server should have a focused responsibility: shaping context for AI models. It should not replace core back-end services; it should adapt enterprise capabilities for agent consumption. &lt;STRONG&gt;What MCP should do &lt;/STRONG&gt;An MCP server should be designed with a clear and focused responsibility: shaping context for AI models. Its primary role is not to replace backend services, but to adapt enterprise data for intelligent consumption. MCP does not orchestrate enterprise workflows or apply business logic. It standardizes how agents discover and invoke external tools and APIs by exposing them through a structured protocol interface. Orchestration, intent resolution, and policy-driven execution are handled by the agent runtime or host framework. It is equally important to understand what does not belong in MCP. Complex transactional workflows, long‑running processes, and UI‑specific formatting should remain in backend systems. Keeping MCP lightweight ensures scalability and easier maintenance.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Call APIM-managed APIs and orchestrate multi-step retrieval when needed.&lt;/LI&gt;
&lt;LI&gt;Apply security checks and business rules consistently.&lt;/LI&gt;
&lt;LI&gt;Filter and minimize payloads (return only fields needed for the intent).&lt;/LI&gt;
&lt;LI&gt;Normalize and reshape responses into stable, predictable JSON schemas.&lt;/LI&gt;
&lt;LI&gt;Handle errors and edge cases with safe, descriptive messages.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;What MCP should not do&lt;/STRONG&gt; Avoid implementing complex transactional workflows, long-running processes, or UI-specific formatting in MCP. Keep it lightweight so it remains scalable, testable, and easy to maintain.&lt;/P&gt;
&lt;H3&gt;Step by step guide&lt;/H3&gt;
&lt;H3&gt;1) Create an MCP server in Azure API Management (APIM)&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;Open the Azure portal (&lt;A href="http://portal.azure.com/" target="_blank" rel="noopener" data-test-app-aware-link=""&gt;portal.azure.com&lt;/A&gt;).&lt;/LI&gt;
&lt;LI&gt;Go to your &lt;STRONG&gt;API Management&lt;/STRONG&gt; instance.&lt;/LI&gt;
&lt;LI&gt;In the left navigation, expand &lt;STRONG&gt;APIs&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI&gt;Create (or select) an &lt;STRONG&gt;API group&lt;/STRONG&gt; for the business domain you want to expose (for example, Orders or Customers).&lt;/LI&gt;
&lt;LI&gt;Add the relevant APIs/operations to that API group.&lt;/LI&gt;
&lt;LI&gt;Create or select an &lt;STRONG&gt;APIM product&lt;/STRONG&gt; dedicated for agent usage, and ensure the product requires a &lt;STRONG&gt;subscription&lt;/STRONG&gt; (subscription key).&lt;/LI&gt;
&lt;LI&gt;Create an &lt;STRONG&gt;MCP server&lt;/STRONG&gt; in APIM and map it to the API (or API group) you want to expose as MCP operations.&lt;/LI&gt;
&lt;LI&gt;In the MCP server settings, ensure &lt;STRONG&gt;Subscription key required&lt;/STRONG&gt; is enabled.&lt;/LI&gt;
&lt;LI&gt;From the product’s &lt;STRONG&gt;Subscriptions&lt;/STRONG&gt; page, copy the subscription key you will use in Copilot Studio.&lt;/LI&gt;
&lt;LI&gt;&lt;EM&gt;Screenshot placeholders:&lt;/EM&gt; APIM API group, product configuration, MCP server mapping, subscription settings, subscription key location.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;* &lt;STRONG&gt;Note:&lt;/STRONG&gt; Using an API Management subscription key to access MCP operations is one supported way to authenticate and consume enterprise APIs. However, this approach is best suited for initial setups, demos, or scenarios where key-based access is explicitly required.&lt;/P&gt;
&lt;P&gt;For production‑grade enterprise solutions, Microsoft recommends using &lt;STRONG&gt;managed identity–based access control&lt;/STRONG&gt;. Managed identities for Azure resources eliminate the need to manage secrets such as subscription keys or client secrets, integrate natively with &lt;STRONG&gt;Microsoft Entra ID&lt;/STRONG&gt;, and support fine‑grained role‑based access control (RBAC). This approach improves security posture while significantly reducing operational and governance overhead for agent and service‑to‑service integrations.&lt;/P&gt;
&lt;P&gt;Wherever possible, agents and MCP servers should authenticate using managed identities to ensure secure, scalable, and compliant access to enterprise APIs.&lt;/P&gt;
&lt;H3&gt;2) Create a Copilot Studio agent and connect to the APIM MCP server using a subscription key&lt;/H3&gt;
&lt;P&gt;Copilot Studio natively supports Model Context Protocol (MCP) servers as tools. When an agent is connected to an MCP server, the tool metadata—including operation names, inputs, and outputs—is automatically discovered and kept in sync, reducing manual configuration and maintenance overhead.&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Sign in to Copilot Studio.&lt;/LI&gt;
&lt;LI&gt;Create a new &lt;STRONG&gt;agent&lt;/STRONG&gt; and add clear instructions describing when to use the MCP tool and how to present results (for example, concise summaries plus key fields).&lt;/LI&gt;
&lt;LI&gt;Open &lt;STRONG&gt;Tools&lt;/STRONG&gt; &amp;gt; &lt;STRONG&gt;Add tool&lt;/STRONG&gt; &amp;gt; &lt;STRONG&gt;Model Context Protocol&lt;/STRONG&gt;, then choose &lt;STRONG&gt;Create&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI&gt;Enter the MCP server details: &lt;STRONG&gt;Server endpoint URL&lt;/STRONG&gt;: copy this from your MCP server in APIM. &lt;STRONG&gt;Authentication&lt;/STRONG&gt;: select &lt;STRONG&gt;API Key&lt;/STRONG&gt;. &lt;STRONG&gt;Header name&lt;/STRONG&gt;: use the subscription key header required by your APIM configuration.&lt;/LI&gt;
&lt;LI&gt;Select &lt;STRONG&gt;Create new connection&lt;/STRONG&gt;, paste the APIM subscription key, and save.&lt;/LI&gt;
&lt;LI&gt;Test the tool in the agent by prompting for a domain-specific task (for example, “Get order status for 12345”). Validate that responses are concise and that errors are handled safely.&lt;/LI&gt;
&lt;LI&gt;&lt;EM&gt;Screenshot placeholders:&lt;/EM&gt; MCP tool creation screen, endpoint + auth configuration, connection creation, test prompt and response.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Operational best practices and guardrails&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Least privilege by default:&lt;/STRONG&gt; create separate APIM products and identities for agent scenarios; avoid broad access to internal APIs.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Prefer intent-level operations:&lt;/STRONG&gt; expose fewer, higher-level MCP operations instead of many low-level endpoints.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Protect write operations:&lt;/STRONG&gt; require explicit parameters, validation, and (when appropriate) approval flows; keep “read” and “write” tools separate.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Stable schemas:&lt;/STRONG&gt; return predictable JSON shapes and limit optional fields to reduce prompt brittleness.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Observability:&lt;/STRONG&gt; log MCP requests/responses (with sensitive fields redacted), monitor APIM analytics, and set alerts for failures and throttling.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Versioning:&lt;/STRONG&gt; version MCP operations and APIM APIs; deprecate safely.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Security hygiene:&lt;/STRONG&gt; treat subscription keys as secrets, rotate regularly, and avoid exposing them in prompts or logs.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Summary&lt;/H3&gt;
&lt;P&gt;As organizations scale agent‑based and Copilot‑driven solutions, directly exposing enterprise APIs to AI agents quickly becomes complex and risky. Centralizing API access through &lt;STRONG&gt;Azure API Management&lt;/STRONG&gt;, shaping agent‑ready context via a &lt;STRONG&gt;Model Context Protocol (MCP) server&lt;/STRONG&gt;, and consuming those capabilities through &lt;STRONG&gt;Copilot Studio&lt;/STRONG&gt; establishes a clean and governable architecture.&lt;/P&gt;
&lt;P&gt;This pattern reduces duplication, enforces consistent security controls, and enables intent‑driven API consumption without exposing unnecessary backend complexity. By combining domain‑aligned API products, lightweight MCP operations, and least‑privilege identity‑based access, enterprises can confidently scale AI agents while maintaining strong governance, observability, and operational control.&lt;/P&gt;
&lt;H2&gt;References&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/api-management/" target="_blank" rel="noopener"&gt;&lt;STRONG&gt;Azure API Management (APIM) – Overview&lt;/STRONG&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/api-management/api-management-key-concepts" target="_blank" rel="noopener"&gt;&lt;STRONG&gt;Azure API Management – Key Concepts&lt;/STRONG&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/developer/azure-mcp-server/" target="_blank" rel="noopener"&gt;&lt;STRONG&gt;Azure MCP Server Documentation (Model Context Protocol)&lt;/STRONG&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/microsoft-copilot-studio/agent-extend-action-mcp" target="_blank"&gt;Extend your agent with Model Context Protocol&lt;/A&gt;&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/azure/active-directory/managed-identities-azure-resources/overview" target="_blank"&gt;&lt;STRONG&gt;Managed identities for Azure resources – Overview&lt;/STRONG&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 23 Apr 2026 21:43:23 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/centralizing-enterprise-api-access-for-agent-based-architectures/ba-p/4511792</guid>
      <dc:creator>sbaskaran</dc:creator>
      <dc:date>2026-04-23T21:43:23Z</dc:date>
    </item>
    <item>
      <title>Automatic SSO Takeover in Azure AD B2C Custom Policies</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/automatic-sso-takeover-in-azure-ad-b2c-custom-policies/ba-p/4514167</link>
      <description>&lt;H2 data-section-id="1u9nex9" data-start="431" data-end="450"&gt;Why This Matters&lt;/H2&gt;
&lt;P data-start="452" data-end="597"&gt;As organizations modernize authentication, many shift toward &lt;STRONG data-start="513" data-end="537"&gt;Single Sign-On (SSO)&lt;/STRONG&gt; using providers like &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/entra/fundamentals/what-is-entra" target="_blank"&gt;Microsoft EntraID&lt;/A&gt;.&lt;/P&gt;
&lt;P data-start="599" data-end="748"&gt;But if you already have users in &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/active-directory-b2c/" target="_blank"&gt;Azure AD B2C&lt;/A&gt; using local accounts (email + password), the transition isn’t straightforward.&lt;/P&gt;
&lt;P data-start="750" data-end="766"&gt;You’ll run into:&lt;/P&gt;
&lt;UL data-start="768" data-end="1034"&gt;
&lt;LI data-section-id="4q8fxl" data-start="768" data-end="845"&gt;&lt;STRONG data-start="770" data-end="794"&gt;Duplicate identities&lt;/STRONG&gt; when users sign in with SSO using the same email&lt;/LI&gt;
&lt;LI data-section-id="xiomg7" data-start="846" data-end="896"&gt;&lt;STRONG data-start="848" data-end="875"&gt;No clean migration path&lt;/STRONG&gt; for existing users&lt;/LI&gt;
&lt;LI data-section-id="jbkaf5" data-start="897" data-end="969"&gt;&lt;STRONG data-start="899" data-end="916"&gt;Security gaps&lt;/STRONG&gt; if password sign-in remains available after SSO&lt;/LI&gt;
&lt;LI data-section-id="pftjxk" data-start="970" data-end="1034"&gt;&lt;STRONG data-start="972" data-end="988"&gt;Confusing UX&lt;/STRONG&gt; if password reset still allowed for SSO users&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 data-section-id="1g0f15" data-start="1041" data-end="1052"&gt;The Goal&lt;/H2&gt;
&lt;P data-start="1054" data-end="1090"&gt;A seamless, secure transition where:&lt;/P&gt;
&lt;UL data-start="1092" data-end="1317"&gt;
&lt;LI data-section-id="11cwd0g" data-start="1092" data-end="1139"&gt;Users keep a &lt;STRONG&gt;single i&lt;/STRONG&gt;&lt;STRONG data-start="1105" data-end="1139"&gt;dentity (objectId)&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-section-id="m6mmrn" data-start="1140" data-end="1194"&gt;SSO is &lt;STRONG data-start="1149" data-end="1173"&gt;automatically linked&lt;/STRONG&gt; to existing accounts&lt;/LI&gt;
&lt;LI data-section-id="1el97or" data-start="1195" data-end="1262"&gt;Password-based flows are &lt;STRONG data-start="1222" data-end="1262"&gt;permanently disabled after migration&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-section-id="1el97or" data-start="1195" data-end="1262"&gt;Non-migrated users continue normal local flows (including Forgot Password)&lt;/LI&gt;
&lt;LI data-section-id="tetug5" data-start="1263" data-end="1317"&gt;No manual migration or user intervention is required&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 data-section-id="m81flh" data-start="1324" data-end="1354"&gt;The Pattern: “SSO Takeover”&lt;/H2&gt;
&lt;P data-start="1356" data-end="1430"&gt;This approach uses &lt;STRONG data-start="1375" data-end="1426"&gt;custom policies (Identity Experience Framework)&lt;/STRONG&gt; to:&lt;/P&gt;
&lt;OL data-start="1432" data-end="1663"&gt;
&lt;LI data-section-id="9vknj" data-start="1432" data-end="1472"&gt;Detect when a user signs in via SSO&lt;/LI&gt;
&lt;LI data-section-id="1q16x80" data-start="1473" data-end="1529"&gt;Check if a local account exists with the same email&lt;/LI&gt;
&lt;LI data-section-id="swiayc" data-start="1530" data-end="1572"&gt;Automatically &lt;STRONG data-start="1547" data-end="1570"&gt;link the identities&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-section-id="zhdsd" data-start="1573" data-end="1620"&gt;Set a flag: &lt;STRONG&gt;extension_ssoMigrated = true&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-section-id="omkimh" data-start="1621" data-end="1663"&gt;Enforce SSO-only access going forward&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;Outcome&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Local user (not migrated)&lt;/td&gt;&lt;td&gt;✅ Password sign-in works&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Local user (not migrated) – Forgot Password&lt;/td&gt;&lt;td&gt;✅ Allowed&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;First SSO login (existing user)&lt;/td&gt;&lt;td&gt;✅ Account linked automatically&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SSO-migrated user&lt;/td&gt;&lt;td&gt;✅ SSO works&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SSO-migrated user – password login&lt;/td&gt;&lt;td&gt;❌ Blocked&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SSO-migrated user – password reset&lt;/td&gt;&lt;td&gt;❌ Blocked&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2 data-section-id="1o8ju5a" data-start="2311" data-end="2333"&gt;Key Building Blocks&lt;/H2&gt;
&lt;H3 data-section-id="gjz6en" data-start="2335" data-end="2381"&gt;1. Migration Flag: extension_ssoMigrated&lt;/H3&gt;
&lt;P data-start="2383" data-end="2436"&gt;A custom boolean attribute stored on the user object. This drives all decisions.&lt;/P&gt;
&lt;LI-CODE lang="xml"&gt;&amp;lt;ClaimType Id="extension_ssoMigrated"&amp;gt;
  &amp;lt;DataType&amp;gt;boolean&amp;lt;/DataType&amp;gt;
&amp;lt;/ClaimType&amp;gt;&lt;/LI-CODE&gt;
&lt;H3 data-section-id="x3ph7o" data-start="2605" data-end="2643"&gt;2. &lt;SPAN style="color: rgb(30, 30, 30); font-size: 24px;"&gt;Conditional blocking (only for migrated users)&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P data-start="1975" data-end="2001"&gt;A transformation enforces:&lt;/P&gt;
&lt;UL data-start="2003" data-end="2082"&gt;
&lt;LI data-section-id="efkpob" data-start="2003" data-end="2048"&gt;If extension_ssoMigrated = true → block&lt;/LI&gt;
&lt;LI data-section-id="1r1n05t" data-start="2049" data-end="2082"&gt;Otherwise → continue normally&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="2084" data-end="2103"&gt;This is applied in:&lt;/P&gt;
&lt;UL data-start="2104" data-end="2140"&gt;
&lt;LI data-section-id="54pvss" data-start="2104" data-end="2121"&gt;Local sign-in&lt;/LI&gt;
&lt;LI data-section-id="cdogea" data-start="2122" data-end="2140"&gt;Password reset&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="2142" data-end="2170"&gt;Using preconditions ensures:&lt;/P&gt;
&lt;UL data-start="2171" data-end="2245"&gt;
&lt;LI data-section-id="1nii7vh" data-start="2171" data-end="2203"&gt;❌ Migrated users are blocked&lt;/LI&gt;
&lt;LI data-section-id="1804kt5" data-start="2204" data-end="2245"&gt;✅ Non-migrated users are NOT affected&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-section-id="1bta1mv" data-start="2917" data-end="2949"&gt;3. Automatic Account Linking&lt;/H3&gt;
&lt;P data-start="2951" data-end="2968"&gt;During SSO login:&lt;/P&gt;
&lt;UL data-start="2970" data-end="3080"&gt;
&lt;LI data-section-id="1709yb0" data-start="2970" data-end="2995"&gt;Look up user by email&lt;/LI&gt;
&lt;LI data-section-id="x4y2v0" data-start="2996" data-end="3041"&gt;If found → attach alternativeSecurityId&lt;/LI&gt;
&lt;LI data-section-id="pv56r1" data-start="3042" data-end="3080"&gt;Set extension_ssoMigrated = true&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3082" data-end="3114"&gt;No duplication. No manual merge.&lt;/P&gt;
&lt;H3 data-section-id="lvspw8" data-start="2431" data-end="2450"&gt;Simplified flow&lt;/H3&gt;
&lt;OL data-start="2452" data-end="2622"&gt;
&lt;LI data-section-id="1kyr7dh" data-start="2452" data-end="2497"&gt;User clicks &lt;STRONG data-start="2467" data-end="2495"&gt;“Sign in with Microsoft”&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-section-id="1d3kims" data-start="2498" data-end="2537"&gt;System checks existing SSO account&lt;/LI&gt;
&lt;LI data-section-id="1slf8ai" data-start="2538" data-end="2568"&gt;If none → lookup by email&lt;/LI&gt;
&lt;LI data-section-id="145kcdm" data-start="2569" data-end="2605"&gt;If match → link + mark migrated&lt;/LI&gt;
&lt;LI data-section-id="10yj4t8" data-start="2606" data-end="2622"&gt;Issue token&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3 data-section-id="lvspw8" data-start="2431" data-end="2450"&gt;&lt;STRONG&gt;TrustFrameworkExtension.xml&lt;/STRONG&gt;&lt;/H3&gt;
&lt;LI-CODE lang="xml"&gt;&amp;lt;?xml version="1.0" encoding="utf-8" ?&amp;gt;
&amp;lt;TrustFrameworkPolicy
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns="http://schemas.microsoft.com/online/cpim/schemas/2013/06"
  PolicySchemaVersion="0.3.0.0"
  TenantId="yourtenant.onmicrosoft.com"
  PolicyId="B2C_1A_TrustFrameworkExtensions"
  PublicPolicyUri="http://yourtenant.onmicrosoft.com/B2C_1A_TrustFrameworkExtensions"
  TenantObjectId="tenantID"&amp;gt;

  &amp;lt;BasePolicy&amp;gt;
    &amp;lt;TenantId&amp;gt;yourtenant.onmicrosoft.com&amp;lt;/TenantId&amp;gt;
    &amp;lt;PolicyId&amp;gt;B2C_1A_TrustFrameworkLocalization&amp;lt;/PolicyId&amp;gt;
  &amp;lt;/BasePolicy&amp;gt;

  &amp;lt;BuildingBlocks&amp;gt;
    &amp;lt;ClaimsSchema&amp;gt;
      &amp;lt;ClaimType Id="extension_ssoMigrated"&amp;gt;
        &amp;lt;DisplayName&amp;gt;SSO Migrated&amp;lt;/DisplayName&amp;gt;
        &amp;lt;DataType&amp;gt;boolean&amp;lt;/DataType&amp;gt;
        &amp;lt;UserHelpText&amp;gt;Indicates if user migrated to SSO&amp;lt;/UserHelpText&amp;gt;
      &amp;lt;/ClaimType&amp;gt;
      &amp;lt;ClaimType Id="isForgotPassword"&amp;gt;
        &amp;lt;DisplayName&amp;gt;isForgotPassword&amp;lt;/DisplayName&amp;gt;
        &amp;lt;DataType&amp;gt;boolean&amp;lt;/DataType&amp;gt;
        &amp;lt;AdminHelpText&amp;gt;Whether the user has clicked Forgot your password&amp;lt;/AdminHelpText&amp;gt;
      &amp;lt;/ClaimType&amp;gt;
    &amp;lt;/ClaimsSchema&amp;gt;

    &amp;lt;ClaimsTransformations&amp;gt;
      &amp;lt;ClaimsTransformation Id="AssertNotSsoMigrated" TransformationMethod="AssertBooleanClaimIsEqualToValue"&amp;gt;
        &amp;lt;InputClaims&amp;gt;
          &amp;lt;InputClaim ClaimTypeReferenceId="extension_ssoMigrated" TransformationClaimType="inputClaim" /&amp;gt;
        &amp;lt;/InputClaims&amp;gt;
        &amp;lt;InputParameters&amp;gt;
          &amp;lt;InputParameter Id="valueToCompareTo" DataType="boolean" Value="false" /&amp;gt;
        &amp;lt;/InputParameters&amp;gt;
      &amp;lt;/ClaimsTransformation&amp;gt;
    &amp;lt;/ClaimsTransformations&amp;gt;
  &amp;lt;/BuildingBlocks&amp;gt;

  &amp;lt;ClaimsProviders&amp;gt;
    &amp;lt;ClaimsProvider&amp;gt;
      &amp;lt;Domain&amp;gt;AADBSI&amp;lt;/Domain&amp;gt;
      &amp;lt;DisplayName&amp;gt;Sign in with Microsoft&amp;lt;/DisplayName&amp;gt;
      &amp;lt;TechnicalProfiles&amp;gt;
        &amp;lt;TechnicalProfile Id="AADBSI-OpenIdConnect"&amp;gt;
          &amp;lt;DisplayName&amp;gt;Sign in with Microsoft&amp;lt;/DisplayName&amp;gt;
          &amp;lt;Description&amp;gt;Sign in with Microsoft&amp;lt;/Description&amp;gt;
          &amp;lt;Protocol Name="OpenIdConnect" /&amp;gt;
          &amp;lt;Metadata&amp;gt;
            &amp;lt;Item Key="METADATA"&amp;gt;https://login.microsoftonline.com/common/v2.0/.well-known/openid-configuration&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="client_id"&amp;gt;4a84d062-21d7-4e96-8f80-c6b688e7127b&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="response_types"&amp;gt;code&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="scope"&amp;gt;openid profile&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="response_mode"&amp;gt;form_post&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="HttpBinding"&amp;gt;POST&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="UsePolicyInRedirectUri"&amp;gt;false&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="DiscoverMetadataByTokenIssuer"&amp;gt;true&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="ValidTokenIssuerPrefixes"&amp;gt;https://sts.windows.net/,https://login.microsoftonline.com/&amp;lt;/Item&amp;gt;
          &amp;lt;/Metadata&amp;gt;
          &amp;lt;CryptographicKeys&amp;gt;
            &amp;lt;Key Id="client_secret" StorageReferenceId="B2C_1A_Multitenancy" /&amp;gt;
          &amp;lt;/CryptographicKeys&amp;gt;
          &amp;lt;OutputClaims&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="issuerUserId" PartnerClaimType="oid" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="tenantId" PartnerClaimType="tid" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="givenName" PartnerClaimType="given_name" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="surName" PartnerClaimType="family_name" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="displayName" PartnerClaimType="name" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="authenticationSource" DefaultValue="socialIdpAuthentication" AlwaysUseDefaultValue="true" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="identityProvider" PartnerClaimType="iss" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="email" PartnerClaimType="email" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="otherMails" PartnerClaimType="mail" /&amp;gt;
          &amp;lt;/OutputClaims&amp;gt;
          &amp;lt;OutputClaimsTransformations&amp;gt;
            &amp;lt;OutputClaimsTransformation ReferenceId="CreateRandomUPNUserName" /&amp;gt;
            &amp;lt;OutputClaimsTransformation ReferenceId="CreateUserPrincipalName" /&amp;gt;
            &amp;lt;OutputClaimsTransformation ReferenceId="CreateAlternativeSecurityId" /&amp;gt;
            &amp;lt;OutputClaimsTransformation ReferenceId="CreateSubjectClaimFromAlternativeSecurityId" /&amp;gt;
          &amp;lt;/OutputClaimsTransformations&amp;gt;
          &amp;lt;UseTechnicalProfileForSessionManagement ReferenceId="SM-SocialLogin" /&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;
      &amp;lt;/TechnicalProfiles&amp;gt;
    &amp;lt;/ClaimsProvider&amp;gt;

    &amp;lt;ClaimsProvider&amp;gt;
      &amp;lt;DisplayName&amp;gt;Local Account SignIn&amp;lt;/DisplayName&amp;gt;
      &amp;lt;TechnicalProfiles&amp;gt;
        &amp;lt;TechnicalProfile Id="login-NonInteractive"&amp;gt;
          &amp;lt;Metadata&amp;gt;
            &amp;lt;Item Key="client_id"&amp;gt;a37a58e7-e96a-4365-bb8e-169bee86dde1&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="IdTokenAudience"&amp;gt;92114a5a-99df-40c8-8b08-228616b18c57&amp;lt;/Item&amp;gt;
          &amp;lt;/Metadata&amp;gt;
          &amp;lt;InputClaims&amp;gt;
            &amp;lt;InputClaim ClaimTypeReferenceId="client_id" DefaultValue="a37a58e7-e96a-4365-bb8e-169bee86dde1" /&amp;gt;
            &amp;lt;InputClaim ClaimTypeReferenceId="resource_id" PartnerClaimType="resource" DefaultValue="92114a5a-99df-40c8-8b08-228616b18c57" /&amp;gt;
          &amp;lt;/InputClaims&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;
      &amp;lt;/TechnicalProfiles&amp;gt;
    &amp;lt;/ClaimsProvider&amp;gt;

    &amp;lt;ClaimsProvider&amp;gt;
      &amp;lt;DisplayName&amp;gt;Azure Active Directory - SSO Control&amp;lt;/DisplayName&amp;gt;
      &amp;lt;TechnicalProfiles&amp;gt;
        &amp;lt;TechnicalProfile Id="AAD-Common"&amp;gt;
          &amp;lt;Metadata&amp;gt;
            &amp;lt;Item Key="ApplicationObjectId"&amp;gt;3cc7f330-2408-4999-ab68-b74d6feccdf1&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="ClientId"&amp;gt;06c3fab4-3e2a-4d1d-9a78-8954da5d364f&amp;lt;/Item&amp;gt;
          &amp;lt;/Metadata&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;

        &amp;lt;TechnicalProfile Id="AAD-UserReadUsingEmailAddress-Takeover"&amp;gt;
          &amp;lt;Metadata&amp;gt;
            &amp;lt;Item Key="Operation"&amp;gt;Read&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="RaiseErrorIfClaimsPrincipalDoesNotExist"&amp;gt;false&amp;lt;/Item&amp;gt;
          &amp;lt;/Metadata&amp;gt;
          &amp;lt;IncludeInSso&amp;gt;false&amp;lt;/IncludeInSso&amp;gt;
          &amp;lt;InputClaims&amp;gt;
            &amp;lt;InputClaim ClaimTypeReferenceId="email" PartnerClaimType="signInNames.emailAddress" Required="true" /&amp;gt;
          &amp;lt;/InputClaims&amp;gt;
          &amp;lt;OutputClaims&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="objectId" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" /&amp;gt;
          &amp;lt;/OutputClaims&amp;gt;
          &amp;lt;IncludeTechnicalProfile ReferenceId="AAD-Common" /&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;

        &amp;lt;TechnicalProfile Id="AAD-LinkSSOToExistingUser"&amp;gt;
          &amp;lt;Metadata&amp;gt;
            &amp;lt;Item Key="Operation"&amp;gt;Write&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="RaiseErrorIfClaimsPrincipalDoesNotExist"&amp;gt;true&amp;lt;/Item&amp;gt;
          &amp;lt;/Metadata&amp;gt;
          &amp;lt;IncludeInSso&amp;gt;false&amp;lt;/IncludeInSso&amp;gt;
          &amp;lt;InputClaims&amp;gt;
            &amp;lt;InputClaim ClaimTypeReferenceId="objectId" Required="true" /&amp;gt;
          &amp;lt;/InputClaims&amp;gt;
          &amp;lt;PersistedClaims&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="objectId" /&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="alternativeSecurityId" /&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="extension_ssoMigrated" DefaultValue="true" /&amp;gt;
          &amp;lt;/PersistedClaims&amp;gt;
          &amp;lt;OutputClaims&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" DefaultValue="true" AlwaysUseDefaultValue="true" /&amp;gt;
          &amp;lt;/OutputClaims&amp;gt;
          &amp;lt;IncludeTechnicalProfile ReferenceId="AAD-Common" /&amp;gt;
          &amp;lt;UseTechnicalProfileForSessionManagement ReferenceId="SM-AAD" /&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;

        &amp;lt;!-- When SSO creates a new user, also write email as signInNames.emailAddress.
             This claims the email so any subsequent local signup with the same email
             is blocked by RaiseErrorIfClaimsPrincipalAlreadyExists in
             AAD-UserWriteUsingLogonEmail — preventing duplicate accounts.
             The UserMessageIfClaimsPrincipalAlreadyExists on AAD-UserWriteUsingLogonEmail
             override below surfaces the SSO-guidance message to the user. --&amp;gt;
        &amp;lt;TechnicalProfile Id="AAD-UserWriteUsingAlternativeSecurityId"&amp;gt;
          &amp;lt;PersistedClaims&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="alternativeSecurityId" /&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="userPrincipalName" /&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="mailNickName" DefaultValue="unknown" /&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="displayName" DefaultValue="unknown" /&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="otherMails" /&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="givenName" /&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="surname" /&amp;gt;
            &amp;lt;PersistedClaim ClaimTypeReferenceId="email" PartnerClaimType="signInNames.emailAddress" /&amp;gt;
          &amp;lt;/PersistedClaims&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;

        &amp;lt;!-- Override to surface a clear SSO-guidance message when local signup is blocked
             because the email is already claimed by an SSO-first account.
             UserMessageIfClaimsPrincipalAlreadyExists must be on the AAD write TP. --&amp;gt;
        &amp;lt;TechnicalProfile Id="AAD-UserWriteUsingLogonEmail"&amp;gt;
          &amp;lt;Metadata&amp;gt;
            &amp;lt;Item Key="UserMessageIfClaimsPrincipalAlreadyExists"&amp;gt;An account already exists for this email via SSO. Please sign in using SSO instead of creating a local account.&amp;lt;/Item&amp;gt;
          &amp;lt;/Metadata&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;

        &amp;lt;!-- Reads extension_ssoMigrated flag by objectId after password validation.
             Only runs after login-NonInteractive has set objectId.
             Used to conditionally block sign-in/reset for SSO-migrated users. --&amp;gt;
        &amp;lt;TechnicalProfile Id="AAD-ReadSsoMigratedFlag"&amp;gt;
          &amp;lt;Metadata&amp;gt;
            &amp;lt;Item Key="Operation"&amp;gt;Read&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="RaiseErrorIfClaimsPrincipalDoesNotExist"&amp;gt;false&amp;lt;/Item&amp;gt;
          &amp;lt;/Metadata&amp;gt;
          &amp;lt;IncludeInSso&amp;gt;false&amp;lt;/IncludeInSso&amp;gt;
          &amp;lt;InputClaims&amp;gt;
            &amp;lt;InputClaim ClaimTypeReferenceId="objectId" Required="true" /&amp;gt;
          &amp;lt;/InputClaims&amp;gt;
          &amp;lt;OutputClaims&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" /&amp;gt;
          &amp;lt;/OutputClaims&amp;gt;
          &amp;lt;IncludeTechnicalProfile ReferenceId="AAD-Common" /&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;
      &amp;lt;/TechnicalProfiles&amp;gt;
    &amp;lt;/ClaimsProvider&amp;gt;

    &amp;lt;!-- Inline password reset: surfaces as the ForgotPasswordExchange ClaimsProviderSelection
         inside the CombinedSignInAndSignUp step. This keeps the forgot-password flow entirely
         within B2C — no AADB2C90118 error is returned to the application. --&amp;gt;
    &amp;lt;ClaimsProvider&amp;gt;
      &amp;lt;DisplayName&amp;gt;Local Account Password Reset&amp;lt;/DisplayName&amp;gt;
      &amp;lt;TechnicalProfiles&amp;gt;
        &amp;lt;TechnicalProfile Id="ForgotPassword"&amp;gt;
          &amp;lt;DisplayName&amp;gt;Forgot your password?&amp;lt;/DisplayName&amp;gt;
          &amp;lt;Protocol Name="Proprietary" Handler="Web.TPEngine.Providers.ClaimsTransformationProtocolProvider, Web.TPEngine, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null" /&amp;gt;
          &amp;lt;OutputClaims&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="isForgotPassword" DefaultValue="true" AlwaysUseDefaultValue="true" /&amp;gt;
          &amp;lt;/OutputClaims&amp;gt;
          &amp;lt;UseTechnicalProfileForSessionManagement ReferenceId="SM-Noop" /&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;
      &amp;lt;/TechnicalProfiles&amp;gt;
    &amp;lt;/ClaimsProvider&amp;gt;

    &amp;lt;ClaimsProvider&amp;gt;
      &amp;lt;DisplayName&amp;gt;SSO Migration Check&amp;lt;/DisplayName&amp;gt;
      &amp;lt;TechnicalProfiles&amp;gt;
        &amp;lt;TechnicalProfile Id="ThrowSsoMigratedError"&amp;gt;
          &amp;lt;DisplayName&amp;gt;Block local password journeys&amp;lt;/DisplayName&amp;gt;
          &amp;lt;Protocol Name="Proprietary" Handler="Web.TPEngine.Providers.ClaimsTransformationProtocolProvider, Web.TPEngine, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null" /&amp;gt;
          &amp;lt;InputClaims&amp;gt;
            &amp;lt;InputClaim ClaimTypeReferenceId="extension_ssoMigrated" /&amp;gt;
          &amp;lt;/InputClaims&amp;gt;
          &amp;lt;OutputClaims&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" /&amp;gt;
          &amp;lt;/OutputClaims&amp;gt;
          &amp;lt;OutputClaimsTransformations&amp;gt;
            &amp;lt;OutputClaimsTransformation ReferenceId="AssertNotSsoMigrated" /&amp;gt;
          &amp;lt;/OutputClaimsTransformations&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;
      &amp;lt;/TechnicalProfiles&amp;gt;
    &amp;lt;/ClaimsProvider&amp;gt;

    &amp;lt;ClaimsProvider&amp;gt;
      &amp;lt;DisplayName&amp;gt;Local Account&amp;lt;/DisplayName&amp;gt;
      &amp;lt;TechnicalProfiles&amp;gt;
        &amp;lt;TechnicalProfile Id="SelfAsserted-LocalAccountSignin-Email"&amp;gt;
          &amp;lt;Metadata&amp;gt;
            &amp;lt;!-- Carry over required base metadata so the TP renders and validates correctly --&amp;gt;
            &amp;lt;Item Key="SignUpTarget"&amp;gt;SignUpWithLogonEmailExchange&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="setting.operatingMode"&amp;gt;Email&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="setting.forgotPasswordLinkOverride"&amp;gt;ForgotPasswordExchange&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="ContentDefinitionReferenceId"&amp;gt;api.localaccountsignin&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="IncludeClaimResolvingInClaimsHandling"&amp;gt;true&amp;lt;/Item&amp;gt;
            &amp;lt;Item Key="UserMessageIfClaimsTransformationBooleanValueIsNotEqual"&amp;gt;This account has been migrated to SSO. Please sign in with SSO instead.&amp;lt;/Item&amp;gt;
          &amp;lt;/Metadata&amp;gt;
          &amp;lt;IncludeInSso&amp;gt;false&amp;lt;/IncludeInSso&amp;gt;
          &amp;lt;InputClaims&amp;gt;
            &amp;lt;InputClaim ClaimTypeReferenceId="signInName" DefaultValue="{OIDC:LoginHint}" AlwaysUseDefaultValue="true" /&amp;gt;
          &amp;lt;/InputClaims&amp;gt;
          &amp;lt;OutputClaims&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="signInName" Required="true" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="password" Required="true" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="objectId" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="authenticationSource" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" /&amp;gt;
          &amp;lt;/OutputClaims&amp;gt;
          &amp;lt;ValidationTechnicalProfiles&amp;gt;
            &amp;lt;!-- Step 1: validate credentials, sets objectId --&amp;gt;
            &amp;lt;ValidationTechnicalProfile ReferenceId="login-NonInteractive" /&amp;gt;
            &amp;lt;!-- Step 2: read the real extension_ssoMigrated flag from the directory.
                 RaiseErrorIfClaimsPrincipalDoesNotExist=false on the TP definition
                 means this silently returns nothing if the user or attribute is missing. --&amp;gt;
            &amp;lt;ValidationTechnicalProfile ReferenceId="AAD-ReadSsoMigratedFlag" /&amp;gt;
            &amp;lt;!-- Step 3: block only users who have been migrated to SSO.
                 ClaimsExist guard ensures this VTP is skipped when extension_ssoMigrated
                 was never set (normal local users, extension attributes not configured). --&amp;gt;
            &amp;lt;ValidationTechnicalProfile ReferenceId="ThrowSsoMigratedError"&amp;gt;
              &amp;lt;Preconditions&amp;gt;
                &amp;lt;Precondition Type="ClaimsExist" ExecuteActionsIf="false"&amp;gt;
                  &amp;lt;Value&amp;gt;extension_ssoMigrated&amp;lt;/Value&amp;gt;
                  &amp;lt;Action&amp;gt;SkipThisValidationTechnicalProfile&amp;lt;/Action&amp;gt;
                &amp;lt;/Precondition&amp;gt;
                &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="false"&amp;gt;
                  &amp;lt;Value&amp;gt;extension_ssoMigrated&amp;lt;/Value&amp;gt;
                  &amp;lt;Value&amp;gt;True&amp;lt;/Value&amp;gt;
                  &amp;lt;Action&amp;gt;SkipThisValidationTechnicalProfile&amp;lt;/Action&amp;gt;
                &amp;lt;/Precondition&amp;gt;
              &amp;lt;/Preconditions&amp;gt;
            &amp;lt;/ValidationTechnicalProfile&amp;gt;
          &amp;lt;/ValidationTechnicalProfiles&amp;gt;
          &amp;lt;UseTechnicalProfileForSessionManagement ReferenceId="SM-AAD" /&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;

        &amp;lt;TechnicalProfile Id="LocalAccountDiscoveryUsingEmailAddress"&amp;gt;
          &amp;lt;Metadata&amp;gt;
            &amp;lt;Item Key="UserMessageIfClaimsTransformationBooleanValueIsNotEqual"&amp;gt;This account has been migrated to SSO. Password reset is not available. Please sign in with SSO.&amp;lt;/Item&amp;gt;
          &amp;lt;/Metadata&amp;gt;
          &amp;lt;OutputClaims&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="email" PartnerClaimType="Verified.Email" Required="true" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="objectId" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="userPrincipalName" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="authenticationSource" /&amp;gt;
            &amp;lt;OutputClaim ClaimTypeReferenceId="extension_ssoMigrated" /&amp;gt;
          &amp;lt;/OutputClaims&amp;gt;
          &amp;lt;ValidationTechnicalProfiles&amp;gt;
            &amp;lt;!-- Step 1: look up account by email, sets objectId --&amp;gt;
            &amp;lt;ValidationTechnicalProfile ReferenceId="AAD-UserReadUsingEmailAddress" /&amp;gt;
            &amp;lt;!-- Step 2: read the real extension_ssoMigrated flag.
                 RaiseErrorIfClaimsPrincipalDoesNotExist=false on the TP definition
                 means this silently returns nothing if the user or attribute is missing. --&amp;gt;
            &amp;lt;ValidationTechnicalProfile ReferenceId="AAD-ReadSsoMigratedFlag" /&amp;gt;
            &amp;lt;!-- Step 3: block password reset only for SSO-migrated users.
                 ClaimsExist guard ensures this VTP is skipped when extension_ssoMigrated
                 was never set. --&amp;gt;
            &amp;lt;ValidationTechnicalProfile ReferenceId="ThrowSsoMigratedError"&amp;gt;
              &amp;lt;Preconditions&amp;gt;
                &amp;lt;Precondition Type="ClaimsExist" ExecuteActionsIf="false"&amp;gt;
                  &amp;lt;Value&amp;gt;extension_ssoMigrated&amp;lt;/Value&amp;gt;
                  &amp;lt;Action&amp;gt;SkipThisValidationTechnicalProfile&amp;lt;/Action&amp;gt;
                &amp;lt;/Precondition&amp;gt;
                &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="false"&amp;gt;
                  &amp;lt;Value&amp;gt;extension_ssoMigrated&amp;lt;/Value&amp;gt;
                  &amp;lt;Value&amp;gt;True&amp;lt;/Value&amp;gt;
                  &amp;lt;Action&amp;gt;SkipThisValidationTechnicalProfile&amp;lt;/Action&amp;gt;
                &amp;lt;/Precondition&amp;gt;
              &amp;lt;/Preconditions&amp;gt;
            &amp;lt;/ValidationTechnicalProfile&amp;gt;
          &amp;lt;/ValidationTechnicalProfiles&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;

        &amp;lt;TechnicalProfile Id="LocalAccountWritePasswordUsingObjectId"&amp;gt;
          &amp;lt;UseTechnicalProfileForSessionManagement ReferenceId="SM-AAD" /&amp;gt;
        &amp;lt;/TechnicalProfile&amp;gt;
      &amp;lt;/TechnicalProfiles&amp;gt;
    &amp;lt;/ClaimsProvider&amp;gt;
  &amp;lt;/ClaimsProviders&amp;gt;

  &amp;lt;UserJourneys&amp;gt;
    &amp;lt;UserJourney Id="CustomSignUpOrSignIn"&amp;gt;
      &amp;lt;OrchestrationSteps&amp;gt;
        &amp;lt;OrchestrationStep Order="1" Type="CombinedSignInAndSignUp" ContentDefinitionReferenceId="api.signuporsignin"&amp;gt;
          &amp;lt;ClaimsProviderSelections&amp;gt;
            &amp;lt;ClaimsProviderSelection ValidationClaimsExchangeId="LocalAccountSigninEmailExchange" /&amp;gt;
            &amp;lt;ClaimsProviderSelection TargetClaimsExchangeId="AzureCommon-AAD-Exchange" /&amp;gt;
            &amp;lt;ClaimsProviderSelection TargetClaimsExchangeId="ForgotPasswordExchange" /&amp;gt;
          &amp;lt;/ClaimsProviderSelections&amp;gt;
          &amp;lt;ClaimsExchanges&amp;gt;
            &amp;lt;ClaimsExchange Id="LocalAccountSigninEmailExchange" TechnicalProfileReferenceId="SelfAsserted-LocalAccountSignin-Email" /&amp;gt;
          &amp;lt;/ClaimsExchanges&amp;gt;
        &amp;lt;/OrchestrationStep&amp;gt;

        &amp;lt;!-- Step 2: Sign-up / SSO / Forgot-password exchange.
             CombinedSignInAndSignUp (Step 1) acts as the implicit ClaimsProviderSelection,
             so this step is a ClaimsExchange directly — no separate selection step needed.
             Every TargetClaimsExchangeId from Step 1 must have a matching exchange here:
               • "Sign up now" → SignUpWithLogonEmailExchange
               • "Sign in with Microsoft" → AzureCommon-AAD-Exchange
               • "Forgot your password?" → ForgotPasswordExchange --&amp;gt;
        &amp;lt;OrchestrationStep Order="2" Type="ClaimsExchange"&amp;gt;
          &amp;lt;Preconditions&amp;gt;
            &amp;lt;Precondition Type="ClaimsExist" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;objectId&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
          &amp;lt;/Preconditions&amp;gt;
          &amp;lt;ClaimsExchanges&amp;gt;
            &amp;lt;ClaimsExchange Id="SignUpWithLogonEmailExchange" TechnicalProfileReferenceId="LocalAccountSignUpWithLogonEmail" /&amp;gt;
            &amp;lt;ClaimsExchange Id="AzureCommon-AAD-Exchange" TechnicalProfileReferenceId="AADBSI-OpenIdConnect" /&amp;gt;
            &amp;lt;ClaimsExchange Id="ForgotPasswordExchange" TechnicalProfileReferenceId="ForgotPassword" /&amp;gt;
          &amp;lt;/ClaimsExchanges&amp;gt;
        &amp;lt;/OrchestrationStep&amp;gt;

        &amp;lt;!-- Steps 3-4: Inline forgot-password flow.
             Two preconditions ensure this step ONLY runs when the user clicked
             "Forgot your password?":
               1. ClaimsExist guard — skips when isForgotPassword was never set
                  (signup, SSO, and local-sign-in flows).
               2. ClaimEquals guard — skips when isForgotPassword exists but ≠ true. --&amp;gt;
        &amp;lt;OrchestrationStep Order="3" Type="ClaimsExchange"&amp;gt;
          &amp;lt;Preconditions&amp;gt;
            &amp;lt;Precondition Type="ClaimsExist" ExecuteActionsIf="false"&amp;gt;
              &amp;lt;Value&amp;gt;isForgotPassword&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
            &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="false"&amp;gt;
              &amp;lt;Value&amp;gt;isForgotPassword&amp;lt;/Value&amp;gt;
              &amp;lt;Value&amp;gt;true&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
          &amp;lt;/Preconditions&amp;gt;
          &amp;lt;ClaimsExchanges&amp;gt;
            &amp;lt;ClaimsExchange Id="PasswordResetUsingEmailAddressExchange" TechnicalProfileReferenceId="LocalAccountDiscoveryUsingEmailAddress" /&amp;gt;
          &amp;lt;/ClaimsExchanges&amp;gt;
        &amp;lt;/OrchestrationStep&amp;gt;

        &amp;lt;OrchestrationStep Order="4" Type="ClaimsExchange"&amp;gt;
          &amp;lt;Preconditions&amp;gt;
            &amp;lt;Precondition Type="ClaimsExist" ExecuteActionsIf="false"&amp;gt;
              &amp;lt;Value&amp;gt;isForgotPassword&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
            &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="false"&amp;gt;
              &amp;lt;Value&amp;gt;isForgotPassword&amp;lt;/Value&amp;gt;
              &amp;lt;Value&amp;gt;true&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
          &amp;lt;/Preconditions&amp;gt;
          &amp;lt;ClaimsExchanges&amp;gt;
            &amp;lt;ClaimsExchange Id="NewCredentials" TechnicalProfileReferenceId="LocalAccountWritePasswordUsingObjectId" /&amp;gt;
          &amp;lt;/ClaimsExchanges&amp;gt;
        &amp;lt;/OrchestrationStep&amp;gt;

        &amp;lt;!-- Step 5: Read SSO user from directory by alternativeSecurityId --&amp;gt;
        &amp;lt;OrchestrationStep Order="5" Type="ClaimsExchange"&amp;gt;
          &amp;lt;Preconditions&amp;gt;
            &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;isForgotPassword&amp;lt;/Value&amp;gt;
              &amp;lt;Value&amp;gt;true&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
            &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;authenticationSource&amp;lt;/Value&amp;gt;
              &amp;lt;Value&amp;gt;localAccountAuthentication&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
          &amp;lt;/Preconditions&amp;gt;
          &amp;lt;ClaimsExchanges&amp;gt;
            &amp;lt;ClaimsExchange Id="AADUserReadUsingAlternativeSecurityId" TechnicalProfileReferenceId="AAD-UserReadUsingAlternativeSecurityId-NoError" /&amp;gt;
          &amp;lt;/ClaimsExchanges&amp;gt;
        &amp;lt;/OrchestrationStep&amp;gt;

        &amp;lt;!-- Step 6: SSO takeover — look up existing local account by email --&amp;gt;
        &amp;lt;OrchestrationStep Order="6" Type="ClaimsExchange"&amp;gt;
          &amp;lt;Preconditions&amp;gt;
            &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;authenticationSource&amp;lt;/Value&amp;gt;
              &amp;lt;Value&amp;gt;localAccountAuthentication&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
            &amp;lt;Precondition Type="ClaimsExist" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;objectId&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
            &amp;lt;Precondition Type="ClaimsExist" ExecuteActionsIf="false"&amp;gt;
              &amp;lt;Value&amp;gt;email&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
          &amp;lt;/Preconditions&amp;gt;
          &amp;lt;ClaimsExchanges&amp;gt;
            &amp;lt;ClaimsExchange Id="ReadByEmail" TechnicalProfileReferenceId="AAD-UserReadUsingEmailAddress-Takeover" /&amp;gt;
          &amp;lt;/ClaimsExchanges&amp;gt;
        &amp;lt;/OrchestrationStep&amp;gt;

        &amp;lt;!-- Step 7: Link SSO identity to existing local account --&amp;gt;
        &amp;lt;OrchestrationStep Order="7" Type="ClaimsExchange"&amp;gt;
          &amp;lt;Preconditions&amp;gt;
            &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;isForgotPassword&amp;lt;/Value&amp;gt;
              &amp;lt;Value&amp;gt;true&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
            &amp;lt;Precondition Type="ClaimsExist" ExecuteActionsIf="false"&amp;gt;
              &amp;lt;Value&amp;gt;objectId&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
            &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;authenticationSource&amp;lt;/Value&amp;gt;
              &amp;lt;Value&amp;gt;localAccountAuthentication&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
            &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;extension_ssoMigrated&amp;lt;/Value&amp;gt;
              &amp;lt;Value&amp;gt;True&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
          &amp;lt;/Preconditions&amp;gt;
          &amp;lt;ClaimsExchanges&amp;gt;
            &amp;lt;ClaimsExchange Id="LinkSSO" TechnicalProfileReferenceId="AAD-LinkSSOToExistingUser" /&amp;gt;
          &amp;lt;/ClaimsExchanges&amp;gt;
        &amp;lt;/OrchestrationStep&amp;gt;

        &amp;lt;!-- Step 8: Collect display name etc. for brand-new SSO users --&amp;gt;
        &amp;lt;OrchestrationStep Order="8" Type="ClaimsExchange"&amp;gt;
          &amp;lt;Preconditions&amp;gt;
            &amp;lt;Precondition Type="ClaimsExist" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;objectId&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
          &amp;lt;/Preconditions&amp;gt;
          &amp;lt;ClaimsExchanges&amp;gt;
            &amp;lt;ClaimsExchange Id="SelfAsserted-Social" TechnicalProfileReferenceId="SelfAsserted-Social" /&amp;gt;
          &amp;lt;/ClaimsExchanges&amp;gt;
        &amp;lt;/OrchestrationStep&amp;gt;

        &amp;lt;!-- Step 9: Read local-account user attributes for the token --&amp;gt;
        &amp;lt;OrchestrationStep Order="9" Type="ClaimsExchange"&amp;gt;
          &amp;lt;Preconditions&amp;gt;
            &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;authenticationSource&amp;lt;/Value&amp;gt;
              &amp;lt;Value&amp;gt;socialIdpAuthentication&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
          &amp;lt;/Preconditions&amp;gt;
          &amp;lt;ClaimsExchanges&amp;gt;
            &amp;lt;ClaimsExchange Id="AADUserReadWithObjectId" TechnicalProfileReferenceId="AAD-UserReadUsingObjectId" /&amp;gt;
          &amp;lt;/ClaimsExchanges&amp;gt;
        &amp;lt;/OrchestrationStep&amp;gt;

        &amp;lt;!-- Step 10: Write new SSO user to directory if not already written --&amp;gt;
        &amp;lt;OrchestrationStep Order="10" Type="ClaimsExchange"&amp;gt;
          &amp;lt;Preconditions&amp;gt;
            &amp;lt;Precondition Type="ClaimEquals" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;isForgotPassword&amp;lt;/Value&amp;gt;
              &amp;lt;Value&amp;gt;true&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
            &amp;lt;Precondition Type="ClaimsExist" ExecuteActionsIf="true"&amp;gt;
              &amp;lt;Value&amp;gt;objectId&amp;lt;/Value&amp;gt;
              &amp;lt;Action&amp;gt;SkipThisOrchestrationStep&amp;lt;/Action&amp;gt;
            &amp;lt;/Precondition&amp;gt;
          &amp;lt;/Preconditions&amp;gt;
          &amp;lt;ClaimsExchanges&amp;gt;
            &amp;lt;ClaimsExchange Id="AADUserWrite" TechnicalProfileReferenceId="AAD-UserWriteUsingAlternativeSecurityId" /&amp;gt;
          &amp;lt;/ClaimsExchanges&amp;gt;
        &amp;lt;/OrchestrationStep&amp;gt;

        &amp;lt;OrchestrationStep Order="11" Type="SendClaims" CpimIssuerTechnicalProfileReferenceId="JwtIssuer" /&amp;gt;
      &amp;lt;/OrchestrationSteps&amp;gt;
      &amp;lt;ClientDefinition ReferenceId="DefaultWeb" /&amp;gt;
    &amp;lt;/UserJourney&amp;gt;
  &amp;lt;/UserJourneys&amp;gt;

&amp;lt;/TrustFrameworkPolicy&amp;gt;&lt;/LI-CODE&gt;
&lt;H3 data-section-id="utx20n" data-start="2831" data-end="2848"&gt;Key takeaways&lt;/H3&gt;
&lt;UL data-start="2850" data-end="3145"&gt;
&lt;LI data-section-id="1y5kjtq" data-start="2850" data-end="2903"&gt;Migration happens &lt;STRONG data-start="2870" data-end="2901"&gt;silently on first SSO login&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-section-id="kla7t3" data-start="2904" data-end="2962"&gt;&lt;STRONG data-start="2906" data-end="2944"&gt;Only migrated users are restricted&lt;/STRONG&gt; (not all users)&lt;/LI&gt;
&lt;LI data-section-id="rvfykk" data-start="2963" data-end="3030"&gt;Forgot password remains fully functional for non-migrated users&lt;/LI&gt;
&lt;LI data-section-id="6h3abr" data-start="3031" data-end="3094"&gt;A single flag (extension_ssoMigrated) controls everything&lt;/LI&gt;
&lt;LI data-section-id="crm7e7" data-start="3095" data-end="3145"&gt;No duplication, no confusion, no manual effort&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3171" data-end="3234"&gt;This approach removes the usual friction of identity migration.&lt;/P&gt;
&lt;P data-start="3236" data-end="3399" data-is-last-node="" data-is-only-node=""&gt;Users don’t need to “move” to SSO—the system does it for them, automatically and securely, while preserving a smooth experience for those who haven’t migrated yet.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Apr 2026 21:34:55 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/automatic-sso-takeover-in-azure-ad-b2c-custom-policies/ba-p/4514167</guid>
      <dc:creator>anammalu</dc:creator>
      <dc:date>2026-04-23T21:34:55Z</dc:date>
    </item>
    <item>
      <title>Enabling Agentic Data Governance with Hybrid Cloud Flexibility in Azure</title>
      <link>https://techcommunity.microsoft.com/t5/azure-architecture-blog/enabling-agentic-data-governance-with-hybrid-cloud-flexibility/ba-p/4513511</link>
      <description>&lt;H1&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662162"&gt;&lt;/A&gt;The “Why”&lt;/H1&gt;
&lt;img /&gt;
&lt;P&gt;Do you manage data in a complex multi-cloud environment? Are you struggling with data silos, evolving regulations, and the pressure to maintain control and compliance across on-prem and multiple clouds? Do you ever wish an intelligent assistant could help shoulder the load of data governance? If so, I can relate. Let me tell you a story that might sound familiar.&lt;/P&gt;
&lt;P&gt;Meet Mark (pictured above). He is a data governance officer at Contoso (a fictional but very representative enterprise). &amp;nbsp;Mark’s day job is ensuring data governance and compliance across his company’s vast hybrid cloud estate – think around&lt;STRONG&gt; ~&lt;/STRONG&gt;2 million data assets sprawled across 12+ datacenters on-premises and in different public clouds. Regulatory requirements are constantly shifting. Customer data is increasingly sensitive. Each department and region has its own way of doing things. Mark is fighting an uphill battle with data silos and disconnected cloud operations. He bounces between a patchwork of tools – spreadsheets, cloud consoles, governance portals – trying to answer basic questions:&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Where is our data? &lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Who’s using it? &lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Are we in compliance?&lt;/EM&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;Armed with an old desk calculator and a pile of paper-based reports (a perfect 1990s backdrop), he is dealing with the data around him that has exploded in volume and complexity.&lt;/P&gt;
&lt;P&gt;What if Mark had a single pane of glass. The glass that reflects and acts. It reflects your governance state and enforces compliance – a self-hydrating pane of glass accompanied by a conversational AI.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;And he’s not alone. We’re all living in a data overload era. Every day, organizations generate and ingest more information than ever before. Transistors and mainframes gave way to the internet boom of the ’90s, then an explosion of mobile devices in the 2000s, social media in the 2010s, and now widespread cloud computing – all funneling data into our systems at an exponential rate. On top of that, a new wave of AI and conversational interfaces has arrived here in the mid-2020s, making data more accessible but also increasing expectations for real-time insight. It’s no wonder modern IT leaders feel overwhelmed.&lt;/P&gt;
&lt;P&gt;But these challenges are also opportunities. The way I see it, the incredible growth of data and cloud capabilities means we have a chance to reimagine data governance. The fact that I’m writing about this right now is no coincidence. My customers are looking to resolve problems in this space. In my conversations with them, I hear the same needs: &lt;EM&gt;We want better governance, more visibility, streamlined oversight… and cherry on top, we want it in an “agentic” fashion.&lt;/EM&gt; In other words, they want to delegate the grunt work to the platform toolset augmented by AI, so they can focus on higher-value tasks.&lt;/P&gt;
&lt;H1&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662163"&gt;&lt;/A&gt;The “What”&lt;/H1&gt;
&lt;img /&gt;
&lt;P&gt;That vision – &lt;EM&gt;agentic data governance with hybrid cloud flexibility&lt;/EM&gt; – became the driver for this work. This is a modular solution, and you have these building block style components (cloud services, governance tools, AI agents), which you can snap them together into an intended solution. Think of it as a &lt;STRONG&gt;jumpstart kit&lt;/STRONG&gt; for continuous data governance across multiple clouds, with autonomous (“agentic”) assistance baked in that you can leverage and build upon. It’s not the final, productized solution – more a vision of what’s possible.&lt;/P&gt;
&lt;H3&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662164"&gt;&lt;/A&gt;Contoso’s Requirements&lt;/H3&gt;
&lt;P&gt;These are the high-level requirements from Contoso:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Data governance across clouds under one roof&lt;/LI&gt;
&lt;LI&gt;A single pane of glass dashboard consolidating reporting on the 5 governance domains:&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;o&amp;nbsp;&amp;nbsp; Visibility on data residency and lineage&lt;/P&gt;
&lt;P&gt;o&amp;nbsp;&amp;nbsp; PII (&lt;STRONG&gt;P&lt;/STRONG&gt;ersonally &lt;STRONG&gt;I&lt;/STRONG&gt;dentifiable &lt;STRONG&gt;I&lt;/STRONG&gt;nformation) must run on a CC (&lt;STRONG&gt;C&lt;/STRONG&gt;onfidential &lt;STRONG&gt;C&lt;/STRONG&gt;ompute)&lt;/P&gt;
&lt;P&gt;o&amp;nbsp;&amp;nbsp; Security software (Defender) compliance&lt;/P&gt;
&lt;P&gt;o&amp;nbsp;&amp;nbsp; Resource tagging compliance (foundational for a good governance posture)&lt;/P&gt;
&lt;P&gt;o&amp;nbsp;&amp;nbsp; OS updates compliance&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Ability to enforce compliance in an agentic manner with a human in the loop&lt;/LI&gt;
&lt;LI&gt;Agentic enforcement of compliance pertaining to residency and confidential compute&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662165"&gt;&lt;/A&gt;Solution – The breakdown&lt;/H3&gt;
&lt;P&gt;The solution is comprised of 8 modules addressing these requirements. These solution modules are:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Foundational (Landing zones, Data Sources, Operational setup, Policies, etc.)&lt;/LI&gt;
&lt;LI&gt;Dashboard Hydration + Agentic Reporting – Residency Compliance&lt;/LI&gt;
&lt;LI&gt;Dashboard Hydration + Agentic Reporting – Confidential Compute for PII Compliance&lt;/LI&gt;
&lt;LI&gt;Dashboard Hydration + Agentic Reporting – MS Defender Compliance&lt;/LI&gt;
&lt;LI&gt;Dashboard Hydration + Agentic Reporting – Resource Tag Compliance&lt;/LI&gt;
&lt;LI&gt;Dashboard Hydration + Agentic Reporting – OS Updates/Patch Compliance&lt;/LI&gt;
&lt;LI&gt;Enforce Compliance via Copilot Agent - Residency Compliance&lt;/LI&gt;
&lt;LI&gt;Enforce Compliance via Copilot Agent – CC PII Compliance&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662166"&gt;&lt;/A&gt;Solution – The architecture view&lt;/H3&gt;
&lt;img /&gt;
&lt;P&gt;These are the main technical components that make up the solution architecture:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Data sources of all shapes and sizes on the left, governed by the native Azure or the Arc plane.&lt;/LI&gt;
&lt;LI&gt;Additional Azure services across the bottom layer for the foundational governance posture&lt;/LI&gt;
&lt;LI&gt;Microsoft Purview, in the top middle, as the unified data governance platform&lt;/LI&gt;
&lt;LI&gt;Microsoft Fabric, in the bottom middle, as the end-to-end ingestion and analytics platform&lt;/LI&gt;
&lt;LI&gt;Microsoft Power Platform, on the right, as the low code/no code business flow and the copilot agent experience&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662167"&gt;&lt;/A&gt;Solution – The end user view&lt;/H3&gt;
&lt;P&gt;So how does Mark see this solution as a data governance officer? He doesn’t see all the intricacies of the solution integration and the logic execution. He sees two things:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;A Power BI dashboard running on Microsoft Fabric with&lt;/LI&gt;
&lt;/OL&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;UL&gt;
&lt;LI&gt;A compliance dashboard with an overall score in each of the five compliance domains alongside scores for each of the data products across these domains&lt;/LI&gt;
&lt;LI&gt;Additional reporting views for more granular reporting&lt;/LI&gt;
&lt;LI&gt;Fabric-based pipeline that hydrates the underlying semantic models from various sources to keep the reports fresh and current&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;A Copilot agent (in Teams) for both:&lt;/LI&gt;
&lt;/OL&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;EM&gt;Reporting on all compliance domains&lt;/EM&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;EM&gt;Enforcing in-scope compliance across selected domains&lt;/EM&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The agent takes care of it - queries Fabric’s semantic model, calls Azure Function endpoints, updates Purview glossary terms, applies Azure tags, and sends Teams notifications.&lt;/P&gt;
&lt;H1&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662168"&gt;&lt;/A&gt;The “How” – Residency Compliance&lt;/H1&gt;
&lt;P&gt;Let’s pick a few modules to walk through how these solution modules work together to give a cohesive agentic governance experience to Mark.&lt;/P&gt;
&lt;P&gt;It’s Monday morning, and Mark logs into the Contoso governance portal with a cup of coffee in hand. Instead of a dozen browser tabs, he has two main tools opened: the Data Governance Dashboard and the Contoso Governance Copilot agent.&lt;/P&gt;
&lt;P&gt;To address some inquiries that came as an assigned action to him, he interacted with the agent. During this interaction, not only did he validate if there were any residency missing in the unified data governance platform (Purview), but he was also able to address a mismatch between Purview and Azure resource, based on the designed principles. Here is the snippet of the chat:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Now, under the hood, several components have worked on behalf of the agent in performing this governance checking and applying the necessary course of action:&lt;/P&gt;
&lt;P&gt;Even before Mark's conversation with the agent, an ongoing hydration process keeps the Fabric Power BI dashboard up to date.&lt;/P&gt;
&lt;H3&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662169"&gt;&lt;/A&gt;Dashboard Hydration + Agentic Reporting – Residency Compliance&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;A Fabric notebook runs the residency scorecard code block through a pipeline.&lt;/LI&gt;
&lt;LI&gt;It reads two Lakehouse tables containing latest residency information from Purview and the approved region list&lt;/LI&gt;
&lt;LI&gt;Then, the notebook gets a Microsoft Entra bearer token&lt;/LI&gt;
&lt;LI&gt;Once acquired, the notebook then calls an Azure Function endpoint&lt;/LI&gt;
&lt;LI&gt;This endpoint, then searches for the Azure resources associated with the data products in Purview using an Azure resource tag.&lt;/LI&gt;
&lt;LI&gt;The notebook then compares the declared Purview residency with the approved region list and the associated resource’s region&lt;/LI&gt;
&lt;LI&gt;The notebook then calculates the final 0 / 25 / 50 / 75 / 100 residency compliance score and a reason. For example: A data product without an associated Azure resource gets a 0, while a data product whose residency in Purview is an approved region by Contoso, and also matches with the associated Azure resource, gets a 100.&lt;/LI&gt;
&lt;LI&gt;It then writes the results to the relevant residency compliance Lakehouse tables&lt;/LI&gt;
&lt;LI&gt;The dedicated compliance table then feeds to the semantic model for reporting&lt;/LI&gt;
&lt;LI&gt;The compliance Power BI dashboard is hydrated&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662170"&gt;&lt;/A&gt;Enforce Compliance via Copilot Agent - Residency Compliance&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;With the dashboard data regularly updated, the agent follows this logic, the updated reporting data, and the actions at its disposal, during the earlier conversation with Mark :&lt;/P&gt;
&lt;img /&gt;
&lt;OL&gt;
&lt;LI&gt;Mark initiates the conversation with the agent&lt;/LI&gt;
&lt;LI&gt;The agent calls a Power Automate flow&lt;/LI&gt;
&lt;LI&gt;This flow retrieves Purview’s residency information stored in the Fabric semantic model&lt;/LI&gt;
&lt;LI&gt;5, 6, 7 and 8. When Mark asks to investigate further on a data product, the agent carries the conversation using a topic, which then leverages a flow, which uses a Power Automate custom connector to access an Azure Function endpoint. This endpoint then retrieves latest glossary (residency) information about the data product in question, from Purview, and provides a preview back to the user&lt;/LI&gt;
&lt;LI&gt;10, 11, 12, and 13. If the update criteria are met, and if there is no conflict, and with Mark’s blessings, the topic then calls another flow to access the Functions Purview Update endpoint, and make the glossary (residency) update in Purview for that data product&lt;/LI&gt;
&lt;/OL&gt;
&lt;H1&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662171"&gt;&lt;/A&gt;The “How” – Confidential Compute for PII Compliance&lt;/H1&gt;
&lt;H3&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662172"&gt;&lt;/A&gt;Dashboard Hydration + Agentic Reporting – Confidential Compute for PII Compliance&lt;/H3&gt;
&lt;P&gt;The following snippet shows how Mark addresses the compliance risk with a critical data product (application), S/4 HANA, and performed the necessary compliance actions, such as tagging the associated resources and notifying the data product owners via Teams channel.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;The following diagram shows the under-the-hood hydration flow for confidential compute compliance:&lt;/P&gt;
&lt;img /&gt;
&lt;H3&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662173"&gt;&lt;/A&gt;Enforce Compliance via Copilot Agent – CC PII Compliance&lt;/H3&gt;
&lt;P&gt;Finally, the diagram below shows how Mark’s conversation flows through the main solution components:&lt;/P&gt;
&lt;img /&gt;
&lt;H1&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662174"&gt;&lt;/A&gt;Outcome&lt;/H1&gt;
&lt;P&gt;Stepping back, what did we accomplish for Mark and Contoso? We turned an onslaught of governance challenges into an opportunity to modernize how data is managed. This gave Mark:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Centralized Visibility into data assets across the landscape through Purview and a unified dashboard&lt;/LI&gt;
&lt;LI&gt;Proactive compliance enabled with automated checks - controlled with Purview exports and Fabric pipeline schedules&lt;/LI&gt;
&lt;LI&gt;And compliance enforcement using an agent&lt;/LI&gt;
&lt;LI&gt;Hybrid Cloud Consistency. By using Azure Arc and a foundational data plane management setup&lt;/LI&gt;
&lt;LI&gt;Reduced Operational overhead with agentic reporting and compliance&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Though the solution is comprised of wide variety of components/services, it is built from standard building blocks and is relatively simple to implement. In total, the solution combined around a dozen Azure services and over 40 distinct components (from Purview catalogs to data pipelines, to custom functions and flows). You can choose to implement some or all the compliance domains. Or, better yet, build upon and create new domains and pave new paths.&lt;/P&gt;
&lt;H1&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662175"&gt;&lt;/A&gt;Wrap-up&lt;/H1&gt;
&lt;P&gt;I believe many enterprises could take a similar journey. If you’re facing these issues, consider this an invitation to think differently about data governance. Start with the pieces you already have – your own building blocks of cloud services and data – and imagine what you could build. Chances are that a lot of the heavy lifting can be orchestrated with today’s technology. And with the rise of AI copilots, the dream of agentic data governance – where your policies are continuously enforced by smart agents – is no longer science fiction. It’s here, right now, waiting for you to take it for a spin.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;Next steps&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Watch the video narrative on SAP on Azure YouTube channel:&amp;nbsp;&lt;div contenteditable="false" class="lia-embeded-content"&gt;&lt;iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FzuZ6IDRNSHs%3Ffeature%3Doembed&amp;amp;display_name=YouTube&amp;amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DzuZ6IDRNSHs&amp;amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FzuZ6IDRNSHs%2Fhqdefault.jpg&amp;amp;type=text%2Fhtml&amp;amp;schema=youtube" title="YouTube embed" scrolling="no" allowfullscreen="allowfullscreen" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" class="embedly-embed" sandbox="allow-scripts allow-same-origin"&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;/LI&gt;
&lt;LI&gt;Build it with the GitHub Repository: &lt;A href="https://github.com/moazmirza/data-sov-and-hyb-cloud" target="_blank" rel="noopener" data-lia-auto-title-active="1"&gt;https://github.com/moazmirza/data-sov-and-hyb-cloud&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Comments/questions: Here, or @ LinkedIn /moazmirza&lt;/LI&gt;
&lt;/UL&gt;
&lt;H1&gt;&lt;A class="lia-anchor" target="_blank" name="_Toc227662176"&gt;&lt;/A&gt;Solution Selfies&lt;/H1&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;BR /&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Azure Policy Compliance - Foundational Governance Posture&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;img /&gt;&lt;img /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Purview Data Product Catalog and Data Lineage&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Purview Governance Metadata à Fabric Lakehouse&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;img /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Fabric Semantic Model&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Additional Fabric Power BI Dashboard&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Copilot Studio Topic Flow&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Azure Function Endpoints&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 100.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&amp;nbsp;&lt;/DIV&gt;</description>
      <pubDate>Thu, 23 Apr 2026 18:17:35 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/azure-architecture-blog/enabling-agentic-data-governance-with-hybrid-cloud-flexibility/ba-p/4513511</guid>
      <dc:creator>Moaz_Mirza</dc:creator>
      <dc:date>2026-04-23T18:17:35Z</dc:date>
    </item>
  </channel>
</rss>

