well architected
138 TopicsCloud Native Platforms: Evolve
Audience: Engineering leaders, platform architects, senior developers exploring how to operationalise AI in their teams Reading time: 8 minutes Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 3 of 3. Cloud helped us scale infrastructure. AI is starting to do the same thing for the work around the code: the planning, the testing, the release communication, the incident triage, the writing that surrounds writing software. The conversation about AI in software has narrowed too quickly to "Copilot in the editor". The bigger story is happening across the lifecycle. Planning, design, development, testing, release, and operations are all being augmented at once. The platforms that adopt AI well are not the ones with the most usage. They are the ones with the clearest discipline around how it is used. This post is about that discipline. AI is changing how we engineer, not how we type AI is not changing how we write code. It is changing how we engineer software. Code generation is the surface. Underneath it, AI is reshaping the unit of leverage. The question is no longer how fast a developer can type. It is how well a workflow can be expressed as a reusable engineering asset. Six disciplines determine whether AI moves the needle on outcomes or just adds another tool to the stack. Figure 1. AI across the SDLC. Each phase has clear AI assist points and clear human-owned validations. The boundary is not negotiable. It is the design. 1. From assistance to augmentation Early AI tools focused on assisting individual developers. Code suggestions. Autocomplete. Quick refactors. The value was real but bounded by the editor. The shift now is into structured workflows that span the lifecycle. The unit of leverage is no longer a single suggestion. It is a sequence of actions executed reliably across phases. ("Agentic" later in this post means a system that makes its own next-step decisions inside guardrails. A workflow follows a fixed sequence; an agent chooses the path.) Code generation has become baseline, not differentiator Workflow generation is where the largest gains live Multi-step assistance with explicit human checkpoints Context that travels across tools, not just within one In practice The pattern that works: start with the single highest-volume writing task on the team (commit messages, code review comments, release notes, postmortem first drafts) and turn the AI assist for that task into a shared workflow rather than each individual's private trick. The cost is one engineer's afternoon documenting the workflow and the eval set. The return is that every engineer on the team inherits the work, and the task that used to consume an engineer's morning every two weeks becomes a background step in the release process. Workflow generation, not faster typing, is where the gains compound across a team. Code suggestions help one developer. Reusable workflows help the next ten. 2. AI across the SDLC, with guardrails AI now has a useful role at every phase of delivery. The role is different at each phase, and the guardrails are different too. Phase What AI helps with What humans must validate Plan Breaking down requirements, drafting acceptance criteria Domain context, business priorities, customer impact Build Code generation, refactoring, scaffolding Architectural fit, security boundaries, performance Test Test case generation, edge case discovery Coverage of business-critical paths, regulatory cases Release Release notes, changelog summaries, communication drafts Accuracy, tone, customer-facing claims Operate Log triage, incident summaries, runbook drafts Root cause attribution, action item ownership The guardrails are not optional decoration. They are the design. In practice The pattern that works: stage AI assists for release communication (changelog drafting, customer-facing release notes, internal release announcements) and require a human review before anything goes out. The draft arrives consistently, faster than a human could produce, and easier to compare across releases. The reviewer is not eliminated; the reviewer is moved from author to editor, which is where their judgment actually matters. Teams that adopt this pattern stop missing release-note deadlines and stop publishing inconsistent communication across products. 3. From prompts to reusable assets Many teams begin with prompt experimentation. Individuals find techniques that work for their tasks. The result is a patchwork of personal practices that do not survive a team change. The compounding value comes when prompts mature into reusable engineering assets. Figure 2. The maturity model from prompts to agents. The value compounds at the workflow stage and accelerates at the agent stage. The disciplines that make agents safe are the same ones that made workflows reliable. The maturity stages, in order of leverage: Prompts: ad-hoc, individual, hard to share Templates: parameterised prompts versioned with the project Workflows: multi-step sequences with clear inputs, outputs, checkpoints Agents: autonomous task chains operating within explicit guardrails The diagram is a maturity ladder, not a graduation. In practice teams operate at all four stages simultaneously for different tasks. A senior engineer may use a one-off prompt to explore a refactor, run a versioned template for commit messages, hand off to a workflow for release notes, and trigger an agent for routine PR triage, all in the same hour. The point of the ladder is not to leave earlier stages behind. It is to know which stage a given task belongs to and to invest accordingly. In practice The pattern that works: pick the three prompts your team uses every week, codify them as parameterised templates in the same repository as the application code, and treat them as engineering artefacts (reviewed, versioned, owned). New engineers inherit the team's accumulated practice instead of building their own from scratch. Quality becomes consistent because the variance between individuals shrinks. Investment pays back in weeks, not quarters, and the maturity ladder keeps producing returns as the team moves from templates to workflows to agents. 4. Agentic delivery, with guardrails that survive a security review The next stage is agentic. AI executes sequences of tasks within a defined scope. The risk is not that the agent will fail. It is that the system around the agent will not catch the failure, and that the failure modes are different in kind from traditional automation. Agents are non-deterministic, they can be manipulated through their inputs, and their actions can have side effects in systems the team does not own. Five guardrails make agentic delivery safe. The first four are necessary. The fifth is what carries the agent through a security review at a regulated enterprise. Identity and scope: the agent runs as a managed identity (or scoped service principal) with the smallest set of permissions that lets it do its job. Permissions are expressed as allowlists, not denylists. Tools fetched at runtime are subject to the same identity boundary as the agent itself. Input quarantine: anything the agent reads from a user-controlled source (work item bodies, PR descriptions, customer tickets) is treated as untrusted text. The agent does not execute instructions found in fetched content, and tool calls are validated against an output schema before execution. This is the prompt-injection mitigation, and it is the most common gap in agentic systems shipped today. Cost and blast-radius caps: every run has a maximum token budget, a maximum number of tool calls, and a maximum spend. Exceeding any cap aborts the run cleanly. Without caps, scoped credentials are not enough to bound the damage. Evaluations and traceability: agents are evaluated against a fixed test set before deployment, and on every prompt or model change. Every action is logged with inputs, outputs, the model and prompt versions used, and the reasoning trace where the model exposes one. Logs are redacted for secrets and personally identifiable information at write time. Reversibility taxonomy: actions are categorised by reversibility, not asserted to be reversible in general. A draft write to a private store is reversible. A post to a customer-facing channel is not reversible (deletion does not unsend). A database update may be reversible by a compensating transaction or not at all. Irreversible actions require human approval at the boundary, before they happen, not after. The agent is allowed to draft and stage. The human is the only one who is allowed to make the move that cannot be undone. In practice The pattern that works: start with one low-risk agent (release-notes drafter, PR triage assistant) running on read-only inputs, write-only-to-drafts permissions, and a hard cost cap per run. Require explicit human approval at the irreversible step. Wire up an evaluation set on day one, and rerun it on every prompt or model change. Treat regressions as failures, not warnings. The first agent the team ships is rarely the most valuable; it is the rehearsal that establishes the controls every later agent inherits. Teams that skip this rehearsal end up with an agent in production that no one feels safe extending. Implementation note An agent without a reversibility taxonomy and a regression eval set is a liability. The discipline is the same one that made workflows reliable: scoped identity, idempotency, traceability, and a clear boundary between machine action and human decision. The YAML below is illustrative, not a runtime contract; it is meant to show the shape of the controls a real agent definition would carry, not the syntax of any specific platform. # Agent run definition (illustrative; not a specific platform's syntax) name: release-notes-drafter trigger: pre-release identity: type: managed-identity scope: tenant=<tenant-id> resource=release-tools/<app-id> permissions: allow: - read: work-items in milestone (filter: state=Done) - read: pull-requests in milestone (filter: merged) - write: drafts/release-notes/${run-id} # Production channels are NOT in the allowlist. The agent cannot post. limits: max_tokens_per_run: 80000 max_tool_calls_per_run: 20 max_runtime_seconds: 300 max_cost_usd: 0.40 on_exceeded: abort_with_partial_artifact input_handling: treat_fetched_content_as: untrusted # Indirect prompt injection is mitigated by the layered discipline below, # not by a single feature flag. Each item is a separate control. enforce_instruction_hierarchy: true validate_tool_args_against_schema: true validate_outputs_against_schema: true steps: - fetch: completed work items in milestone - draft: release notes from items - validate: required fields present - request-review: from: release-manager idempotency_key: ${milestone-id}-${draft-hash} - on-approval: action: post-to-internal-channel reversibility: not-reversible requires: explicit-human-click # the agent does NOT click this audit: log_inputs: true log_outputs: true redact: - secrets # Pattern-based: handles structured PII like emails, phones, IDs. - pii_patterns: [email, phone, national-id, payment-card, ip-address] # Entity-based: required for unstructured PII like names. Pattern alone # cannot redact a customer name without an entity-recognition step. - pii_entities: ner-based # names, locations, organisations retain: 365_days # tune to your audit policy, not to the demo evaluation: test_set: tests/release-notes/eval-v3.jsonl on_prompt_change: rerun on_model_change: rerun fail_threshold: 5_percent_regression 5. Where AI still needs human judgment AI has clear boundaries. The boundaries are not embarrassing. They are the design. What must stay human-owned: Architectural trade-offs and design decisions Security validation and threat modelling Correctness for business-critical and regulatory paths Domain context that has not been written down Accountability for outcomes, not just outputs The goal is collaboration, not replacement. The teams that get the most value from AI are not the ones with the most automation. They are the ones with the clearest sense of where automation ends and judgment begins. In practice The pattern that works: name the human-owned items explicitly in the team's working agreement (architecture, security, regulatory correctness, accountability) and audit every AI workflow against that list. When a workflow asks the AI to make a decision in any of those categories, redesign it so the AI prepares the analysis and a human makes the call. Most teams over-trust AI for one of these areas in their first six months and learn the hard way. Naming the boundary up front prevents the lesson from being paid in production. The clarity is the value; the model behind the workflow is interchangeable. 6. Responsible AI is engineering work The first five disciplines decide whether AI moves the needle. The sixth decides whether the platform can defend the choices it makes with AI. Responsible AI is the engineering practice of building systems whose AI behaviour is fair, transparent, accountable, and safe by design, not by audit after the fact. Treating it as a compliance checkbox at the end of the project is how teams end up shipping AI workflows that fail security review, embarrass the company, or harm users. Six controls turn responsible AI from a policy into engineering work. These map directly onto the practices Microsoft and the broader industry have converged on, but the names matter less than the practice they enable. Fairness in inputs and outputs. The training data, eval set, and prompts are reviewed for systematic bias against any group the system serves. The eval set covers under-represented cases by design, not by accident, and regressions on those cases fail the build. Transparency to end users. When a user sees AI-generated content, they are told. When a decision is AI-assisted, the path from input to output is explainable in plain language, not just in a model card buried in documentation. Content safety filters. Inputs and outputs pass through safety classifiers (prompt injection, prohibited content, jailbreak patterns) before reaching the model and before reaching the user. Filtering decisions are logged and reviewable. Accountability ownership. Every AI workflow has a named owner who is accountable for its outcomes, not just its uptime. The owner has the authority to pause or roll back the workflow when harm is detected. Data minimisation and residency. The AI sees only the data it needs to do the task. Personally identifiable information and customer data are scoped, redacted, and kept inside the boundary the customer agreed to. Cross-tenant leakage is treated as a P1 incident, not a feature request. Harm evaluation alongside quality evaluation. The eval set measures harm potential (toxicity, hallucination on factual queries, leakage of confidential context) with the same rigour as it measures correctness. Both must pass for a release to ship. Figure 3. Responsible AI as a set of engineering controls around the AI workflow. The six controls fall into four categories: data discipline (fairness, data minimisation), model discipline (content safety, harm evaluation), deployment discipline (transparency to users), and governance (accountability ownership). All six are necessary; none is sufficient on its own. In practice The pattern that works: write the responsible AI plan before the first agent ships, not after the first incident. Pick one workflow that touches user data or generates customer-facing content, and use it as the reference implementation: fairness review on the eval set, content safety filters wrapping the model call, transparency annotation in the UI, redaction of identifying details in logs, harm evals running alongside quality evals on every change, and a named owner with explicit pause authority. The first such workflow takes longer to ship than the unconstrained version. Every workflow after it inherits the controls and ships faster than it would have without them. Teams that defer responsible AI to a future quarter end up retrofitting it under pressure, which is the most expensive way to do it. A scenario that ties it together Picture a platform team several months into using Copilot. Adoption is high. Productivity dashboards show gains. But defect rates are not improving and lead time is flat. Leadership asks the obvious question: is AI actually helping, or just feeling like help? The answer is not to stop using AI. It is to change how AI is measured. Move adoption metrics to the background. Move outcome metrics to the front: defect escape rate, lead time for change, change failure rate, mean time to recovery. In parallel, promote the individual prompts that have proved themselves to shared templates, and the templates to versioned workflows. Retrofit responsible AI controls onto the workflows that shipped first: content safety filters, harm evaluations alongside quality evaluations, transparency annotations on customer-facing output, and a named owner for each workflow. Six months later, the picture is different. Defect rate improves on the parts of the codebase where reusable workflows were introduced. Onboarding for new engineers is visibly faster. Release notes are consistent across teams. The shift is from celebrating use to tracking outcomes, and once the team measures what matters, the tooling decisions start making themselves. What teams get wrong The common pattern is measuring AI by usage, not by outcome. Adoption metrics tell you who tried Copilot. They do not tell you whether defects dropped, lead time improved, or release notes got better. The fix is not less AI. It is better measurement. The four metrics named in the scenario above (defect escape rate, lead time for change, change failure rate, mean time to recovery) come from the DORA research on software delivery performance and have become a useful default. Two warnings travel with them. First, attribution is hard: an AI workflow rolled out alongside a test refactor and a CI pipeline change cannot claim credit cleanly. Second, baselines matter more than headlines: a single quarter's improvement is not a trend, and a single team's gain is not the platform's gain. Outcome measurement done well needs a baseline window, an attribution discipline, and a kill criterion for workflows that are not paying back. Done poorly, it is just adoption metrics with better names. There is also the question of cost. AI usage carries a per-run token bill, an evaluation bill on every change, and (for agents) a cost cap that limits damage when something goes wrong. None of these are large compared to the engineering time saved when the workflow works. All of them are visible enough that a finance-aware reader will ask. Track them. Where to start The most concrete starter from this post: promote one personal prompt to a shared template. Pick the prompt that gets used most often (commit messages, code reviews, release notes, debugging assist), move it from someone's notes into the repository where the team versions everything else, and watch what changes when the next person on the team runs it. That is the smallest unit of the workflow shift this post argues for, and it is the step where prompts stop being individual practice and start becoming engineering assets. The shift The shift is from building systems to building smarter systems: AI does not replace engineers. It changes what an engineer's leverage looks like. The unit of value is the workflow, not the suggestion. The discipline that made platforms operable is the same discipline that makes AI useful. Responsible AI is not a compliance step. It is the sixth engineering discipline that lets the other five compound safely. The series ends here, but the arc is consistent across all three posts. The disciplines that make platforms scale are the same disciplines that make AI useful. Build with discipline. Run with discipline. Evolve with discipline. The tools change. The disciplines do not. Want to discuss? Where has AI moved the needle most in your delivery, and where has it disappointed you? Drop a comment with patterns you have seen in your environment. Every reply gets read. Previously in this series: Building Cloud Native Platforms That Scale: Patterns That Actually Work. Part 1 covered the design choices that make scale possible. Running Cloud Native Platforms: Why Day 2 Decides Everything. Part 2 covered the operational disciplines that decide production outcomes. This is the third and final post in the series.118Views0likes1CommentCloud Native Platforms: Run
Audience: SREs (Site Reliability Engineers), platform engineers, engineering managers running production systems Reading time: 8 minutes Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 2 of 3. Most systems are designed thoughtfully. Most operations are inherited reactively. The systems that survive are not the ones built with the most care. They are the ones operated with the most discipline. Production has a way of revealing every shortcut taken during design and every assumption left unverified. This post is about what it takes to operate a platform once the build is done. How they are run, not how they are built Systems are not defined by how they are built. They are defined by how they are run. A well-designed system that is operated reactively will fail in production. A modestly designed system that is operated with discipline will outperform it. Five operational disciplines decide which side of that line a platform lives on. Each one is engineering work, not a checklist for someone else to handle. Figure 1. The incident lifecycle as a state machine. The states are not optional steps. They are the contract between the team and the system. 1. Observability is the backbone of reliability Without observability, every operation becomes a guess. As systems grow, the cost of guessing rises faster than the cost of seeing. Part 1 of this series argued that observability is a design property: instrumentation contracts, request id propagation, structured logging schemas. Production is where those design choices either pay off or do not. Strong observability in production is a contract that lets any engineer answer three questions in minutes: what failed, why it failed, and what the impact was. The shape of that contract matters more than the tool that implements it. (This three-question framing is community-popularised through the SRE community and writers such as Charity Majors. See Honeycomb's What is Observability for the canonical articulation of the three-pillars and question framing; the substance is older than the framing.) Dashboards organised around user journeys, not infrastructure components Service level indicators (SLIs: the specific measurements you care about, e.g., success rate, p99 latency) chosen from the user's perspective, not the database's Alerts that page only on burn-rate against an SLO (Service Level Objective: the target value of an SLI, e.g., 99.9% of requests complete in under 800ms over a rolling month) using a multi-window strategy. A short window catches fast burns; a long window catches slow drifts. This is what makes SLOs operational rather than decorative. Sampling and retention tuned for cost, but never for blind spots The distinction between MTTA (mean time to acknowledge: how fast someone notices) and MTTR (mean time to restore: how fast service returns) tracked separately. Conflating them hides whether the team's bottleneck is detection, response, or fix. In practice The pattern that works: rebuild the operational view around two or three user journeys (sign-in, place order, view history) rather than per-component charts. Tie alerts to error budget burn rather than raw threshold crossings. Track MTTA and MTTR separately so the team's actual bottleneck (detection, response, or fix) is visible. The investment is rethinking what to measure, not buying a new tool. The return is that incidents stop being discovered by customer complaints first. Teams that make this shift typically find their existing telemetry was sufficient; only the questions being asked of it were wrong. If a dashboard cannot answer "what is the user experiencing right now", it is not an observability dashboard. It is decoration. 2. Alerts are signals, not notifications More alerts do not mean better monitoring. In practice, the opposite is true. Once alerts outpace the team's ability to act, important signals start getting missed. Effective alerting works to a small set of rules: Severity that maps to action, not to technical category Ownership baked in, never inferred at runtime Thresholds tied to user impact, not raw metric values Noise treated as a defect, with a regular review cadence Suppression and grouping for known multi-alert patterns In practice The pattern that works: audit every alert against one test, "what action would I take in the next five minutes if this fires now?" Demote alerts with no answer to dashboards. Remove alerts where the answer is the same as another alert's. Group related alerts so one incident produces one page, not twelve. Most teams discover their alert volume drops by an order of magnitude after a thorough audit, and the alerts that remain start getting trusted again. Trust is the precondition for every other operational practice. Without it, on-call rotations decay into noise filtering and the real signals get missed. Figure 2. From raw events to pages, in approximate orders of magnitude. The numbers vary by team and workload; what does not vary is that each stage needs to remove one to two orders of magnitude of noise. Teams that page on raw events end up with on-call rotations nobody trusts. 3. Incident response is a practiced muscle Failures are inevitable. Unstructured response is not. The teams that recover quickly do not improvise during incidents. They follow a structure that has been practiced when nothing was on fire. The structure is intentionally simple, because incident time is the worst time to negotiate roles. Clear roles: incident lead, communications lead, scribe, subject matter expert (the RACI model, Responsible-Accountable-Consulted-Informed, adapted for incident response) Defined escalation paths with clear handoff criteria. Escalation means re-paging to a higher tier or specialist, not returning to detection. The lifecycle diagram in Figure 1 makes the distinction explicit. Runbooks for the top failure modes, kept short enough to actually be read Status communication on a fixed cadence, even when there is nothing new to say. Customer comms and internal comms are tracked separately. Blameless postmortems (focus on the system that allowed the failure, not the person who pushed the button) that produce action items the team actually completes Game days: scheduled exercises that simulate failure modes (region outage, dependency unavailability, traffic spike) under controlled conditions, so gaps in runbooks are found before incidents do In practice The pattern that works: name the incident lead and the comms lead before the first message goes out. Write runbooks short enough to be scannable at 3 AM. Run blameless postmortems with action items that actually get tracked to completion. Schedule game days quarterly so the runbooks are exercised before real incidents. Teams that operate with this structure do not have more engineers; they have engineers who are not single points of failure during recovery. The deepest experts stay the deepest experts, but the platform stops depending on whether they happen to be online. Implementation note A short, well-structured runbook outperforms a long, exhaustive one. The goal during an incident is not to think. It is to act on a procedure that has been thought through in calmer times. # Runbook header pattern (keep it scannable in incident time) title: High latency on order API slo_protected: # this runbook protects two SLOs - order-completion-success - order-completion-latency severity: # derived from burn rate, not declared fast_burn: P1 # 14.4x budget burn over 1 hour => page now slow_burn: P2 # 6x budget burn over 6 hours => investigate owner: payments-team indicators: # triggers for evaluation, not severity - p99 (99th-percentile) latency exceeds the SLO target for 5 min - error rate exceeds the SLO target for 3 min on order-completion first_actions: - Open the order-journey dashboard. Confirm impact in business terms. - Check Service Bus queue depth and dead-letter rate (the most common cause of API latency under load is downstream backpressure) - Verify Cosmos DB RU/s saturation and partition hotspots - Inspect the most recent deployment for behavioural changes escalate_if: - Latency does not recover in 15 min - Error rate exceeds 5% (fast burn against the SLO) - Customer reports arrive before our own signals do rollback_path: - Feature flag "new-order-pipeline" can be disabled per-tenant - Last known good deployment id is in the release tracker note_on_scaling: # CPU is rarely the cause of latency in this service. Scale only after # confirming the bottleneck is compute, not a downstream dependency or # queue depth. Adding capacity to a saturated downstream amplifies the # incident; it does not resolve it. The general principle behind that last note travels beyond this runbook: scale-out is the right remediation for compute saturation, not for downstream saturation. When latency rises because a database, queue, or external dependency is saturated, adding capacity in front of the bottleneck moves more requests into the bottleneck and makes the incident worse. This is one of the most common operational mistakes when the dashboard shows red and the on-call instinct says "add more". 4. Release confidence is engineered Releases get harder as systems grow. The platforms that ship confidently at scale have engineered the path, not learned to fear it. The patterns that change the math: Feature flags that allow change without deploy Canary deployments (releasing the new version to a small slice of traffic first, watching error budget burn before continuing) that surface problems on a small slice Gradual rollouts with automated rollback triggers Database migrations split from application releases Release coordination that scales with services, not with team size In practice The pattern that works: every change ships behind a feature flag, canary deployments take a small slice of traffic first, and rollback is a one-click step in the pipeline rather than a procedure to be invented during an incident. The cost is the discipline of building rollback paths and exercising them. The return is releases that stop being events. Issues that previously triggered full rollbacks get isolated to a slice and rolled back automatically before they reach most users. The willingness to ship smaller, more frequent changes follows directly from the confidence that bad changes can be undone fast. Big releases feel safe because they are rare. They are actually risky because every change rides together. 5. Reliability is continuous, not a milestone Reliability is not achieved through tools alone. It requires continuous refinement, feedback-driven improvement, and a budget that the team can spend on operational work without negotiating each time. The disciplines that keep systems reliable over years are codified well in the SRE-book framing of service level objectives and error budgets (the canonical reference is the Google SRE Book chapter on Service Level Objectives, with the operational follow-up in the SRE Workbook chapter on alerting on SLOs). The names matter less than the practice they enable. SLOs chosen from the user's perspective, with two or three per service rather than ten. More SLOs means none of them shape behaviour. Error budgets: the inverse of the SLO, expressing how much unreliability the team is willing to spend in a window. Used up early in the month means slow down on releases. Healthy means feature work keeps moving. Multi-window burn-rate alerting turns SLOs from dashboards into pages: short window catches catastrophic failures, long window catches slow drift. Without burn-rate alerting, SLOs are observation, not operation. (The pattern is documented in the SRE Workbook.) Reliability work has its own backlog, prioritised against features. Not a wishlist after every incident. Regular game days that exercise failure modes (region failover, dependency outage, traffic spike) before they happen for real Capacity planning informed by data, not by anxiety In practice The pattern that works: define two or three SLOs per service, expressed from the user's perspective. Compute the error budget weekly. When the budget is healthy, ship feature work. When the budget is burning fast, slow down and fix the cause. The conversation about which incidents matter and which can wait becomes possible because there is a shared number to point at. Reliability becomes a quantified property of the platform, not an opinion debated at every retrospective. Teams that adopt this discipline stop having the recurring "how reliable do we need to be?" argument and start having data-grounded trade-off discussions instead. A scenario that ties it together A platform was launching a new region. The build had gone well. Day 1 was clean. Two weeks in, latency started creeping up during peak hours. Alerts fired on raw thresholds, but no one could tell which ones to trust. Incident calls turned into long debugging sessions because three different teams owned overlapping pieces of the request path. The team did not start by buying a new tool. They started by treating operations as engineering work. The dashboard was redesigned around the user journey. Alerts were audited and most were demoted or removed. Roles for incident response were written down. A short runbook covered the top failure modes. Releases were broken into canary slices behind feature flags. None of this was new. It was discipline applied consistently to work that was previously assumed to be someone else's. The next region launch took half the effort, and the team's mean time to restore on the failures that did happen was measurably lower. What teams get wrong The common pattern is treating Day 2 as the cost of Day 1. Teams design beautifully, ship fast, then quietly absorb the operational debt. Dashboards proliferate. Alerts grow louder. Postmortems pile up. The fix is not more dashboards. It is treating operations as engineering work with the same rigour as feature delivery. Operability is a property the system either has or does not. It is not earned by adding monitoring. It is earned by designing for visibility and operating with discipline. Where to start The most concrete starter from this post: an alert audit. List every alert that fires in the next week and apply a single test to each one: "what action would I take in the next five minutes?" Demote the alerts that have no answer. Remove the alerts where the answer is the same as another alert's. The audit takes a morning. The result usually halves alert volume and lifts trust on what remains, which is the precondition for every other operational practice in this post. The shift The most important shift in maturity is not technical. It is in stance. The shift is from shipping software to operating systems: Operations is not a phase that follows engineering. It is engineering. Reliability is not a milestone reached. It is a discipline practiced. Incidents are not interruptions to the work. They are the work. The teams that internalise this shift run platforms that are smaller, calmer, and more trusted. They do not have fewer incidents because their systems are more advanced. They have fewer incidents because their operational discipline is more consistent. Part 3 of this series argues that the same discipline applies again, in a different domain: the practices that make platforms operable are the practices that make AI useful in delivery. Want to discuss? What is the one operational practice your team adopted that changed how you sleep at night? Drop a comment with patterns you have seen in your environment. Every reply gets read. Previously in this series: Building Cloud Native Platforms That Scale: Patterns That Actually Work. The first post covered the design choices that make scale possible. Next in this series: AI-First Platform Engineering: From Copilot to Agentic Delivery. Cloud helped us scale infrastructure. The next post looks at how AI is now changing how we build and run platforms.110Views0likes1CommentCloud Native Platforms: Build
Audience: Cloud architects, platform engineers, engineering leaders making design decisions Reading time: 8 minutes Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 1 of 3. Most engineering teams can build systems. Few can scale them without rebuilding them. As platforms grow, complexity does not increase linearly. It multiplies across users, services, tenants, regions, and integrations. The systems that struggle and the systems that scale are rarely separated by which cloud they run on. They are separated by a handful of design choices made early and applied consistently. This post is about those choices. The differentiator is not the cloud Scalable platforms are not built with the right tools. They are built with the right design choices. Cloud services have closed the gap on infrastructure. The differentiator is no longer which managed service a team picks. It is whether the platform is designed to absorb change, tolerate failure, and support visibility from day one. Five engineering disciplines determine whether a platform scales gracefully or collects technical debt while it grows. Figure 1. The five disciplines compound into platform scale. Any one neglected becomes the constraint that forces a rewrite later. 1. Flexibility is the foundation of scale Hard-coded systems work until they do not. The first request to add a tenant, a region, a SKU (a sellable product variant), or a regulatory variant is the moment a rigid design starts to bend. Each subsequent request adds weight. Scalable platforms move behavior out of code: Configuration replaces conditional logic Feature flags enable safer, tenant-scoped rollouts APIs evolve through versioning, not breaking changes Schemas evolve additively. Breaking changes go through versioned contracts with a deprecation window long enough that consumers can migrate without downtime. In practice The pattern that works: configuration in a managed store, feature flags with tenant scope, and APIs versioned per consumer contract. Cost is the discipline of treating configuration as code (versioned, reviewed, audited). The return is that releases stop being events and start being routine. A change that previously needed a coordinated deployment can be executed in minutes, gated to a single tenant for verification, and rolled out broadly only after the signal is clean. Most platforms reach this state by retrofit, not by design. Doing it earlier costs less than waiting. If a change requires a redeploy, it should require a very good reason. 2. Failures are normal. Resilience is a choice. Distributed systems will fail in unpredictable ways. The real question is not how to prevent failure. It is how the system responds when failure happens. Resilience is engineered, not inherited from the platform. The patterns that move the needle are well known and consistently applied: Idempotent operations (safe to call multiple times with the same result) that make retries safe Reliable messaging patterns such as the transaction outbox (writing the message to the same database transaction as the business change, then publishing asynchronously) to avoid lost or duplicated events Decoupled services that contain blast radius (the scope of damage when one component fails) Timeouts, retries, and circuit breakers (a wrapper around a dependency that stops calling it for a cool-off window after repeated failures) tuned per dependency Bulkheads (isolation pools, often a separate compute or queue lane per workload class) that keep noisy neighbours from starving critical paths of resources In practice The pattern that works: every write that can be retried carries an idempotency key, every queue consumer is safe to replay, every event published goes through an outbox in the same transactional unit as the business change. When peak load triggers retries, duplicates collapse cleanly instead of producing duplicate orders, double-charged customers, or split-brain state. The contract changes outwards: callers can retry without thinking, queues can be at-least-once instead of exactly-once, and recovery moves from a manual cleanup task to a property of the system. Most teams that adopt this pattern stop seeing certain classes of incident entirely. Implementation note An idempotent API is not just a design preference. It changes how the rest of the system can be built. Once writes are safe to repeat, retries become cheap, queues become trustworthy, and recovery becomes automatic. The naive implementation (read the key, if absent process and save) has a race. Two concurrent requests with the same key both miss the lookup, both call the processor, and both attempt to save. That is the failure mode idempotency exists to prevent. The pattern that survives production is an atomic reserve-then-execute: insert a row keyed by the idempotency key with a unique constraint before doing any work. The first writer wins. Concurrent callers either wait for the original to complete and read its result, or they receive a conflict response. // Contract for the idempotency store. The two key methods are TryReserveAsync // (atomic insert with unique-key constraint) and CompleteAsync (record the // result of the first writer). GetCompletedResultAsync polls until the first // writer commits or returns 409 Conflict if the in-flight window exceeds the // configured deadline. public interface IIdempotencyStore { Task<Reservation> TryReserveAsync( string idempotencyKey, string requestHash, CancellationToken ct); Task CompleteAsync( string idempotencyKey, OrderResult result, CancellationToken ct); Task<OrderResult> GetCompletedResultAsync( string idempotencyKey, CancellationToken ct, TimeSpan? maxWait = null); } public readonly record struct Reservation( bool IsFirstWriter, string RequestHash); // Idempotency via atomic reserve-then-execute. // First writer wins; replays return the original result; concurrent // duplicates lose the race and read the winner's outcome (or get 409). public async Task<OrderResult> CreateOrderAsync( Order order, string idempotencyKey, CancellationToken ct) { var requestHash = StableHash(order); // canonical content hash // Atomic insert: succeeds for the first caller, fails for the rest. var reserved = await _store.TryReserveAsync( idempotencyKey, requestHash, ct); if (!reserved.IsFirstWriter) { if (reserved.RequestHash != requestHash) throw new IdempotencyKeyReusedException(); // A previous run committed (return its result) or is in-flight // (poll with a bounded deadline; 409 if exceeded). return await _store.GetCompletedResultAsync( idempotencyKey, ct, maxWait: TimeSpan.FromSeconds(5)); } // We are the first writer. Execute, persist, mark complete. var result = await _processor.ProcessAsync(order, ct); await _store.CompleteAsync(idempotencyKey, result, ct); return result; } Three production details matter: TTL or compaction on the idempotency record. Without it, the store grows forever. Most teams retain records for the request retry window plus a safety margin (commonly 24 to 72 hours). Stable content hash, not the default object hash code. The request hash detects key reuse with a different body, so a client that reuses an idempotency key with a different payload receives IdempotencyKeyReusedException rather than silently getting the wrong result. Canonicalise field ordering, locale, and null handling before hashing. Bound the in-flight window explicitly. The genuinely hard case is when the processor succeeded but the store write failed. Production-grade implementations either run the side-effect and the store write in the same transaction (when the processor and store share a database) or use the transaction outbox pattern to bridge them. The poll-with-deadline in GetCompletedResultAsync handles the duplicate-arrives-mid-flight case; the transactional boundary handles everything else. 3. Observability is not optional Without observability, teams operate blind. As systems grow, the price of guessing rises faster than the price of seeing. At build time, observability is a design property. The decisions made before the system reaches production are what determine whether it can be operated at all. The dashboards, alerts, and incident practices covered in Part 2 of this series rely on instrumentation choices made here. The build-time work that pays off in production: Request identifiers propagated through every service hop, every queue, every async boundary, so a single user action can be traced end to end Structured logging with a consistent schema (event name, correlation id, tenant, severity) rather than free-form strings Metrics emitted at the boundaries that matter (every external call, every queue read or write, every database operation), not only at the entry point Tracing libraries integrated at the framework or middleware layer so coverage is automatic, not opt-in Schemas designed so business signals (orders, sessions, transactions) and system signals (CPU, latency, errors) share the same identifiers and can be correlated later In practice The pattern that works: a single request id flowing through every service hop, every queue, every async boundary, propagated automatically at the framework layer rather than per-call. Add one structured logging schema across services (event name, correlation id, tenant, severity), so that a single query joins business events with system events. The investment is hours of upfront framework wiring. The return is that production diagnosis stops being archaeology. Cross-service questions become single dashboards; postmortems shrink from days to hours; and the dashboards in Part 2 actually work because the data underneath is shaped to support them. 4. Delivery practices set the ceiling Scaling teams requires scaling delivery. Small inefficiencies in pipelines, environments, and release coordination compound into measurable drag. Delivery maturity that pays off at scale: Pipelines as code, reviewed and versioned like application code Parallel deployments across services and regions where dependencies allow Infrastructure as code with shared modules, not hand-managed environments Automated quality gates: tests, security scans, dependency checks Trunk-based development (developers commit to a single shared branch many times a day) with short-lived feature branches and progressive delivery. Important caveat: trunk-based works only when test automation and feature flags are already in place. Adopting it before those foundations exist tends to amplify production incidents rather than reduce them. In practice The pattern that works: pipelines run in parallel where dependencies allow, infrastructure provisioning is templated rather than per-environment, and quality gates run automatically rather than as discretionary steps. Sequential deployment of a multi-service platform across three environments takes hours; parallelised deployment of the same change takes minutes. The payback is not only release speed. It is the compounding cost reduction of every wait state for every engineer on every release. Teams that treat pipelines as a product feature, not an afterthought, ship more confidently and recover from bad changes faster because the rollback path was exercised, not invented during an incident. Slow pipelines are not a tooling problem. They are a design problem. 5. Cost discipline is engineering work Cloud platforms can become expensive quickly when cost is treated as someone else's problem. Cost is a property of the design, not a quarterly review. The teams that get this right treat cost the same way they treat performance: Elastic compute and storage tiers chosen per workload pattern Non-production environments with automated scale-down windows (the easiest savings to leave on the table) Tagging discipline so cost can be attributed to a service, a feature, a tenant Egress and data-tier choices, not compute, dominate cloud bills past a certain scale. Right-size storage tiers (hot vs cool vs archive), eliminate cross-region chatter, and watch egress on the data plane more closely than compute on the request path. Budgets and usage alerts wired into the same channels as reliability alerts Cost reviews built into design discussions, not deferred to FinOps (Financial Operations: the practice of managing cloud spend as an engineering concern) In practice The pattern that works: non-production environments scale down automatically outside business hours, storage tiers match access patterns (hot, cool, archive), and tagging is enforced so every dollar can be attributed to a service or feature. Cost reviews happen at design time, not after the bill arrives. The biggest savings come from data plane decisions, not compute: cross-region egress, oversized storage tiers, and forgotten test environments dominate cloud bills past a certain scale. Treat cost as a first-class non-functional requirement, alongside latency and availability, and the discipline compounds in every design discussion that follows. A scenario that ties it together Figure 2. A reference architecture that puts the disciplines into one shape. The request path is decoupled, the data layer is purpose-fit, identity is brokered by managed identity throughout, private endpoints isolate the data tier from public networks, and observability runs as a first-class lane. Picture a multi-tenant platform at a growth inflection. Onboarding a new tenant takes weeks because tenant-specific behaviour is hard-coded across services. Every release carries risk because there is no way to roll out a change to one tenant without affecting the rest. Incidents linger because logs and metrics live in different tools and nobody can correlate them in production. Do not start with a rewrite. Start with the smallest set of changes that unlocks the next year of growth: extract configuration out of code, introduce tenant-aware feature flags, wire a unified observability view into the existing services, and parallelise the pipelines. None of these are architectural revolutions. They are design choices applied with discipline, in the order the disciplines compound. Eighteen months in, onboarding a tenant takes hours instead of weeks. Releases move from monthly events to weekly increments. Incidents are caught earlier and resolved faster. The platform did not get bigger. It got more capable. The five disciplines did the work; the team made the choice to apply them. What teams get wrong The common pattern is architecting for the system you have, not the system you are growing into. It looks like progress because the current sprint ships. Pillars get postponed because they feel like overhead. The cost surfaces later. Each shortcut becomes a constraint. The constraints compound, and three releases later the team is debating a rewrite. The fix is not premature abstraction. It is small, deliberate investments in flexibility, resilience, observability, delivery, and cost from day one. The discipline is to make these investments before they are urgent. Where to start when you cannot do everything at once Five disciplines is a wall, and real teams cannot fund all five at once. The right order depends on whether the platform is being built fresh or already running. For a system already in production and already in pain, the SRE community's hierarchy of reliability needs gives the most defensible starting order: monitoring and observability first (you cannot fix what you cannot see), then incident response (close the bleeding cleanly), then resilience patterns (idempotency, retries, decoupling) so the bleeding has fewer reasons to start, then flexibility and delivery so safe change can travel at speed. Cost discipline runs alongside throughout, never as the headline. For a system being built fresh, the order in this post (flexibility, resilience, observability, delivery, cost) reflects the Azure Well-Architected Framework's emphasis on designing for change, failure, and visibility before scaling teams or workloads. Both orders are defensible. What is not defensible is leaving any of the five for later. The most concrete starter from this post: request id propagation. A single correlation identifier travelling through every service hop, every queue, every async boundary, costs hours up front and pays back every time someone has to debug production for the rest of the platform's life. It is the smallest unit of the observability discipline and the foundation that the dashboards, traces, and incident response in Part 2 all depend on. The shift The most important transformation in scaling a platform is not technical. It is mindset. The shift is from project thinking to platform thinking: Build reusable capabilities, not one-off solutions Design systems for long-term evolution, not the next release Enable other teams, not just deliver for one team Tools change. Cloud services evolve. The architectural fashions of this year will not be the architectural fashions of the next. What persists is the discipline behind the choices. Scalable systems are not built by tools. They are built by teams that treat design as continuous work. The same discipline shows up again in Part 2 (operating these systems) and Part 3 (using AI to augment that work). The tools change. The disciplines do not. Want to discuss? What single design choice has paid the most dividends in the platforms you run? Drop a comment with patterns you have seen in your environment. Every reply gets read. Next in this series: Running Cloud Native Platforms: Why Day 2 Decides Everything. Building is half the journey. The next post looks at what it takes to operate these platforms once they are in production.168Views0likes1CommentFrom Prompt to Production: Building Azure Architecture Diagrams with AI
Author: Arturo Quiroga, Senior Partner Solutions Architect — Microsoft Cloud architects spend significant time translating ideas into architecture diagrams. They toggle between Visio, draw.io, pricing calculators, and documentation. According to the 2024 Stack Overflow Developer Survey, 61% of developers spend more than 30 minutes a day searching for answers or solutions, time lost to context-switching rather than design. What if you could describe your architecture in plain English and get a diagram, cost estimate, and deployment guide in minutes? The Challenge: Fragmented Architecture Workflows Designing Azure architectures today typically involves multiple disconnected steps: Sketch the architecture in a diagramming tool Look up official Azure icons and drag them into place Research pricing across regions using the Azure Pricing Calculator Validate the design against the Well-Architected Framework (WAF) Write deployment documentation and Infrastructure as Code templates Compare alternative designs manually Each step lives in a different tool, and keeping them in sync as designs evolve is costly. The Azure Architecture Diagram Builder brings these workflows together in a single browser-based experience. How It Works Describe your architecture in natural language, for example "A HIPAA-compliant healthcare platform with FHIR APIs, event-driven processing, and multi-region disaster recovery", and the AI generates a diagram with grouped services, data flow connections, and logical organization. Figure 1. Enter a natural-language prompt describing your architecture. Curated example prompts help you get started, and you can optionally upload an existing diagram for the AI to analyze. The tool uses Azure OpenAI to power generation across multiple models, enabling you to choose the model that best fits your scenario — from fast iterations to deeper reasoning. Key Features AI-Powered Architecture Generation Describe what you need in plain English, and the AI creates an architecture diagram with: 714 official Azure service icons across 29 categories Smart grouping: services are logically organized (Frontend, Backend, Data, Security) Data flow connections: labeled edges showing how data moves through the system 13 curated example prompts: from simple web apps to complex enterprise scenarios like Zero Trust networks, Industrial IoT with 5,000+ sensors, and global multiplayer gaming backends Figure 2. A generated industrial IoT architecture. Top: the clean diagram view as initially produced. Bottom: the same diagram with per-service monthly cost overlays toggled on, plus a running subscription total in the toolbar. Architecture Image Import Already have an architecture on a whiteboard or in a screenshot? Upload the image and let the AI analyze it, mapping services to official Azure icons and recreating the architecture as an editable, interactive diagram. Figure 3. Upload a photo of a whiteboard sketch (top-right reference panel) and the AI recreates it as an editable diagram with official Azure service icons and labeled data flow connections. ARM Template Import Import existing ARM templates to visualize your current infrastructure. The AI parses resource definitions and dependencies, groups related resources into logical layers, and produces a meaningful diagram of what you actually have deployed — a fast way to document an inherited environment or sanity-check a template before deployment. Figure 4. ARM template import in action. Top: the parser status banner while resources and dependencies are being analyzed. Bottom: the resulting diagram, with resources auto-grouped into logical layers (Web Tier, Data Layer, Container Platform, Observability & Logging) and a Generated from: ARM Template badge linking the diagram back to its source file. Well-Architected Framework Validation Validate your architecture against all five WAF pillars — Security, Reliability, Performance Efficiency, Cost Optimization, and Operational Excellence. The validator provides: An overall WAF score with pillar-level breakdowns Specific findings with severity levels Actionable recommendations you can select and apply Select the recommendations you agree with, and the AI regenerates an improved architecture incorporating those changes. Figure 5. WAF validation results showing the overall score, per-pillar breakdowns, and individual findings with severity badges. Tick the recommendations you want and the AI rebuilds the diagram with those changes applied. Multi-Model Comparison Run the same architecture prompt through multiple AI models side-by-side and compare: Architecture Comparison: service counts, connection counts, groups, token usage, and latency Validation Comparison: WAF scores across models, severity breakdowns, and finding counts Apply Winner: pick the best result and apply it to the canvas with one click Present Critique: a talking avatar narrates the AI-generated ranking with live closed captions Figure 6. Multi-model comparison. Top: select the models and reasoning effort, then enter the prompt. Bottom: side-by-side results across all selected models with service counts, latency, token usage, and Fastest / Cheapest / Most Thorough badges. Multi-Region Cost Estimation Get cost estimates from the Azure Retail Prices API across 8 Azure regions: East US 2, Australia East, Canada Central, Brazil South, Mexico Central, West Europe, Sweden Central, and Southeast Asia. Features include: Color-coded cost legend (green / yellow / red thresholds) SKU and tier information for each service Export options: CSV, JSON, plain-text summary, and an analysis report with top cost drivers, Reserved Instance flags, and a ranked multi-region comparison table Figure 7. The cost legend overlay shows per-service pricing with color-coded thresholds. The region selector in the toolbar lets you re-price the entire architecture in any of eight Azure regions. Deployment Guide Generation with Bicep Generate step-by-step deployment documentation including: Prerequisites and Azure resource requirements Step-by-step deployment instructions Bicep templates for each service (Infrastructure as Code) Post-deployment verification steps Security configuration recommendations Figure 8. Each generated Deployment Guide opens with the architecture name, an estimated deployment time, and a prerequisites checklist covering subscription roles, CLI versions, Microsoft Entra ID permissions, and region requirements, followed by numbered, copy-ready deployment steps. Figure 9. The Infrastructure as Code section produces a main.bicep orchestrator plus a per-service module (Log Analytics, Key Vault, Cosmos DB, SQL Database, Event Hubs, Azure Functions, and more). The Download All Templates button packages everything into a ready-to-deploy folder. Workflow Animation & Avatar Presenter Visualize how data flows through your architecture with step-by-step animations that highlight services on the canvas as each step plays. When the Azure Speech Service is configured, a photorealistic talking avatar can narrate the workflow or present model comparison results, with live word-by-word closed captions in a draggable, resizable panel. Figure 10. A workflow step is highlighted on the canvas as the Avatar Presenter narrates that step. Live word-by-word closed captions appear in a draggable, resizable panel, useful for accessibility and stakeholder demos. Export Options Figure 11. A single-slide PowerPoint export, available in dark or light theme, ready to drop straight into a stakeholder deck. Format Use Case PNG Documentation, presentations SVG Scalable vector graphics PPTX Single PowerPoint slide (dark or light theme) Draw.io Edit in diagrams.net JSON Backup, version control CSV / ZIP Cost analysis with multi-region comparison Highlights The Azure Architecture Diagram Builder unifies the architecture design lifecycle in a single tool: End-to-end workflow: from natural-language description to deployable Bicep templates without tool switching Official Azure icons: 714 icons across 29 categories, mapped directly from the Azure service catalog Live pricing: queries the Azure Retail Prices API at design time rather than relying on static estimates WAF-integrated validation: architectural best practices built into the design loop rather than applied after the fact Multi-model flexibility: choose the AI model that best suits each task, with fast models for iteration and reasoning models for complex designs Open source: the source code is available for customization and contribution One-Command Deploy with Azure Developer CLI The fastest way to get your own instance running is with azd : # Install azd (once) brew tap azure/azd && brew install azd # macOS winget install microsoft.azd # Windows # Clone, configure, and deploy git clone https://github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder cd azure-architecture-diagram-builder azd auth login azd env set AZURE_OPENAI_ENDPOINT "https://your-resource.openai.azure.com/" azd env set AZURE_OPENAI_API_KEY "your-key" azd up # Provisions infrastructure + builds + deploys (~8 min) azd up provisions the following via Bicep: Resource Purpose Azure Container Registry Stores the Docker image Azure Container Apps Runs the app (nginx + token server) Log Analytics + Application Insights Monitoring and telemetry Azure Speech (S0) Avatar Presenter (optional, keyless auth via managed identity) Try It Today The Azure Architecture Diagram Builder is available now: Live demo: https://aka.ms/diagram-builder Source code: GitHub repository Documentation: See the Getting Started Guide for detailed setup instructions We welcome feedback and contributions. Use the GitHub Issues page to report bugs, suggest features, or share your experience. Tags: artificial intelligence · application · apps & devops · well architected · infrastructure555Views1like0CommentsWAR, Azure Advisor, and Us (Azure Arch Diagram Builder): Three Ways to Score an Azure Architecture
Author: Arturo Quiroga, Azure AI services Engineer - Senior Partner Solutions Architect — Microsoft A few days ago I published From Prompt to Production: Building Azure Architecture Diagrams with AI, introducing the open-source Azure Architecture Diagram Builder. One feature got more follow-up questions than any other: the Well-Architected Framework (WAF) validation. Architects from partners and customers — many of whom already use Azure Advisor and the Well-Architected Review — wanted to know exactly what scoring algorithm we use, how it compares to Microsoft's official tools, and whether they should be using all three. This post is that answer. It's a deep dive into how design-time WAF validation works, how Microsoft's two official WAF assessment algorithms work, and where each fits in the architecture lifecycle. TL;DR. Microsoft ships two WAF assessment vehicles — the Well-Architected Review (questionnaire, scored from human answers) and the Azure Advisor score (healthy-resources-÷-applicable-resources weighted per subcategory, with Defender Secure Score for Security and cost-weighted math for Cost). Both require either a human filling in a form or live Azure telemetry. Our app runs at design time on a diagram, before anything is deployed, using a hybrid pipeline: a deterministic rule pre-scan followed by an LLM refinement pass. Same five WAF pillars, different lifecycle stage. Complementary, not competitive. Why design-time validation matters Every cost overrun, reliability gap, and security incident I've ever debugged was cheaper to fix on a whiteboard than in production. Yet most WAF tooling assumes the architecture already exists — either because there are deployed resources to scan (Advisor) or because someone has built enough of it to answer 60 specific questions about it (WAR). That leaves a gap. Between "rough sketch" and "deployed resource group" there is no algorithmic WAF feedback loop. That's the gap the Diagram Builder fills. Microsoft's two official WAF assessment algorithms Before describing our approach, it's worth being precise about what Microsoft already ships, because the term "WAF assessment algorithm" can mean either of two very different things. 1. Azure Well-Architected Review (WAR) — questionnaire-based The Well-Architected Review is a free self-assessment hosted on Microsoft Learn. Aspect Detail Input Human answers to ~60 questions mapped to the WAF pillar checklists Workload variants Core WAR, plus AI/ML, IoT, SAP on Azure, Azure Stack Hub, SaaS, Mission Critical Scoring Derived from the answers — each "no" or unanswered question subtracts from the pillar score Output Per-pillar maturity score + prioritized recommendations + optional Advisor integration Improvement tracking "Milestones" (point-in-time snapshots) When to use Periodic deep reviews; greenfield design baselining; brownfield audits WAR is human-driven. The algorithm is essentially "how many of the recommended practices have you confirmed you do?" — which is exactly the right algorithm when the assessor is the workload team itself. 2. Azure Advisor Score — telemetry-based The Advisor score is the closest thing Microsoft ships to a real, deterministic WAF algorithm. It runs continuously over your deployed Azure resources. The math: Pillar-specific overrides: Security uses Microsoft Defender for Cloud's Secure Score model. Cost weights by retail $ cost of healthy resources, plus age-of-recommendation weighting; postponed/dismissed items are removed from the denominator. Reliability / Performance / Operational Excellence use the healthy-resources ratio above. Key terms: Healthy resource — a deployed resource with no open Advisor recommendation against it for that pillar. Total applicable — resources Advisor was able to evaluate (excludes dismissed/snoozed). Advisor is the right tool once you're in production. It cannot help you before deployment, because there is nothing to count as "healthy" or "applicable." The missing stage: design time Here's the lifecycle, with each tool's domain shaded: Design / Diagram — Diagram Builder validation runs here. Operate / Observe — Azure Advisor runs here continuously. Periodic Review — WAR runs here, typically quarterly or at major milestones. These three stages are sequential and complementary. Our app does not replace Advisor or WAR — it adds a feedback loop earlier in the lifecycle, where corrections are cheapest. How design-time validation works in the Azure Architecture Diagram Builder The validator is a two-phase hybrid pipeline: deterministic local rules first, then LLM refinement. The full source lives in three files: src/services/architectureValidator.ts — orchestrator and prompt src/services/wafPatternDetector.ts — topology + service rule engine src/data/wafRules.ts — the rule knowledge base Phase 1 — Deterministic rule pre-scan (~1 ms, no LLM) When you click Validate Architecture, the validator runs a fully client-side rule engine against the diagram's services, connections, and groups. There are two kinds of rules: Architecture-pattern rules These fire when a topology anti-pattern is detected: Pattern Detection trigger single-region No global LB (Traffic Manager / Front Door) with ≥3 services single-database Exactly one database service, no replication signal no-cache Compute + database present, no Redis/CDN no-monitoring No Azure Monitor / App Insights / Log Analytics no-identity No Microsoft Entra ID no-waf Public web tier without WAF / Front Door / App Gateway direct-db-access An edge from a frontend service directly into a database no-key-vault 4+ services and no Key Vault no-backup Database present, no Azure Backup / Recovery Services no-api-gateway 2+ compute services and no APIM / App Gateway / Front Door Service-specific rules Every service in the in the generated Azure Architecture diagram is matched against SERVICE_SPECIFIC_RULES by normalized type — App Service, Functions, AKS, Cosmos DB, SQL Database, Storage, Key Vault, and 22 more. The knowledge base at a glance Metric Count Total rules 73 Architecture-pattern rules 10 Service-specific rules 63 Distinct Azure services covered 29 Rules tagged Reliability 18 Rules tagged Security 34 Rules tagged Cost Optimization 5 Rules tagged Operational Excellence 7 Rules tagged Performance Efficiency 9 The preliminary score Each finding has a severity, and severity drives a fixed point deduction from a starting score of 100: Severity Deduction critical −12 high −7 medium −3 low −1 Result is floored at 10 (so even a deliberately bad architecture scores at least 10) and ceilinged at 95 (no findings ≠ perfect — there's always something the model might still catch). This is the deterministic baseline before the LLM ever sees the architecture, and it's what makes the pipeline reproducible. Phase 2 — LLM contextual refinement The pre-scan output, the topology, and the optional natural-language description are folded into a focused prompt sent to one of seven Azure OpenAI models (GPT-5.1 through 5.4, GPT-5.x Codex variants, DeepSeek V3.2 Speciale, Grok 4.1 Fast). The system prompt gives the model explicit scoring guardrails: Score based on what IS present, not what COULD be added. A well-connected architecture with appropriate services should score 60–80. Score below 50 only for critical gaps (no auth, no monitoring, single points of failure). Findings are improvement suggestions, not reasons to penalize the score severely. The model returns strict JSON: { "overallScore": 0-100, "summary": "2–3 sentence assessment", "pillars": [ { "pillar": "Reliability | Security | Cost Optimization | Operational Excellence | Performance Efficiency", "score": 0-100, "findings": [ { "severity": "critical | high | medium | low", "category": "...", "issue": "...", "recommendation": "...", "resources": ["service-name-1", "service-name-2"], "source": "rule-based | ai-analysis" } ] } ], "quickWins": [ /* same shape as findings */ ] } Two things to call out: Every finding is tagged rule-based or ai-analysis . That tag is the credibility lever. You can always see what the deterministic engine produced versus what the model contributed on top. If you don't trust the AI layer, you can ignore it entirely — the rule layer still stands. The LLM is given pattern hints, not the entire rule catalog. The prompt stays small and focused, which is roughly 3–5× faster and cheaper than asking the LLM to do everything from scratch. What the user sees On every run the modal reports: Overall WAF score (0–100) Per-pillar score × 5 (0–100 each) Severity breakdown — counts of critical / high / medium / low across all findings Quick wins — high-impact, low-effort items the model surfaces separately Hybrid metadata — local findings count, patterns detected, KB rules used, preliminary score, local elapsed ms AI metrics — model used, reasoning effort, prompt/completion/total tokens, elapsed time App Insights telemetry — an Architecture_Validated event with model, overall score, finding count, elapsed time Worked example Take this prompt, which I've used in demos with partners: "A multi-region web application: Azure Front Door in front of two App Service instances in West US 2 and East US 2, both reading from an Azure SQL Database with geo-replication, with Application Insights for telemetry. No Entra ID, no Key Vault." After generation, Validate Architecture runs: Phase 1 — pre-scan (deterministic), ~1 ms Patterns detected: no-identity , no-key-vault Findings produced: 8 (1 critical, 1 high, 3 medium, 3 low) Preliminary score: 100 − 12 − 7 − (3×3) − (1×3) = 69 Phase 2 — LLM refinement, ~6–9 s depending on model The model accepts the two pattern hints, validates them in context, and adds three more findings of its own: Finding Source Pillar Severity No Microsoft Entra ID for authentication rule-based Security critical No Key Vault for secret management rule-based Security high App Service slots not used for safe deploys ai-analysis Operational Excellence medium SQL DB geo-replication present but RTO/RPO not documented ai-analysis Reliability medium No CDN for static assets behind Front Door ai-analysis Performance Efficiency low Final scores returned by the model: Pillar Score Reliability 78 Security 52 Cost Optimization 80 Operational Excellence 70 Performance Efficiency 75 Overall 71 The Security score is the lowest because two of the highest-severity findings landed there — exactly what a human reviewer would flag first. Multi-model comparison Because the deterministic floor is identical across runs, the Validation Comparison view becomes a fair shootout of what each LLM adds on top of the same baseline. The same diagram is scored by all seven models, and the UI surfaces: Overall score per model Per-pillar score per model Severity-count deltas Number of ai-analysis findings each model contributed Quick wins each model identified This is genuinely useful for two reasons. First, it shows that LLM scores vary — typically by ±5–10 points on the same architecture — which is exactly why we publish the rule-based vs ai-analysis tag. Second, it lets architects pick the model whose review style matches their own. How we align with Microsoft's algorithms Alignment point What it means Same five pillars Identical names and scope to the official WAF Same source material Rules derived from WAF docs and Azure Architecture Center service guides Severity-graded findings Map conceptually to Advisor's high/medium/low impact recommendations Per-pillar + overall scoring Mirrors WAR/Advisor output shape, so the results feel familiar Where we deliberately differ — and why Concern Microsoft Diagram Builder Why we differ Needs deployed resources Advisor: yes No — works on a diagram We're a design-time tool; the architecture doesn't exist yet Needs human Q&A WAR: yes No — derived from the diagram One-click validation inside the design flow Healthy/Applicable ratio Advisor: yes No No resource-health signal exists pre-deployment Subcategory fixed weights Advisor: yes No explicit weights Severity is the de-facto weight (12/7/3/1) Defender Secure Score for Security Advisor: yes No Defender requires deployed resources Cost-weighted scoring Advisor: yes No (separate Cost Estimation feature) Cost is a separate pipeline in our app AI/LLM refinement Neither Yes Catches context-specific issues a static catalog misses, and explains findings in natural language Multi-model comparison Neither Yes Lets architects see scoring variance across models Honest limitations I'd rather you hear these from me than discover them in production: LLM scores drift. ±5–10 points across models on the same diagram is normal. Treat the score as directional, the findings as actionable. The rule-based tag is your anchor. No live telemetry. We can't know if your App Service is actually using availability zones — only that you have App Service in the diagram. Advisor will tell you the truth post-deployment. Generic ruleset. No specialized workload branches yet (AI/ML, IoT, SAP, SaaS). WAR has those. No milestone tracking. Each validation run is independent. Compare runs manually using the Validation Comparison view. Rule coverage is finite. 29 services and 73 rules is a strong start but not exhaustive — the LLM layer exists in part to compensate for that gap. How to use all three together A lifecycle that actually works: Design — Use the Diagram Builder to sketch the architecture and validate at design time. Iterate until the per-pillar scores look reasonable and the critical/high findings are addressed. Deploy — Generate Bicep from the diagram, deploy, and let Azure Advisor start scoring real resources. Operate — Use Azure Advisor continuously. Use Defender Secure Score for security posture. Periodic review — Run a Core WAR every quarter or at major milestones to capture the things only humans know (business context, tradeoffs, planned debt). None of these three replace the others. They cover different stages of the same loop. What's next A few things on the roadmap I'd love feedback on: Milestone tracking so design-time scores can be compared over time the way WAR milestones work. Workload-specific rulesets mirroring WAR's branches — starting with AI/ML. Direct Advisor handoff — once a diagram is deployed, surface the corresponding Advisor recommendations in the same UI to close the loop. Try it, fork it, tell me where it's wrong Live app: https://aka.ms/diagram-builder Source: github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder Useful references: Azure Well-Architected Framework pillars Azure Well-Architected Review tool Azure Advisor score — calculation Use Azure WAF assessments (Advisor) Complete an Azure Well-Architected Review assessment If you're a partner or customer architect who's already living in Advisor and WAR, I'd genuinely value your reaction — does the design-time stage feel like a real gap to you, or are you already covering it some other way? Open an issue on the repo or reply on LinkedIn. Posted on the Azure Architecture Blog · Comments and issues welcome on the repo.168Views0likes0CommentsGoverning Agent Sprawl: A Multi‑Region AI Agent Landing Zone on Azure (Reference Architecture)
It doesn’t take long for AI agents to get out of hand. In most enterprises, the first few agents are celebrated. A chatbot here. A document summarizer there. Then another team ships an agent that calls APIs. Someone else connects one to internal data. Within months, IT is staring at dozens—or hundreds—of autonomous systems running across subscriptions, regions, and tools. At that point, the questions stop being about model quality and start being uncomfortable operational ones: Who owns this agent? What data can it access? What happens if it misbehaves? Why did it just consume half our monthly token budget in a day? Developers can build an AI agent in minutes—the difficult part is understanding what agents are doing, how they perform, and whether they comply with organizational policy. Signals scatter across tools, context is lost, and governance becomes reactive. This reference architecture exists to solve that problem. It describes a multi‑region AI agent landing zone on Azure that treats agents as first‑class, governable workloads—provisioned automatically, constrained by policy, and observable from day one. The architectural principle: separate control from execution The design starts with a simple but non‑negotiable rule: Control plane concerns must be separated from runtime concerns. Azure landing zones already follow this model. Management groups, Azure Policy, and RBAC are global constructs. Workloads run in regions. This architecture applies the same discipline to AI agents. The runtime plane is where agents execute, models infer, and data flows—often in multiple Azure regions. The control plane is where identity, policy, safety, evaluation, and oversight live—independent of region. This separation is what allows teams to scale agents without losing control. Layer 1: Azure AI Gateway — governing every request The first control layer sits directly in the request path. The AI gateway in Azure API Management provides a policy‑enforcement and observability layer in front of AI models, agents, and tools. It is not a separate service—it extends Azure API Management. Everything flows through it: Microsoft Foundry model deployments Azure AI Model Inference API endpoints OpenAI‑compatible third‑party models Self‑hosted models MCP servers and A2A agent APIs (preview) What the gateway actually enforces This layer is intentionally narrow and operational: Token quotas and rate limits The llm-token-limit policy (GA) enforces tokens‑per‑minute or quota ceilings per consumer before requests reach the backend. This prevents one application—or one agent—from exhausting shared capacity. Content safety at ingress The llm-content-safety policy (GA) integrates Azure AI Content Safety to moderate prompts automatically. Unsafe requests never reach the model. Traffic routing and resiliency Azure API Management supports multi‑region gateway deployment (Premium tier). If a region fails, traffic routes to the next closest gateway automatically. Token usage, prompts, and completions are logged to Azure Monitor and Application Insights using built‑in policies such as llm-emit-token-metric. The gateway does not understand agent intent or business context. That is by design. It governs traffic, not behavior. Layer 2: Azure AI Foundry Control Plane — governing behavior at scale The second layer governs what agents do, not just how requests flow. Azure AI Foundry Control Plane provides a unified management surface for AI agents, models, and tools across projects and subscriptions. It is designed specifically for agentic systems. Foundry Control Plane is currently in public preview. What Foundry Control Plane adds Fleet‑wide inventory Every agent, model, and tool appears in a single, searchable view across projects. Continuous evaluation on production traffic Foundry runs evaluations that measure task adherence, groundedness, tool‑call accuracy, sensitive data exposure, and other agent‑specific risk dimensions. Centralized guardrails Policy is enforced across inputs, outputs, and tool interactions—not just prompts. Bulk remediation can be applied across the fleet. Security integration Foundry integrates with: Microsoft Entra for agent identity (Entra Agent ID) Microsoft Defender for threat signals Microsoft Purview for data protection and compliance visibility Foundry Control Plane also requires an AI Gateway to be configured for advanced governance scenarios—reinforcing the layered approach. Layer 3: Microsoft Agent 365 — enterprise oversight, not just Azure oversight The third layer exists because Azure governance alone is not enough. Agents don’t just call APIs. They act on behalf of users. They access enterprise data. They operate inside Microsoft 365 workflows. Microsoft Agent 365 is the tenant‑level control plane for AI agents. It brings agents under the same administrative model used for users and applications. Status: Frontier Preview General availability: May 1, 2026 Why this layer matters Agent 365 introduces controls that Azure alone cannot provide: Agent registry A single inventory of all agents in the tenant—including sanctioned and shadow agents. Unsanctioned agents can be quarantined. Identity‑first access control Every agent is issued an Entra agent ID. Conditional Access policies apply to agents the same way they do to users. Human‑in‑the‑loop oversight Agents surface in Microsoft 365 admin workflows, not just Azure portals. Security and compliance Defender and Purview extend threat detection and data protection policies to agent activity. Agent 365 does not replace Foundry Control Plane. It complements it—connecting agent operations to enterprise identity, compliance, and productivity systems. How the pieces work together Individually, these services are powerful. The architecture works because they are deliberately layered. External approval → automated provisioning When a use case is approved in an external governance system, it triggers an Azure DevOps pipeline using the REST API. That pipeline: Provisions subscriptions and resource groups Deploys Foundry projects Configures Azure API Management with AI Gateway policies Enables monitoring and logging Governance is applied before the first request is made. One policy model, many regions Azure landing zones are region‑agnostic at the governance layer. This architecture follows that guidance. Policies and RBAC apply globally AI Gateway enforces limits locally in each region Runtime services scale region by region Expanding to a new region does not introduce a new governance model—only new capacity. A single operational view Signals flow upward: AI Gateway emits traffic and usage metrics Foundry Control Plane correlates evaluations, guardrail enforcement, and security alerts Agent 365 aggregates tenant‑level identity, compliance, and threat signals Operations teams no longer hunt across dashboards. They work from one prioritized view, with context intact. What this architecture deliberately does not promise This is a reference architecture, not a silver bullet. It does not eliminate the need for: Clear agent ownership Business‑level approval processes Ongoing evaluation of agent usefulness What it does provide is a foundation—one that lets organizations scale agentic AI without accepting chaos as the cost of innovation. Closing thoughts Agent sprawl is not a tooling failure. It’s an architectural one. By separating control from execution, layering governance where it belongs, and aligning AI operations with existing Azure and Microsoft 365 control planes, this architecture gives enterprises a way to move fast without losing sight of what their agents are doing. That’s the difference between experimentation—and production. Co-Contributor: Jorge Pena Alarcon-Sr. Cloud & AI Specialist References (official Microsoft sources) Azure AI Gateway in Azure API Management Configure AI Gateway for Foundry Foundry Control Plane overview Microsoft Agent 365 announcement Agent 365 GA annoucement Azure landing zones and regions Azure DevOps pipeline REST API920Views1like1CommentHow to Secure Azure Databricks without Public Exposure using WAF + Private Endpoints
This blog outlines a Zero Trust–aligned architecture for securing Azure Databricks using Application Gateway (WAF) and Private Endpoints within a Hub-Spoke network model. Enables a true Zero Trust model, ensuring: No direct exposure of Databricks Full traffic inspection Compliance-ready secure access for both internal and external users1.5KViews1like1CommentArchitecture to Resilience: A Decision Guide
Start with the framework, accelerate with the tool Watch the video walkthrough The Application Resilience Framework originated from a practical gap we saw in resilience reviews: teams had architecture diagrams, monitoring data, incident history, and runbooks, but no consistent way to connect them into a measurable resilience model. The framework is intended to close that gap by turning architecture context into a structured lifecycle for risk identification, mitigation validation, health modeling, and governance. It aligns closely with the Reliability pillar of the Azure Well-Architected Framework, especially the guidance around identifying critical flows, performing Failure Mode Analysis, defining reliability targets, and building health models. The Application Resilience Framework Tool helps teams apply this framework faster by starting with artifacts they already have, such as data flow diagrams or sequence diagrams in Mermaid or image format. The tool extracts workflows, application components, platform components, dependencies, and initial failure modes, then guides the team through the decisions needed to make resilience measurable. From those artifacts, the tool creates the first version of a resilience model by extracting workflows, application components, platform components, dependencies, and initial failure modes. It then guides the team through one import step followed by four phases: Import Artifacts -> Phase 1: Failure Mode Analysis -> Phase 2: Mitigation and Validation -> Phase 3: Health Model Mapping -> Phase 4: Operations and Governance It is not a replacement for WAF guidance or Resilience Hub style assessments. It is a practical way to operationalize those concepts at the workload and workflow level, producing prioritized risks, mitigation plans, validation paths, health signals, dashboards, reports, and governance ownership. How to use this guide This guide follows the same flow as the tool. For each step, it covers: The decision: What needs to be decided? The options: What paths are available? The guidance: When each option fits Use this with the video walkthrough. The video shows the tool in action. This guide explains the choices behind each step. Question 1: What artifact should you import first? The import step creates the starting point for the model. Regardless of the input path, the output is the same: workflows that move into Phase 1: Failure Mode Analysis. Options Import option Best for What happens Data flow diagram System, module, data movement, and dependency views If imported as an image, the tool breaks it into sequence-style flows. Selected flows become workflows. Sequence diagram Transaction flow and service interaction views Converted directly into workflows. Mermaid input Diagrams maintained as code in Mermaid format Converted directly into workflows. Image input JPG or PNG diagrams Azure Foundry Vision models interpret the image and convert it into workflows. Manual entry Missing or incomplete diagrams User creates or corrects workflows manually. When to pick which Use data flow for system and dependency views. Use sequence diagrams for transaction or interaction views. Regardless of import path, the output is the same: workflows, components, dependencies, and initial failure modes ready for Phase 1. Question 2: Which workflows should be analyzed first? Phase 1 is Failure Mode Analysis. This is where the tool identifies what can fail and how important each failure is. Options Critical user flows: Login, checkout, payment, onboarding, request processing. High-risk platform flows: Database writes, queue processing, storage access, identity, messaging, external APIs. Known issue areas: Workflows with recent incidents, recurring alerts, or customer impact. When to pick which Start where failure creates the highest customer or business impact. The goal is not to model everything at once. The goal is to model the right thing first. Deliverables Failure Mode Analysis catalog RPV risk scores Criticality classification Question 3: How should failure modes be prioritized? After workflows and components are imported, the tool helps score each failure mode using Risk Priority Value or RPV, which uses the four factors of Impact, Likelihood, Detectability and Outage severity. Options Use generated failure modes and scores: Best for a fast first pass. Tune the RPV scores with engineering input: Best when workload context matters. Add custom failure modes: Best when known risks come from incidents, reviews, or customer experience. When to pick which Use the generated model to accelerate the first pass, then adjust it with real system knowledge. The goal is not to create the longest list of risks. The goal is to identify the risks that deserve attention first. Deliverables Failure Mode Catalog RPV Risk Scores Prioritized criticality list Question 4: Are mitigations defined or validated? Phase 2 is Mitigation and Validation. This is where each failure mode gets a response plan. Options Detection only: The team can detect the failure, but the response is not defined. Defined mitigation: The response is documented, such as retry, fallback, failover, scaling, restore, or rebalance. Validated mitigation: The response has been tested through a controlled validation or chaos test. When to pick which For low-risk items, documented mitigation may be enough. For critical and high-risk items, validation is the key. A mitigation that has not been tested is still an assumption. Deliverables Mitigation playbooks Chaos test plans Support playbooks Question 5: Which risks need health signals? Phase 3 is Health Model Mapping. This is where the tool connects risks to observability. A failure mode should not just sit in a document. It should map to a signal that can show whether the system is healthy, degraded, or unhealthy. Options Map all failure modes: Best for small systems or highly critical workloads. Map critical and high-risk failure modes first: Best for large systems. Track unmapped risks as gaps: Best when observability coverage is still improving. When to pick which Start with the highest RPV items. Every critical failure mode should have at least one signal, such as a metric, log, alert, availability check, or dependency signal. Deliverables Health model Signal definitions Coverage report Bicep templates Question 6: Should the health model be exported or deployed? Once the health model is built, the next decision is how to use it. Options Export for review: Best when the team needs to validate the model first. Generate monitoring templates: Best when the team wants repeatable implementation. Deploy to Azure: Best when the model is ready to become part of operations. Use outputs in downstream tools: Best when support, SRE, or incident response workflows need structured playbooks. When to pick which Export first if the model is still being reviewed. Deploy when component relationships, signals, and coverage are accurate enough for operational use. Question 7: How will governance keep the model current? Phase 4 is Operations and Governance. This is where the resilience model becomes an ongoing practice. Options One-time assessment: Useful for quick discovery but limited long term. Recurring review: Best for production workloads that change regularly. Closed-loop governance: Best when incidents, failed validations, and monitoring gaps feed back into the model. When to pick which For production systems, use a recurring governance cadence. Assign owners, track gaps, review dashboards, and update the model as the system changes. Deliverables Governance model Dashboards Reports and exports Runbooks Putting it together: three adoption patterns Once governance is defined, the tool can be used in different ways depending on the team’s maturity and objective. The three common adoption patterns are: Pattern A: Quick resilience review Import one critical workflow Generate failure modes Review RPV scores Identify top risks Export findings Best for fast architecture reviews or early customer conversations. Pattern B: Full workload assessment Import multiple workflows Build a full Failure Mode Catalog Define mitigations and recovery steps Create chaos test plans Map risks to signals Produce coverage reports Best for structured resilience assessments. Pattern C: Operational health model Build and tune the health model Export or deploy monitoring artifacts Track risk and signal coverage Review mitigation effectiveness Assign governance ownership Feed findings back into the model Best when the goal is continuous operational improvement. A short checklist before using the tool Which workflow should we import first? Do we have a data flow diagram, sequence diagram, or Mermaid file? What components and dependencies should be included? Which failure modes matter most? How should RPV be adjusted for this workload? Do critical failure modes have mitigations? Have those mitigations been validated? Are failure modes mapped to health signals? What coverage gaps remain? Should the health model be exported or deployed? Who owns ongoing review? How often should the model be updated? Closing thought The Application Resilience Framework Tool provides a practical way to move from architecture artifacts to measurable, continuously improving resilience. It starts with data flow or sequence diagrams, builds a structured view of the system, and guides teams through the decisions that matter: what can fail, how severe it is, how it is mitigated, how it is detected, and how it is governed. Tool repo: Application Resilience Framework Tool510Views0likes0CommentsAzure Course Blueprints
Each Blueprint serves as a 1:1 visual representation of the official Microsoft instructor‑led course (ILT), ensuring full alignment with the learning path. This helps learners: see exactly how topics fit into the broader Azure landscape, map concepts interactively as they progress, and understand the “why” behind each module, not just the “what.” Formats Available: PDF · Visio · Excel · Video Every icon is clickable and links directly to the related Learn module. Layers and Cross‑Course Comparisons For expert‑level certifications like SC‑100 and AZ‑305, the Visio Template+ includes additional layers for each associate-level course. This allows trainers and students to compare certification paths at a glance: 🔐 Security Path SC‑100 side‑by‑side with SC‑200, SC‑300, AZ‑500 🏗️ Infrastructure & Dev Path AZ‑305 alongside AZ‑104, AZ‑204, AZ‑700, AZ‑140 This helps learners clearly identify: prerequisites, skill gaps, overlapping modules, progression paths toward expert roles. Because associate certifications (e.g., SC‑300 → SC‑100 or AZ‑104 → AZ‑305) are often prerequisites or recommended foundations, this comparison layer makes it easy to understand what additional knowledge is required as learners advance. Azure Course Blueprints + Demo Deploy Demos are essential for achieving end‑to‑end understanding of Azure. To reduce preparation overhead, we collaborated with Peter De Tender to align each Blueprint with the official Trainer Demo Deploy scenarios. With a single click, trainers can deploy the full environment and guide learners through practical, aligned demonstrations. https://aka.ms/DemoDeployPDF Benefits for Students 🎯 Defined Goals Learners clearly see the skills and services they are expected to master. 🔍 Focused Learning By spotlighting what truly matters, the Blueprint keeps learners oriented toward core learning objectives. 📈 Progress Tracking Students can easily identify what they’ve already mastered and where more study is needed. 📊 Slide Deck Topic Lists (Excel) A downloadable .xlsx file provides: a topic list for every module, links to Microsoft Learn, prerequisite dependencies. This file helps students build their own study plan while keeping all links organized. Download links Associate Level PDF - Demo Visio Contents AZ-104 Azure Administrator Associate R: 12/14/2023 U: 12/17/2025 Blueprint Demo Video Visio Excel AZ-204 Azure Developer Associate R: 11/05/2024 U: 12/17/2025 Blueprint Demo Visio Excel AZ-500 Azure Security Engineer Associate R: 01/09/2024 U: 10/10/2024 Blueprint Demo Visio+ Excel AZ-700 Azure Network Engineer Associate R: 01/25/2024 U: 12/17/2025 Blueprint Demo Visio Excel SC-200 Security Operations Analyst Associate R: 04/03/2025 U:04/09/2025 Blueprint Demo Visio Excel SC-300 Identity and Access Administrator Associate R: 10/10/2024 Blueprint Demo Excel Specialty PDF Visio AZ-140 Azure Virtual Desktop Specialty R: 01/03/2024 U: 12/17/2025 Blueprint Demo Visio Excel Expert level PDF Visio AZ-305 Designing Microsoft Azure Infrastructure Solutions R: 05/07/2024 U: 12/17/2025 Blueprint Demo Visio+ AZ-104 AZ-204 AZ-700 AZ-140 Excel SC-100 Microsoft Cybersecurity Architect R: 10/10/2024 U: 04/09/2025 Blueprint Demo Visio+ AZ-500 SC-300 SC-200 Excel Skill based Credentialing PDF AZ-1002 Configure secure access to your workloads using Azure virtual networking R: 05/27/2024 Blueprint Visio Excel AZ-1003 Secure storage for Azure Files and Azure Blob Storage R: 02/07/2024 U: 02/05/2024 Blueprint Excel Subscribe if you want to get notified of any update like new releases or updates. Author: Ilan Nyska, Microsoft Technical Trainer My email ilan.nyska@microsoft.com LinkedIn https://www.linkedin.com/in/ilan-nyska/ I’ve received so many kind messages, thank-you notes, and reshares — and I’m truly grateful. But here’s the reality: 💬 The only thing I can use internally to justify continuing this project is your engagement — through this survey https://lnkd.in/gnZ8v4i8 ___ Benefits for Trainers: Trainers can follow this plan to design a tailored diagram for their course, filled with notes. They can construct this comprehensive diagram during class on a whiteboard and continuously add to it in each session. This evolving visual aid can be shared with students to enhance their grasp of the subject matter. Explore Azure Course Blueprints! | Microsoft Community Hub Visio stencils Azure icons - Azure Architecture Center | Microsoft Learn ___ Are you curious how grounding Copilot in Azure Course Blueprints transforms your study journey into smarter, more visual experience: 🧭 Clickable guides that transform modules into intuitive roadmaps 🌐 Dynamic visual maps revealing how Azure services connect ⚖️ Side-by-side comparisons that clarify roles, services, and security models Whether you're a trainer, a student, or just certification-curious, Copilot becomes your shortcut to clarity, confidence, and mastery. Navigating Azure Certifications with Copilot and Azure Course Blueprints | Microsoft Community Hub36KViews15likes20CommentsCentralizing Enterprise API Access for Agent-Based Architectures
Problem Statement When building AI agents or automation solutions, calling enterprise APIs directly often means configuring individual HTTP actions within each agent for every API. While this works for simple scenarios, it quickly becomes repetitive and difficult to manage as complexity grows. The challenge becomes more pronounced when a single business domain exposes multiple APIs, or when the same APIs are consumed by multiple agents. This leads to duplicated configurations, higher maintenance effort, inconsistent behavior, and increased governance and security risks. A more scalable approach is to centralize and reuse API access. By grouping APIs by business domain using an API management layer, shaping those APIs through a Model Context Protocol (MCP) server, and exposing the MCP server as a standardized tool or connector, agents can consume business capabilities in a consistent, reusable, and governable manner. This pattern not only reduces duplication and configuration overhead but also enables stronger versioning, security controls, observability, and domain‑driven ownership—making agent-based systems easier to scale and operate in enterprise environments. Designing Agent‑Ready APIs with Azure API Management, an MCP Server, and Copilot Studio As enterprises increasingly adopt AI‑powered assistants and Copilots, API design must evolve to meet the needs of intelligent agents. Traditional APIs—often designed for user interfaces or backend integrations—can expose excessive data, lack intent-level abstraction, and increase security risk when consumed directly by AI systems. This document outlines a practical, enterprise-‑ready approach to organize APIs in Azure API Management (APIM), introduce a Model Context Protocol (MCP) server to shape and control context, and integrate the solution with Microsoft Copilot Studio. The goal is to make APIs truly agent-‑ready: secure, scalable, reusable, and easy to govern. Architecture at a glance Back-end services expose domain APIs. Azure API Management (APIM) groups and governs those APIs (products, policies, authentication, throttling, versions). An MCP server calls APIM, orchestrates/filters responses, and returns concise, model-friendly outputs. Copilot Studio connects to the MCP server and invokes a small set of predictable operations to satisfy user intents. Why Traditional API Designs Fall Short for AI Agents Enterprise APIs have historically been built around CRUD operations and service-‑to-‑service integration patterns. While this works well for deterministic applications, AI agents work best with intent-driven operations and context-aware responses. When agents consume traditional APIs directly, common issues include: overly verbose payloads, multiple calls to satisfy a single user intent, and insufficient guardrails for read vs. write operations. The result can be unpredictable agent behavior that is difficult to test, validate, and govern. Structuring APIs Effectively in Azure API Management Azure API Management (APIM) is the control plane between enterprise systems and AI agents. A well-‑structured APIM instance improves security, discoverability, and governance through products, policies, subscriptions, and analytics. Key design principles for agent consumption Organize APIs by business capability (for example, Customer, Orders, Billing) rather than technical layers. Expose agent-facing APIs via dedicated APIM products to enable controlled access, throttling, versioning, and independent lifecycle management. Prefer read-only operations where possible; scope write operations narrowly and protect them with explicit checks, approvals, and least-privilege identities. Read‑only APIs should be prioritized, while action‑oriented APIs must be carefully scoped and gated. The Role of the MCP Server in Agent‑Based Architectures APIM provides governance and security, but agents also need an intent-level interface and model-friendly responses. A Model Context Protocol (MCP) server fills this gap by acting as a mediator between Copilot Studio and APIM-exposed APIs. Instead of exposing many back-end endpoints directly to the agent, the MCP server can: orchestrate multiple API calls, filter irrelevant fields, enforce business rules, enrich results with additional context, and emit concise, predictable JSON outputs. This makes agent behavior more reliable and easier to validate. Instead of exposing multiple backend APIs directly to the agent, the MCP server aggregates responses, filters irrelevant data, enriches results with business context, and formats responses into LLM‑friendly schemas. By introducing this abstraction layer, Copilot interactions become simpler, safer, and more deterministic. The agent interacts with a small number of well‑defined MCP operations that encapsulate enterprise logic without exposing internal complexity. Designing an Effective MCP Server An MCP server should have a focused responsibility: shaping context for AI models. It should not replace core back-end services; it should adapt enterprise capabilities for agent consumption. What MCP should do An MCP server should be designed with a clear and focused responsibility: shaping context for AI models. Its primary role is not to replace backend services, but to adapt enterprise data for intelligent consumption. MCP does not orchestrate enterprise workflows or apply business logic. It standardizes how agents discover and invoke external tools and APIs by exposing them through a structured protocol interface. Orchestration, intent resolution, and policy-driven execution are handled by the agent runtime or host framework. It is equally important to understand what does not belong in MCP. Complex transactional workflows, long‑running processes, and UI‑specific formatting should remain in backend systems. Keeping MCP lightweight ensures scalability and easier maintenance. Call APIM-managed APIs and orchestrate multi-step retrieval when needed. Apply security checks and business rules consistently. Filter and minimize payloads (return only fields needed for the intent). Normalize and reshape responses into stable, predictable JSON schemas. Handle errors and edge cases with safe, descriptive messages. What MCP should not do Avoid implementing complex transactional workflows, long-running processes, or UI-specific formatting in MCP. Keep it lightweight so it remains scalable, testable, and easy to maintain. Step by step guide 1) Create an MCP server in Azure API Management (APIM) Open the Azure portal (portal.azure.com). Go to your API Management instance. In the left navigation, expand APIs. Create (or select) an API group for the business domain you want to expose (for example, Orders or Customers). Add the relevant APIs/operations to that API group. Create or select an APIM product dedicated for agent usage, and ensure the product requires a subscription (subscription key). Create an MCP server in APIM and map it to the API (or API group) you want to expose as MCP operations. In the MCP server settings, ensure Subscription key required is enabled. From the product’s Subscriptions page, copy the subscription key you will use in Copilot Studio. Screenshot placeholders: APIM API group, product configuration, MCP server mapping, subscription settings, subscription key location. * Note: Using an API Management subscription key to access MCP operations is one supported way to authenticate and consume enterprise APIs. However, this approach is best suited for initial setups, demos, or scenarios where key-based access is explicitly required. For production‑grade enterprise solutions, Microsoft recommends using managed identity–based access control. Managed identities for Azure resources eliminate the need to manage secrets such as subscription keys or client secrets, integrate natively with Microsoft Entra ID, and support fine‑grained role‑based access control (RBAC). This approach improves security posture while significantly reducing operational and governance overhead for agent and service‑to‑service integrations. Wherever possible, agents and MCP servers should authenticate using managed identities to ensure secure, scalable, and compliant access to enterprise APIs. 2) Create a Copilot Studio agent and connect to the APIM MCP server using a subscription key Copilot Studio natively supports Model Context Protocol (MCP) servers as tools. When an agent is connected to an MCP server, the tool metadata—including operation names, inputs, and outputs—is automatically discovered and kept in sync, reducing manual configuration and maintenance overhead. Sign in to Copilot Studio. Create a new agent and add clear instructions describing when to use the MCP tool and how to present results (for example, concise summaries plus key fields). Open Tools > Add tool > Model Context Protocol, then choose Create. Enter the MCP server details: Server endpoint URL: copy this from your MCP server in APIM. Authentication: select API Key. Header name: use the subscription key header required by your APIM configuration. Select Create new connection, paste the APIM subscription key, and save. Test the tool in the agent by prompting for a domain-specific task (for example, “Get order status for 12345”). Validate that responses are concise and that errors are handled safely. Screenshot placeholders: MCP tool creation screen, endpoint + auth configuration, connection creation, test prompt and response. Operational best practices and guardrails Least privilege by default: create separate APIM products and identities for agent scenarios; avoid broad access to internal APIs. Prefer intent-level operations: expose fewer, higher-level MCP operations instead of many low-level endpoints. Protect write operations: require explicit parameters, validation, and (when appropriate) approval flows; keep “read” and “write” tools separate. Stable schemas: return predictable JSON shapes and limit optional fields to reduce prompt brittleness. Observability: log MCP requests/responses (with sensitive fields redacted), monitor APIM analytics, and set alerts for failures and throttling. Versioning: version MCP operations and APIM APIs; deprecate safely. Security hygiene: treat subscription keys as secrets, rotate regularly, and avoid exposing them in prompts or logs. Summary As organizations scale agent‑based and Copilot‑driven solutions, directly exposing enterprise APIs to AI agents quickly becomes complex and risky. Centralizing API access through Azure API Management, shaping agent‑ready context via a Model Context Protocol (MCP) server, and consuming those capabilities through Copilot Studio establishes a clean and governable architecture. This pattern reduces duplication, enforces consistent security controls, and enables intent‑driven API consumption without exposing unnecessary backend complexity. By combining domain‑aligned API products, lightweight MCP operations, and least‑privilege identity‑based access, enterprises can confidently scale AI agents while maintaining strong governance, observability, and operational control. References Azure API Management (APIM) – Overview Azure API Management – Key Concepts Azure MCP Server Documentation (Model Context Protocol) Extend your agent with Model Context Protocol Managed identities for Azure resources – Overview432Views0likes0Comments