apps & devops
64 TopicsCloud Native Platforms: Evolve
Audience: Engineering leaders, platform architects, senior developers exploring how to operationalise AI in their teams Reading time: 8 minutes Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 3 of 3. Cloud helped us scale infrastructure. AI is starting to do the same thing for the work around the code: the planning, the testing, the release communication, the incident triage, the writing that surrounds writing software. The conversation about AI in software has narrowed too quickly to "Copilot in the editor". The bigger story is happening across the lifecycle. Planning, design, development, testing, release, and operations are all being augmented at once. The platforms that adopt AI well are not the ones with the most usage. They are the ones with the clearest discipline around how it is used. This post is about that discipline. AI is changing how we engineer, not how we type AI is not changing how we write code. It is changing how we engineer software. Code generation is the surface. Underneath it, AI is reshaping the unit of leverage. The question is no longer how fast a developer can type. It is how well a workflow can be expressed as a reusable engineering asset. Six disciplines determine whether AI moves the needle on outcomes or just adds another tool to the stack. Figure 1. AI across the SDLC. Each phase has clear AI assist points and clear human-owned validations. The boundary is not negotiable. It is the design. 1. From assistance to augmentation Early AI tools focused on assisting individual developers. Code suggestions. Autocomplete. Quick refactors. The value was real but bounded by the editor. The shift now is into structured workflows that span the lifecycle. The unit of leverage is no longer a single suggestion. It is a sequence of actions executed reliably across phases. ("Agentic" later in this post means a system that makes its own next-step decisions inside guardrails. A workflow follows a fixed sequence; an agent chooses the path.) Code generation has become baseline, not differentiator Workflow generation is where the largest gains live Multi-step assistance with explicit human checkpoints Context that travels across tools, not just within one In practice The pattern that works: start with the single highest-volume writing task on the team (commit messages, code review comments, release notes, postmortem first drafts) and turn the AI assist for that task into a shared workflow rather than each individual's private trick. The cost is one engineer's afternoon documenting the workflow and the eval set. The return is that every engineer on the team inherits the work, and the task that used to consume an engineer's morning every two weeks becomes a background step in the release process. Workflow generation, not faster typing, is where the gains compound across a team. Code suggestions help one developer. Reusable workflows help the next ten. 2. AI across the SDLC, with guardrails AI now has a useful role at every phase of delivery. The role is different at each phase, and the guardrails are different too. Phase What AI helps with What humans must validate Plan Breaking down requirements, drafting acceptance criteria Domain context, business priorities, customer impact Build Code generation, refactoring, scaffolding Architectural fit, security boundaries, performance Test Test case generation, edge case discovery Coverage of business-critical paths, regulatory cases Release Release notes, changelog summaries, communication drafts Accuracy, tone, customer-facing claims Operate Log triage, incident summaries, runbook drafts Root cause attribution, action item ownership The guardrails are not optional decoration. They are the design. In practice The pattern that works: stage AI assists for release communication (changelog drafting, customer-facing release notes, internal release announcements) and require a human review before anything goes out. The draft arrives consistently, faster than a human could produce, and easier to compare across releases. The reviewer is not eliminated; the reviewer is moved from author to editor, which is where their judgment actually matters. Teams that adopt this pattern stop missing release-note deadlines and stop publishing inconsistent communication across products. 3. From prompts to reusable assets Many teams begin with prompt experimentation. Individuals find techniques that work for their tasks. The result is a patchwork of personal practices that do not survive a team change. The compounding value comes when prompts mature into reusable engineering assets. Figure 2. The maturity model from prompts to agents. The value compounds at the workflow stage and accelerates at the agent stage. The disciplines that make agents safe are the same ones that made workflows reliable. The maturity stages, in order of leverage: Prompts: ad-hoc, individual, hard to share Templates: parameterised prompts versioned with the project Workflows: multi-step sequences with clear inputs, outputs, checkpoints Agents: autonomous task chains operating within explicit guardrails The diagram is a maturity ladder, not a graduation. In practice teams operate at all four stages simultaneously for different tasks. A senior engineer may use a one-off prompt to explore a refactor, run a versioned template for commit messages, hand off to a workflow for release notes, and trigger an agent for routine PR triage, all in the same hour. The point of the ladder is not to leave earlier stages behind. It is to know which stage a given task belongs to and to invest accordingly. In practice The pattern that works: pick the three prompts your team uses every week, codify them as parameterised templates in the same repository as the application code, and treat them as engineering artefacts (reviewed, versioned, owned). New engineers inherit the team's accumulated practice instead of building their own from scratch. Quality becomes consistent because the variance between individuals shrinks. Investment pays back in weeks, not quarters, and the maturity ladder keeps producing returns as the team moves from templates to workflows to agents. 4. Agentic delivery, with guardrails that survive a security review The next stage is agentic. AI executes sequences of tasks within a defined scope. The risk is not that the agent will fail. It is that the system around the agent will not catch the failure, and that the failure modes are different in kind from traditional automation. Agents are non-deterministic, they can be manipulated through their inputs, and their actions can have side effects in systems the team does not own. Five guardrails make agentic delivery safe. The first four are necessary. The fifth is what carries the agent through a security review at a regulated enterprise. Identity and scope: the agent runs as a managed identity (or scoped service principal) with the smallest set of permissions that lets it do its job. Permissions are expressed as allowlists, not denylists. Tools fetched at runtime are subject to the same identity boundary as the agent itself. Input quarantine: anything the agent reads from a user-controlled source (work item bodies, PR descriptions, customer tickets) is treated as untrusted text. The agent does not execute instructions found in fetched content, and tool calls are validated against an output schema before execution. This is the prompt-injection mitigation, and it is the most common gap in agentic systems shipped today. Cost and blast-radius caps: every run has a maximum token budget, a maximum number of tool calls, and a maximum spend. Exceeding any cap aborts the run cleanly. Without caps, scoped credentials are not enough to bound the damage. Evaluations and traceability: agents are evaluated against a fixed test set before deployment, and on every prompt or model change. Every action is logged with inputs, outputs, the model and prompt versions used, and the reasoning trace where the model exposes one. Logs are redacted for secrets and personally identifiable information at write time. Reversibility taxonomy: actions are categorised by reversibility, not asserted to be reversible in general. A draft write to a private store is reversible. A post to a customer-facing channel is not reversible (deletion does not unsend). A database update may be reversible by a compensating transaction or not at all. Irreversible actions require human approval at the boundary, before they happen, not after. The agent is allowed to draft and stage. The human is the only one who is allowed to make the move that cannot be undone. In practice The pattern that works: start with one low-risk agent (release-notes drafter, PR triage assistant) running on read-only inputs, write-only-to-drafts permissions, and a hard cost cap per run. Require explicit human approval at the irreversible step. Wire up an evaluation set on day one, and rerun it on every prompt or model change. Treat regressions as failures, not warnings. The first agent the team ships is rarely the most valuable; it is the rehearsal that establishes the controls every later agent inherits. Teams that skip this rehearsal end up with an agent in production that no one feels safe extending. Implementation note An agent without a reversibility taxonomy and a regression eval set is a liability. The discipline is the same one that made workflows reliable: scoped identity, idempotency, traceability, and a clear boundary between machine action and human decision. The YAML below is illustrative, not a runtime contract; it is meant to show the shape of the controls a real agent definition would carry, not the syntax of any specific platform. # Agent run definition (illustrative; not a specific platform's syntax) name: release-notes-drafter trigger: pre-release identity: type: managed-identity scope: tenant=<tenant-id> resource=release-tools/<app-id> permissions: allow: - read: work-items in milestone (filter: state=Done) - read: pull-requests in milestone (filter: merged) - write: drafts/release-notes/${run-id} # Production channels are NOT in the allowlist. The agent cannot post. limits: max_tokens_per_run: 80000 max_tool_calls_per_run: 20 max_runtime_seconds: 300 max_cost_usd: 0.40 on_exceeded: abort_with_partial_artifact input_handling: treat_fetched_content_as: untrusted # Indirect prompt injection is mitigated by the layered discipline below, # not by a single feature flag. Each item is a separate control. enforce_instruction_hierarchy: true validate_tool_args_against_schema: true validate_outputs_against_schema: true steps: - fetch: completed work items in milestone - draft: release notes from items - validate: required fields present - request-review: from: release-manager idempotency_key: ${milestone-id}-${draft-hash} - on-approval: action: post-to-internal-channel reversibility: not-reversible requires: explicit-human-click # the agent does NOT click this audit: log_inputs: true log_outputs: true redact: - secrets # Pattern-based: handles structured PII like emails, phones, IDs. - pii_patterns: [email, phone, national-id, payment-card, ip-address] # Entity-based: required for unstructured PII like names. Pattern alone # cannot redact a customer name without an entity-recognition step. - pii_entities: ner-based # names, locations, organisations retain: 365_days # tune to your audit policy, not to the demo evaluation: test_set: tests/release-notes/eval-v3.jsonl on_prompt_change: rerun on_model_change: rerun fail_threshold: 5_percent_regression 5. Where AI still needs human judgment AI has clear boundaries. The boundaries are not embarrassing. They are the design. What must stay human-owned: Architectural trade-offs and design decisions Security validation and threat modelling Correctness for business-critical and regulatory paths Domain context that has not been written down Accountability for outcomes, not just outputs The goal is collaboration, not replacement. The teams that get the most value from AI are not the ones with the most automation. They are the ones with the clearest sense of where automation ends and judgment begins. In practice The pattern that works: name the human-owned items explicitly in the team's working agreement (architecture, security, regulatory correctness, accountability) and audit every AI workflow against that list. When a workflow asks the AI to make a decision in any of those categories, redesign it so the AI prepares the analysis and a human makes the call. Most teams over-trust AI for one of these areas in their first six months and learn the hard way. Naming the boundary up front prevents the lesson from being paid in production. The clarity is the value; the model behind the workflow is interchangeable. 6. Responsible AI is engineering work The first five disciplines decide whether AI moves the needle. The sixth decides whether the platform can defend the choices it makes with AI. Responsible AI is the engineering practice of building systems whose AI behaviour is fair, transparent, accountable, and safe by design, not by audit after the fact. Treating it as a compliance checkbox at the end of the project is how teams end up shipping AI workflows that fail security review, embarrass the company, or harm users. Six controls turn responsible AI from a policy into engineering work. These map directly onto the practices Microsoft and the broader industry have converged on, but the names matter less than the practice they enable. Fairness in inputs and outputs. The training data, eval set, and prompts are reviewed for systematic bias against any group the system serves. The eval set covers under-represented cases by design, not by accident, and regressions on those cases fail the build. Transparency to end users. When a user sees AI-generated content, they are told. When a decision is AI-assisted, the path from input to output is explainable in plain language, not just in a model card buried in documentation. Content safety filters. Inputs and outputs pass through safety classifiers (prompt injection, prohibited content, jailbreak patterns) before reaching the model and before reaching the user. Filtering decisions are logged and reviewable. Accountability ownership. Every AI workflow has a named owner who is accountable for its outcomes, not just its uptime. The owner has the authority to pause or roll back the workflow when harm is detected. Data minimisation and residency. The AI sees only the data it needs to do the task. Personally identifiable information and customer data are scoped, redacted, and kept inside the boundary the customer agreed to. Cross-tenant leakage is treated as a P1 incident, not a feature request. Harm evaluation alongside quality evaluation. The eval set measures harm potential (toxicity, hallucination on factual queries, leakage of confidential context) with the same rigour as it measures correctness. Both must pass for a release to ship. Figure 3. Responsible AI as a set of engineering controls around the AI workflow. The six controls fall into four categories: data discipline (fairness, data minimisation), model discipline (content safety, harm evaluation), deployment discipline (transparency to users), and governance (accountability ownership). All six are necessary; none is sufficient on its own. In practice The pattern that works: write the responsible AI plan before the first agent ships, not after the first incident. Pick one workflow that touches user data or generates customer-facing content, and use it as the reference implementation: fairness review on the eval set, content safety filters wrapping the model call, transparency annotation in the UI, redaction of identifying details in logs, harm evals running alongside quality evals on every change, and a named owner with explicit pause authority. The first such workflow takes longer to ship than the unconstrained version. Every workflow after it inherits the controls and ships faster than it would have without them. Teams that defer responsible AI to a future quarter end up retrofitting it under pressure, which is the most expensive way to do it. A scenario that ties it together Picture a platform team several months into using Copilot. Adoption is high. Productivity dashboards show gains. But defect rates are not improving and lead time is flat. Leadership asks the obvious question: is AI actually helping, or just feeling like help? The answer is not to stop using AI. It is to change how AI is measured. Move adoption metrics to the background. Move outcome metrics to the front: defect escape rate, lead time for change, change failure rate, mean time to recovery. In parallel, promote the individual prompts that have proved themselves to shared templates, and the templates to versioned workflows. Retrofit responsible AI controls onto the workflows that shipped first: content safety filters, harm evaluations alongside quality evaluations, transparency annotations on customer-facing output, and a named owner for each workflow. Six months later, the picture is different. Defect rate improves on the parts of the codebase where reusable workflows were introduced. Onboarding for new engineers is visibly faster. Release notes are consistent across teams. The shift is from celebrating use to tracking outcomes, and once the team measures what matters, the tooling decisions start making themselves. What teams get wrong The common pattern is measuring AI by usage, not by outcome. Adoption metrics tell you who tried Copilot. They do not tell you whether defects dropped, lead time improved, or release notes got better. The fix is not less AI. It is better measurement. The four metrics named in the scenario above (defect escape rate, lead time for change, change failure rate, mean time to recovery) come from the DORA research on software delivery performance and have become a useful default. Two warnings travel with them. First, attribution is hard: an AI workflow rolled out alongside a test refactor and a CI pipeline change cannot claim credit cleanly. Second, baselines matter more than headlines: a single quarter's improvement is not a trend, and a single team's gain is not the platform's gain. Outcome measurement done well needs a baseline window, an attribution discipline, and a kill criterion for workflows that are not paying back. Done poorly, it is just adoption metrics with better names. There is also the question of cost. AI usage carries a per-run token bill, an evaluation bill on every change, and (for agents) a cost cap that limits damage when something goes wrong. None of these are large compared to the engineering time saved when the workflow works. All of them are visible enough that a finance-aware reader will ask. Track them. Where to start The most concrete starter from this post: promote one personal prompt to a shared template. Pick the prompt that gets used most often (commit messages, code reviews, release notes, debugging assist), move it from someone's notes into the repository where the team versions everything else, and watch what changes when the next person on the team runs it. That is the smallest unit of the workflow shift this post argues for, and it is the step where prompts stop being individual practice and start becoming engineering assets. The shift The shift is from building systems to building smarter systems: AI does not replace engineers. It changes what an engineer's leverage looks like. The unit of value is the workflow, not the suggestion. The discipline that made platforms operable is the same discipline that makes AI useful. Responsible AI is not a compliance step. It is the sixth engineering discipline that lets the other five compound safely. The series ends here, but the arc is consistent across all three posts. The disciplines that make platforms scale are the same disciplines that make AI useful. Build with discipline. Run with discipline. Evolve with discipline. The tools change. The disciplines do not. Want to discuss? Where has AI moved the needle most in your delivery, and where has it disappointed you? Drop a comment with patterns you have seen in your environment. Every reply gets read. Previously in this series: Building Cloud Native Platforms That Scale: Patterns That Actually Work. Part 1 covered the design choices that make scale possible. Running Cloud Native Platforms: Why Day 2 Decides Everything. Part 2 covered the operational disciplines that decide production outcomes. This is the third and final post in the series.Cloud Native Platforms: Run
Audience: SREs (Site Reliability Engineers), platform engineers, engineering managers running production systems Reading time: 8 minutes Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 2 of 3. Most systems are designed thoughtfully. Most operations are inherited reactively. The systems that survive are not the ones built with the most care. They are the ones operated with the most discipline. Production has a way of revealing every shortcut taken during design and every assumption left unverified. This post is about what it takes to operate a platform once the build is done. How they are run, not how they are built Systems are not defined by how they are built. They are defined by how they are run. A well-designed system that is operated reactively will fail in production. A modestly designed system that is operated with discipline will outperform it. Five operational disciplines decide which side of that line a platform lives on. Each one is engineering work, not a checklist for someone else to handle. Figure 1. The incident lifecycle as a state machine. The states are not optional steps. They are the contract between the team and the system. 1. Observability is the backbone of reliability Without observability, every operation becomes a guess. As systems grow, the cost of guessing rises faster than the cost of seeing. Part 1 of this series argued that observability is a design property: instrumentation contracts, request id propagation, structured logging schemas. Production is where those design choices either pay off or do not. Strong observability in production is a contract that lets any engineer answer three questions in minutes: what failed, why it failed, and what the impact was. The shape of that contract matters more than the tool that implements it. (This three-question framing is community-popularised through the SRE community and writers such as Charity Majors. See Honeycomb's What is Observability for the canonical articulation of the three-pillars and question framing; the substance is older than the framing.) Dashboards organised around user journeys, not infrastructure components Service level indicators (SLIs: the specific measurements you care about, e.g., success rate, p99 latency) chosen from the user's perspective, not the database's Alerts that page only on burn-rate against an SLO (Service Level Objective: the target value of an SLI, e.g., 99.9% of requests complete in under 800ms over a rolling month) using a multi-window strategy. A short window catches fast burns; a long window catches slow drifts. This is what makes SLOs operational rather than decorative. Sampling and retention tuned for cost, but never for blind spots The distinction between MTTA (mean time to acknowledge: how fast someone notices) and MTTR (mean time to restore: how fast service returns) tracked separately. Conflating them hides whether the team's bottleneck is detection, response, or fix. In practice The pattern that works: rebuild the operational view around two or three user journeys (sign-in, place order, view history) rather than per-component charts. Tie alerts to error budget burn rather than raw threshold crossings. Track MTTA and MTTR separately so the team's actual bottleneck (detection, response, or fix) is visible. The investment is rethinking what to measure, not buying a new tool. The return is that incidents stop being discovered by customer complaints first. Teams that make this shift typically find their existing telemetry was sufficient; only the questions being asked of it were wrong. If a dashboard cannot answer "what is the user experiencing right now", it is not an observability dashboard. It is decoration. 2. Alerts are signals, not notifications More alerts do not mean better monitoring. In practice, the opposite is true. Once alerts outpace the team's ability to act, important signals start getting missed. Effective alerting works to a small set of rules: Severity that maps to action, not to technical category Ownership baked in, never inferred at runtime Thresholds tied to user impact, not raw metric values Noise treated as a defect, with a regular review cadence Suppression and grouping for known multi-alert patterns In practice The pattern that works: audit every alert against one test, "what action would I take in the next five minutes if this fires now?" Demote alerts with no answer to dashboards. Remove alerts where the answer is the same as another alert's. Group related alerts so one incident produces one page, not twelve. Most teams discover their alert volume drops by an order of magnitude after a thorough audit, and the alerts that remain start getting trusted again. Trust is the precondition for every other operational practice. Without it, on-call rotations decay into noise filtering and the real signals get missed. Figure 2. From raw events to pages, in approximate orders of magnitude. The numbers vary by team and workload; what does not vary is that each stage needs to remove one to two orders of magnitude of noise. Teams that page on raw events end up with on-call rotations nobody trusts. 3. Incident response is a practiced muscle Failures are inevitable. Unstructured response is not. The teams that recover quickly do not improvise during incidents. They follow a structure that has been practiced when nothing was on fire. The structure is intentionally simple, because incident time is the worst time to negotiate roles. Clear roles: incident lead, communications lead, scribe, subject matter expert (the RACI model, Responsible-Accountable-Consulted-Informed, adapted for incident response) Defined escalation paths with clear handoff criteria. Escalation means re-paging to a higher tier or specialist, not returning to detection. The lifecycle diagram in Figure 1 makes the distinction explicit. Runbooks for the top failure modes, kept short enough to actually be read Status communication on a fixed cadence, even when there is nothing new to say. Customer comms and internal comms are tracked separately. Blameless postmortems (focus on the system that allowed the failure, not the person who pushed the button) that produce action items the team actually completes Game days: scheduled exercises that simulate failure modes (region outage, dependency unavailability, traffic spike) under controlled conditions, so gaps in runbooks are found before incidents do In practice The pattern that works: name the incident lead and the comms lead before the first message goes out. Write runbooks short enough to be scannable at 3 AM. Run blameless postmortems with action items that actually get tracked to completion. Schedule game days quarterly so the runbooks are exercised before real incidents. Teams that operate with this structure do not have more engineers; they have engineers who are not single points of failure during recovery. The deepest experts stay the deepest experts, but the platform stops depending on whether they happen to be online. Implementation note A short, well-structured runbook outperforms a long, exhaustive one. The goal during an incident is not to think. It is to act on a procedure that has been thought through in calmer times. # Runbook header pattern (keep it scannable in incident time) title: High latency on order API slo_protected: # this runbook protects two SLOs - order-completion-success - order-completion-latency severity: # derived from burn rate, not declared fast_burn: P1 # 14.4x budget burn over 1 hour => page now slow_burn: P2 # 6x budget burn over 6 hours => investigate owner: payments-team indicators: # triggers for evaluation, not severity - p99 (99th-percentile) latency exceeds the SLO target for 5 min - error rate exceeds the SLO target for 3 min on order-completion first_actions: - Open the order-journey dashboard. Confirm impact in business terms. - Check Service Bus queue depth and dead-letter rate (the most common cause of API latency under load is downstream backpressure) - Verify Cosmos DB RU/s saturation and partition hotspots - Inspect the most recent deployment for behavioural changes escalate_if: - Latency does not recover in 15 min - Error rate exceeds 5% (fast burn against the SLO) - Customer reports arrive before our own signals do rollback_path: - Feature flag "new-order-pipeline" can be disabled per-tenant - Last known good deployment id is in the release tracker note_on_scaling: # CPU is rarely the cause of latency in this service. Scale only after # confirming the bottleneck is compute, not a downstream dependency or # queue depth. Adding capacity to a saturated downstream amplifies the # incident; it does not resolve it. The general principle behind that last note travels beyond this runbook: scale-out is the right remediation for compute saturation, not for downstream saturation. When latency rises because a database, queue, or external dependency is saturated, adding capacity in front of the bottleneck moves more requests into the bottleneck and makes the incident worse. This is one of the most common operational mistakes when the dashboard shows red and the on-call instinct says "add more". 4. Release confidence is engineered Releases get harder as systems grow. The platforms that ship confidently at scale have engineered the path, not learned to fear it. The patterns that change the math: Feature flags that allow change without deploy Canary deployments (releasing the new version to a small slice of traffic first, watching error budget burn before continuing) that surface problems on a small slice Gradual rollouts with automated rollback triggers Database migrations split from application releases Release coordination that scales with services, not with team size In practice The pattern that works: every change ships behind a feature flag, canary deployments take a small slice of traffic first, and rollback is a one-click step in the pipeline rather than a procedure to be invented during an incident. The cost is the discipline of building rollback paths and exercising them. The return is releases that stop being events. Issues that previously triggered full rollbacks get isolated to a slice and rolled back automatically before they reach most users. The willingness to ship smaller, more frequent changes follows directly from the confidence that bad changes can be undone fast. Big releases feel safe because they are rare. They are actually risky because every change rides together. 5. Reliability is continuous, not a milestone Reliability is not achieved through tools alone. It requires continuous refinement, feedback-driven improvement, and a budget that the team can spend on operational work without negotiating each time. The disciplines that keep systems reliable over years are codified well in the SRE-book framing of service level objectives and error budgets (the canonical reference is the Google SRE Book chapter on Service Level Objectives, with the operational follow-up in the SRE Workbook chapter on alerting on SLOs). The names matter less than the practice they enable. SLOs chosen from the user's perspective, with two or three per service rather than ten. More SLOs means none of them shape behaviour. Error budgets: the inverse of the SLO, expressing how much unreliability the team is willing to spend in a window. Used up early in the month means slow down on releases. Healthy means feature work keeps moving. Multi-window burn-rate alerting turns SLOs from dashboards into pages: short window catches catastrophic failures, long window catches slow drift. Without burn-rate alerting, SLOs are observation, not operation. (The pattern is documented in the SRE Workbook.) Reliability work has its own backlog, prioritised against features. Not a wishlist after every incident. Regular game days that exercise failure modes (region failover, dependency outage, traffic spike) before they happen for real Capacity planning informed by data, not by anxiety In practice The pattern that works: define two or three SLOs per service, expressed from the user's perspective. Compute the error budget weekly. When the budget is healthy, ship feature work. When the budget is burning fast, slow down and fix the cause. The conversation about which incidents matter and which can wait becomes possible because there is a shared number to point at. Reliability becomes a quantified property of the platform, not an opinion debated at every retrospective. Teams that adopt this discipline stop having the recurring "how reliable do we need to be?" argument and start having data-grounded trade-off discussions instead. A scenario that ties it together A platform was launching a new region. The build had gone well. Day 1 was clean. Two weeks in, latency started creeping up during peak hours. Alerts fired on raw thresholds, but no one could tell which ones to trust. Incident calls turned into long debugging sessions because three different teams owned overlapping pieces of the request path. The team did not start by buying a new tool. They started by treating operations as engineering work. The dashboard was redesigned around the user journey. Alerts were audited and most were demoted or removed. Roles for incident response were written down. A short runbook covered the top failure modes. Releases were broken into canary slices behind feature flags. None of this was new. It was discipline applied consistently to work that was previously assumed to be someone else's. The next region launch took half the effort, and the team's mean time to restore on the failures that did happen was measurably lower. What teams get wrong The common pattern is treating Day 2 as the cost of Day 1. Teams design beautifully, ship fast, then quietly absorb the operational debt. Dashboards proliferate. Alerts grow louder. Postmortems pile up. The fix is not more dashboards. It is treating operations as engineering work with the same rigour as feature delivery. Operability is a property the system either has or does not. It is not earned by adding monitoring. It is earned by designing for visibility and operating with discipline. Where to start The most concrete starter from this post: an alert audit. List every alert that fires in the next week and apply a single test to each one: "what action would I take in the next five minutes?" Demote the alerts that have no answer. Remove the alerts where the answer is the same as another alert's. The audit takes a morning. The result usually halves alert volume and lifts trust on what remains, which is the precondition for every other operational practice in this post. The shift The most important shift in maturity is not technical. It is in stance. The shift is from shipping software to operating systems: Operations is not a phase that follows engineering. It is engineering. Reliability is not a milestone reached. It is a discipline practiced. Incidents are not interruptions to the work. They are the work. The teams that internalise this shift run platforms that are smaller, calmer, and more trusted. They do not have fewer incidents because their systems are more advanced. They have fewer incidents because their operational discipline is more consistent. Part 3 of this series argues that the same discipline applies again, in a different domain: the practices that make platforms operable are the practices that make AI useful in delivery. Want to discuss? What is the one operational practice your team adopted that changed how you sleep at night? Drop a comment with patterns you have seen in your environment. Every reply gets read. Previously in this series: Building Cloud Native Platforms That Scale: Patterns That Actually Work. The first post covered the design choices that make scale possible. Next in this series: AI-First Platform Engineering: From Copilot to Agentic Delivery. Cloud helped us scale infrastructure. The next post looks at how AI is now changing how we build and run platforms.Cloud Native Platforms: Build
Audience: Cloud architects, platform engineers, engineering leaders making design decisions Reading time: 8 minutes Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 1 of 3. Most engineering teams can build systems. Few can scale them without rebuilding them. As platforms grow, complexity does not increase linearly. It multiplies across users, services, tenants, regions, and integrations. The systems that struggle and the systems that scale are rarely separated by which cloud they run on. They are separated by a handful of design choices made early and applied consistently. This post is about those choices. The differentiator is not the cloud Scalable platforms are not built with the right tools. They are built with the right design choices. Cloud services have closed the gap on infrastructure. The differentiator is no longer which managed service a team picks. It is whether the platform is designed to absorb change, tolerate failure, and support visibility from day one. Five engineering disciplines determine whether a platform scales gracefully or collects technical debt while it grows. Figure 1. The five disciplines compound into platform scale. Any one neglected becomes the constraint that forces a rewrite later. 1. Flexibility is the foundation of scale Hard-coded systems work until they do not. The first request to add a tenant, a region, a SKU (a sellable product variant), or a regulatory variant is the moment a rigid design starts to bend. Each subsequent request adds weight. Scalable platforms move behavior out of code: Configuration replaces conditional logic Feature flags enable safer, tenant-scoped rollouts APIs evolve through versioning, not breaking changes Schemas evolve additively. Breaking changes go through versioned contracts with a deprecation window long enough that consumers can migrate without downtime. In practice The pattern that works: configuration in a managed store, feature flags with tenant scope, and APIs versioned per consumer contract. Cost is the discipline of treating configuration as code (versioned, reviewed, audited). The return is that releases stop being events and start being routine. A change that previously needed a coordinated deployment can be executed in minutes, gated to a single tenant for verification, and rolled out broadly only after the signal is clean. Most platforms reach this state by retrofit, not by design. Doing it earlier costs less than waiting. If a change requires a redeploy, it should require a very good reason. 2. Failures are normal. Resilience is a choice. Distributed systems will fail in unpredictable ways. The real question is not how to prevent failure. It is how the system responds when failure happens. Resilience is engineered, not inherited from the platform. The patterns that move the needle are well known and consistently applied: Idempotent operations (safe to call multiple times with the same result) that make retries safe Reliable messaging patterns such as the transaction outbox (writing the message to the same database transaction as the business change, then publishing asynchronously) to avoid lost or duplicated events Decoupled services that contain blast radius (the scope of damage when one component fails) Timeouts, retries, and circuit breakers (a wrapper around a dependency that stops calling it for a cool-off window after repeated failures) tuned per dependency Bulkheads (isolation pools, often a separate compute or queue lane per workload class) that keep noisy neighbours from starving critical paths of resources In practice The pattern that works: every write that can be retried carries an idempotency key, every queue consumer is safe to replay, every event published goes through an outbox in the same transactional unit as the business change. When peak load triggers retries, duplicates collapse cleanly instead of producing duplicate orders, double-charged customers, or split-brain state. The contract changes outwards: callers can retry without thinking, queues can be at-least-once instead of exactly-once, and recovery moves from a manual cleanup task to a property of the system. Most teams that adopt this pattern stop seeing certain classes of incident entirely. Implementation note An idempotent API is not just a design preference. It changes how the rest of the system can be built. Once writes are safe to repeat, retries become cheap, queues become trustworthy, and recovery becomes automatic. The naive implementation (read the key, if absent process and save) has a race. Two concurrent requests with the same key both miss the lookup, both call the processor, and both attempt to save. That is the failure mode idempotency exists to prevent. The pattern that survives production is an atomic reserve-then-execute: insert a row keyed by the idempotency key with a unique constraint before doing any work. The first writer wins. Concurrent callers either wait for the original to complete and read its result, or they receive a conflict response. // Contract for the idempotency store. The two key methods are TryReserveAsync // (atomic insert with unique-key constraint) and CompleteAsync (record the // result of the first writer). GetCompletedResultAsync polls until the first // writer commits or returns 409 Conflict if the in-flight window exceeds the // configured deadline. public interface IIdempotencyStore { Task<Reservation> TryReserveAsync( string idempotencyKey, string requestHash, CancellationToken ct); Task CompleteAsync( string idempotencyKey, OrderResult result, CancellationToken ct); Task<OrderResult> GetCompletedResultAsync( string idempotencyKey, CancellationToken ct, TimeSpan? maxWait = null); } public readonly record struct Reservation( bool IsFirstWriter, string RequestHash); // Idempotency via atomic reserve-then-execute. // First writer wins; replays return the original result; concurrent // duplicates lose the race and read the winner's outcome (or get 409). public async Task<OrderResult> CreateOrderAsync( Order order, string idempotencyKey, CancellationToken ct) { var requestHash = StableHash(order); // canonical content hash // Atomic insert: succeeds for the first caller, fails for the rest. var reserved = await _store.TryReserveAsync( idempotencyKey, requestHash, ct); if (!reserved.IsFirstWriter) { if (reserved.RequestHash != requestHash) throw new IdempotencyKeyReusedException(); // A previous run committed (return its result) or is in-flight // (poll with a bounded deadline; 409 if exceeded). return await _store.GetCompletedResultAsync( idempotencyKey, ct, maxWait: TimeSpan.FromSeconds(5)); } // We are the first writer. Execute, persist, mark complete. var result = await _processor.ProcessAsync(order, ct); await _store.CompleteAsync(idempotencyKey, result, ct); return result; } Three production details matter: TTL or compaction on the idempotency record. Without it, the store grows forever. Most teams retain records for the request retry window plus a safety margin (commonly 24 to 72 hours). Stable content hash, not the default object hash code. The request hash detects key reuse with a different body, so a client that reuses an idempotency key with a different payload receives IdempotencyKeyReusedException rather than silently getting the wrong result. Canonicalise field ordering, locale, and null handling before hashing. Bound the in-flight window explicitly. The genuinely hard case is when the processor succeeded but the store write failed. Production-grade implementations either run the side-effect and the store write in the same transaction (when the processor and store share a database) or use the transaction outbox pattern to bridge them. The poll-with-deadline in GetCompletedResultAsync handles the duplicate-arrives-mid-flight case; the transactional boundary handles everything else. 3. Observability is not optional Without observability, teams operate blind. As systems grow, the price of guessing rises faster than the price of seeing. At build time, observability is a design property. The decisions made before the system reaches production are what determine whether it can be operated at all. The dashboards, alerts, and incident practices covered in Part 2 of this series rely on instrumentation choices made here. The build-time work that pays off in production: Request identifiers propagated through every service hop, every queue, every async boundary, so a single user action can be traced end to end Structured logging with a consistent schema (event name, correlation id, tenant, severity) rather than free-form strings Metrics emitted at the boundaries that matter (every external call, every queue read or write, every database operation), not only at the entry point Tracing libraries integrated at the framework or middleware layer so coverage is automatic, not opt-in Schemas designed so business signals (orders, sessions, transactions) and system signals (CPU, latency, errors) share the same identifiers and can be correlated later In practice The pattern that works: a single request id flowing through every service hop, every queue, every async boundary, propagated automatically at the framework layer rather than per-call. Add one structured logging schema across services (event name, correlation id, tenant, severity), so that a single query joins business events with system events. The investment is hours of upfront framework wiring. The return is that production diagnosis stops being archaeology. Cross-service questions become single dashboards; postmortems shrink from days to hours; and the dashboards in Part 2 actually work because the data underneath is shaped to support them. 4. Delivery practices set the ceiling Scaling teams requires scaling delivery. Small inefficiencies in pipelines, environments, and release coordination compound into measurable drag. Delivery maturity that pays off at scale: Pipelines as code, reviewed and versioned like application code Parallel deployments across services and regions where dependencies allow Infrastructure as code with shared modules, not hand-managed environments Automated quality gates: tests, security scans, dependency checks Trunk-based development (developers commit to a single shared branch many times a day) with short-lived feature branches and progressive delivery. Important caveat: trunk-based works only when test automation and feature flags are already in place. Adopting it before those foundations exist tends to amplify production incidents rather than reduce them. In practice The pattern that works: pipelines run in parallel where dependencies allow, infrastructure provisioning is templated rather than per-environment, and quality gates run automatically rather than as discretionary steps. Sequential deployment of a multi-service platform across three environments takes hours; parallelised deployment of the same change takes minutes. The payback is not only release speed. It is the compounding cost reduction of every wait state for every engineer on every release. Teams that treat pipelines as a product feature, not an afterthought, ship more confidently and recover from bad changes faster because the rollback path was exercised, not invented during an incident. Slow pipelines are not a tooling problem. They are a design problem. 5. Cost discipline is engineering work Cloud platforms can become expensive quickly when cost is treated as someone else's problem. Cost is a property of the design, not a quarterly review. The teams that get this right treat cost the same way they treat performance: Elastic compute and storage tiers chosen per workload pattern Non-production environments with automated scale-down windows (the easiest savings to leave on the table) Tagging discipline so cost can be attributed to a service, a feature, a tenant Egress and data-tier choices, not compute, dominate cloud bills past a certain scale. Right-size storage tiers (hot vs cool vs archive), eliminate cross-region chatter, and watch egress on the data plane more closely than compute on the request path. Budgets and usage alerts wired into the same channels as reliability alerts Cost reviews built into design discussions, not deferred to FinOps (Financial Operations: the practice of managing cloud spend as an engineering concern) In practice The pattern that works: non-production environments scale down automatically outside business hours, storage tiers match access patterns (hot, cool, archive), and tagging is enforced so every dollar can be attributed to a service or feature. Cost reviews happen at design time, not after the bill arrives. The biggest savings come from data plane decisions, not compute: cross-region egress, oversized storage tiers, and forgotten test environments dominate cloud bills past a certain scale. Treat cost as a first-class non-functional requirement, alongside latency and availability, and the discipline compounds in every design discussion that follows. A scenario that ties it together Figure 2. A reference architecture that puts the disciplines into one shape. The request path is decoupled, the data layer is purpose-fit, identity is brokered by managed identity throughout, private endpoints isolate the data tier from public networks, and observability runs as a first-class lane. Picture a multi-tenant platform at a growth inflection. Onboarding a new tenant takes weeks because tenant-specific behaviour is hard-coded across services. Every release carries risk because there is no way to roll out a change to one tenant without affecting the rest. Incidents linger because logs and metrics live in different tools and nobody can correlate them in production. Do not start with a rewrite. Start with the smallest set of changes that unlocks the next year of growth: extract configuration out of code, introduce tenant-aware feature flags, wire a unified observability view into the existing services, and parallelise the pipelines. None of these are architectural revolutions. They are design choices applied with discipline, in the order the disciplines compound. Eighteen months in, onboarding a tenant takes hours instead of weeks. Releases move from monthly events to weekly increments. Incidents are caught earlier and resolved faster. The platform did not get bigger. It got more capable. The five disciplines did the work; the team made the choice to apply them. What teams get wrong The common pattern is architecting for the system you have, not the system you are growing into. It looks like progress because the current sprint ships. Pillars get postponed because they feel like overhead. The cost surfaces later. Each shortcut becomes a constraint. The constraints compound, and three releases later the team is debating a rewrite. The fix is not premature abstraction. It is small, deliberate investments in flexibility, resilience, observability, delivery, and cost from day one. The discipline is to make these investments before they are urgent. Where to start when you cannot do everything at once Five disciplines is a wall, and real teams cannot fund all five at once. The right order depends on whether the platform is being built fresh or already running. For a system already in production and already in pain, the SRE community's hierarchy of reliability needs gives the most defensible starting order: monitoring and observability first (you cannot fix what you cannot see), then incident response (close the bleeding cleanly), then resilience patterns (idempotency, retries, decoupling) so the bleeding has fewer reasons to start, then flexibility and delivery so safe change can travel at speed. Cost discipline runs alongside throughout, never as the headline. For a system being built fresh, the order in this post (flexibility, resilience, observability, delivery, cost) reflects the Azure Well-Architected Framework's emphasis on designing for change, failure, and visibility before scaling teams or workloads. Both orders are defensible. What is not defensible is leaving any of the five for later. The most concrete starter from this post: request id propagation. A single correlation identifier travelling through every service hop, every queue, every async boundary, costs hours up front and pays back every time someone has to debug production for the rest of the platform's life. It is the smallest unit of the observability discipline and the foundation that the dashboards, traces, and incident response in Part 2 all depend on. The shift The most important transformation in scaling a platform is not technical. It is mindset. The shift is from project thinking to platform thinking: Build reusable capabilities, not one-off solutions Design systems for long-term evolution, not the next release Enable other teams, not just deliver for one team Tools change. Cloud services evolve. The architectural fashions of this year will not be the architectural fashions of the next. What persists is the discipline behind the choices. Scalable systems are not built by tools. They are built by teams that treat design as continuous work. The same discipline shows up again in Part 2 (operating these systems) and Part 3 (using AI to augment that work). The tools change. The disciplines do not. Want to discuss? What single design choice has paid the most dividends in the platforms you run? Drop a comment with patterns you have seen in your environment. Every reply gets read. Next in this series: Running Cloud Native Platforms: Why Day 2 Decides Everything. Building is half the journey. The next post looks at what it takes to operate these platforms once they are in production.303Views2likes1CommentHow to Modernise a Microsoft Access Database (Forms + VBA) to Node.JS, OpenAPI and SQL Server
Microsoft Access has played a significant role in enterprise environments for over three decades. Released in November 1992, its flexibility and ease of use made it a popular choice for organizations of all sizes—from FTSE250 companies to startups and the public sector. The platform enables rapid development of graphical user interfaces (GUIs) paired with relational databases, allowing users to quickly create professional-looking applications. Developers, data architects, and power users have all leveraged Microsoft Access to address various enterprise challenges. Its integration with Microsoft Visual Basic for Applications (VBA), an object-based programming language, ensured that Access solutions often became central to business operations. Unsurprisingly, modernizing these applications is a common requirement in contemporary IT engagements as thse solutions lead to data fragmentation, lack of integration into master data systems, multiple copies of the same data replicated across each access database and so on. At first glance, upgrading a Microsoft Access application may seem simple, given its reliance on forms, VBA code, queries, and tables. However, substantial complexity often lurks beneath this straightforward exterior. Modernization efforts must consider whether to retain the familiar user interface to reduce staff retraining, how to accurately re-implement business logic, strategies for seamless data migration, and whether to introduce an API layer for data access. These factors can significantly increase the scope and effort required to deliver a modern equivalent, especially when dealing with numerous web forms, making manual rewrites a daunting task. This is where GitHub Copilot can have a transformative impact, dramatically reducing redevelopment time. By following a defined migration path, it is possible to deliver a modernized solution in as little as two weeks. In this blog post, I’ll walk you through each tier of the application and give you example prompts used at each stage. 🏛️Architecture Breakdown: The N-Tier Approach Breaking down the application architecture reveals a classic N-Tier structure, consisting of a presentation layer, business logic layer, data access layer, and data management layer. 💫First-Layer Migration: Migrating a Microsoft Access Database to SQL Server The migration process began with the database layer, which is typically the most straightforward to move from Access to another relational database management system (RDBMS). In this case, SQL Server was selected to leverage the SQL Server Migration Assistant (SSMA) for Microsoft Access—a free tool from Microsoft that streamlines database migration to SQL Server, Azure SQL Database, or Azure SQL Database Managed Instance (SQLMI). While GitHub Copilot could generate new database schemas and insert scripts, the availability of a specialized tool made the process more efficient. Using SSMA, the database was migrated to SQL Server with minimal effort. However, it is important to note that relationships in Microsoft Access may lack explicit names. In such cases, SSMA appends a GUID or uses one entirely to create unique foreign key names, which can result in confusing relationship names post-migration. Fortunately, GitHub Copilot can batch-rename these relationships in the generated SQL scripts, applying more meaningful naming conventions. By dropping and recreating the constraints, relationships become easier to understand and maintain. SSMA handles the bulk of the migration workload, allowing you to quickly obtain a fully functional SQL Server database containing all original data. In practice, renaming and recreating constraints often takes longer than the data migration itself. Prompt Used: # Context I want to refactor the #file:script.sql SQL script. Your task is to follow the below steps to analyse it and refactor it according to the specified rules. You are allowed to create / run any python scripts or terminal commands to assist in the analysis and refactoring process. # Analysis Phase Identify: Any warning comments Relations between tables Foreign key creation References to these foreign keys in 'MS_SSMA_SOURCE' metadata # Refactor Phase Refactor any SQL matching the following rules: - Create a new script file with the same name as the original but with a `.refactored.sql` extension - Rename any primary key constraints to follow the format PK_{table_name}_{column_name} - Rename any foreign key constraints like [TableName]${GUID} to FK_{child_table}_{parent_table} - Rename any indexes like [TableName]${GUID} to IDX_{table_name}_{column_name} - Ensure any updated foreign keys are updated elsewhere in the script - Identify which warnings flagged by the migration assistant need addressed # Summary Phase Create a summary file in markdown format with the following sections: - Summary of changes made - List of warnings addressed - List of foreign keys renamed - Any other relevant notes 🤖Bonus: Introduce Database Automation and Change Management As we now had a SQL database, we needed to consider how we would roll out changes to the database and we could introduce a formal tool to cater for this within the solution which was Liquibase. Prompt Used: # Context I want to refactor #file:db.changelog.xml. Your task is to follow the below steps to analyse it and refactor it according to the specified rules. You are allowed to create / run any python scripts or terminal commands to assist in the analysis and refactoring process. # Analysis Phase Analyse the generated changelog to identify the structure and content. Identify the tables, columns, data types, constraints, and relationships present in the database. Identify any default values, indexes, and foreign keys that need to be included in the changelog. Identify any vendor specific data types / fucntions that need to be converted to common Liquibase types. # Refactor Phase DO NOT modify the original #file:db.changelog.xml file in any way. Instead, create a new changelog file called `db.changelog-1-0.xml` to store the refactored changesets. The new file should follow the structure and conventions of Liquibase changelogs. You can fetch https://docs.liquibase.com/concepts/data-type-handling.html to get available Liquibase types and their mappings across RDBMS implementations. Copy the original changesets from the `db.changelog.xml` file into the new file Refactor the changesets according to the following rules: - The main changelog should only include child changelogs and not directly run migration operations - Child changelogs should follow the convention db.changelog-{version}.xml and start at 1-0 - Ensure data types are converted to common Liquibase data types. For example: - `nvarchar(max)` should be converted to `TEXT` - `datetime2` should be converted to `TIMESTAMP` - `bit` should be converted to `BOOLEAN` - Ensure any default values are retained but ensure that they are compatible with the liquibase data type for the column. - Use standard SQL functions like `CURRENT_TIMESTAMP` instead of vendor-specific functions. - Only use vendor specific data types or functions if they are necessary and cannot be converted to common Liquibase types. These must be documented in the changelog and summary. Ensure that the original changeset IDs are preserved for traceability. Ensure that the author of all changesets is "liquibase (generated)" # Validation Phase Validate the new changelog file against the original #file:db.changelog.xml to ensure that all changesets are correctly refactored and that the structure is maintained. Confirm no additional changesets are added that were not present in the original changelog. # Finalisation Phase Provide a summary of the changes made in the new changelog file. Document any vendor specific data types or functions that were used and why they could not be converted to common Liquibase types. Ensure the main changelog file (`db.changelog.xml`) is updated to include the new child changelog file (`db.changelog-1-0.xml`). 🤖Bonus: Synthetic Data Generation Since the legacy system lacked synthetic data for development or testing, GitHub Copilot was used to generate fake seed data. Care was taken to ensure all generated data was clearly fictional—using placeholders like ‘Fake Name’ and ‘Fake Town’—to avoid any confusion with real-world information. This step greatly improved the maintainability of the project, enabling developers to test features without handling sensitive or real data. 💫Second-Layer Migration: OpenAPI Specifications With data migration complete, the focus shifted to implementing an API-driven approach for data retrieval. Adopting modern standards, OpenAPI specifications were used to define new RESTful APIs for creating, reading, updating, and deleting data. Because these APIs mapped directly to underlying entities, GitHub Copilot efficiently generated the required endpoints and services in Node.js, utilizing a repository pattern. This approach not only provided robust APIs but also included comprehensive self-describing documentation, validation at the API boundary, automatic error handling, and safeguards against invalid data reaching business logic or database layers. 💫Third-Layer Migration: Business Logic The business logic, originally authored in VBA, was generally straightforward. GitHub Copilot translated this logic into its Node.js equivalent and created corresponding tests for each method. These tests were developed directly from the code, adding a layer of quality assurance that was absent in the original Access solution. The result was a set of domain services mirroring the functionality of their VBA predecessors, successfully completing the migration of the third layer. At this stage, the project had a new database, a fresh API tier, and updated business logic, all conforming to the latest organizational standards. The final major component was the user interface, an area where advances in GitHub Copilot’s capabilities became especially evident. 💫Fourth Layer: User Interface The modernization of the Access Forms user interface posed unique challenges. To minimize retraining requirements, the new system needed to retain as much of the original layout as possible, ensuring familiar placement of buttons, dropdowns, and other controls. At the same time, it was necessary to meet new accessibility standards and best practices. Some Access forms were complex, spanning multiple tabs and containing numerous controls. Manually describing each interface for redevelopment would have been time-consuming. Fortunately, newer versions of GitHub Copilot support image-based prompts, allowing screenshots of Access Forms to serve as context. Using these screenshots, Copilot generated Government Digital Service Views that closely mirrored the original application while incorporating required accessibility features, such as descriptive labels and field selectors. Although the automatically generated UI might not fully comply with all current accessibility standards, prompts referencing WCAG guidelines helped guide Copilot’s improvements. The generated interfaces provided a strong starting point for UX engineers to further refine accessibility and user experience to meet organizational requirements. 🤖Bonus: User Story Generation from the User Interface For organizations seeking a specification-driven development approach, GitHub Copilot can convert screenshots and business logic into user stories following the “As a … I want to … So that …” format. While not flawless, this capability is invaluable for systems lacking formal requirements, giving business analysts a foundation to build upon in future iterations. 🤖Bonus: Introducing MongoDB Towards the end of the modernization engagement, there was interest in demonstrating migration from SQL Server to MongoDB. GitHub Copilot can facilitate this migration, provided it is given adequate context. As with all NoSQL databases, the design should be based on application data access patterns—typically reading and writing related data together. Copilot’s ability to automate this process depends on a comprehensive understanding of the application’s data relationships and patterns. # Context The `<business_entity>` entity from the existing system needs to be added to the MongoDB schema. You have been provided with the following: - #file:documentation - System documentation to provide domain / business entity context - #file:db.changelog.xml - Liquibase changelog for SQL context - #file:mongo-erd.md - Contains the current Mongo schema Mermaid ERD. Create this if it does not exist. - #file:stories - Contains the user stories that will the system will be built around # Analysis Phase Analyse the available documentation and changelog to identify the structure, relationships, and business context of the `<business_entity>`. Identify: - All relevant data fields and attributes - Relationships with other entities - Any specific data types, constraints, or business rules Determine how this entity fits into the overall MongoDB schema: - Should it be a separate collection? - Should it be embedded in another document? - Should it be a reference to another collection for lookups or relationships? - Explore the benefit of denormalization for performance and business needs Consider the data access patterns and how this entity will be used in the application. # MongoDB Schema Design Using the analysis, suggest how the `<business_entity>` should be represented in MongoDB: - The name of the MongoDB collection that will represent this entity - List each field in the collection, its type, any constraints, and what it maps to in the original business context - For fields that are embedded, document the parent collection and how the fields are nested. Nested fields should follow the format `parentField->childField`. - For fields that are referenced, document the reference collection and how the lookup will be performed. - Provide any additional notes on indexing, performance considerations, or specific MongoDB features that should be used - Always use pascal case for collection names and camel case for field names # ERD Creation Create or update the Mermaid ERD in `mongo-erd.md` to include the results of your analysis. The ERD should reflect: - The new collection or embedded document structure - Any relationships with other collections/entities - The data types, constraints, and business rules that are relevant for MongoDB - Ensure the ERD is clear and follows best practices for MongoDB schema design Each entity in the ERD should have the following layout: **Entity Name**: The name of the MongoDB collection / schema **Fields**: A list of fields in the collection, including: - Field Name (in camel case) - Data Type (e.g., String, Number, Date, ObjectId) - Constraints (e.g. indexed, unique, not null, nullable) In this example, Liquibase was used as a changelog to supply the necessary context, detailing entities, columns, data types, and relationships. Based on this, Copilot could offer architectural recommendations for new document or collection types, including whether to embed documents or use separate collections with cache references for lookup data. Copilot can also generate an entity relationship diagram (ERD), allowing for review and validation before proceeding. From there, a new data access layer can be generated, configurable to switch between SQL Server and MongoDB as needed. While production environments typically standardize on a single database model, this demonstration showcased the speed and flexibility with which strategic architectural components can be introduced using GitHub Copilot. 👨💻Conclusion This modernization initiative demonstrated how strategic use of automation and best practices can transform legacy Microsoft Access solutions into scalable, maintainable architectures utilizing Node.js, SQL Server, MongoDB, and OpenAPI. By carefully planning each migration layer—from database and API specifications to business logic—the team preserved core functionality while introducing modern standards and enhanced capabilities. GitHub Copilot played a pivotal role, not only speeding up redevelopment but also improving code quality through automated documentation, test generation, and meaningful naming conventions. The result was a significant reduction in development time, with a robust, standards-compliant system delivered in just two weeks compared to an estimated six to eight months using traditional manual methods. This project serves as a blueprint for organizations seeking to modernize their Access-based applications, highlighting the efficiency gains and quality improvements that can be achieved by leveraging AI-powered tools and well-defined migration strategies. The approach ensures future scalability, easier maintenance, and alignment with contemporary enterprise requirements.1.3KViews1like2CommentsCentralizing Enterprise API Access for Agent-Based Architectures
Problem Statement When building AI agents or automation solutions, calling enterprise APIs directly often means configuring individual HTTP actions within each agent for every API. While this works for simple scenarios, it quickly becomes repetitive and difficult to manage as complexity grows. The challenge becomes more pronounced when a single business domain exposes multiple APIs, or when the same APIs are consumed by multiple agents. This leads to duplicated configurations, higher maintenance effort, inconsistent behavior, and increased governance and security risks. A more scalable approach is to centralize and reuse API access. By grouping APIs by business domain using an API management layer, shaping those APIs through a Model Context Protocol (MCP) server, and exposing the MCP server as a standardized tool or connector, agents can consume business capabilities in a consistent, reusable, and governable manner. This pattern not only reduces duplication and configuration overhead but also enables stronger versioning, security controls, observability, and domain‑driven ownership—making agent-based systems easier to scale and operate in enterprise environments. Designing Agent‑Ready APIs with Azure API Management, an MCP Server, and Copilot Studio As enterprises increasingly adopt AI‑powered assistants and Copilots, API design must evolve to meet the needs of intelligent agents. Traditional APIs—often designed for user interfaces or backend integrations—can expose excessive data, lack intent-level abstraction, and increase security risk when consumed directly by AI systems. This document outlines a practical, enterprise-‑ready approach to organize APIs in Azure API Management (APIM), introduce a Model Context Protocol (MCP) server to shape and control context, and integrate the solution with Microsoft Copilot Studio. The goal is to make APIs truly agent-‑ready: secure, scalable, reusable, and easy to govern. Architecture at a glance Back-end services expose domain APIs. Azure API Management (APIM) groups and governs those APIs (products, policies, authentication, throttling, versions). An MCP server calls APIM, orchestrates/filters responses, and returns concise, model-friendly outputs. Copilot Studio connects to the MCP server and invokes a small set of predictable operations to satisfy user intents. Why Traditional API Designs Fall Short for AI Agents Enterprise APIs have historically been built around CRUD operations and service-‑to-‑service integration patterns. While this works well for deterministic applications, AI agents work best with intent-driven operations and context-aware responses. When agents consume traditional APIs directly, common issues include: overly verbose payloads, multiple calls to satisfy a single user intent, and insufficient guardrails for read vs. write operations. The result can be unpredictable agent behavior that is difficult to test, validate, and govern. Structuring APIs Effectively in Azure API Management Azure API Management (APIM) is the control plane between enterprise systems and AI agents. A well-‑structured APIM instance improves security, discoverability, and governance through products, policies, subscriptions, and analytics. Key design principles for agent consumption Organize APIs by business capability (for example, Customer, Orders, Billing) rather than technical layers. Expose agent-facing APIs via dedicated APIM products to enable controlled access, throttling, versioning, and independent lifecycle management. Prefer read-only operations where possible; scope write operations narrowly and protect them with explicit checks, approvals, and least-privilege identities. Read‑only APIs should be prioritized, while action‑oriented APIs must be carefully scoped and gated. The Role of the MCP Server in Agent‑Based Architectures APIM provides governance and security, but agents also need an intent-level interface and model-friendly responses. A Model Context Protocol (MCP) server fills this gap by acting as a mediator between Copilot Studio and APIM-exposed APIs. Instead of exposing many back-end endpoints directly to the agent, the MCP server can: orchestrate multiple API calls, filter irrelevant fields, enforce business rules, enrich results with additional context, and emit concise, predictable JSON outputs. This makes agent behavior more reliable and easier to validate. Instead of exposing multiple backend APIs directly to the agent, the MCP server aggregates responses, filters irrelevant data, enriches results with business context, and formats responses into LLM‑friendly schemas. By introducing this abstraction layer, Copilot interactions become simpler, safer, and more deterministic. The agent interacts with a small number of well‑defined MCP operations that encapsulate enterprise logic without exposing internal complexity. Designing an Effective MCP Server An MCP server should have a focused responsibility: shaping context for AI models. It should not replace core back-end services; it should adapt enterprise capabilities for agent consumption. What MCP should do An MCP server should be designed with a clear and focused responsibility: shaping context for AI models. Its primary role is not to replace backend services, but to adapt enterprise data for intelligent consumption. MCP does not orchestrate enterprise workflows or apply business logic. It standardizes how agents discover and invoke external tools and APIs by exposing them through a structured protocol interface. Orchestration, intent resolution, and policy-driven execution are handled by the agent runtime or host framework. It is equally important to understand what does not belong in MCP. Complex transactional workflows, long‑running processes, and UI‑specific formatting should remain in backend systems. Keeping MCP lightweight ensures scalability and easier maintenance. Call APIM-managed APIs and orchestrate multi-step retrieval when needed. Apply security checks and business rules consistently. Filter and minimize payloads (return only fields needed for the intent). Normalize and reshape responses into stable, predictable JSON schemas. Handle errors and edge cases with safe, descriptive messages. What MCP should not do Avoid implementing complex transactional workflows, long-running processes, or UI-specific formatting in MCP. Keep it lightweight so it remains scalable, testable, and easy to maintain. Step by step guide 1) Create an MCP server in Azure API Management (APIM) Open the Azure portal (portal.azure.com). Go to your API Management instance. In the left navigation, expand APIs. Create (or select) an API group for the business domain you want to expose (for example, Orders or Customers). Add the relevant APIs/operations to that API group. Create or select an APIM product dedicated for agent usage, and ensure the product requires a subscription (subscription key). Create an MCP server in APIM and map it to the API (or API group) you want to expose as MCP operations. In the MCP server settings, ensure Subscription key required is enabled. From the product’s Subscriptions page, copy the subscription key you will use in Copilot Studio. Screenshot placeholders: APIM API group, product configuration, MCP server mapping, subscription settings, subscription key location. * Note: Using an API Management subscription key to access MCP operations is one supported way to authenticate and consume enterprise APIs. However, this approach is best suited for initial setups, demos, or scenarios where key-based access is explicitly required. For production‑grade enterprise solutions, Microsoft recommends using managed identity–based access control. Managed identities for Azure resources eliminate the need to manage secrets such as subscription keys or client secrets, integrate natively with Microsoft Entra ID, and support fine‑grained role‑based access control (RBAC). This approach improves security posture while significantly reducing operational and governance overhead for agent and service‑to‑service integrations. Wherever possible, agents and MCP servers should authenticate using managed identities to ensure secure, scalable, and compliant access to enterprise APIs. 2) Create a Copilot Studio agent and connect to the APIM MCP server using a subscription key Copilot Studio natively supports Model Context Protocol (MCP) servers as tools. When an agent is connected to an MCP server, the tool metadata—including operation names, inputs, and outputs—is automatically discovered and kept in sync, reducing manual configuration and maintenance overhead. Sign in to Copilot Studio. Create a new agent and add clear instructions describing when to use the MCP tool and how to present results (for example, concise summaries plus key fields). Open Tools > Add tool > Model Context Protocol, then choose Create. Enter the MCP server details: Server endpoint URL: copy this from your MCP server in APIM. Authentication: select API Key. Header name: use the subscription key header required by your APIM configuration. Select Create new connection, paste the APIM subscription key, and save. Test the tool in the agent by prompting for a domain-specific task (for example, “Get order status for 12345”). Validate that responses are concise and that errors are handled safely. Screenshot placeholders: MCP tool creation screen, endpoint + auth configuration, connection creation, test prompt and response. Operational best practices and guardrails Least privilege by default: create separate APIM products and identities for agent scenarios; avoid broad access to internal APIs. Prefer intent-level operations: expose fewer, higher-level MCP operations instead of many low-level endpoints. Protect write operations: require explicit parameters, validation, and (when appropriate) approval flows; keep “read” and “write” tools separate. Stable schemas: return predictable JSON shapes and limit optional fields to reduce prompt brittleness. Observability: log MCP requests/responses (with sensitive fields redacted), monitor APIM analytics, and set alerts for failures and throttling. Versioning: version MCP operations and APIM APIs; deprecate safely. Security hygiene: treat subscription keys as secrets, rotate regularly, and avoid exposing them in prompts or logs. Summary As organizations scale agent‑based and Copilot‑driven solutions, directly exposing enterprise APIs to AI agents quickly becomes complex and risky. Centralizing API access through Azure API Management, shaping agent‑ready context via a Model Context Protocol (MCP) server, and consuming those capabilities through Copilot Studio establishes a clean and governable architecture. This pattern reduces duplication, enforces consistent security controls, and enables intent‑driven API consumption without exposing unnecessary backend complexity. By combining domain‑aligned API products, lightweight MCP operations, and least‑privilege identity‑based access, enterprises can confidently scale AI agents while maintaining strong governance, observability, and operational control. References Azure API Management (APIM) – Overview Azure API Management – Key Concepts Azure MCP Server Documentation (Model Context Protocol) Extend your agent with Model Context Protocol Managed identities for Azure resources – Overview441Views0likes0CommentsAdvancing to Agentic AI with Azure NetApp Files VS Code Extension v1.2.0
The Azure NetApp Files VS Code Extension v1.2.0 introduces a major leap toward agentic, AI‑informed cloud operations with the debut of the autonomous Volume Scanner. Moving beyond traditional assistive AI, this release enables intelligent infrastructure analysis that can detect configuration risks, recommend remediations, and execute approved changes under user governance. Complemented by an expanded natural language interface, developers can now manage, optimize, and troubleshoot Azure NetApp Files resources through conversational commands - from performance monitoring to cross‑region replication, backup orchestration, and ARM template generation. Version 1.2.0 establishes the foundation for a multi‑agent system built to reduce operational toil and accelerate a shift toward self-managing enterprise storage in the cloud.410Views0likes0CommentsDesigning Reliable Health Check Endpoints for IIS Behind Azure Application Gateway
Why Health Probes Matter in Azure Application Gateway Azure Application Gateway relies entirely on health probes to determine whether backend instances should receive traffic. If a probe: Receives a non‑200 response Times out Gets redirected Requires authentication …the backend is marked Unhealthy, and traffic is stopped—resulting in user-facing errors. A healthy IIS application does not automatically mean a healthy Application Gateway backend. Failure Flow: How a Misconfigured Health Probe Leads to 502 Errors One of the most confusing scenarios teams encounter is when the IIS application is running correctly, yet users intermittently receive 502 Bad Gateway errors. This typically happens when health probes fail, causing Azure Application Gateway to mark backend instances as Unhealthy and stop routing traffic to them. The following diagram illustrates this failure flow. Failure Flow Diagram (Probe Fails → Backend Unhealthy → 502) Key takeaway: Most 502 errors behind Azure Application Gateway are not application failures—they are health probe failures. What’s Happening Here? Azure Application Gateway periodically sends health probes to backend IIS instances. If the probe endpoint: o Redirects to /login o Requires authentication o Returns 401 / 403 / 302 o Times out the probe is considered failed. After consecutive failures, the backend instance is marked Unhealthy. Application Gateway stops forwarding traffic to unhealthy backends. If all backend instances are unhealthy, every client request results in a 502 Bad Gateway—even though IIS itself may still be running. This is why a dedicated, lightweight, unauthenticated health endpoint is critical for production stability. Common Health Probe Pitfalls with IIS Before designing a solution, let’s look at what commonly goes wrong. 1. Probing the Root Path (/) Many IIS applications: Redirect / → /login Require authentication Return 401 / 302 / 403 Application Gateway expects a clean 200 OK, not redirects or auth challenges. 2. Authentication-Enabled Endpoints Health probes do not support authentication headers. If your app enforces: Windows Authentication OAuth / JWT Client certificates …the probe will fail. 3. Slow or Heavy Endpoints Probing a controller that: Calls a database Performs startup checks Loads configuration can cause intermittent failures, especially under load. 4. Certificate and Host Header Mismatch TLS-enabled backends may fail probes due to: Missing Host header Incorrect SNI configuration Certificate CN mismatch Design Principles for a Reliable IIS Health Endpoint A good health check endpoint should be: Lightweight Anonymous Fast (< 100 ms) Always return HTTP 200 Independent of business logic Client Browser | | HTTPS (Public DNS) v +-------------------------------------------------+ | Azure Application Gateway (v2) | | - HTTPS Listener | | - SSL Certificate | | - Custom Health Probe (/health) | +-------------------------------------------------+ | | HTTPS (SNI + Host Header) v +-------------------------------------------------------------------+ | IIS Backend VM | | | | Site Bindings: | | - HTTPS : app.domain.com | | | | Endpoints: | | - /health (Anonymous, Static, 200 OK) | | - /login (Authenticated) | | | +-------------------------------------------------------------------+ Azure Application Gateway health probe architecture for IIS backends using a dedicated /health endpoint. Azure Application Gateway continuously probes a dedicated /health endpoint on each IIS backend instance. The health endpoint is designed to return a fast, unauthenticated 200 OK response, allowing Application Gateway to reliably determine backend health while keeping application endpoints secure. Step 1: Create a Dedicated Health Endpoint Recommended Path 1 /health This endpoint should: Bypass authentication Avoid redirects Avoid database calls Example: Simple IIS Health Page Create a static file: 1 C:\inetpub\wwwroot\website\health\index.html Static Fast Zero dependencies Step 2: Exclude the Health Endpoint from Authentication If your IIS site uses authentication, explicitly allow anonymous access to /health. web.config Example 1 <location path="health"> 2 <system.webServer> 3 <security> 4 <authentication> 5 <anonymousAuthentication enabled="true" /> 6 <windowsAuthentication enabled="false" /> 7 </authentication> 8 </security> 9 </system.webServer> 10 </location> ⚠️ This ensures probes succeed even if the rest of the site is secured. Step 3: Configure Azure Application Gateway Health Probe Recommended Probe Settings Setting Value Protocol HTTPS Path /health Interval 30 seconds Timeout 30 seconds Unhealthy threshold 3 Pick host name from backend Enabled Why “Pick host name from backend” matters This ensures: Correct Host header Proper certificate validation Avoids TLS handshake failures Step 4: Validate Health Probe Behavior From Application Gateway Navigate to Backend health Ensure status shows Healthy Confirm response code = 200 From the IIS VM 1 Invoke-WebRequest https://your-app-domain/health Expected: 1 StatusCode : 200 Troubleshooting Common Failures Probe shows Unhealthy but app works ✔ Check authentication rules ✔ Verify /health does not redirect ✔ Confirm HTTP 200 response TLS or certificate errors ✔ Ensure certificate CN matches backend domain ✔ Enable “Pick host name from backend” ✔ Validate certificate is bound in IIS Intermittent failures ✔ Reduce probe complexity ✔ Avoid DB or service calls ✔ Use static content Production Best Practices Use separate health endpoints per application Never reuse business endpoints for probes Monitor probe failures as early warning signs Test probes after every deployment Keep health endpoints simple and boring Final Thoughts A reliable health check endpoint is not optional when running IIS behind Azure Application Gateway—it is a core part of application availability. By designing a dedicated, authentication‑free, lightweight health endpoint, you can eliminate a large class of false outages and significantly improve platform stability. If you’re migrating IIS applications to Azure or troubleshooting unexplained Application Gateway failures, start with your health probe—it’s often the silent culprit.382Views0likes0CommentsGranting Azure Resources Access to SharePoint Online Sites Using Managed Identity
When integrating Azure resources like Logic Apps, Function Apps, or Azure VMs with SharePoint Online, you often need secure and granular access control. Rather than handling credentials manually, Managed Identity is the recommended approach to securely authenticate to Microsoft Graph and access SharePoint resources. High-level steps: Step 1: Enable Managed Identity (or App Registration) Step 2: Grant Sites.Selected Permission in Microsoft Entra ID Step 3: Assign SharePoint Site-Level Permission Step 1: Enable Managed Identity (or App Registration) For your Azure resource (e.g., Logic App): Navigate to the Azure portal. Go to the resource (e.g., Logic App). Under Identity, enable System-assigned Managed Identity. Note the Object ID and Client ID (you’ll need the Client ID later). Alternatively, use an App Registration if you prefer a multi-tenant or reusable identity. How to register an app in Microsoft Entra ID - Microsoft identity platform | Microsoft Learn Step 2: Grant Sites.Selected Permission in Microsoft Entra Open Microsoft Entra ID > App registrations. Select your Logic App’s managed identity or app registration. Under API permissions, click Add a permission > Microsoft Graph. Select Application permissions and add: Sites.Selected Click Grant admin consent. Note: Sites.Selected ensures least-privilege access — you must explicitly allow site-level access later. Step 3: Assign SharePoint Site-Level Permission SharePoint Online requires site-level consent for apps with Sites.Selected. Use the script below to assign access. Note: You must be a SharePoint Administrator and have the Sites.FullControl.All permission when running this. PowerShell Script: # Replace with your values $application = @{ id = "{ApplicationID}" # Client ID of the Managed Identity displayName = "{DisplayName}" # Display name (optional but recommended) } $appRole = "write" # Can be "read" or "write" $spoTenant = "contoso.sharepoint.com" # Sharepoint site host $spoSite = "{Sitename}" # Sharepoint site name # Site ID format for Graph API $spoSiteId = $spoTenant + ":/sites/" + $spoSite + ":" # Load Microsoft Graph module Import-Module Microsoft.Graph.Sites # Connect with appropriate permissions Connect-MgGraph -Scope Sites.FullControl.All # Grant site-level permission New-MgSitePermission -SiteId $spoSiteId -Roles $appRole -GrantedToIdentities @{ Application = $application } That's it, Your Logic App or Azure resource can now call Microsoft Graph APIs to interact with that specific SharePoint site (e.g., list files, upload documents). You maintain centralized control and least-privilege access, complying with enterprise security standards. By following this approach, you ensure secure, auditable, and scalable access from Azure services to SharePoint Online — no secrets, no user credentials, just managed identity done right.11KViews2likes6CommentsBuilding a Secure and Compliant Azure AI Landing Zone: Policy Framework & Best Practices
As organizations accelerate their AI adoption on Microsoft Azure, governance, compliance, and security become critical pillars for success. Deploying AI workloads without a structured compliance framework can expose enterprises to data privacy issues, misconfigurations, and regulatory risks. To address this challenge, the Azure AI Landing Zone provides a scalable and secure foundation — bringing together Azure Policy, Blueprints, and Infrastructure-as-Code (IaC) to ensure every resource aligns with organizational and regulatory standards. The Azure Policy & Compliance Framework acts as the governance backbone of this landing zone. It enforces consistency across environments by applying policy definitions, initiatives, and assignments that monitor and remediate non-compliant resources automatically. This blog will guide you through: 🧭 The architecture and layers of an AI Landing Zone 🧩 How Azure Policy as Code enables automated governance ⚙️ Steps to implement and deploy policies using IaC pipelines 📈 Visualizing compliance flows for AI-specific resources What is Azure AI Landing Zone (AI ALZ)? AI ALZ is a foundational architecture that integrates core Azure services (ML, OpenAI, Cognitive Services) with best practices in identity, networking, governance, and operations. To ensure consistency, security, and responsibility, a robust policy framework is essential. Policy & Compliance in AI ALZ Azure Policy helps enforce standards across subscriptions and resource groups. You define policies (single rules), group them into initiatives (policy sets), and assign them with certain scopes & exemptions. Compliance reporting helps surface noncompliant resources for mitigation. In AI workloads, some unique considerations: Sensitive data (PII, models) Model accountability, logging, audit trails Cost & performance from heavy compute usage Preview features and frequent updates Scope This framework covers: Azure Machine Learning (AML) Azure API Management Azure AI Foundry Azure App Service Azure Cognitive Services Azure OpenAI Azure Storage Accounts Azure Databases (SQL, Cosmos DB, MySQL, PostgreSQL) Azure Key Vault Azure Kubernetes Service Core Policy Categories 1. Networking & Access Control Restrict resource deployment to approved regions (e.g., Europe only). Enforce private link and private endpoint usage for all critical resources. Disable public network access for workspaces, storage, search, and key vaults. 2. Identity & Authentication Require user-assigned managed identities for resource access. Disable local authentication; enforce Microsoft Entra ID (Azure AD) authentication. 3. Data Protection Enforce encryption at rest with customer-managed keys (CMK). Restrict public access to storage accounts and databases. 4. Monitoring & Logging Deploy diagnostic settings to Log Analytics for all key resources. Ensure activity/resource logs are enabled and retained for at least one year. 5. Resource-Specific Guardrails Apply built-in and custom policy initiatives for OpenAI, Kubernetes, App Services, Databases, etc. A detailed list of all policies is bundled and attached at the end of this blog. Be sure to check it out for a ready-to-use Excel file—perfect for customer workshops—which includes policy type (Standalone/Initiative), origin (Built-in/Custom), and more. Implementation: Policy-as-Code using EPAC To turn policies from Excel/JSON into operational governance, Enterprise Policy as Code (EPAC) is a powerful tool. EPAC transforms policy artifacts into a desired state repository and handles deployment, lifecycle, versioning, and CI/CD automation. What is EPAC & Why Use It? EPAC is a set of PowerShell scripts / modules to deploy policy definitions, initiatives, assignments, role assignments, exemptions. Enterprise Policy As Code (EPAC) It supports CI/CD integration (GitHub Actions, Azure DevOps) so policy changes can be treated like code. It handles ordering, dependency resolution, and enforcement of a “desired state” — any policy resources not in your repo may be pruned (depending on configuration). It integrates with Azure Landing Zones (including governance baseline) out of the box. References & Further Reading EPAC GitHub Repository Advanced Azure Policy management - Microsoft Learn [Advanced A...Framework] How to deploy Azure policies the DevOps way [How to dep...- Rabobank]2.3KViews1like2CommentsCross-Region Zero Trust: Connecting Power Platform to Azure PaaS across different regions
In the modern enterprise cloud landscape, data rarely sits in one place. You might face a scenario where your Power Platform environment (Dynamics 365, Power Apps, or Power Automate) is hosted in Region A for centralized management, while your sensitive SQL Databases or Storage Accounts must reside in Region B due to data sovereignty, latency requirements, or legacy infrastructure. Connecting these two worlds usually involves traversing the public internet - a major "red flag" for security teams. The Missing Link in Cloud Security When we talk about enterprise security, "Public Access: Disabled" is the holy grail. But for Power Platform architects, this setting is often followed by a headache. The challenge is simple but daunting: How can a Power Platform Environment (e.g., in Region A) communicate with an Azure PaaS service (e.g., Storage or SQL in Region B) when that resource is completely locked down behind a Private Endpoint? Existing documentation usually covers single-region setups with no firewalls. This post details a "Zero Trust" architecture that bridges this gap. This is a walk through for setting up a Cross-Region Private Link that routes traffic from the Power Platform in Region A, through a secure Azure Hub, and down the Azure Global Backbone to a Private Endpoint in Region B, without a single packet ever touching the public internet. 1. Understanding the Foundation: VNet Support Before we build, we must understand what moves: Power Platform VNet integration is an "Outbound" technology. It allows the platform to connect to data sources secured within an Azure Virtual Network and "inject" its traffic into your Virtual Network, without needing to install or manage an on-premises data gateway. According to Microsoft's official documentation, this integration supports a wide range of services: Dataverse: Plugins and Virtual Tables. Power Automate: Cloud Flows using standard connectors. Power Apps: Canvas Apps calling private APIs. This means once the "tunnel" is built, your entire Power Platform ecosystem can reach your private Azure universe. Virtual Network support overview – Power Platform | Microsoft Learn 2. The Architecture: A Cross-Region Global Bridge Based on the Hub-and-Spoke topology, this architecture relies on four key components working in unison: Source (Region A): The Power Platform environment utilizes VNet Injection. This injects the platform's outbound traffic into a dedicated, delegated subnet within your Region A Spoke VNet. The Hub: A central VNet containing an Azure Firewall. This acts as the regional traffic cop and DNS Proxy, inspecting traffic and resolving private names before allowing packets to traverse the global backbone. The Bridge (Global Backbone): We utilize Global VNet Peering to connect Region A to the Region B Spoke. This keeps traffic on Microsoft's private fiber backbone. Destination (Region B): The Azure PaaS service (e.g. Storage Account) is locked down with Public Access Disabled. It is only accessible via a Private Endpoint. The Architecture: Visualizing the Flow As illustrated in the diagram below, this solution separates the responsibilities into two distinct layers: the Network Admin (Azure Infrastructure) and the Power Platform Admin (Enterprise Policy). 3. The High Availability Constraint: Regional Pairs A common pitfall of these deployments is configuring only a single region. Power Platform environments are inherently redundant. In a geography like Europe, your environment is actually hosted across a Regional Pair (e.g., West Europe and North Europe). Why? If one Azure region in the pair experiences an outage, your Power Platform environment will failover to the second region. If your VNet Policy isn't already there, your private connectivity will break. To maintain High Availability (HA) for your private tunnel, your Azure footprint must mirror this: Two VNets: You must create a Virtual Network in each region of the pair. Two Delegated Subnets: Each VNet requires a subnet delegated specifically to Microsoft.PowerPlatform/enterprisePolicies. Two Network Policies: You must create an Enterprise Policy in each region and link both to your environment to ensure traffic flows even during a regional failover. Ensure your Azure subscription is registered for the Microsoft.PowerPlatform resource provider by running the SetupSubscriptionForPowerPlatform.ps1 script. 4. Solving the DNS Riddle with Azure Firewall In a Hub-and-Spoke model, peering the VNets is only half the battle. If your Power Platform environment in Region A asks for mystorage.blob.core.windows.net, it will receive a public IP by default, and your connection will be blocked. To fix this, we utilize the Azure Firewall as a DNS Proxy: Link the Private DNS Zone: Ensure your Private DNS Zones (e.g., privatelink.blob.core.windows.net) are linked to the Hub VNet. Enable DNS Proxy: Turn on the DNS Proxy feature on your Azure Firewall. Configure Custom DNS: Set the DNS servers of your Spoke VNets (Region A) to the Firewall’s Internal IP. Now, the DNS query flows through the Firewall, which "sees" the Private DNS Zone and returns the Private IP to the Power Platform. 5. Secretless Security with User-Assigned Managed Identity Private networking secures the path, but identity secures the access. Instead of managing fragile Client Secrets, we use User-Assigned Managed Identity (UAMI). Phase A: The Azure Setup Create the Identity: Generate a User-Assigned Managed Identity in your Azure subscription. Assign RBAC Roles: Grant this identity specific permissions on your destination resource. For example, assign the Storage Blob Data Contributor role to allow the identity to manage files in your private storage account. Phase B: The Power Platform Integration To make the environment recognize this identity, you must register it as an Application User: Navigate to the Power Platform Admin Center. Go to Environments > [Your Environment] > Settings > Users + permissions > Application users. Add a new app and select the Managed Identity you created in Azure. 6. Creating Enterprise Policy using PowerShell Scripts One of the most important things to realize is that Enterprise Policies cannot be created manually in the Azure Portal UI. They must be deployed via PowerShell or CLI. While Microsoft provides a comprehensive official GitHub repository with all the necessary templates, it is designed to be highly modular and granular. This means that to achieve a High Availability (HA) setup, an admin usually needs to execute deployments for each region separately and then perform the linking step. To simplify this workflow, I have developed a Simplified Scripts Repository on my GitHub. These scripts use the official Microsoft templates as their foundation but add an orchestration layer specifically for the Regional Pair requirement: Regional Pair Automation: Instead of running separate deployments, my script handles the dual-VNet injection in a single flow. It automates the creation of policies in both regions and links them to your environment in one execution. Focused Scenarios: I’ve distilled the most essential scripts for Network Injection and Encryption (CMK), making it easier for admins to get up and running without navigating the entire modular library. The Goal: To provide a "Fast-Track" experience that follows Microsoft's best practices while reducing the manual steps required to achieve a resilient, multi-region architecture. Owning the Keys with Encryption Policies (CMK) While Microsoft encrypts Dataverse data by default, many enterprise compliance standards require Customer-Managed Keys (CMK). This ensures that you, not Microsoft, control the encryption keys for your environments. - Manage your customer-managed encryption key - Power Platform | Microsoft Learn Key Requirements: Key Vault Configuration: Your Key Vault must have Purge Protection and Soft Delete enabled to prevent accidental data loss. The Identity Bridge: The Encryption Policy uses the User-Assigned Managed Identity (created in Step 5) to authenticate against the Key Vault. Permissions: You must grant the Managed Identity the Key Vault Crypto Service Encryption User role so it can wrap and unwrap the encryption keys. 7. The Final Handshake: Linking Policies to Your Environment Creating the Enterprise Policy in Azure is only the first half of the process. You must now "inform" your Power Platform environment that it should use these policies for its outbound traffic and identity. Linking the Policies to Your Environment: For VNet Injection: In the Admin Center, go to Security > Data and privacy > Azure Virtual Network Policies. Select your environment and link it to the Network Injection policies you created. For Encryption (CMK): Go to Security > Data and privacy > Customer-managed encryption Key. Add the Select the Encryption Enterprise Policy -Edit Policy - Add Environment. Crucial Step: You must first grant the Power Platform service "Get", "List", "Wrap" and "Unwrap" permissions on your specific key within Azure Key Vault before the environment can successfully validate the policy. Verification: The "Smoking Gun" in Log Analytics After successfully reaching a Resource from one of the power platform services you can check if the connection was private. How do you prove its private? Use KQL in Azure Log Analytics to verify the Network Security Perimeter (NSP) ID. The Proof: When you see a GUID in the NetworkPerimeter field, it is cryptographic evidence that the resource accepted the request only because it arrived via your authorized private bridge. In Azure Portal - Navigate to your Resource for example KeyVault - Logs - Use the following KQL: AzureDiagnostics | where ResourceProvider == "MICROSOFT.KEYVAULT" | where OperationName == "KeyGet" or OperationName == "KeyUnwrap" | where ResultType == "Success" | project TimeGenerated, OperationName, VaultName = Resource, ResultType, CallerIP = CallerIPAddress, EnterprisePolicy = identity_claim_xms_mirid_s, NetworkPerimeter = identity_claim_xms_az_nwperimid_s | sort by TimeGenerated desc Result: By implementing the Network, and Encryption Enterprise policy you transition the Power Platform from a public SaaS tool into a fully governed, private extension of your Azure infrastructure. You no longer have to choose between the agility of low-code and the security of a private cloud. To summarize the transformation from public endpoints to a complete Zero Trust architecture across regions, here is the end-to-end workflow: PHASE 1: Azure Infrastructure Foundation Create Network Fabric (HA): Deploy VNets and Delegated Subnets in both regional pairs. Deploy the Hub: Set up the Central Hub VNet with Azure Firewall. Connect Globally: Establish Global VNet Peering between all Spokes and the Hub. Solve DNS: Enable DNS Proxy on the Firewall and link Private DNS Zones to the Hub VNet. ↓ PHASE 2: Identity & Security Prep Create Identity: Generate a User-Assigned Managed Identity (UAMI). Grant Access (RBAC): Give the UAMI permissions on the target PaaS resource (e.g., Storage Contributor). Prepare CMK: Configure Key Vault access policies for the UAMI (Wrap/Unwrap permissions). ↓ PHASE 3: Deploy Enterprise Policies (PowerShell/IaC) Deploy Network Policies: Create "Network Injection" policies in Azure for both regions. Deploy Encryption Policy: Create the "CMK" policy linking to your Key Vault and Identity. ↓ PHASE 4: Power Platform Final Link (Admin Center) Link Network: Associate the Environment with the two Network Policies. Link Encryption: Activate the Customer-Managed Key on the environment. Register User: Add the Managed Identity as an "Application User" in the environment. ↓ PHASE 5: Verification Run Workload: Trigger a Flow or Plugin. Audit Logs: Use KQL in Log Analytics to confirm the presence of the NetworkPerimeter ID.1.3KViews3likes2Comments