well architected

140 Topics

Beyond the Canvas: The Azure Architecture Diagram Builder Becomes Agent-Ready
AZURE ARCHITECTURE BLOG · 8 MIN READ Author: Arturo Quiroga, Senior Partner Solutions Architect — Microsoft Two months ago I published From Prompt to Production: Building Azure Architecture Diagrams with AI, introducing the open-source Azure Architecture Diagram Builder. The response was humbling — thousands of you read it, tried the tool, and filed issues and feature requests. A follow-up on how the Well-Architected Framework scoring works went deep on validation. You asked, and the tool grew. This post is about what’s new since May — and one change big enough to reframe the whole project: the Azure Architecture Diagram Builder is no longer just an app you click. It’s a partner you chat with, and a tool other agents can call. TL;DR. Three arcs of new capability: (1) Architecture Chat turns diagram design into a multi-turn conversation over the live canvas; (2) Blueprint Diagrams produce hand-drawn, whiteboard-style deliverables alongside the formal topology; and (3) the app now exposes its capabilities as a Model Context Protocol (MCP) server, so AI agents can generate, validate, cost, and render Azure architectures programmatically. Plus a 13-model fleet, deployment guides grounded in Microsoft Learn, and July output enhancements. What’s new at a glance Capability What it does Architecture Chat Refine a diagram by conversation — “add Front Door with WAF,” then“now make it zone-redundant.” Each turn reads the live canvas and auto-saves to history. Blueprint Diagrams (BETA) Hand-drawn, whiteboard-style renders with nested zones and numbered flow arrows. Topology, Blueprint, or Both. A fleet of 13 models Multi-provider roster — GPT-5.x, DeepSeek, Grok, Mistral, and Kimi — with side-by-side comparison to pick the right brain per task. MCP server The app is now a remote MCP server. Agents can list_services, validate_architecture, estimate_costs, generate_bicep and render_diagram with typed, structured outputs. Microsoft Learn grounding Deployment guides now cite live Microsoft Learn documentation. Output enhancements (July 2026) Cost badges, light/dark render themes, and metadata panels in rendered diagrams. From clicking to conversing: Architecture Chat The single most common request after the launch post was some version of “I love the first diagram, but I want to iterate without re-writing the whole prompt.” Regenerating from scratch every time you tweak a requirement is slow and loses context. Architecture Chat solves this. It’s a conversational panel that sits alongside the canvas and treats your diagram as a living document. Each message is a turn in an ongoing design session: “Add an Azure Front Door with WAF in front of the app tier.” “Now make the data layer zone-redundant.” “Swap the SQL Database for Cosmos DB and update the connections.” Every turn reads the current state of the canvas — not the original prompt — so refinements compound naturally the way they would with a human architect at a whiteboard. The conversation auto-saves to history, so you can step back through the evolution of a design or branch from an earlier point. Architecture Chat panel beside the canvas, showing a multi-turn conversation that incrementally adds and modifies services on the diagram. Figure 1. Architecture Chat treats the diagram as a living document. Each message refines the current canvas — adding services, changing SKUs, or reorganizing groups — and the full exchange is saved to history. The shift is subtle but important: architecture design stops being a one-shot prompt and becomes an iterative dialogue. The whiteboard deliverable: Blueprint Diagrams (BETA) Formal topology diagrams with official Azure icons are perfect for documentation and stakeholder decks. But early-stage design conversations often want something looser — the hand-drawn feel of a whiteboard sketch that communicates intent without implying finality. Blueprint Diagrams generate exactly that: a whiteboard-style render with nested zones (subscription → VNet → subnet), numbered flow arrows, and a deliberately sketchy aesthetic. You choose the output mode: Topology — the formal, icon-based diagram from the launch post Blueprint — the hand-drawn whiteboard style Both — generate the two side by side The formal topology diagram of an architecture shown next to a Blueprint-style hand-drawn version of the same design with nested zones and numbered flow arrows. Figure 2. The same architecture in two visual languages. Left: the formal, icon-based topology. Right: Blueprint mode — a whiteboard-style render with nested zones and numbered flow steps, plus a numbered legend explaining each hop. Use Blueprint for early design conversations and Topology for final documentation. It’s the same underlying architecture — two visual languages for two different moments in the design lifecycle. A fleet of 13 models: pick the right brain per task The launch post shipped with multi-model support. That fleet has grown to 13 models across five providers, so you can match the model to the job — fast models for iteration, reasoning models for complex designs, code-optimized models for Bicep generation: OpenAI GPT-5.x — GPT-5.1, GPT-5.2, GPT-5.2 Codex, GPT-5.3 Codex, GPT-5.4, GPT-5.4 Mini DeepSeek — V3.2 Speciale, V4 Pro xAI Grok — 4.1 Fast, 4.3 Mistral — Large 3 MoonshotAI Kimi — K2.5, K2.7 Code The Compare Models feature runs the same prompt through any subset of these in parallel and ranks them on service count, token usage, latency, and cost — with Fastest / Cheapest / Most Thorough badges — so you can make an evidence-based choice rather than a guess. Compare Models results grid showing side-by-side metrics across all 13 models with Fastest, Cheapest, and Most Thorough badges. AI Critique panel with an overall ranking and per-model analysis generated by a critic model. Figure 3. Multi-model comparison across the full 13-model fleet. Top: the results grid ranks every model on service count, connections, token usage, latency, and cost, with Fastest / Cheapest / Most Thorough badges. Bottom: an optional AI Critique uses a critic model to rank the outputs and explain each model’s strengths and gaps. Adding a model is now a small, well-understood change — a testament to how the multi-provider abstraction has matured since May. The headline: the Diagram Builder is now an MCP server Here’s the change that reframes the project. Everything above is about a person using a web app. But the same capabilities — generating a diagram, validating it against WAF, estimating its cost, producing Bicep — are exactly the things an AI agent needs when it reasons about Azure architecture. So we exposed them. The Azure Architecture Diagram Builder now runs as a Model Context Protocol (MCP) server. Any MCP-capable agent can call its tools with typed inputs and structured outputs: Tool What the agent gets list_services The catalog of supported Azure services and categories validate_architecture A WAF assessment with pillar scores and findings estimate_costs Multi-region cost estimates from the Azure Retail Prices API generate_bicep Infrastructure-as-Code templates for the design render_diagram A rendered diagram (topology or blueprint) of the architecture This means an agent can hold a conversation like “design a HIPAA-compliant platform, check it against the Well-Architected Framework, tell me the monthly cost in West Europe, and give me the Bicep” — and the Diagram Builder answers each part programmatically, returning structured data the agent can reason over and chain. Microsoft Scout invoking the Diagram Builder’s render_diagram MCP tool, showing the tool-call parameters and saving the generated SVG to the workspace. The Azure architecture diagram rendered by the MCP tool and displayed inline in the Microsoft Scout conversation. Figure 4. The Diagram Builder as an MCP server inside Microsoft Scout. Top: from a natural-language request, the agent calls the render_diagram tool with structured parameters (title, format, direction, theme, region) and saves the returned SVG to its workspace. Bottom: the rendered architecture — grouped zones, labeled flows, and cost badges — appears inline in the conversation, generated entirely through agent tool calls. The tool that started as a canvas for humans is now also a building block for agents. That’s the arc: from an app you click, to a partner you chat with, to a tool other agents call. Grounded in Microsoft Learn, and sharper output Two smaller-but-meaningful improvements round out the release: Microsoft Learn grounding. Deployment guides now search official Microsoft Learn documentation at generation time and cite it, so the guidance reflects current, authoritative practice rather than a model’s training snapshot. Output enhancements (July 2026). Rendered diagrams now carry per-service cost badges, support light and dark render themes, and include metadata panels that summarize the architecture — service counts, regions, and estimated cost — directly on the image. Highlights Since the May launch, the Azure Architecture Diagram Builder has grown from a design tool into an agent-ready platform: Conversational design: iterate on a diagram by chatting over the live canvas, with full history Two visual languages: formal topology and hand-drawn Blueprint, from the same architecture 13 models, five providers: choose the right brain per task, with evidence-based comparison Agent-ready: an MCP server exposing generation, validation, costing, and IaC as callable tools Grounded guidance: deployment guides cite live Microsoft Learn documentation Still open source: every capability above is available to inspect, extend, and contribute to Try It Today Live demo: https://aka.ms/diagram-builder Source code: GitHub repository Documentation: See the Getting Started Guide for setup, and the repository’s MCP server directory for agent integration. If you read the first post and tried the tool — thank you. The features above exist because you told me what you needed. Keep the feedback coming via GitHub Issues. Tags: artificial intelligence · application · apps & devops · well architected · infrastructure
arturoqu
Jul 10, 2026 Place Azure Architecture Blog
129Views
0likes
0Comments
From Prompt to Production: Building Azure Architecture Diagrams with AI
Author: Arturo Quiroga, Senior Partner Solutions Architect — Microsoft Cloud architects spend significant time translating ideas into architecture diagrams. They toggle between Visio, draw.io, pricing calculators, and documentation. According to the 2024 Stack Overflow Developer Survey, 61% of developers spend more than 30 minutes a day searching for answers or solutions, time lost to context-switching rather than design. What if you could describe your architecture in plain English and get a diagram, cost estimate, and deployment guide in minutes? The Challenge: Fragmented Architecture Workflows Designing Azure architectures today typically involves multiple disconnected steps: Sketch the architecture in a diagramming tool Look up official Azure icons and drag them into place Research pricing across regions using the Azure Pricing Calculator Validate the design against the Well-Architected Framework (WAF) Write deployment documentation and Infrastructure as Code templates Compare alternative designs manually Each step lives in a different tool, and keeping them in sync as designs evolve is costly. The Azure Architecture Diagram Builder brings these workflows together in a single browser-based experience. How It Works Describe your architecture in natural language, for example "A HIPAA-compliant healthcare platform with FHIR APIs, event-driven processing, and multi-region disaster recovery", and the AI generates a diagram with grouped services, data flow connections, and logical organization. Figure 1. Enter a natural-language prompt describing your architecture. Curated example prompts help you get started, and you can optionally upload an existing diagram for the AI to analyze. The tool uses Azure OpenAI to power generation across multiple models, enabling you to choose the model that best fits your scenario — from fast iterations to deeper reasoning. Key Features AI-Powered Architecture Generation Describe what you need in plain English, and the AI creates an architecture diagram with: 714 official Azure service icons across 29 categories Smart grouping: services are logically organized (Frontend, Backend, Data, Security) Data flow connections: labeled edges showing how data moves through the system 13 curated example prompts: from simple web apps to complex enterprise scenarios like Zero Trust networks, Industrial IoT with 5,000+ sensors, and global multiplayer gaming backends Figure 2. A generated industrial IoT architecture. Top: the clean diagram view as initially produced. Bottom: the same diagram with per-service monthly cost overlays toggled on, plus a running subscription total in the toolbar. Architecture Image Import Already have an architecture on a whiteboard or in a screenshot? Upload the image and let the AI analyze it, mapping services to official Azure icons and recreating the architecture as an editable, interactive diagram. Figure 3. Upload a photo of a whiteboard sketch (top-right reference panel) and the AI recreates it as an editable diagram with official Azure service icons and labeled data flow connections. ARM Template Import Import existing ARM templates to visualize your current infrastructure. The AI parses resource definitions and dependencies, groups related resources into logical layers, and produces a meaningful diagram of what you actually have deployed — a fast way to document an inherited environment or sanity-check a template before deployment. Figure 4. ARM template import in action. Top: the parser status banner while resources and dependencies are being analyzed. Bottom: the resulting diagram, with resources auto-grouped into logical layers (Web Tier, Data Layer, Container Platform, Observability & Logging) and a Generated from: ARM Template badge linking the diagram back to its source file. Well-Architected Framework Validation Validate your architecture against all five WAF pillars — Security, Reliability, Performance Efficiency, Cost Optimization, and Operational Excellence. The validator provides: An overall WAF score with pillar-level breakdowns Specific findings with severity levels Actionable recommendations you can select and apply Select the recommendations you agree with, and the AI regenerates an improved architecture incorporating those changes. Figure 5. WAF validation results showing the overall score, per-pillar breakdowns, and individual findings with severity badges. Tick the recommendations you want and the AI rebuilds the diagram with those changes applied. Multi-Model Comparison Run the same architecture prompt through multiple AI models side-by-side and compare: Architecture Comparison: service counts, connection counts, groups, token usage, and latency Validation Comparison: WAF scores across models, severity breakdowns, and finding counts Apply Winner: pick the best result and apply it to the canvas with one click Present Critique: a talking avatar narrates the AI-generated ranking with live closed captions Figure 6. Multi-model comparison. Top: select the models and reasoning effort, then enter the prompt. Bottom: side-by-side results across all selected models with service counts, latency, token usage, and Fastest / Cheapest / Most Thorough badges. Multi-Region Cost Estimation Get cost estimates from the Azure Retail Prices API across 8 Azure regions: East US 2, Australia East, Canada Central, Brazil South, Mexico Central, West Europe, Sweden Central, and Southeast Asia. Features include: Color-coded cost legend (green / yellow / red thresholds) SKU and tier information for each service Export options: CSV, JSON, plain-text summary, and an analysis report with top cost drivers, Reserved Instance flags, and a ranked multi-region comparison table Figure 7. The cost legend overlay shows per-service pricing with color-coded thresholds. The region selector in the toolbar lets you re-price the entire architecture in any of eight Azure regions. Deployment Guide Generation with Bicep Generate step-by-step deployment documentation including: Prerequisites and Azure resource requirements Step-by-step deployment instructions Bicep templates for each service (Infrastructure as Code) Post-deployment verification steps Security configuration recommendations Figure 8. Each generated Deployment Guide opens with the architecture name, an estimated deployment time, and a prerequisites checklist covering subscription roles, CLI versions, Microsoft Entra ID permissions, and region requirements, followed by numbered, copy-ready deployment steps. Figure 9. The Infrastructure as Code section produces a main.bicep orchestrator plus a per-service module (Log Analytics, Key Vault, Cosmos DB, SQL Database, Event Hubs, Azure Functions, and more). The Download All Templates button packages everything into a ready-to-deploy folder. Workflow Animation & Avatar Presenter Visualize how data flows through your architecture with step-by-step animations that highlight services on the canvas as each step plays. When the Azure Speech Service is configured, a photorealistic talking avatar can narrate the workflow or present model comparison results, with live word-by-word closed captions in a draggable, resizable panel. Figure 10. A workflow step is highlighted on the canvas as the Avatar Presenter narrates that step. Live word-by-word closed captions appear in a draggable, resizable panel, useful for accessibility and stakeholder demos. Export Options Figure 11. A single-slide PowerPoint export, available in dark or light theme, ready to drop straight into a stakeholder deck. Format Use Case PNG Documentation, presentations SVG Scalable vector graphics PPTX Single PowerPoint slide (dark or light theme) Draw.io Edit in diagrams.net JSON Backup, version control CSV / ZIP Cost analysis with multi-region comparison Highlights The Azure Architecture Diagram Builder unifies the architecture design lifecycle in a single tool: End-to-end workflow: from natural-language description to deployable Bicep templates without tool switching Official Azure icons: 714 icons across 29 categories, mapped directly from the Azure service catalog Live pricing: queries the Azure Retail Prices API at design time rather than relying on static estimates WAF-integrated validation: architectural best practices built into the design loop rather than applied after the fact Multi-model flexibility: choose the AI model that best suits each task, with fast models for iteration and reasoning models for complex designs Open source: the source code is available for customization and contribution One-Command Deploy with Azure Developer CLI The fastest way to get your own instance running is with azd : # Install azd (once) brew tap azure/azd && brew install azd # macOS winget install microsoft.azd # Windows # Clone, configure, and deploy git clone https://github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder cd azure-architecture-diagram-builder azd auth login azd env set AZURE_OPENAI_ENDPOINT "https://your-resource.openai.azure.com/" azd env set AZURE_OPENAI_API_KEY "your-key" azd up # Provisions infrastructure + builds + deploys (~8 min) azd up provisions the following via Bicep: Resource Purpose Azure Container Registry Stores the Docker image Azure Container Apps Runs the app (nginx + token server) Log Analytics + Application Insights Monitoring and telemetry Azure Speech (S0) Avatar Presenter (optional, keyless auth via managed identity) Try It Today The Azure Architecture Diagram Builder is available now: Live demo: https://aka.ms/diagram-builder Source code: GitHub repository Documentation: See the Getting Started Guide for detailed setup instructions We welcome feedback and contributions. Use the GitHub Issues page to report bugs, suggest features, or share your experience. Tags: artificial intelligence · application · apps & devops · well architected · infrastructure
arturoqu
Jul 09, 2026 Place Azure Architecture Blog
2.2KViews
2likes
2Comments
Revolutionizing Document Intelligence: Scaling Construction Industries with AI-Driven Extraction
Introduction Generative AI (GenAI) is poised to transform the construction industry by addressing chronic challenges such as low productivity, cost overruns, schedule delays, and labor shortages. By automating the analysis of drawings, specifications, contracts, and project documentation, GenAI can reduce manual effort, accelerate decision-making, and improve coordination across architects, engineers, contractors, and suppliers. Industry studies indicate that AI-powered workflows can increase productivity by 20–40% in planning, engineering, and administrative functions while reducing costly rework and errors. The result is faster project delivery, improved resource utilization, lower costs, and more predictable project outcomes. A major opportunity for GenAI in construction lies in its ability to unlock the vast amount of information trapped within AutoCAD drawings, architectural plans, BIM models, specifications, and engineering documents. Today, project teams spend countless hours manually reviewing drawings, performing quantity takeoffs, identifying dependencies, and translating design intent into actionable work packages for downstream trades. GenAI can automate this process by extracting and interpreting dimensions, materials, quantities, assemblies, and building components directly from design artifacts, then intelligently distributing that information to foundation, framing, roofing, insulation, MEP, and finish teams. This creates a digital thread from design through execution, eliminating manual handoffs, reducing human error, and ensuring every stakeholder works from a single source of truth. The impact extends beyond productivity gains—GenAI enables more accurate material forecasting, streamlined procurement, reduced waste, faster response to design changes, fewer change orders, and greater confidence that the architect's vision is executed precisely in the field. In an industry where margins are tight and inefficiencies are costly, GenAI has the potential to fundamentally redefine how construction projects are planned, coordinated, and delivered. This article specifically demonstrates how organizations can leverage Azure AI services—including Azure Content Understanding, Azure foundry, Azure Blob Storage, Azure Open AI—to extract, understand, and operationalize information from construction drawings and project documentation. The solution illustrates how Azure's AI platform can transform unstructured design artifacts into actionable intelligence that improves productivity, reduces risk, accelerates procurement, and enables more efficient execution across the entire construction lifecycle. This transformation is now achievable through a hybrid AI architecture. By combining structured layout understanding models with Generative AI reasoning capabilities, organizations can build highly scalable, intelligent extraction systems that meet the rigorous safety and compliance standards of the construction sector. The Evolution from GenAI Approach to Deterministic Precision Starting with a Generative AI–driven approach to extract structured fields from documents is a fundamentally more effective initial strategy. It accelerates early-stage extraction without requiring large, labeled datasets, while simultaneously enabling structured data collection needed to train deterministic models—which typically require thousands of annotated samples. This approach delivers immediate value by rapidly identifying relevant data patterns in documents and uncovering key factors that influence extraction accuracy, such as document quality, layout complexity, and multi-section ambiguity. At the same time, it naturally builds the dataset necessary to transition toward a more scalable and repeatable solution. However, while powerful for contextual reasoning across document sections, Generative AI is inherently probabilistic and sensitive to input variability. For enterprise-grade reliability, precision, and repeatable structured document extraction, a complementary approach is required. The optimal solution is a hybrid model that combines the strengths of both: Azure Content Understanding provides precise, consistent field extraction with per-field confidence scores at scale. Azure OpenAI GPT-5.2 (generative) adds contextual reasoning, validates ambiguous fields, fills extraction gaps, and interprets complex multi-section relationships. AI Agent (bounded triage) handles exception cases with structured CORRECT/ACCEPT/ESCALATE decisions before human escalation. Together, they form a superior system—delivering higher accuracy, reduced ambiguity, bounded AI cost, and stronger auditability in complex real-world conditions. Note : AI cannot compensate for inconsistent input data. Standardized document schemas and operational discipline remain prerequisites for reliable automation. Solution Components and Architecture The solution follows a modular, event-driven architecture that combines deterministic document understanding and Generative AI to enable scalable, intelligent extraction workflows. At a high level, documents are ingested, deduplicated, processed through Azure Content Understanding for primary extraction, enhanced with GPT-5.2 for gap-fill verification, validated against business rules, and routed through a confidence-based decision system before persistence. The code repository for the solution can be found here Conceptual Architecture Azure Architecture: - The pipeline execution follows this flow: a document is uploaded to Azure Blob Storage, triggering the orchestrator. The pipeline checks for duplicates via SHA-256 hash against Cosmos DB. New documents are submitted to Azure Content Understanding, which returns structured fields with per-field confidence scores. The AI Schema Mapper then identifies gaps—fields that are missing or have confidence below 0.70—and sends only those to GPT-4.1 for verification. Results are normalized, validated against cross-field business rules, and routed based on aggregate confidence. Throughout the pipeline, built-in feedback loops—quality filtering, validation checks, and confidence gates—ensure that only high-confidence results are persisted automatically, enabling a reliable and production-ready extraction system. Azure Blob Storage — Primary storage for source PDFs and extraction artifacts. Standard_LRS, Hot tier, HTTPS-only with SAS-secured access for Content Understanding. Azure Content Understanding — Primary deterministic extractor with custom analyzer supporting 100+ configurable fields. Returns per-field confidence scores (0.0–1.0) plus raw markdown text. Non-LLM, repeatable, and auditable. Azure AI Foundry / OpenAI (GPT-5.2) — Bounded gap-fill verifier invoked only for missing or low-confidence fields (typically 10–20% of total). Temperature 0.0, JSON response format enforced, schema-aware prompting with domain rules. Azure Cosmos DB (Serverless)— Document persistence with SHA-256 deduplication, version increment on re-processing, and partition-by-document-type for efficient querying. Pay-per-request scales from zero. Azure Service Bus (Basic) — Event-driven queue integration with `document-processing` and `human-review` queues for processing triggers and escalation routing. Application Insights + OpenTelemetry — End-to-end observability with per-stage telemetry events, custom metrics (fill_rate, record_confidence, extraction_duration_ms), and distributed tracing Cost Impact of Hybrid Approach Metric CU-Only GPT-Only Hybrid (This Architecture) Cost per document ~$0.01 $0.15–0.30 $0.03–0.05 Determinism 100% Variable 95%+ Accuracy 75-80% 80–90% 90-95% Auditability Full Limited Per-field source attribution Cost savings: 60–80% reduction compared to GPT-only by limiting LLM to gap fields. Security and Enterprise Considerations Azure Blob Storage: Storage accounts can be secured by minimizing public exposure, enforcing strong identity‑based access, protecting data, and continuously monitoring for threats. Organizations should use Private Endpoints and disable public network access wherever possible, authenticate users and applications with Microsoft Entra ID instead of shared keys, and apply least‑privilege Azure RBAC with managed identities. Data should be encrypted in transit (TLS 1.2+) and at rest using Microsoft‑managed or customer‑managed keys stored in Azure Key Vault, while Microsoft Defender for Storage, logging, soft delete, backups, and Azure Policy should be enabled to detect threats, support recovery, and enforce compliance at scale. Content Safety can be called from the application layer to block uploads based on image content. Staging containers can be used to isolate untrusted uploads. Content Safety provides signals; your app enforces policy. Azure Content Understanding / AI Vision: Azure AI services support enterprise-grade security through Microsoft Entra ID–based authentication and Azure RBAC, ensuring only authorized applications can access extraction models. Network isolation can be enforced using Virtual Network (VNet) integration and Private Link to restrict public internet exposure. All data transmitted is encrypted in transit and at rest. Microsoft Defender for Cloud provides continuous security posture visibility across these AI workloads. Azure OpenAI Govern which models are approved for use and protect model artifacts and training data from unauthorized access through strong identity, network, encryption, and logging controls. AI applications should be designed with layered defenses, including multi‑stage content filtering, safety meta‑prompts, and least‑privilege permissions for agents and plugins to reduce the risk of prompt injection, data leakage, and unintended actions. High‑risk AI operations should include human‑in‑the‑loop review to prevent autonomous execution of harmful or incorrect outcomes. Organizations must continuously monitor AI systems for misuse, anomalous behavior, and data exfiltration, and they should perform ongoing AI red teaming to identify vulnerabilities such as jailbreaking, adversarial inputs, and model manipulation before they can be exploited. Azure Cosmos DB Azure Cosmos enhances network security by supporting access restrictions via Virtual Network (VNet) integrationand secure access through Private Link. Data protection is reinforced by integration with Microsoft Purview, which helps classify and label sensitive data, and Defender for Cosmos DBto detect threats and exfiltration attempts. Cosmos DB ensures all data is encrypted in transit using TLS 1.2+ (mandatory) and at rest using Microsoft-managed or customer-managed keys (CMKs). Azure Functions / Compute Secured with Entra ID authentication and managed identities, least-privilege RBAC, HTTPS-only access, private endpoints, VNet integration, and Key Vault for secrets. Hardened with Azure Policy, Defender for Cloud, and centralized logging. Microsoft Foundry Microsoft Foundry supports robust identity management using Azure Role-Based Access Control (RBAC) to assign roles within Microsoft Entra ID, and it supports Managed Identities for secure resource access. Conditional Access policies allow organizations to enforce access based on location, device, and risk level. For network security, Azure AI Foundry supports Private Link, Managed Network Isolation, and Network Security Groups (NSGs) to restrict resource access. Data is encrypted in transit and at rest using Microsoft-managed keys or optional Customer-Managed Keys (CMKs). Azure Policy enables auditing and enforcing configurations for all resources deployed in the environment. Additionally, Microsoft Entra Agent ID, which extends identity management and access capabilities to AI agents. AI agents created within Microsoft Foundry are automatically assigned identities in a Microsoft Entra directory centralizing agent and user management in one solution. AI Security Posture Management can be used to assess the security posture of AI workloads. Defender for AI Services provides threat protection and insights for you AI resources. Purview APIs enable Azure AI Foundry and developers to integrate data security and compliance controls into custom AI apps and agents. This includes enforcing policies based on how users interact with sensitive information in AI applications. Purview Sensitive Information Types can be used to detect sensitive data in user prompts and responses when interacting with AI applications. DevOps Security Security is further “shifted left” by integrating automated controls directly into CI/CD pipelines. GitHub Advanced Security for Azure DevOps, which provides dependency scanning, CodeQL-based static application security testing (SAST), and secret scanning to identify vulnerabilities and exposed credentials in code and third-party libraries. Infrastructure-as-code templates can be validated with Azure Policy and Microsoft Defender for Cloud, while pipeline protections such as protected branches and approvals reduce the risk of unauthorized changes. DevOps environments can be hardened using Azure Key Vault for secrets management, Managed Identities and Microsoft Entra ID for least-privilege access, and monitoring through Azure Monitor . Microsoft Defender for Cloud DevOps Security provides centralized code‑to‑cloud visibility across Azure DevOps, GitHub, and GitLab, identifying risks in code, secrets, dependencies, and IaC and helping teams prioritize fixes early in CI/CD pipelines Related and Future Scenarios Although document extraction serves as the initial use case, this architecture establishes a scalable pattern for many applications: Insurance Claims Processing: Swap schema to claim fields; update CU analyzer for claim forms Legal Contract Analysis: Schema for clauses, parties, dates; add NER in normalization Healthcare Medical Records: HIPAA-compliant Cosmos; schema for diagnoses, medications, vitals Financial Document Processing: Schema for transactions, accounts; add currency normalization Engineering/Construction Plans: Schema for dimensions, materials, specifications Digital Twin Integration: Feed extracted data into asset models for real-time facility visualization Predictive Analytics: Track extracted values over time for trend detection and forecasting Conclusion Modernizing document extraction is not simply about applying AI—it requires aligning technology, operational discipline, and data quality. Early exploration using Generative AI enabled rapid learning and feasibility validation. However, a production-grade solution must be built on structured layout understanding models supported by standardized schema definitions and operational controls. By combining primary structured extraction with Generative AI reasoning for bounded gap-fill verification, organizations can achieve scalable, repeatable, and auditable extraction processes. This hybrid approach enables reduced manual effort, lower error rates, and the transition from batch manual processing to intelligent, automated workflows. The result is not just an automated extraction tool, but a scalable AI architecture for modern document intelligence—adaptable to any industry, any document type, and any structured data need. Contributors: This article is maintained by Microsoft. It was originally written by the following contributors. Gaurav Bhardwaj | Senior Cloud Solution Architect – US Customer Success  Manasa Ramalinga | Senior Principal Cloud Solution Architect – US Customer Success  Abed Sau | Principal Cloud Solution Architect – US Customer Success
gauravbhardwaj
Jun 18, 2026 Place Azure Architecture Blog
415Views
0likes
0Comments
Azure Cost Optimisation
One of the 5 pillars of WAF, Cost optimisation is a fairly popular choice, as everyone wants to save money in Azure and realise value. I share my experience in cost optimisation after delivering Cost Optimisation Assessments at a rate of about once a fortnight for our large enterprise customers.
Marc Kean
Jun 03, 2026 Place Azure Architecture Blog
197KViews
16likes
6Comments
Azure Course Blueprints
Each Blueprint serves as a 1:1 visual representation of the official Microsoft instructor‑led course (ILT), ensuring full alignment with the learning path. This helps learners: see exactly how topics fit into the broader Azure landscape, map concepts interactively as they progress, and understand the “why” behind each module, not just the “what.” Formats Available: PDF · Visio · Excel · Video Every icon is clickable and links directly to the related Learn module. Layers and Cross‑Course Comparisons For expert‑level certifications like SC‑100 and AZ‑305, the Visio Template+ includes additional layers for each associate-level course. This allows trainers and students to compare certification paths at a glance: 🔐 Security Path SC‑100 side‑by‑side with SC‑200, SC‑300, SC‑500 🏗️ Infrastructure; Ai & Dev Path AZ‑305 alongside Ai-103, AZ‑104, Ai‑200, AZ‑700, AZ‑140 This helps learners clearly identify: prerequisites, skill gaps, overlapping modules, progression paths toward expert roles. Because associate certifications (e.g., SC‑300 → SC‑100 or AZ‑104 → AZ‑305) are often prerequisites or recommended foundations, this comparison layer makes it easy to understand what additional knowledge is required as learners advance. Benefits for Students 🎯 Defined Goals Learners clearly see the skills and services they are expected to master. 🔍 Focused Learning By spotlighting what truly matters, the Blueprint keeps learners oriented toward core learning objectives. 📈 Progress Tracking Students can easily identify what they’ve already mastered and where more study is needed. 📊 Slide Deck Topic Lists (Excel) A downloadable .xlsx file provides: a topic list for every module, links to Microsoft Learn, prerequisite dependencies. This file helps students build their own study plan while keeping all links organized. Download links Associate Level PDF - Demo Visio Contents AZ-104 Azure Administrator Associate R: 12/14/2023 U: 12/17/2025 Blueprint Demo Video Layer of Visio+ Excel Ai-103 Developing AI Apps and Agents on Azure R: 05/20/2026 Blueprint Inludes Ai-200 Layer of Visio+ Ai-200 Azure Developer Associate R: 11/05/2024 U: 12/17/2025 Blueprint Demo Layer of Visio+ Excel SC-500 Azure Security Engineer Associate R: 01/09/2024 U: 05/20/2026 Blueprint Demo Layer of Visio+ Excel AZ-700 Azure Network Engineer Associate R: 01/25/2024 U: 12/17/2025 Blueprint Demo Layer of Visio+ Excel SC-200 Security Operations Analyst Associate R: 04/03/2025 U:04/09/2025 Blueprint Demo Layer of Visio+ Excel SC-300 Identity and Access Administrator Associate R: 10/10/2024 Blueprint Demo Layer of Visio+ Excel Specialty PDF Visio AZ-140 Azure Virtual Desktop Specialty R: 01/03/2024 U: 12/17/2025 Blueprint Demo Layer of Visio+ Excel Expert level PDF Visio AZ-305 Designing Microsoft Azure Infrastructure Solutions R: 05/07/2024 U: 05/20/2026 Blueprint Demo Visio+ AZ-104 AZ-700 AZ-140 Ai-103 Ai-200 Excel SC-100 Microsoft Cybersecurity Architect R: 10/10/2024 U: 05/20/2026 Blueprint Demo Visio+ SC-500 SC-300 SC-200 Excel Skill based Credentialing PDF AZ-1002 Configure secure access to your workloads using Azure virtual networking R: 05/27/2024 Blueprint Visio Excel AZ-1003 Secure storage for Azure Files and Azure Blob Storage R: 02/07/2024 U: 02/05/2024 Blueprint Excel Subscribe if you want to get notified of any update like new releases or updates. Author: Ilan Nyska, Microsoft Technical Trainer My email ilan.nyska@microsoft.com LinkedIn https://www.linkedin.com/in/ilan-nyska/ I’ve received so many kind messages, thank-you notes, and reshares — and I’m truly grateful. But here’s the reality: 💬 The only thing I can use internally to justify continuing this project is your engagement — through this survey https://lnkd.in/gnZ8v4i8 ___ Benefits for Trainers: Trainers can follow this plan to design a tailored diagram for their course, filled with notes. They can construct this comprehensive diagram during class on a whiteboard and continuously add to it in each session. This evolving visual aid can be shared with students to enhance their grasp of the subject matter. Explore Azure Course Blueprints! | Microsoft Community Hub Visio stencils Azure icons - Azure Architecture Center | Microsoft Learn ___ Are you curious how grounding Copilot in Azure Course Blueprints transforms your study journey into smarter, more visual experience: 🧭 Clickable guides that transform modules into intuitive roadmaps 🌐 Dynamic visual maps revealing how Azure services connect ⚖️ Side-by-side comparisons that clarify roles, services, and security models Whether you're a trainer, a student, or just certification-curious, Copilot becomes your shortcut to clarity, confidence, and mastery. Navigating Azure Certifications with Copilot and Azure Course Blueprints | Microsoft Community Hub Azure Course Blueprints + Demo Deploy Demos are essential for achieving end‑to‑end understanding of Azure. To reduce preparation overhead, we collaborated with Peter De Tender to align each Blueprint with the official Trainer Demo Deploy scenarios. With a single click, trainers can deploy the full environment and guide learners through practical, aligned demonstrations. https://aka.ms/DemoDeployPDF
Ilan_Nyska
May 27, 2026 Place Azure Architecture Blog
37KViews
15likes
20Comments
Cloud Native Platforms: Evolve
Audience: Engineering leaders, platform architects, senior developers exploring how to operationalise AI in their teams Reading time: 8 minutes Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 3 of 3. Cloud helped us scale infrastructure. AI is starting to do the same thing for the work around the code: the planning, the testing, the release communication, the incident triage, the writing that surrounds writing software. The conversation about AI in software has narrowed too quickly to "Copilot in the editor". The bigger story is happening across the lifecycle. Planning, design, development, testing, release, and operations are all being augmented at once. The platforms that adopt AI well are not the ones with the most usage. They are the ones with the clearest discipline around how it is used. This post is about that discipline. AI is changing how we engineer, not how we type AI is not changing how we write code. It is changing how we engineer software. Code generation is the surface. Underneath it, AI is reshaping the unit of leverage. The question is no longer how fast a developer can type. It is how well a workflow can be expressed as a reusable engineering asset. Six disciplines determine whether AI moves the needle on outcomes or just adds another tool to the stack. Figure 1. AI across the SDLC. Each phase has clear AI assist points and clear human-owned validations. The boundary is not negotiable. It is the design. 1. From assistance to augmentation Early AI tools focused on assisting individual developers. Code suggestions. Autocomplete. Quick refactors. The value was real but bounded by the editor. The shift now is into structured workflows that span the lifecycle. The unit of leverage is no longer a single suggestion. It is a sequence of actions executed reliably across phases. ("Agentic" later in this post means a system that makes its own next-step decisions inside guardrails. A workflow follows a fixed sequence; an agent chooses the path.) Code generation has become baseline, not differentiator Workflow generation is where the largest gains live Multi-step assistance with explicit human checkpoints Context that travels across tools, not just within one In practice The pattern that works: start with the single highest-volume writing task on the team (commit messages, code review comments, release notes, postmortem first drafts) and turn the AI assist for that task into a shared workflow rather than each individual's private trick. The cost is one engineer's afternoon documenting the workflow and the eval set. The return is that every engineer on the team inherits the work, and the task that used to consume an engineer's morning every two weeks becomes a background step in the release process. Workflow generation, not faster typing, is where the gains compound across a team. Code suggestions help one developer. Reusable workflows help the next ten. 2. AI across the SDLC, with guardrails AI now has a useful role at every phase of delivery. The role is different at each phase, and the guardrails are different too. Phase What AI helps with What humans must validate Plan Breaking down requirements, drafting acceptance criteria Domain context, business priorities, customer impact Build Code generation, refactoring, scaffolding Architectural fit, security boundaries, performance Test Test case generation, edge case discovery Coverage of business-critical paths, regulatory cases Release Release notes, changelog summaries, communication drafts Accuracy, tone, customer-facing claims Operate Log triage, incident summaries, runbook drafts Root cause attribution, action item ownership The guardrails are not optional decoration. They are the design. In practice The pattern that works: stage AI assists for release communication (changelog drafting, customer-facing release notes, internal release announcements) and require a human review before anything goes out. The draft arrives consistently, faster than a human could produce, and easier to compare across releases. The reviewer is not eliminated; the reviewer is moved from author to editor, which is where their judgment actually matters. Teams that adopt this pattern stop missing release-note deadlines and stop publishing inconsistent communication across products. 3. From prompts to reusable assets Many teams begin with prompt experimentation. Individuals find techniques that work for their tasks. The result is a patchwork of personal practices that do not survive a team change. The compounding value comes when prompts mature into reusable engineering assets. Figure 2. The maturity model from prompts to agents. The value compounds at the workflow stage and accelerates at the agent stage. The disciplines that make agents safe are the same ones that made workflows reliable. The maturity stages, in order of leverage: Prompts: ad-hoc, individual, hard to share Templates: parameterised prompts versioned with the project Workflows: multi-step sequences with clear inputs, outputs, checkpoints Agents: autonomous task chains operating within explicit guardrails The diagram is a maturity ladder, not a graduation. In practice teams operate at all four stages simultaneously for different tasks. A senior engineer may use a one-off prompt to explore a refactor, run a versioned template for commit messages, hand off to a workflow for release notes, and trigger an agent for routine PR triage, all in the same hour. The point of the ladder is not to leave earlier stages behind. It is to know which stage a given task belongs to and to invest accordingly. In practice The pattern that works: pick the three prompts your team uses every week, codify them as parameterised templates in the same repository as the application code, and treat them as engineering artefacts (reviewed, versioned, owned). New engineers inherit the team's accumulated practice instead of building their own from scratch. Quality becomes consistent because the variance between individuals shrinks. Investment pays back in weeks, not quarters, and the maturity ladder keeps producing returns as the team moves from templates to workflows to agents. 4. Agentic delivery, with guardrails that survive a security review The next stage is agentic. AI executes sequences of tasks within a defined scope. The risk is not that the agent will fail. It is that the system around the agent will not catch the failure, and that the failure modes are different in kind from traditional automation. Agents are non-deterministic, they can be manipulated through their inputs, and their actions can have side effects in systems the team does not own. Five guardrails make agentic delivery safe. The first four are necessary. The fifth is what carries the agent through a security review at a regulated enterprise. Identity and scope: the agent runs as a managed identity (or scoped service principal) with the smallest set of permissions that lets it do its job. Permissions are expressed as allowlists, not denylists. Tools fetched at runtime are subject to the same identity boundary as the agent itself. Input quarantine: anything the agent reads from a user-controlled source (work item bodies, PR descriptions, customer tickets) is treated as untrusted text. The agent does not execute instructions found in fetched content, and tool calls are validated against an output schema before execution. This is the prompt-injection mitigation, and it is the most common gap in agentic systems shipped today. Cost and blast-radius caps: every run has a maximum token budget, a maximum number of tool calls, and a maximum spend. Exceeding any cap aborts the run cleanly. Without caps, scoped credentials are not enough to bound the damage. Evaluations and traceability: agents are evaluated against a fixed test set before deployment, and on every prompt or model change. Every action is logged with inputs, outputs, the model and prompt versions used, and the reasoning trace where the model exposes one. Logs are redacted for secrets and personally identifiable information at write time. Reversibility taxonomy: actions are categorised by reversibility, not asserted to be reversible in general. A draft write to a private store is reversible. A post to a customer-facing channel is not reversible (deletion does not unsend). A database update may be reversible by a compensating transaction or not at all. Irreversible actions require human approval at the boundary, before they happen, not after. The agent is allowed to draft and stage. The human is the only one who is allowed to make the move that cannot be undone. In practice The pattern that works: start with one low-risk agent (release-notes drafter, PR triage assistant) running on read-only inputs, write-only-to-drafts permissions, and a hard cost cap per run. Require explicit human approval at the irreversible step. Wire up an evaluation set on day one, and rerun it on every prompt or model change. Treat regressions as failures, not warnings. The first agent the team ships is rarely the most valuable; it is the rehearsal that establishes the controls every later agent inherits. Teams that skip this rehearsal end up with an agent in production that no one feels safe extending. Implementation note An agent without a reversibility taxonomy and a regression eval set is a liability. The discipline is the same one that made workflows reliable: scoped identity, idempotency, traceability, and a clear boundary between machine action and human decision. The YAML below is illustrative, not a runtime contract; it is meant to show the shape of the controls a real agent definition would carry, not the syntax of any specific platform. # Agent run definition (illustrative; not a specific platform's syntax) name: release-notes-drafter trigger: pre-release identity: type: managed-identity scope: tenant=<tenant-id> resource=release-tools/<app-id> permissions: allow: - read: work-items in milestone (filter: state=Done) - read: pull-requests in milestone (filter: merged) - write: drafts/release-notes/${run-id} # Production channels are NOT in the allowlist. The agent cannot post. limits: max_tokens_per_run: 80000 max_tool_calls_per_run: 20 max_runtime_seconds: 300 max_cost_usd: 0.40 on_exceeded: abort_with_partial_artifact input_handling: treat_fetched_content_as: untrusted # Indirect prompt injection is mitigated by the layered discipline below, # not by a single feature flag. Each item is a separate control. enforce_instruction_hierarchy: true validate_tool_args_against_schema: true validate_outputs_against_schema: true steps: - fetch: completed work items in milestone - draft: release notes from items - validate: required fields present - request-review: from: release-manager idempotency_key: ${milestone-id}-${draft-hash} - on-approval: action: post-to-internal-channel reversibility: not-reversible requires: explicit-human-click # the agent does NOT click this audit: log_inputs: true log_outputs: true redact: - secrets # Pattern-based: handles structured PII like emails, phones, IDs. - pii_patterns: [email, phone, national-id, payment-card, ip-address] # Entity-based: required for unstructured PII like names. Pattern alone # cannot redact a customer name without an entity-recognition step. - pii_entities: ner-based # names, locations, organisations retain: 365_days # tune to your audit policy, not to the demo evaluation: test_set: tests/release-notes/eval-v3.jsonl on_prompt_change: rerun on_model_change: rerun fail_threshold: 5_percent_regression 5. Where AI still needs human judgment AI has clear boundaries. The boundaries are not embarrassing. They are the design. What must stay human-owned: Architectural trade-offs and design decisions Security validation and threat modelling Correctness for business-critical and regulatory paths Domain context that has not been written down Accountability for outcomes, not just outputs The goal is collaboration, not replacement. The teams that get the most value from AI are not the ones with the most automation. They are the ones with the clearest sense of where automation ends and judgment begins. In practice The pattern that works: name the human-owned items explicitly in the team's working agreement (architecture, security, regulatory correctness, accountability) and audit every AI workflow against that list. When a workflow asks the AI to make a decision in any of those categories, redesign it so the AI prepares the analysis and a human makes the call. Most teams over-trust AI for one of these areas in their first six months and learn the hard way. Naming the boundary up front prevents the lesson from being paid in production. The clarity is the value; the model behind the workflow is interchangeable. 6. Responsible AI is engineering work The first five disciplines decide whether AI moves the needle. The sixth decides whether the platform can defend the choices it makes with AI. Responsible AI is the engineering practice of building systems whose AI behaviour is fair, transparent, accountable, and safe by design, not by audit after the fact. Treating it as a compliance checkbox at the end of the project is how teams end up shipping AI workflows that fail security review, embarrass the company, or harm users. Six controls turn responsible AI from a policy into engineering work. These map directly onto the practices Microsoft and the broader industry have converged on, but the names matter less than the practice they enable. Fairness in inputs and outputs. The training data, eval set, and prompts are reviewed for systematic bias against any group the system serves. The eval set covers under-represented cases by design, not by accident, and regressions on those cases fail the build. Transparency to end users. When a user sees AI-generated content, they are told. When a decision is AI-assisted, the path from input to output is explainable in plain language, not just in a model card buried in documentation. Content safety filters. Inputs and outputs pass through safety classifiers (prompt injection, prohibited content, jailbreak patterns) before reaching the model and before reaching the user. Filtering decisions are logged and reviewable. Accountability ownership. Every AI workflow has a named owner who is accountable for its outcomes, not just its uptime. The owner has the authority to pause or roll back the workflow when harm is detected. Data minimisation and residency. The AI sees only the data it needs to do the task. Personally identifiable information and customer data are scoped, redacted, and kept inside the boundary the customer agreed to. Cross-tenant leakage is treated as a P1 incident, not a feature request. Harm evaluation alongside quality evaluation. The eval set measures harm potential (toxicity, hallucination on factual queries, leakage of confidential context) with the same rigour as it measures correctness. Both must pass for a release to ship. Figure 3. Responsible AI as a set of engineering controls around the AI workflow. The six controls fall into four categories: data discipline (fairness, data minimisation), model discipline (content safety, harm evaluation), deployment discipline (transparency to users), and governance (accountability ownership). All six are necessary; none is sufficient on its own. In practice The pattern that works: write the responsible AI plan before the first agent ships, not after the first incident. Pick one workflow that touches user data or generates customer-facing content, and use it as the reference implementation: fairness review on the eval set, content safety filters wrapping the model call, transparency annotation in the UI, redaction of identifying details in logs, harm evals running alongside quality evals on every change, and a named owner with explicit pause authority. The first such workflow takes longer to ship than the unconstrained version. Every workflow after it inherits the controls and ships faster than it would have without them. Teams that defer responsible AI to a future quarter end up retrofitting it under pressure, which is the most expensive way to do it. A scenario that ties it together Picture a platform team several months into using Copilot. Adoption is high. Productivity dashboards show gains. But defect rates are not improving and lead time is flat. Leadership asks the obvious question: is AI actually helping, or just feeling like help? The answer is not to stop using AI. It is to change how AI is measured. Move adoption metrics to the background. Move outcome metrics to the front: defect escape rate, lead time for change, change failure rate, mean time to recovery. In parallel, promote the individual prompts that have proved themselves to shared templates, and the templates to versioned workflows. Retrofit responsible AI controls onto the workflows that shipped first: content safety filters, harm evaluations alongside quality evaluations, transparency annotations on customer-facing output, and a named owner for each workflow. Six months later, the picture is different. Defect rate improves on the parts of the codebase where reusable workflows were introduced. Onboarding for new engineers is visibly faster. Release notes are consistent across teams. The shift is from celebrating use to tracking outcomes, and once the team measures what matters, the tooling decisions start making themselves. What teams get wrong The common pattern is measuring AI by usage, not by outcome. Adoption metrics tell you who tried Copilot. They do not tell you whether defects dropped, lead time improved, or release notes got better. The fix is not less AI. It is better measurement. The four metrics named in the scenario above (defect escape rate, lead time for change, change failure rate, mean time to recovery) come from the DORA research on software delivery performance and have become a useful default. Two warnings travel with them. First, attribution is hard: an AI workflow rolled out alongside a test refactor and a CI pipeline change cannot claim credit cleanly. Second, baselines matter more than headlines: a single quarter's improvement is not a trend, and a single team's gain is not the platform's gain. Outcome measurement done well needs a baseline window, an attribution discipline, and a kill criterion for workflows that are not paying back. Done poorly, it is just adoption metrics with better names. There is also the question of cost. AI usage carries a per-run token bill, an evaluation bill on every change, and (for agents) a cost cap that limits damage when something goes wrong. None of these are large compared to the engineering time saved when the workflow works. All of them are visible enough that a finance-aware reader will ask. Track them. Where to start The most concrete starter from this post: promote one personal prompt to a shared template. Pick the prompt that gets used most often (commit messages, code reviews, release notes, debugging assist), move it from someone's notes into the repository where the team versions everything else, and watch what changes when the next person on the team runs it. That is the smallest unit of the workflow shift this post argues for, and it is the step where prompts stop being individual practice and start becoming engineering assets. The shift The shift is from building systems to building smarter systems: AI does not replace engineers. It changes what an engineer's leverage looks like. The unit of value is the workflow, not the suggestion. The discipline that made platforms operable is the same discipline that makes AI useful. Responsible AI is not a compliance step. It is the sixth engineering discipline that lets the other five compound safely. The series ends here, but the arc is consistent across all three posts. The disciplines that make platforms scale are the same disciplines that make AI useful. Build with discipline. Run with discipline. Evolve with discipline. The tools change. The disciplines do not. Want to discuss? Where has AI moved the needle most in your delivery, and where has it disappointed you? Drop a comment with patterns you have seen in your environment. Every reply gets read. Previously in this series: Building Cloud Native Platforms That Scale: Patterns That Actually Work. Part 1 covered the design choices that make scale possible. Running Cloud Native Platforms: Why Day 2 Decides Everything. Part 2 covered the operational disciplines that decide production outcomes. This is the third and final post in the series.
KishoreKumarPattabiraman
May 23, 2026 Place Azure Architecture Blog
374Views
1like
1Comment
Cloud Native Platforms: Run
Audience: SREs (Site Reliability Engineers), platform engineers, engineering managers running production systems Reading time: 8 minutes Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 2 of 3. Most systems are designed thoughtfully. Most operations are inherited reactively. The systems that survive are not the ones built with the most care. They are the ones operated with the most discipline. Production has a way of revealing every shortcut taken during design and every assumption left unverified. This post is about what it takes to operate a platform once the build is done. How they are run, not how they are built Systems are not defined by how they are built. They are defined by how they are run. A well-designed system that is operated reactively will fail in production. A modestly designed system that is operated with discipline will outperform it. Five operational disciplines decide which side of that line a platform lives on. Each one is engineering work, not a checklist for someone else to handle. Figure 1. The incident lifecycle as a state machine. The states are not optional steps. They are the contract between the team and the system. 1. Observability is the backbone of reliability Without observability, every operation becomes a guess. As systems grow, the cost of guessing rises faster than the cost of seeing. Part 1 of this series argued that observability is a design property: instrumentation contracts, request id propagation, structured logging schemas. Production is where those design choices either pay off or do not. Strong observability in production is a contract that lets any engineer answer three questions in minutes: what failed, why it failed, and what the impact was. The shape of that contract matters more than the tool that implements it. (This three-question framing is community-popularised through the SRE community and writers such as Charity Majors. See Honeycomb's What is Observability for the canonical articulation of the three-pillars and question framing; the substance is older than the framing.) Dashboards organised around user journeys, not infrastructure components Service level indicators (SLIs: the specific measurements you care about, e.g., success rate, p99 latency) chosen from the user's perspective, not the database's Alerts that page only on burn-rate against an SLO (Service Level Objective: the target value of an SLI, e.g., 99.9% of requests complete in under 800ms over a rolling month) using a multi-window strategy. A short window catches fast burns; a long window catches slow drifts. This is what makes SLOs operational rather than decorative. Sampling and retention tuned for cost, but never for blind spots The distinction between MTTA (mean time to acknowledge: how fast someone notices) and MTTR (mean time to restore: how fast service returns) tracked separately. Conflating them hides whether the team's bottleneck is detection, response, or fix. In practice The pattern that works: rebuild the operational view around two or three user journeys (sign-in, place order, view history) rather than per-component charts. Tie alerts to error budget burn rather than raw threshold crossings. Track MTTA and MTTR separately so the team's actual bottleneck (detection, response, or fix) is visible. The investment is rethinking what to measure, not buying a new tool. The return is that incidents stop being discovered by customer complaints first. Teams that make this shift typically find their existing telemetry was sufficient; only the questions being asked of it were wrong. If a dashboard cannot answer "what is the user experiencing right now", it is not an observability dashboard. It is decoration. 2. Alerts are signals, not notifications More alerts do not mean better monitoring. In practice, the opposite is true. Once alerts outpace the team's ability to act, important signals start getting missed. Effective alerting works to a small set of rules: Severity that maps to action, not to technical category Ownership baked in, never inferred at runtime Thresholds tied to user impact, not raw metric values Noise treated as a defect, with a regular review cadence Suppression and grouping for known multi-alert patterns In practice The pattern that works: audit every alert against one test, "what action would I take in the next five minutes if this fires now?" Demote alerts with no answer to dashboards. Remove alerts where the answer is the same as another alert's. Group related alerts so one incident produces one page, not twelve. Most teams discover their alert volume drops by an order of magnitude after a thorough audit, and the alerts that remain start getting trusted again. Trust is the precondition for every other operational practice. Without it, on-call rotations decay into noise filtering and the real signals get missed. Figure 2. From raw events to pages, in approximate orders of magnitude. The numbers vary by team and workload; what does not vary is that each stage needs to remove one to two orders of magnitude of noise. Teams that page on raw events end up with on-call rotations nobody trusts. 3. Incident response is a practiced muscle Failures are inevitable. Unstructured response is not. The teams that recover quickly do not improvise during incidents. They follow a structure that has been practiced when nothing was on fire. The structure is intentionally simple, because incident time is the worst time to negotiate roles. Clear roles: incident lead, communications lead, scribe, subject matter expert (the RACI model, Responsible-Accountable-Consulted-Informed, adapted for incident response) Defined escalation paths with clear handoff criteria. Escalation means re-paging to a higher tier or specialist, not returning to detection. The lifecycle diagram in Figure 1 makes the distinction explicit. Runbooks for the top failure modes, kept short enough to actually be read Status communication on a fixed cadence, even when there is nothing new to say. Customer comms and internal comms are tracked separately. Blameless postmortems (focus on the system that allowed the failure, not the person who pushed the button) that produce action items the team actually completes Game days: scheduled exercises that simulate failure modes (region outage, dependency unavailability, traffic spike) under controlled conditions, so gaps in runbooks are found before incidents do In practice The pattern that works: name the incident lead and the comms lead before the first message goes out. Write runbooks short enough to be scannable at 3 AM. Run blameless postmortems with action items that actually get tracked to completion. Schedule game days quarterly so the runbooks are exercised before real incidents. Teams that operate with this structure do not have more engineers; they have engineers who are not single points of failure during recovery. The deepest experts stay the deepest experts, but the platform stops depending on whether they happen to be online. Implementation note A short, well-structured runbook outperforms a long, exhaustive one. The goal during an incident is not to think. It is to act on a procedure that has been thought through in calmer times. # Runbook header pattern (keep it scannable in incident time) title: High latency on order API slo_protected: # this runbook protects two SLOs - order-completion-success - order-completion-latency severity: # derived from burn rate, not declared fast_burn: P1 # 14.4x budget burn over 1 hour => page now slow_burn: P2 # 6x budget burn over 6 hours => investigate owner: payments-team indicators: # triggers for evaluation, not severity - p99 (99th-percentile) latency exceeds the SLO target for 5 min - error rate exceeds the SLO target for 3 min on order-completion first_actions: - Open the order-journey dashboard. Confirm impact in business terms. - Check Service Bus queue depth and dead-letter rate (the most common cause of API latency under load is downstream backpressure) - Verify Cosmos DB RU/s saturation and partition hotspots - Inspect the most recent deployment for behavioural changes escalate_if: - Latency does not recover in 15 min - Error rate exceeds 5% (fast burn against the SLO) - Customer reports arrive before our own signals do rollback_path: - Feature flag "new-order-pipeline" can be disabled per-tenant - Last known good deployment id is in the release tracker note_on_scaling: # CPU is rarely the cause of latency in this service. Scale only after # confirming the bottleneck is compute, not a downstream dependency or # queue depth. Adding capacity to a saturated downstream amplifies the # incident; it does not resolve it. The general principle behind that last note travels beyond this runbook: scale-out is the right remediation for compute saturation, not for downstream saturation. When latency rises because a database, queue, or external dependency is saturated, adding capacity in front of the bottleneck moves more requests into the bottleneck and makes the incident worse. This is one of the most common operational mistakes when the dashboard shows red and the on-call instinct says "add more". 4. Release confidence is engineered Releases get harder as systems grow. The platforms that ship confidently at scale have engineered the path, not learned to fear it. The patterns that change the math: Feature flags that allow change without deploy Canary deployments (releasing the new version to a small slice of traffic first, watching error budget burn before continuing) that surface problems on a small slice Gradual rollouts with automated rollback triggers Database migrations split from application releases Release coordination that scales with services, not with team size In practice The pattern that works: every change ships behind a feature flag, canary deployments take a small slice of traffic first, and rollback is a one-click step in the pipeline rather than a procedure to be invented during an incident. The cost is the discipline of building rollback paths and exercising them. The return is releases that stop being events. Issues that previously triggered full rollbacks get isolated to a slice and rolled back automatically before they reach most users. The willingness to ship smaller, more frequent changes follows directly from the confidence that bad changes can be undone fast. Big releases feel safe because they are rare. They are actually risky because every change rides together. 5. Reliability is continuous, not a milestone Reliability is not achieved through tools alone. It requires continuous refinement, feedback-driven improvement, and a budget that the team can spend on operational work without negotiating each time. The disciplines that keep systems reliable over years are codified well in the SRE-book framing of service level objectives and error budgets (the canonical reference is the Google SRE Book chapter on Service Level Objectives, with the operational follow-up in the SRE Workbook chapter on alerting on SLOs). The names matter less than the practice they enable. SLOs chosen from the user's perspective, with two or three per service rather than ten. More SLOs means none of them shape behaviour. Error budgets: the inverse of the SLO, expressing how much unreliability the team is willing to spend in a window. Used up early in the month means slow down on releases. Healthy means feature work keeps moving. Multi-window burn-rate alerting turns SLOs from dashboards into pages: short window catches catastrophic failures, long window catches slow drift. Without burn-rate alerting, SLOs are observation, not operation. (The pattern is documented in the SRE Workbook.) Reliability work has its own backlog, prioritised against features. Not a wishlist after every incident. Regular game days that exercise failure modes (region failover, dependency outage, traffic spike) before they happen for real Capacity planning informed by data, not by anxiety In practice The pattern that works: define two or three SLOs per service, expressed from the user's perspective. Compute the error budget weekly. When the budget is healthy, ship feature work. When the budget is burning fast, slow down and fix the cause. The conversation about which incidents matter and which can wait becomes possible because there is a shared number to point at. Reliability becomes a quantified property of the platform, not an opinion debated at every retrospective. Teams that adopt this discipline stop having the recurring "how reliable do we need to be?" argument and start having data-grounded trade-off discussions instead. A scenario that ties it together A platform was launching a new region. The build had gone well. Day 1 was clean. Two weeks in, latency started creeping up during peak hours. Alerts fired on raw thresholds, but no one could tell which ones to trust. Incident calls turned into long debugging sessions because three different teams owned overlapping pieces of the request path. The team did not start by buying a new tool. They started by treating operations as engineering work. The dashboard was redesigned around the user journey. Alerts were audited and most were demoted or removed. Roles for incident response were written down. A short runbook covered the top failure modes. Releases were broken into canary slices behind feature flags. None of this was new. It was discipline applied consistently to work that was previously assumed to be someone else's. The next region launch took half the effort, and the team's mean time to restore on the failures that did happen was measurably lower. What teams get wrong The common pattern is treating Day 2 as the cost of Day 1. Teams design beautifully, ship fast, then quietly absorb the operational debt. Dashboards proliferate. Alerts grow louder. Postmortems pile up. The fix is not more dashboards. It is treating operations as engineering work with the same rigour as feature delivery. Operability is a property the system either has or does not. It is not earned by adding monitoring. It is earned by designing for visibility and operating with discipline. Where to start The most concrete starter from this post: an alert audit. List every alert that fires in the next week and apply a single test to each one: "what action would I take in the next five minutes?" Demote the alerts that have no answer. Remove the alerts where the answer is the same as another alert's. The audit takes a morning. The result usually halves alert volume and lifts trust on what remains, which is the precondition for every other operational practice in this post. The shift The most important shift in maturity is not technical. It is in stance. The shift is from shipping software to operating systems: Operations is not a phase that follows engineering. It is engineering. Reliability is not a milestone reached. It is a discipline practiced. Incidents are not interruptions to the work. They are the work. The teams that internalise this shift run platforms that are smaller, calmer, and more trusted. They do not have fewer incidents because their systems are more advanced. They have fewer incidents because their operational discipline is more consistent. Part 3 of this series argues that the same discipline applies again, in a different domain: the practices that make platforms operable are the practices that make AI useful in delivery. Want to discuss? What is the one operational practice your team adopted that changed how you sleep at night? Drop a comment with patterns you have seen in your environment. Every reply gets read. Previously in this series: Building Cloud Native Platforms That Scale: Patterns That Actually Work. The first post covered the design choices that make scale possible. Next in this series: AI-First Platform Engineering: From Copilot to Agentic Delivery. Cloud helped us scale infrastructure. The next post looks at how AI is now changing how we build and run platforms.
KishoreKumarPattabiraman
May 23, 2026 Place Azure Architecture Blog
365Views
1like
1Comment
Cloud Native Platforms: Build
Audience: Cloud architects, platform engineers, engineering leaders making design decisions Reading time: 8 minutes Series: Cloud Native Platforms. Build, Run, Evolve. This is Part 1 of 3. Most engineering teams can build systems. Few can scale them without rebuilding them. As platforms grow, complexity does not increase linearly. It multiplies across users, services, tenants, regions, and integrations. The systems that struggle and the systems that scale are rarely separated by which cloud they run on. They are separated by a handful of design choices made early and applied consistently. This post is about those choices. The differentiator is not the cloud Scalable platforms are not built with the right tools. They are built with the right design choices. Cloud services have closed the gap on infrastructure. The differentiator is no longer which managed service a team picks. It is whether the platform is designed to absorb change, tolerate failure, and support visibility from day one. Five engineering disciplines determine whether a platform scales gracefully or collects technical debt while it grows. Figure 1. The five disciplines compound into platform scale. Any one neglected becomes the constraint that forces a rewrite later. 1. Flexibility is the foundation of scale Hard-coded systems work until they do not. The first request to add a tenant, a region, a SKU (a sellable product variant), or a regulatory variant is the moment a rigid design starts to bend. Each subsequent request adds weight. Scalable platforms move behavior out of code: Configuration replaces conditional logic Feature flags enable safer, tenant-scoped rollouts APIs evolve through versioning, not breaking changes Schemas evolve additively. Breaking changes go through versioned contracts with a deprecation window long enough that consumers can migrate without downtime. In practice The pattern that works: configuration in a managed store, feature flags with tenant scope, and APIs versioned per consumer contract. Cost is the discipline of treating configuration as code (versioned, reviewed, audited). The return is that releases stop being events and start being routine. A change that previously needed a coordinated deployment can be executed in minutes, gated to a single tenant for verification, and rolled out broadly only after the signal is clean. Most platforms reach this state by retrofit, not by design. Doing it earlier costs less than waiting. If a change requires a redeploy, it should require a very good reason. 2. Failures are normal. Resilience is a choice. Distributed systems will fail in unpredictable ways. The real question is not how to prevent failure. It is how the system responds when failure happens. Resilience is engineered, not inherited from the platform. The patterns that move the needle are well known and consistently applied: Idempotent operations (safe to call multiple times with the same result) that make retries safe Reliable messaging patterns such as the transaction outbox (writing the message to the same database transaction as the business change, then publishing asynchronously) to avoid lost or duplicated events Decoupled services that contain blast radius (the scope of damage when one component fails) Timeouts, retries, and circuit breakers (a wrapper around a dependency that stops calling it for a cool-off window after repeated failures) tuned per dependency Bulkheads (isolation pools, often a separate compute or queue lane per workload class) that keep noisy neighbours from starving critical paths of resources In practice The pattern that works: every write that can be retried carries an idempotency key, every queue consumer is safe to replay, every event published goes through an outbox in the same transactional unit as the business change. When peak load triggers retries, duplicates collapse cleanly instead of producing duplicate orders, double-charged customers, or split-brain state. The contract changes outwards: callers can retry without thinking, queues can be at-least-once instead of exactly-once, and recovery moves from a manual cleanup task to a property of the system. Most teams that adopt this pattern stop seeing certain classes of incident entirely. Implementation note An idempotent API is not just a design preference. It changes how the rest of the system can be built. Once writes are safe to repeat, retries become cheap, queues become trustworthy, and recovery becomes automatic. The naive implementation (read the key, if absent process and save) has a race. Two concurrent requests with the same key both miss the lookup, both call the processor, and both attempt to save. That is the failure mode idempotency exists to prevent. The pattern that survives production is an atomic reserve-then-execute: insert a row keyed by the idempotency key with a unique constraint before doing any work. The first writer wins. Concurrent callers either wait for the original to complete and read its result, or they receive a conflict response. // Contract for the idempotency store. The two key methods are TryReserveAsync // (atomic insert with unique-key constraint) and CompleteAsync (record the // result of the first writer). GetCompletedResultAsync polls until the first // writer commits or returns 409 Conflict if the in-flight window exceeds the // configured deadline. public interface IIdempotencyStore { Task<Reservation> TryReserveAsync( string idempotencyKey, string requestHash, CancellationToken ct); Task CompleteAsync( string idempotencyKey, OrderResult result, CancellationToken ct); Task<OrderResult> GetCompletedResultAsync( string idempotencyKey, CancellationToken ct, TimeSpan? maxWait = null); } public readonly record struct Reservation( bool IsFirstWriter, string RequestHash); // Idempotency via atomic reserve-then-execute. // First writer wins; replays return the original result; concurrent // duplicates lose the race and read the winner's outcome (or get 409). public async Task<OrderResult> CreateOrderAsync( Order order, string idempotencyKey, CancellationToken ct) { var requestHash = StableHash(order); // canonical content hash // Atomic insert: succeeds for the first caller, fails for the rest. var reserved = await _store.TryReserveAsync( idempotencyKey, requestHash, ct); if (!reserved.IsFirstWriter) { if (reserved.RequestHash != requestHash) throw new IdempotencyKeyReusedException(); // A previous run committed (return its result) or is in-flight // (poll with a bounded deadline; 409 if exceeded). return await _store.GetCompletedResultAsync( idempotencyKey, ct, maxWait: TimeSpan.FromSeconds(5)); } // We are the first writer. Execute, persist, mark complete. var result = await _processor.ProcessAsync(order, ct); await _store.CompleteAsync(idempotencyKey, result, ct); return result; } Three production details matter: TTL or compaction on the idempotency record. Without it, the store grows forever. Most teams retain records for the request retry window plus a safety margin (commonly 24 to 72 hours). Stable content hash, not the default object hash code. The request hash detects key reuse with a different body, so a client that reuses an idempotency key with a different payload receives IdempotencyKeyReusedException rather than silently getting the wrong result. Canonicalise field ordering, locale, and null handling before hashing. Bound the in-flight window explicitly. The genuinely hard case is when the processor succeeded but the store write failed. Production-grade implementations either run the side-effect and the store write in the same transaction (when the processor and store share a database) or use the transaction outbox pattern to bridge them. The poll-with-deadline in GetCompletedResultAsync handles the duplicate-arrives-mid-flight case; the transactional boundary handles everything else. 3. Observability is not optional Without observability, teams operate blind. As systems grow, the price of guessing rises faster than the price of seeing. At build time, observability is a design property. The decisions made before the system reaches production are what determine whether it can be operated at all. The dashboards, alerts, and incident practices covered in Part 2 of this series rely on instrumentation choices made here. The build-time work that pays off in production: Request identifiers propagated through every service hop, every queue, every async boundary, so a single user action can be traced end to end Structured logging with a consistent schema (event name, correlation id, tenant, severity) rather than free-form strings Metrics emitted at the boundaries that matter (every external call, every queue read or write, every database operation), not only at the entry point Tracing libraries integrated at the framework or middleware layer so coverage is automatic, not opt-in Schemas designed so business signals (orders, sessions, transactions) and system signals (CPU, latency, errors) share the same identifiers and can be correlated later In practice The pattern that works: a single request id flowing through every service hop, every queue, every async boundary, propagated automatically at the framework layer rather than per-call. Add one structured logging schema across services (event name, correlation id, tenant, severity), so that a single query joins business events with system events. The investment is hours of upfront framework wiring. The return is that production diagnosis stops being archaeology. Cross-service questions become single dashboards; postmortems shrink from days to hours; and the dashboards in Part 2 actually work because the data underneath is shaped to support them. 4. Delivery practices set the ceiling Scaling teams requires scaling delivery. Small inefficiencies in pipelines, environments, and release coordination compound into measurable drag. Delivery maturity that pays off at scale: Pipelines as code, reviewed and versioned like application code Parallel deployments across services and regions where dependencies allow Infrastructure as code with shared modules, not hand-managed environments Automated quality gates: tests, security scans, dependency checks Trunk-based development (developers commit to a single shared branch many times a day) with short-lived feature branches and progressive delivery. Important caveat: trunk-based works only when test automation and feature flags are already in place. Adopting it before those foundations exist tends to amplify production incidents rather than reduce them. In practice The pattern that works: pipelines run in parallel where dependencies allow, infrastructure provisioning is templated rather than per-environment, and quality gates run automatically rather than as discretionary steps. Sequential deployment of a multi-service platform across three environments takes hours; parallelised deployment of the same change takes minutes. The payback is not only release speed. It is the compounding cost reduction of every wait state for every engineer on every release. Teams that treat pipelines as a product feature, not an afterthought, ship more confidently and recover from bad changes faster because the rollback path was exercised, not invented during an incident. Slow pipelines are not a tooling problem. They are a design problem. 5. Cost discipline is engineering work Cloud platforms can become expensive quickly when cost is treated as someone else's problem. Cost is a property of the design, not a quarterly review. The teams that get this right treat cost the same way they treat performance: Elastic compute and storage tiers chosen per workload pattern Non-production environments with automated scale-down windows (the easiest savings to leave on the table) Tagging discipline so cost can be attributed to a service, a feature, a tenant Egress and data-tier choices, not compute, dominate cloud bills past a certain scale. Right-size storage tiers (hot vs cool vs archive), eliminate cross-region chatter, and watch egress on the data plane more closely than compute on the request path. Budgets and usage alerts wired into the same channels as reliability alerts Cost reviews built into design discussions, not deferred to FinOps (Financial Operations: the practice of managing cloud spend as an engineering concern) In practice The pattern that works: non-production environments scale down automatically outside business hours, storage tiers match access patterns (hot, cool, archive), and tagging is enforced so every dollar can be attributed to a service or feature. Cost reviews happen at design time, not after the bill arrives. The biggest savings come from data plane decisions, not compute: cross-region egress, oversized storage tiers, and forgotten test environments dominate cloud bills past a certain scale. Treat cost as a first-class non-functional requirement, alongside latency and availability, and the discipline compounds in every design discussion that follows. A scenario that ties it together Figure 2. A reference architecture that puts the disciplines into one shape. The request path is decoupled, the data layer is purpose-fit, identity is brokered by managed identity throughout, private endpoints isolate the data tier from public networks, and observability runs as a first-class lane. Picture a multi-tenant platform at a growth inflection. Onboarding a new tenant takes weeks because tenant-specific behaviour is hard-coded across services. Every release carries risk because there is no way to roll out a change to one tenant without affecting the rest. Incidents linger because logs and metrics live in different tools and nobody can correlate them in production. Do not start with a rewrite. Start with the smallest set of changes that unlocks the next year of growth: extract configuration out of code, introduce tenant-aware feature flags, wire a unified observability view into the existing services, and parallelise the pipelines. None of these are architectural revolutions. They are design choices applied with discipline, in the order the disciplines compound. Eighteen months in, onboarding a tenant takes hours instead of weeks. Releases move from monthly events to weekly increments. Incidents are caught earlier and resolved faster. The platform did not get bigger. It got more capable. The five disciplines did the work; the team made the choice to apply them. What teams get wrong The common pattern is architecting for the system you have, not the system you are growing into. It looks like progress because the current sprint ships. Pillars get postponed because they feel like overhead. The cost surfaces later. Each shortcut becomes a constraint. The constraints compound, and three releases later the team is debating a rewrite. The fix is not premature abstraction. It is small, deliberate investments in flexibility, resilience, observability, delivery, and cost from day one. The discipline is to make these investments before they are urgent. Where to start when you cannot do everything at once Five disciplines is a wall, and real teams cannot fund all five at once. The right order depends on whether the platform is being built fresh or already running. For a system already in production and already in pain, the SRE community's hierarchy of reliability needs gives the most defensible starting order: monitoring and observability first (you cannot fix what you cannot see), then incident response (close the bleeding cleanly), then resilience patterns (idempotency, retries, decoupling) so the bleeding has fewer reasons to start, then flexibility and delivery so safe change can travel at speed. Cost discipline runs alongside throughout, never as the headline. For a system being built fresh, the order in this post (flexibility, resilience, observability, delivery, cost) reflects the Azure Well-Architected Framework's emphasis on designing for change, failure, and visibility before scaling teams or workloads. Both orders are defensible. What is not defensible is leaving any of the five for later. The most concrete starter from this post: request id propagation. A single correlation identifier travelling through every service hop, every queue, every async boundary, costs hours up front and pays back every time someone has to debug production for the rest of the platform's life. It is the smallest unit of the observability discipline and the foundation that the dashboards, traces, and incident response in Part 2 all depend on. The shift The most important transformation in scaling a platform is not technical. It is mindset. The shift is from project thinking to platform thinking: Build reusable capabilities, not one-off solutions Design systems for long-term evolution, not the next release Enable other teams, not just deliver for one team Tools change. Cloud services evolve. The architectural fashions of this year will not be the architectural fashions of the next. What persists is the discipline behind the choices. Scalable systems are not built by tools. They are built by teams that treat design as continuous work. The same discipline shows up again in Part 2 (operating these systems) and Part 3 (using AI to augment that work). The tools change. The disciplines do not. Want to discuss? What single design choice has paid the most dividends in the platforms you run? Drop a comment with patterns you have seen in your environment. Every reply gets read. Next in this series: Running Cloud Native Platforms: Why Day 2 Decides Everything. Building is half the journey. The next post looks at what it takes to operate these platforms once they are in production.
KishoreKumarPattabiraman
May 23, 2026 Place Azure Architecture Blog
653Views
2likes
1Comment
WAR, Azure Advisor, and Us (Azure Arch Diagram Builder): Three Ways to Score an Azure Architecture
Author: Arturo Quiroga, Azure AI services Engineer - Senior Partner Solutions Architect — Microsoft A few days ago I published From Prompt to Production: Building Azure Architecture Diagrams with AI, introducing the open-source Azure Architecture Diagram Builder. One feature got more follow-up questions than any other: the Well-Architected Framework (WAF) validation. Architects from partners and customers — many of whom already use Azure Advisor and the Well-Architected Review — wanted to know exactly what scoring algorithm we use, how it compares to Microsoft's official tools, and whether they should be using all three. This post is that answer. It's a deep dive into how design-time WAF validation works, how Microsoft's two official WAF assessment algorithms work, and where each fits in the architecture lifecycle. TL;DR. Microsoft ships two WAF assessment vehicles — the Well-Architected Review (questionnaire, scored from human answers) and the Azure Advisor score (healthy-resources-÷-applicable-resources weighted per subcategory, with Defender Secure Score for Security and cost-weighted math for Cost). Both require either a human filling in a form or live Azure telemetry. Our app runs at design time on a diagram, before anything is deployed, using a hybrid pipeline: a deterministic rule pre-scan followed by an LLM refinement pass. Same five WAF pillars, different lifecycle stage. Complementary, not competitive. Why design-time validation matters Every cost overrun, reliability gap, and security incident I've ever debugged was cheaper to fix on a whiteboard than in production. Yet most WAF tooling assumes the architecture already exists — either because there are deployed resources to scan (Advisor) or because someone has built enough of it to answer 60 specific questions about it (WAR). That leaves a gap. Between "rough sketch" and "deployed resource group" there is no algorithmic WAF feedback loop. That's the gap the Diagram Builder fills. Microsoft's two official WAF assessment algorithms Before describing our approach, it's worth being precise about what Microsoft already ships, because the term "WAF assessment algorithm" can mean either of two very different things. 1. Azure Well-Architected Review (WAR) — questionnaire-based The Well-Architected Review is a free self-assessment hosted on Microsoft Learn. Aspect Detail Input Human answers to ~60 questions mapped to the WAF pillar checklists Workload variants Core WAR, plus AI/ML, IoT, SAP on Azure, Azure Stack Hub, SaaS, Mission Critical Scoring Derived from the answers — each "no" or unanswered question subtracts from the pillar score Output Per-pillar maturity score + prioritized recommendations + optional Advisor integration Improvement tracking "Milestones" (point-in-time snapshots) When to use Periodic deep reviews; greenfield design baselining; brownfield audits WAR is human-driven. The algorithm is essentially "how many of the recommended practices have you confirmed you do?" — which is exactly the right algorithm when the assessor is the workload team itself. 2. Azure Advisor Score — telemetry-based The Advisor score is the closest thing Microsoft ships to a real, deterministic WAF algorithm. It runs continuously over your deployed Azure resources. The math: Pillar-specific overrides: Security uses Microsoft Defender for Cloud's Secure Score model. Cost weights by retail $ cost of healthy resources, plus age-of-recommendation weighting; postponed/dismissed items are removed from the denominator. Reliability / Performance / Operational Excellence use the healthy-resources ratio above. Key terms: Healthy resource — a deployed resource with no open Advisor recommendation against it for that pillar. Total applicable — resources Advisor was able to evaluate (excludes dismissed/snoozed). Advisor is the right tool once you're in production. It cannot help you before deployment, because there is nothing to count as "healthy" or "applicable." The missing stage: design time Here's the lifecycle, with each tool's domain shaded: Design / Diagram — Diagram Builder validation runs here. Operate / Observe — Azure Advisor runs here continuously. Periodic Review — WAR runs here, typically quarterly or at major milestones. These three stages are sequential and complementary. Our app does not replace Advisor or WAR — it adds a feedback loop earlier in the lifecycle, where corrections are cheapest. How design-time validation works in the Azure Architecture Diagram Builder The validator is a two-phase hybrid pipeline: deterministic local rules first, then LLM refinement. The full source lives in three files: src/services/architectureValidator.ts — orchestrator and prompt src/services/wafPatternDetector.ts — topology + service rule engine src/data/wafRules.ts — the rule knowledge base Phase 1 — Deterministic rule pre-scan (~1 ms, no LLM) When you click Validate Architecture, the validator runs a fully client-side rule engine against the diagram's services, connections, and groups. There are two kinds of rules: Architecture-pattern rules These fire when a topology anti-pattern is detected: Pattern Detection trigger single-region No global LB (Traffic Manager / Front Door) with ≥3 services single-database Exactly one database service, no replication signal no-cache Compute + database present, no Redis/CDN no-monitoring No Azure Monitor / App Insights / Log Analytics no-identity No Microsoft Entra ID no-waf Public web tier without WAF / Front Door / App Gateway direct-db-access An edge from a frontend service directly into a database no-key-vault 4+ services and no Key Vault no-backup Database present, no Azure Backup / Recovery Services no-api-gateway 2+ compute services and no APIM / App Gateway / Front Door Service-specific rules Every service in the in the generated Azure Architecture diagram is matched against SERVICE_SPECIFIC_RULES by normalized type — App Service, Functions, AKS, Cosmos DB, SQL Database, Storage, Key Vault, and 22 more. The knowledge base at a glance Metric Count Total rules 73 Architecture-pattern rules 10 Service-specific rules 63 Distinct Azure services covered 29 Rules tagged Reliability 18 Rules tagged Security 34 Rules tagged Cost Optimization 5 Rules tagged Operational Excellence 7 Rules tagged Performance Efficiency 9 The preliminary score Each finding has a severity, and severity drives a fixed point deduction from a starting score of 100: Severity Deduction critical −12 high −7 medium −3 low −1 Result is floored at 10 (so even a deliberately bad architecture scores at least 10) and ceilinged at 95 (no findings ≠ perfect — there's always something the model might still catch). This is the deterministic baseline before the LLM ever sees the architecture, and it's what makes the pipeline reproducible. Phase 2 — LLM contextual refinement The pre-scan output, the topology, and the optional natural-language description are folded into a focused prompt sent to one of seven Azure OpenAI models (GPT-5.1 through 5.4, GPT-5.x Codex variants, DeepSeek V3.2 Speciale, Grok 4.1 Fast). The system prompt gives the model explicit scoring guardrails: Score based on what IS present, not what COULD be added. A well-connected architecture with appropriate services should score 60–80. Score below 50 only for critical gaps (no auth, no monitoring, single points of failure). Findings are improvement suggestions, not reasons to penalize the score severely. The model returns strict JSON: { "overallScore": 0-100, "summary": "2–3 sentence assessment", "pillars": [ { "pillar": "Reliability | Security | Cost Optimization | Operational Excellence | Performance Efficiency", "score": 0-100, "findings": [ { "severity": "critical | high | medium | low", "category": "...", "issue": "...", "recommendation": "...", "resources": ["service-name-1", "service-name-2"], "source": "rule-based | ai-analysis" } ] } ], "quickWins": [ /* same shape as findings */ ] } Two things to call out: Every finding is tagged rule-based or ai-analysis . That tag is the credibility lever. You can always see what the deterministic engine produced versus what the model contributed on top. If you don't trust the AI layer, you can ignore it entirely — the rule layer still stands. The LLM is given pattern hints, not the entire rule catalog. The prompt stays small and focused, which is roughly 3–5× faster and cheaper than asking the LLM to do everything from scratch. What the user sees On every run the modal reports: Overall WAF score (0–100) Per-pillar score × 5 (0–100 each) Severity breakdown — counts of critical / high / medium / low across all findings Quick wins — high-impact, low-effort items the model surfaces separately Hybrid metadata — local findings count, patterns detected, KB rules used, preliminary score, local elapsed ms AI metrics — model used, reasoning effort, prompt/completion/total tokens, elapsed time App Insights telemetry — an Architecture_Validated event with model, overall score, finding count, elapsed time Worked example Take this prompt, which I've used in demos with partners: "A multi-region web application: Azure Front Door in front of two App Service instances in West US 2 and East US 2, both reading from an Azure SQL Database with geo-replication, with Application Insights for telemetry. No Entra ID, no Key Vault." After generation, Validate Architecture runs: Phase 1 — pre-scan (deterministic), ~1 ms Patterns detected: no-identity , no-key-vault Findings produced: 8 (1 critical, 1 high, 3 medium, 3 low) Preliminary score: 100 − 12 − 7 − (3×3) − (1×3) = 69 Phase 2 — LLM refinement, ~6–9 s depending on model The model accepts the two pattern hints, validates them in context, and adds three more findings of its own: Finding Source Pillar Severity No Microsoft Entra ID for authentication rule-based Security critical No Key Vault for secret management rule-based Security high App Service slots not used for safe deploys ai-analysis Operational Excellence medium SQL DB geo-replication present but RTO/RPO not documented ai-analysis Reliability medium No CDN for static assets behind Front Door ai-analysis Performance Efficiency low Final scores returned by the model: Pillar Score Reliability 78 Security 52 Cost Optimization 80 Operational Excellence 70 Performance Efficiency 75 Overall 71 The Security score is the lowest because two of the highest-severity findings landed there — exactly what a human reviewer would flag first. Multi-model comparison Because the deterministic floor is identical across runs, the Validation Comparison view becomes a fair shootout of what each LLM adds on top of the same baseline. The same diagram is scored by all seven models, and the UI surfaces: Overall score per model Per-pillar score per model Severity-count deltas Number of ai-analysis findings each model contributed Quick wins each model identified This is genuinely useful for two reasons. First, it shows that LLM scores vary — typically by ±5–10 points on the same architecture — which is exactly why we publish the rule-based vs ai-analysis tag. Second, it lets architects pick the model whose review style matches their own. How we align with Microsoft's algorithms Alignment point What it means Same five pillars Identical names and scope to the official WAF Same source material Rules derived from WAF docs and Azure Architecture Center service guides Severity-graded findings Map conceptually to Advisor's high/medium/low impact recommendations Per-pillar + overall scoring Mirrors WAR/Advisor output shape, so the results feel familiar Where we deliberately differ — and why Concern Microsoft Diagram Builder Why we differ Needs deployed resources Advisor: yes No — works on a diagram We're a design-time tool; the architecture doesn't exist yet Needs human Q&A WAR: yes No — derived from the diagram One-click validation inside the design flow Healthy/Applicable ratio Advisor: yes No No resource-health signal exists pre-deployment Subcategory fixed weights Advisor: yes No explicit weights Severity is the de-facto weight (12/7/3/1) Defender Secure Score for Security Advisor: yes No Defender requires deployed resources Cost-weighted scoring Advisor: yes No (separate Cost Estimation feature) Cost is a separate pipeline in our app AI/LLM refinement Neither Yes Catches context-specific issues a static catalog misses, and explains findings in natural language Multi-model comparison Neither Yes Lets architects see scoring variance across models Honest limitations I'd rather you hear these from me than discover them in production: LLM scores drift. ±5–10 points across models on the same diagram is normal. Treat the score as directional, the findings as actionable. The rule-based tag is your anchor. No live telemetry. We can't know if your App Service is actually using availability zones — only that you have App Service in the diagram. Advisor will tell you the truth post-deployment. Generic ruleset. No specialized workload branches yet (AI/ML, IoT, SAP, SaaS). WAR has those. No milestone tracking. Each validation run is independent. Compare runs manually using the Validation Comparison view. Rule coverage is finite. 29 services and 73 rules is a strong start but not exhaustive — the LLM layer exists in part to compensate for that gap. How to use all three together A lifecycle that actually works: Design — Use the Diagram Builder to sketch the architecture and validate at design time. Iterate until the per-pillar scores look reasonable and the critical/high findings are addressed. Deploy — Generate Bicep from the diagram, deploy, and let Azure Advisor start scoring real resources. Operate — Use Azure Advisor continuously. Use Defender Secure Score for security posture. Periodic review — Run a Core WAR every quarter or at major milestones to capture the things only humans know (business context, tradeoffs, planned debt). None of these three replace the others. They cover different stages of the same loop. What's next A few things on the roadmap I'd love feedback on: Milestone tracking so design-time scores can be compared over time the way WAR milestones work. Workload-specific rulesets mirroring WAR's branches — starting with AI/ML. Direct Advisor handoff — once a diagram is deployed, surface the corresponding Advisor recommendations in the same UI to close the loop. Try it, fork it, tell me where it's wrong Live app: https://aka.ms/diagram-builder Source: github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder Useful references: Azure Well-Architected Framework pillars Azure Well-Architected Review tool Azure Advisor score — calculation Use Azure WAF assessments (Advisor) Complete an Azure Well-Architected Review assessment If you're a partner or customer architect who's already living in Advisor and WAR, I'd genuinely value your reaction — does the design-time stage feel like a real gap to you, or are you already covering it some other way? Open an issue on the repo or reply on LinkedIn. Posted on the Azure Architecture Blog · Comments and issues welcome on the repo.
arturoqu
May 21, 2026 Place Azure Architecture Blog
437Views
0likes
0Comments
Governing Agent Sprawl: A Multi‑Region AI Agent Landing Zone on Azure (Reference Architecture)
It doesn’t take long for AI agents to get out of hand. In most enterprises, the first few agents are celebrated. A chatbot here. A document summarizer there. Then another team ships an agent that calls APIs. Someone else connects one to internal data. Within months, IT is staring at dozens—or hundreds—of autonomous systems running across subscriptions, regions, and tools. At that point, the questions stop being about model quality and start being uncomfortable operational ones: Who owns this agent? What data can it access? What happens if it misbehaves? Why did it just consume half our monthly token budget in a day? Developers can build an AI agent in minutes—the difficult part is understanding what agents are doing, how they perform, and whether they comply with organizational policy. Signals scatter across tools, context is lost, and governance becomes reactive. This reference architecture exists to solve that problem. It describes a multi‑region AI agent landing zone on Azure that treats agents as first‑class, governable workloads—provisioned automatically, constrained by policy, and observable from day one. The architectural principle: separate control from execution The design starts with a simple but non‑negotiable rule: Control plane concerns must be separated from runtime concerns. Azure landing zones already follow this model. Management groups, Azure Policy, and RBAC are global constructs. Workloads run in regions. This architecture applies the same discipline to AI agents. The runtime plane is where agents execute, models infer, and data flows—often in multiple Azure regions. The control plane is where identity, policy, safety, evaluation, and oversight live—independent of region. This separation is what allows teams to scale agents without losing control. Layer 1: Azure AI Gateway — governing every request The first control layer sits directly in the request path. The AI gateway in Azure API Management provides a policy‑enforcement and observability layer in front of AI models, agents, and tools. It is not a separate service—it extends Azure API Management. Everything flows through it: Microsoft Foundry model deployments Azure AI Model Inference API endpoints OpenAI‑compatible third‑party models Self‑hosted models MCP servers and A2A agent APIs (preview) What the gateway actually enforces This layer is intentionally narrow and operational: Token quotas and rate limits The llm-token-limit policy (GA) enforces tokens‑per‑minute or quota ceilings per consumer before requests reach the backend. This prevents one application—or one agent—from exhausting shared capacity. Content safety at ingress The llm-content-safety policy (GA) integrates Azure AI Content Safety to moderate prompts automatically. Unsafe requests never reach the model. Traffic routing and resiliency Azure API Management supports multi‑region gateway deployment (Premium tier). If a region fails, traffic routes to the next closest gateway automatically. Token usage, prompts, and completions are logged to Azure Monitor and Application Insights using built‑in policies such as llm-emit-token-metric. The gateway does not understand agent intent or business context. That is by design. It governs traffic, not behavior. Layer 2: Azure AI Foundry Control Plane — governing behavior at scale The second layer governs what agents do, not just how requests flow. Azure AI Foundry Control Plane provides a unified management surface for AI agents, models, and tools across projects and subscriptions. It is designed specifically for agentic systems. Foundry Control Plane is currently in public preview. What Foundry Control Plane adds Fleet‑wide inventory Every agent, model, and tool appears in a single, searchable view across projects. Continuous evaluation on production traffic Foundry runs evaluations that measure task adherence, groundedness, tool‑call accuracy, sensitive data exposure, and other agent‑specific risk dimensions. Centralized guardrails Policy is enforced across inputs, outputs, and tool interactions—not just prompts. Bulk remediation can be applied across the fleet. Security integration Foundry integrates with: Microsoft Entra for agent identity (Entra Agent ID) Microsoft Defender for threat signals Microsoft Purview for data protection and compliance visibility Foundry Control Plane also requires an AI Gateway to be configured for advanced governance scenarios—reinforcing the layered approach. Layer 3: Microsoft Agent 365 — enterprise oversight, not just Azure oversight The third layer exists because Azure governance alone is not enough. Agents don’t just call APIs. They act on behalf of users. They access enterprise data. They operate inside Microsoft 365 workflows. Microsoft Agent 365 is the tenant‑level control plane for AI agents. It brings agents under the same administrative model used for users and applications. Status: Frontier Preview General availability: May 1, 2026 Why this layer matters Agent 365 introduces controls that Azure alone cannot provide: Agent registry A single inventory of all agents in the tenant—including sanctioned and shadow agents. Unsanctioned agents can be quarantined. Identity‑first access control Every agent is issued an Entra agent ID. Conditional Access policies apply to agents the same way they do to users. Human‑in‑the‑loop oversight Agents surface in Microsoft 365 admin workflows, not just Azure portals. Security and compliance Defender and Purview extend threat detection and data protection policies to agent activity. Agent 365 does not replace Foundry Control Plane. It complements it—connecting agent operations to enterprise identity, compliance, and productivity systems. How the pieces work together Individually, these services are powerful. The architecture works because they are deliberately layered. External approval → automated provisioning When a use case is approved in an external governance system, it triggers an Azure DevOps pipeline using the REST API. That pipeline: Provisions subscriptions and resource groups Deploys Foundry projects Configures Azure API Management with AI Gateway policies Enables monitoring and logging Governance is applied before the first request is made. One policy model, many regions Azure landing zones are region‑agnostic at the governance layer. This architecture follows that guidance. Policies and RBAC apply globally AI Gateway enforces limits locally in each region Runtime services scale region by region Expanding to a new region does not introduce a new governance model—only new capacity. A single operational view Signals flow upward: AI Gateway emits traffic and usage metrics Foundry Control Plane correlates evaluations, guardrail enforcement, and security alerts Agent 365 aggregates tenant‑level identity, compliance, and threat signals Operations teams no longer hunt across dashboards. They work from one prioritized view, with context intact. What this architecture deliberately does not promise This is a reference architecture, not a silver bullet. It does not eliminate the need for: Clear agent ownership Business‑level approval processes Ongoing evaluation of agent usefulness What it does provide is a foundation—one that lets organizations scale agentic AI without accepting chaos as the cost of innovation. Closing thoughts Agent sprawl is not a tooling failure. It’s an architectural one. By separating control from execution, layering governance where it belongs, and aligning AI operations with existing Azure and Microsoft 365 control planes, this architecture gives enterprises a way to move fast without losing sight of what their agents are doing. That’s the difference between experimentation—and production. Co-Contributor: Jorge Pena Alarcon-Sr. Cloud & AI Specialist References (official Microsoft sources) Azure AI Gateway in Azure API Management Configure AI Gateway for Foundry Foundry Control Plane overview Microsoft Agent 365 announcement Agent 365 GA annoucement Azure landing zones and regions Azure DevOps pipeline REST API
KimVaddi
May 15, 2026 Place Azure Architecture Blog
1.2KViews
1like
1Comment