armchair architects

36 Topics

WAR, Azure Advisor, and Us (Azure Arch Diagram Builder): Three Ways to Score an Azure Architecture
Author: Arturo Quiroga, Azure AI services Engineer - Senior Partner Solutions Architect — Microsoft A few days ago I published From Prompt to Production: Building Azure Architecture Diagrams with AI, introducing the open-source Azure Architecture Diagram Builder. One feature got more follow-up questions than any other: the Well-Architected Framework (WAF) validation. Architects from partners and customers — many of whom already use Azure Advisor and the Well-Architected Review — wanted to know exactly what scoring algorithm we use, how it compares to Microsoft's official tools, and whether they should be using all three. This post is that answer. It's a deep dive into how design-time WAF validation works, how Microsoft's two official WAF assessment algorithms work, and where each fits in the architecture lifecycle. TL;DR. Microsoft ships two WAF assessment vehicles — the Well-Architected Review (questionnaire, scored from human answers) and the Azure Advisor score (healthy-resources-÷-applicable-resources weighted per subcategory, with Defender Secure Score for Security and cost-weighted math for Cost). Both require either a human filling in a form or live Azure telemetry. Our app runs at design time on a diagram, before anything is deployed, using a hybrid pipeline: a deterministic rule pre-scan followed by an LLM refinement pass. Same five WAF pillars, different lifecycle stage. Complementary, not competitive. Why design-time validation matters Every cost overrun, reliability gap, and security incident I've ever debugged was cheaper to fix on a whiteboard than in production. Yet most WAF tooling assumes the architecture already exists — either because there are deployed resources to scan (Advisor) or because someone has built enough of it to answer 60 specific questions about it (WAR). That leaves a gap. Between "rough sketch" and "deployed resource group" there is no algorithmic WAF feedback loop. That's the gap the Diagram Builder fills. Microsoft's two official WAF assessment algorithms Before describing our approach, it's worth being precise about what Microsoft already ships, because the term "WAF assessment algorithm" can mean either of two very different things. 1. Azure Well-Architected Review (WAR) — questionnaire-based The Well-Architected Review is a free self-assessment hosted on Microsoft Learn. Aspect Detail Input Human answers to ~60 questions mapped to the WAF pillar checklists Workload variants Core WAR, plus AI/ML, IoT, SAP on Azure, Azure Stack Hub, SaaS, Mission Critical Scoring Derived from the answers — each "no" or unanswered question subtracts from the pillar score Output Per-pillar maturity score + prioritized recommendations + optional Advisor integration Improvement tracking "Milestones" (point-in-time snapshots) When to use Periodic deep reviews; greenfield design baselining; brownfield audits WAR is human-driven. The algorithm is essentially "how many of the recommended practices have you confirmed you do?" — which is exactly the right algorithm when the assessor is the workload team itself. 2. Azure Advisor Score — telemetry-based The Advisor score is the closest thing Microsoft ships to a real, deterministic WAF algorithm. It runs continuously over your deployed Azure resources. The math: Pillar-specific overrides: Security uses Microsoft Defender for Cloud's Secure Score model. Cost weights by retail $ cost of healthy resources, plus age-of-recommendation weighting; postponed/dismissed items are removed from the denominator. Reliability / Performance / Operational Excellence use the healthy-resources ratio above. Key terms: Healthy resource — a deployed resource with no open Advisor recommendation against it for that pillar. Total applicable — resources Advisor was able to evaluate (excludes dismissed/snoozed). Advisor is the right tool once you're in production. It cannot help you before deployment, because there is nothing to count as "healthy" or "applicable." The missing stage: design time Here's the lifecycle, with each tool's domain shaded: Design / Diagram — Diagram Builder validation runs here. Operate / Observe — Azure Advisor runs here continuously. Periodic Review — WAR runs here, typically quarterly or at major milestones. These three stages are sequential and complementary. Our app does not replace Advisor or WAR — it adds a feedback loop earlier in the lifecycle, where corrections are cheapest. How design-time validation works in the Azure Architecture Diagram Builder The validator is a two-phase hybrid pipeline: deterministic local rules first, then LLM refinement. The full source lives in three files: src/services/architectureValidator.ts — orchestrator and prompt src/services/wafPatternDetector.ts — topology + service rule engine src/data/wafRules.ts — the rule knowledge base Phase 1 — Deterministic rule pre-scan (~1 ms, no LLM) When you click Validate Architecture, the validator runs a fully client-side rule engine against the diagram's services, connections, and groups. There are two kinds of rules: Architecture-pattern rules These fire when a topology anti-pattern is detected: Pattern Detection trigger single-region No global LB (Traffic Manager / Front Door) with ≥3 services single-database Exactly one database service, no replication signal no-cache Compute + database present, no Redis/CDN no-monitoring No Azure Monitor / App Insights / Log Analytics no-identity No Microsoft Entra ID no-waf Public web tier without WAF / Front Door / App Gateway direct-db-access An edge from a frontend service directly into a database no-key-vault 4+ services and no Key Vault no-backup Database present, no Azure Backup / Recovery Services no-api-gateway 2+ compute services and no APIM / App Gateway / Front Door Service-specific rules Every service in the in the generated Azure Architecture diagram is matched against SERVICE_SPECIFIC_RULES by normalized type — App Service, Functions, AKS, Cosmos DB, SQL Database, Storage, Key Vault, and 22 more. The knowledge base at a glance Metric Count Total rules 73 Architecture-pattern rules 10 Service-specific rules 63 Distinct Azure services covered 29 Rules tagged Reliability 18 Rules tagged Security 34 Rules tagged Cost Optimization 5 Rules tagged Operational Excellence 7 Rules tagged Performance Efficiency 9 The preliminary score Each finding has a severity, and severity drives a fixed point deduction from a starting score of 100: Severity Deduction critical −12 high −7 medium −3 low −1 Result is floored at 10 (so even a deliberately bad architecture scores at least 10) and ceilinged at 95 (no findings ≠ perfect — there's always something the model might still catch). This is the deterministic baseline before the LLM ever sees the architecture, and it's what makes the pipeline reproducible. Phase 2 — LLM contextual refinement The pre-scan output, the topology, and the optional natural-language description are folded into a focused prompt sent to one of seven Azure OpenAI models (GPT-5.1 through 5.4, GPT-5.x Codex variants, DeepSeek V3.2 Speciale, Grok 4.1 Fast). The system prompt gives the model explicit scoring guardrails: Score based on what IS present, not what COULD be added. A well-connected architecture with appropriate services should score 60–80. Score below 50 only for critical gaps (no auth, no monitoring, single points of failure). Findings are improvement suggestions, not reasons to penalize the score severely. The model returns strict JSON: { "overallScore": 0-100, "summary": "2–3 sentence assessment", "pillars": [ { "pillar": "Reliability | Security | Cost Optimization | Operational Excellence | Performance Efficiency", "score": 0-100, "findings": [ { "severity": "critical | high | medium | low", "category": "...", "issue": "...", "recommendation": "...", "resources": ["service-name-1", "service-name-2"], "source": "rule-based | ai-analysis" } ] } ], "quickWins": [ /* same shape as findings */ ] } Two things to call out: Every finding is tagged rule-based or ai-analysis . That tag is the credibility lever. You can always see what the deterministic engine produced versus what the model contributed on top. If you don't trust the AI layer, you can ignore it entirely — the rule layer still stands. The LLM is given pattern hints, not the entire rule catalog. The prompt stays small and focused, which is roughly 3–5× faster and cheaper than asking the LLM to do everything from scratch. What the user sees On every run the modal reports: Overall WAF score (0–100) Per-pillar score × 5 (0–100 each) Severity breakdown — counts of critical / high / medium / low across all findings Quick wins — high-impact, low-effort items the model surfaces separately Hybrid metadata — local findings count, patterns detected, KB rules used, preliminary score, local elapsed ms AI metrics — model used, reasoning effort, prompt/completion/total tokens, elapsed time App Insights telemetry — an Architecture_Validated event with model, overall score, finding count, elapsed time Worked example Take this prompt, which I've used in demos with partners: "A multi-region web application: Azure Front Door in front of two App Service instances in West US 2 and East US 2, both reading from an Azure SQL Database with geo-replication, with Application Insights for telemetry. No Entra ID, no Key Vault." After generation, Validate Architecture runs: Phase 1 — pre-scan (deterministic), ~1 ms Patterns detected: no-identity , no-key-vault Findings produced: 8 (1 critical, 1 high, 3 medium, 3 low) Preliminary score: 100 − 12 − 7 − (3×3) − (1×3) = 69 Phase 2 — LLM refinement, ~6–9 s depending on model The model accepts the two pattern hints, validates them in context, and adds three more findings of its own: Finding Source Pillar Severity No Microsoft Entra ID for authentication rule-based Security critical No Key Vault for secret management rule-based Security high App Service slots not used for safe deploys ai-analysis Operational Excellence medium SQL DB geo-replication present but RTO/RPO not documented ai-analysis Reliability medium No CDN for static assets behind Front Door ai-analysis Performance Efficiency low Final scores returned by the model: Pillar Score Reliability 78 Security 52 Cost Optimization 80 Operational Excellence 70 Performance Efficiency 75 Overall 71 The Security score is the lowest because two of the highest-severity findings landed there — exactly what a human reviewer would flag first. Multi-model comparison Because the deterministic floor is identical across runs, the Validation Comparison view becomes a fair shootout of what each LLM adds on top of the same baseline. The same diagram is scored by all seven models, and the UI surfaces: Overall score per model Per-pillar score per model Severity-count deltas Number of ai-analysis findings each model contributed Quick wins each model identified This is genuinely useful for two reasons. First, it shows that LLM scores vary — typically by ±5–10 points on the same architecture — which is exactly why we publish the rule-based vs ai-analysis tag. Second, it lets architects pick the model whose review style matches their own. How we align with Microsoft's algorithms Alignment point What it means Same five pillars Identical names and scope to the official WAF Same source material Rules derived from WAF docs and Azure Architecture Center service guides Severity-graded findings Map conceptually to Advisor's high/medium/low impact recommendations Per-pillar + overall scoring Mirrors WAR/Advisor output shape, so the results feel familiar Where we deliberately differ — and why Concern Microsoft Diagram Builder Why we differ Needs deployed resources Advisor: yes No — works on a diagram We're a design-time tool; the architecture doesn't exist yet Needs human Q&A WAR: yes No — derived from the diagram One-click validation inside the design flow Healthy/Applicable ratio Advisor: yes No No resource-health signal exists pre-deployment Subcategory fixed weights Advisor: yes No explicit weights Severity is the de-facto weight (12/7/3/1) Defender Secure Score for Security Advisor: yes No Defender requires deployed resources Cost-weighted scoring Advisor: yes No (separate Cost Estimation feature) Cost is a separate pipeline in our app AI/LLM refinement Neither Yes Catches context-specific issues a static catalog misses, and explains findings in natural language Multi-model comparison Neither Yes Lets architects see scoring variance across models Honest limitations I'd rather you hear these from me than discover them in production: LLM scores drift. ±5–10 points across models on the same diagram is normal. Treat the score as directional, the findings as actionable. The rule-based tag is your anchor. No live telemetry. We can't know if your App Service is actually using availability zones — only that you have App Service in the diagram. Advisor will tell you the truth post-deployment. Generic ruleset. No specialized workload branches yet (AI/ML, IoT, SAP, SaaS). WAR has those. No milestone tracking. Each validation run is independent. Compare runs manually using the Validation Comparison view. Rule coverage is finite. 29 services and 73 rules is a strong start but not exhaustive — the LLM layer exists in part to compensate for that gap. How to use all three together A lifecycle that actually works: Design — Use the Diagram Builder to sketch the architecture and validate at design time. Iterate until the per-pillar scores look reasonable and the critical/high findings are addressed. Deploy — Generate Bicep from the diagram, deploy, and let Azure Advisor start scoring real resources. Operate — Use Azure Advisor continuously. Use Defender Secure Score for security posture. Periodic review — Run a Core WAR every quarter or at major milestones to capture the things only humans know (business context, tradeoffs, planned debt). None of these three replace the others. They cover different stages of the same loop. What's next A few things on the roadmap I'd love feedback on: Milestone tracking so design-time scores can be compared over time the way WAR milestones work. Workload-specific rulesets mirroring WAR's branches — starting with AI/ML. Direct Advisor handoff — once a diagram is deployed, surface the corresponding Advisor recommendations in the same UI to close the loop. Try it, fork it, tell me where it's wrong Live app: https://aka.ms/diagram-builder Source: github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder Useful references: Azure Well-Architected Framework pillars Azure Well-Architected Review tool Azure Advisor score — calculation Use Azure WAF assessments (Advisor) Complete an Azure Well-Architected Review assessment If you're a partner or customer architect who's already living in Advisor and WAR, I'd genuinely value your reaction — does the design-time stage feel like a real gap to you, or are you already covering it some other way? Open an issue on the repo or reply on LinkedIn. Posted on the Azure Architecture Blog · Comments and issues welcome on the repo.
arturoqu
May 21, 2026 Place Azure Architecture Blog
47Views
0likes
0Comments
From Prompt to Production: Building Azure Architecture Diagrams with AI
Author: Arturo Quiroga, Senior Partner Solutions Architect — Microsoft Cloud architects spend significant time translating ideas into architecture diagrams. They toggle between Visio, draw.io, pricing calculators, and documentation. According to the 2024 Stack Overflow Developer Survey, 61% of developers spend more than 30 minutes a day searching for answers or solutions, time lost to context-switching rather than design. What if you could describe your architecture in plain English and get a diagram, cost estimate, and deployment guide in minutes? The Challenge: Fragmented Architecture Workflows Designing Azure architectures today typically involves multiple disconnected steps: Sketch the architecture in a diagramming tool Look up official Azure icons and drag them into place Research pricing across regions using the Azure Pricing Calculator Validate the design against the Well-Architected Framework (WAF) Write deployment documentation and Infrastructure as Code templates Compare alternative designs manually Each step lives in a different tool, and keeping them in sync as designs evolve is costly. The Azure Architecture Diagram Builder brings these workflows together in a single browser-based experience. How It Works Describe your architecture in natural language, for example "A HIPAA-compliant healthcare platform with FHIR APIs, event-driven processing, and multi-region disaster recovery", and the AI generates a diagram with grouped services, data flow connections, and logical organization. Figure 1. Enter a natural-language prompt describing your architecture. Curated example prompts help you get started, and you can optionally upload an existing diagram for the AI to analyze. The tool uses Azure OpenAI to power generation across multiple models, enabling you to choose the model that best fits your scenario — from fast iterations to deeper reasoning. Key Features AI-Powered Architecture Generation Describe what you need in plain English, and the AI creates an architecture diagram with: 714 official Azure service icons across 29 categories Smart grouping: services are logically organized (Frontend, Backend, Data, Security) Data flow connections: labeled edges showing how data moves through the system 13 curated example prompts: from simple web apps to complex enterprise scenarios like Zero Trust networks, Industrial IoT with 5,000+ sensors, and global multiplayer gaming backends Figure 2. A generated industrial IoT architecture. Top: the clean diagram view as initially produced. Bottom: the same diagram with per-service monthly cost overlays toggled on, plus a running subscription total in the toolbar. Architecture Image Import Already have an architecture on a whiteboard or in a screenshot? Upload the image and let the AI analyze it, mapping services to official Azure icons and recreating the architecture as an editable, interactive diagram. Figure 3. Upload a photo of a whiteboard sketch (top-right reference panel) and the AI recreates it as an editable diagram with official Azure service icons and labeled data flow connections. ARM Template Import Import existing ARM templates to visualize your current infrastructure. The AI parses resource definitions and dependencies, groups related resources into logical layers, and produces a meaningful diagram of what you actually have deployed — a fast way to document an inherited environment or sanity-check a template before deployment. Figure 4. ARM template import in action. Top: the parser status banner while resources and dependencies are being analyzed. Bottom: the resulting diagram, with resources auto-grouped into logical layers (Web Tier, Data Layer, Container Platform, Observability & Logging) and a Generated from: ARM Template badge linking the diagram back to its source file. Well-Architected Framework Validation Validate your architecture against all five WAF pillars — Security, Reliability, Performance Efficiency, Cost Optimization, and Operational Excellence. The validator provides: An overall WAF score with pillar-level breakdowns Specific findings with severity levels Actionable recommendations you can select and apply Select the recommendations you agree with, and the AI regenerates an improved architecture incorporating those changes. Figure 5. WAF validation results showing the overall score, per-pillar breakdowns, and individual findings with severity badges. Tick the recommendations you want and the AI rebuilds the diagram with those changes applied. Multi-Model Comparison Run the same architecture prompt through multiple AI models side-by-side and compare: Architecture Comparison: service counts, connection counts, groups, token usage, and latency Validation Comparison: WAF scores across models, severity breakdowns, and finding counts Apply Winner: pick the best result and apply it to the canvas with one click Present Critique: a talking avatar narrates the AI-generated ranking with live closed captions Figure 6. Multi-model comparison. Top: select the models and reasoning effort, then enter the prompt. Bottom: side-by-side results across all selected models with service counts, latency, token usage, and Fastest / Cheapest / Most Thorough badges. Multi-Region Cost Estimation Get cost estimates from the Azure Retail Prices API across 8 Azure regions: East US 2, Australia East, Canada Central, Brazil South, Mexico Central, West Europe, Sweden Central, and Southeast Asia. Features include: Color-coded cost legend (green / yellow / red thresholds) SKU and tier information for each service Export options: CSV, JSON, plain-text summary, and an analysis report with top cost drivers, Reserved Instance flags, and a ranked multi-region comparison table Figure 7. The cost legend overlay shows per-service pricing with color-coded thresholds. The region selector in the toolbar lets you re-price the entire architecture in any of eight Azure regions. Deployment Guide Generation with Bicep Generate step-by-step deployment documentation including: Prerequisites and Azure resource requirements Step-by-step deployment instructions Bicep templates for each service (Infrastructure as Code) Post-deployment verification steps Security configuration recommendations Figure 8. Each generated Deployment Guide opens with the architecture name, an estimated deployment time, and a prerequisites checklist covering subscription roles, CLI versions, Microsoft Entra ID permissions, and region requirements, followed by numbered, copy-ready deployment steps. Figure 9. The Infrastructure as Code section produces a main.bicep orchestrator plus a per-service module (Log Analytics, Key Vault, Cosmos DB, SQL Database, Event Hubs, Azure Functions, and more). The Download All Templates button packages everything into a ready-to-deploy folder. Workflow Animation & Avatar Presenter Visualize how data flows through your architecture with step-by-step animations that highlight services on the canvas as each step plays. When the Azure Speech Service is configured, a photorealistic talking avatar can narrate the workflow or present model comparison results, with live word-by-word closed captions in a draggable, resizable panel. Figure 10. A workflow step is highlighted on the canvas as the Avatar Presenter narrates that step. Live word-by-word closed captions appear in a draggable, resizable panel, useful for accessibility and stakeholder demos. Export Options Figure 11. A single-slide PowerPoint export, available in dark or light theme, ready to drop straight into a stakeholder deck. Format Use Case PNG Documentation, presentations SVG Scalable vector graphics PPTX Single PowerPoint slide (dark or light theme) Draw.io Edit in diagrams.net JSON Backup, version control CSV / ZIP Cost analysis with multi-region comparison Highlights The Azure Architecture Diagram Builder unifies the architecture design lifecycle in a single tool: End-to-end workflow: from natural-language description to deployable Bicep templates without tool switching Official Azure icons: 714 icons across 29 categories, mapped directly from the Azure service catalog Live pricing: queries the Azure Retail Prices API at design time rather than relying on static estimates WAF-integrated validation: architectural best practices built into the design loop rather than applied after the fact Multi-model flexibility: choose the AI model that best suits each task, with fast models for iteration and reasoning models for complex designs Open source: the source code is available for customization and contribution One-Command Deploy with Azure Developer CLI The fastest way to get your own instance running is with azd : # Install azd (once) brew tap azure/azd && brew install azd # macOS winget install microsoft.azd # Windows # Clone, configure, and deploy git clone https://github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder cd azure-architecture-diagram-builder azd auth login azd env set AZURE_OPENAI_ENDPOINT "https://your-resource.openai.azure.com/" azd env set AZURE_OPENAI_API_KEY "your-key" azd up # Provisions infrastructure + builds + deploys (~8 min) azd up provisions the following via Bicep: Resource Purpose Azure Container Registry Stores the Docker image Azure Container Apps Runs the app (nginx + token server) Log Analytics + Application Insights Monitoring and telemetry Azure Speech (S0) Avatar Presenter (optional, keyless auth via managed identity) Try It Today The Azure Architecture Diagram Builder is available now: Live demo: https://aka.ms/diagram-builder Source code: GitHub repository Documentation: See the Getting Started Guide for detailed setup instructions We welcome feedback and contributions. Use the GitHub Issues page to report bugs, suggest features, or share your experience. Tags: artificial intelligence · application · apps & devops · well architected · infrastructure
arturoqu
May 19, 2026 Place Azure Architecture Blog
386Views
1like
0Comments
Advancing to Agentic AI with Azure NetApp Files VS Code Extension v1.2.0
The Azure NetApp Files VS Code Extension v1.2.0 introduces a major leap toward agentic, AI‑informed cloud operations with the debut of the autonomous Volume Scanner. Moving beyond traditional assistive AI, this release enables intelligent infrastructure analysis that can detect configuration risks, recommend remediations, and execute approved changes under user governance. Complemented by an expanded natural language interface, developers can now manage, optimize, and troubleshoot Azure NetApp Files resources through conversational commands - from performance monitoring to cross‑region replication, backup orchestration, and ARM template generation. Version 1.2.0 establishes the foundation for a multi‑agent system built to reduce operational toil and accelerate a shift toward self-managing enterprise storage in the cloud.
GeertVanTeylingen
Apr 09, 2026 Place Azure Architecture Blog
400Views
0likes
0Comments
Deploy PostgreSQL on Azure VMs with Azure NetApp Files: Production-Ready Infrastructure as Code
PostgreSQL is a popular open‑source cloud database for modern web applications and AI/ML workloads, and deploying it on Azure VMs with high‑performance storage should be simple. In practice, however, using Azure NetApp Files requires many coordinated steps—from provisioning networking and storage to configuring NFS, installing and initializing PostgreSQL, and maintaining consistent, secure, and high‑performance environments across development, test, and production. To address this complexity, we’ve built production‑ready Infrastructure as Code templates that fully automate the deployment, from infrastructure setup to database initialization, ensuring PostgreSQL runs on high‑performance Azure NetApp Files storage from day one.
GeertVanTeylingen
Jan 15, 2026 Place Azure Architecture Blog
606Views
1like
0Comments
What's New with Azure NetApp Files VS Code Extension
The latest update to the Azure NetApp Files (ANF) VS Code Extension introduces powerful enhancements designed to simplify cloud storage management for developers. From multi-tenant support to intuitive right-click mounting and AI-powered commands, this release focuses on improving productivity and streamlining workflows within Visual Studio Code. Explore the new features, learn how they accelerate development, and see why this extension is becoming an essential tool for cloud-native applications.
GeertVanTeylingen
Jan 15, 2026 Place Azure Architecture Blog
308Views
0likes
0Comments
Streamline Azure NetApp Files Management—Right from Your IDE
The Azure NetApp Files VS Code Extension is designed to streamline storage provisioning and management directly within the developer’s IDE. Traditional workflows often require extensive portal navigation, manual configuration, and policy management, leading to inefficiencies and context switching. The extension addresses these challenges by enabling AI-powered automation through natural language commands, reducing provisioning time from hours to minutes while minimizing errors and improving compliance. Key capabilities include generating production-ready ARM templates, validating resources, and delivering optimization insights—all without leaving the coding environment.
GeertVanTeylingen
Dec 15, 2025 Place Azure Architecture Blog
263Views
0likes
0Comments
Accelerating Cloud-Native Development with AI-Powered Azure NetApp Files VS Code Extension
Streamlining enterprise storage provisioning through intelligent automation and developer-centric tooling.
GeertVanTeylingen
Oct 28, 2025 Place Azure Architecture Blog
377Views
1like
0Comments
Armchair Architects: Architects vs. The Ivory Tower
Do software architects really live in Ivory Towers, descending only to share their wisdom?
EricCharran
Aug 06, 2025 Place Azure Architecture Blog
8.7KViews
5likes
1Comment
Azure NetApp Files solutions for three EDA Cloud-Compute scenarios
Table of Contents Abstract Introduction EDA Cloud-Compute scenarios Scenario 1: Burst to Azure from on-premises Data Center Scenario 2: “24x7 Single Set Workload” Scenario 3: "Data Center Supplement" Summary Abstract Azure NetApp Files (ANF) is transforming Electronic Design Automation (EDA) workflows in the cloud by delivering unparalleled performance, scalability, and efficiency. This blog explores how ANF addresses critical challenges in three cloud compute scenarios: Cloud Bursting, 24x7 All-in-Cloud, and Cloud-based Data Center Supplement. These solutions are tailored to optimize EDA processes, which rely on high-performance NFS file systems to design advanced semiconductor products. With the ability to support clusters exceeding 50,000 cores, ANF enhances productivity, shortens design cycles, and eliminates infrastructure concerns, making it the default choice for EDA workloads in Azure. Additionally, innovations such as increased L3 cache and the transition to DDR5 memory enable performance boosts of up to 60%, further accelerating the pace of chip design and innovation. Co-authors: Andy Chan, Principal Product Manager Azure NetApp Files Arnt de Gier, Technical Marketing Engineer Azure NetApp Files Introduction Azure NetApp Files (ANF) solutions support three major cloud compute scenarios running Electronic Design Automation (EDA) in Azure: Cloud Bursting 24x7 All-in-Cloud Cloud based Data Center Supplement ANF solutions can address the key challenges associated with each scenario. By providing an optimized solution stack for EDA engineers ANF will increase productivity and shorten design cycles, making ANF the de facto standard file system for running EDA workloads in Azure. Electronic Design Automation (EDA) processes are comprised of a suite of software tools and workflows used to design semiconductor products such as advanced computer processors (chips) which are all in need of high performance NFS file system solutions. The increasing demand for chips with superior performance, reduced size, and lower power consumption (PPA) is driven by today's rapid pace of innovation to power workloads such as AI. To meet this growing demand, EDA tools require numerous nodes and multiple CPUs (cores) in a cluster. This is where Azure NetApp Files (ANF) comes into play with its high-performance, scalable file system. ANF ensures that data is efficiently delivered to these compute nodes. This means a single cluster—sometimes encompassing more than 50,000 cores—can function as a unified entity, providing both scale-out performance and consistency which is essential for designing advanced semiconductor products. ANF is the most performance optimized NFS storage in Azure making it the De facto solution for EDA workloads. According to Philip Steinke, AMD's Fellow of CAD Infrastructure and Physical Design, the main priority is to maximize the productivity of chip designers by eliminating infrastructure concerns related to compute and file system expansion typically experienced with on-premises deployments that require long planning cycles and significant capital expenditure. In register-transfer level (RTL) simulations, Microsoft Azure showcased that moving to a CPU with greater amounts of L3 Cache can give EDA users a performance boost of up to 60% for their workloads. This improvement is attributed to increased L3 cache, higher clock speeds (instructions per cycle), and the transition from DDR4 to DDR5 memory. Azure’s commitment to providing high-performing, on-demand HPC (High-Performance Computing) infrastructure is a well-known advantage and has become the primary reason EDA companies are increasingly adopting Azure for their chip design needs. In this paper, three different scenarios of Azure for EDA are explored, namely “Cloud Bursting”, “24x7 Single Set Workload” and “Data Center Supplement” as a reference framework to help guide engineer’s Azure for EDA journey. EDA Cloud-Compute scenarios The following sections delve into three key scenarios that address the computational needs of EDA workflows: “Cloud Bursting,” “24x7 Single Set Workload,” and “Data Center Supplement.” Each scenario highlights how Azure's robust infrastructure, combined with high-performance solutions like Azure NetApp Files, enables engineering teams to overcome traditional limitations, streamline chip design processes, and significantly enhance productivity. Scenario 1: Burst to Azure from on-premises Data Center An EDA workload is made up of a series of workflows where certain steps are bursty which can lead to incidents in semiconductor project cycles where compute demand exceeds the on-premises HPC server cluster capacity. Many EDA customers have been bursting to Azure to speed up their engineering projects. In one example, a total of 120,000 cores were deployed serving in many clusters, all were well supported with the high-performance capabilities of ANF. As design projects approach completion, the design is continuously and incrementally modified to fix bugs, synthesis and timing issues, optimization of area, timing and power, resolving issues associated with manufacturing design rule checks, etc. When design changes are made, many if not all the design steps must be re-run to ensure the change did not break the design. As a result, “design spins” or “large regression” jobs will put a large compute demand on the HPC server cluster. This leads to long job scheduler queues (IBM LSF and Univa Grid Engine are two common schedulers for EDA) where jobs wait to be dispatched to run on an available compute node. Competing project schedules are another reason HPC server cluster demands can exceed on-premises fixed capacity. Most engineering divisions within a company share infrastructure resources across teams and projects which inevitably leads to oversubscription of compute capacity and long job queues resulting in production delays. Bursting EDA jobs into Azure with its available compute capacity, is a way to alleviate these delays. For example, Azure’s latest CPU offering can deliver up to 47% shorter turnaround times for RTL simulation than on-premises. Engineering management tries to increase productivity with effective use of their EDA tool licensing. Utilizing Azure's on-demand compute resources and high-performance storage solutions like Azure NetApp Files, enables engineering teams to accelerate design cycles and reduce Non-recurring Engineering (NRE) costs, enhancing productivity significantly. For “burst to Azure” scenarios that allow engineers quick access to compute resources to finish a job without worrying about the underlying NFS infrastructure and traditional complex management overhead, ANF delivers: High Performance: up to 826,000 IOPS per large volume, serving the data for the most demanding simulations with ease to reduce turn-around-time. Scalability: As EDA projects advance, the data generated can grow exponentially. ANF provides large-capacity single namespaces with volumes up to 2PiB, enabling your storage solution to scale seamlessly, while supporting compute clusters with more than 50,000 cores. Ease of Use: ANF is designed for simplicity, with SaaS-like user experience, allowing deployment and management with a few clicks or API automation. Since storage deployment can be done rapidly, engineering to access their EDA HPC hardware quickly for their jobs. Cost-Effectiveness: ANF offers cool access, which transparently moves ‘cold’ data blocks to lower-cost Azure Storage. Additionally, Reserved Capacity (RC) can provide significant cost savings compared to pay-as-you-go pricing, further reducing the high upfront CapEx costs and long procurement cycle associated with on-premises storage solutions. Use the ANF effective pricing estimator to estimate your savings. Reliability and Security: ANF provides enterprise-grade data management and security features, ensuring that your critical EDA data is protected and available when you need it with key management and encryption built-in. Scenario 2: “24x7 Single Set Workload” As Azure for EDA matured over time and the value of providing engineers with available and faster HPC Infrastructure is becoming more widely shared, more users are now moving a entire sets of workloads into Azure that run 24x7. In addition to SPICE or RTL simulations, one such set of workloads is "digital signoff” with the same goal of increasing productivity. Scenario 1 concerns cloud bursting which involves batch processes with high performance and rapid deployment, whereas Scenario 2 involves operating a set of workloads with additional ANF capabilities for data security and user control needs. QoS support: ANF's QoS function fine-tunes storage utilization by establishing a direct correlation between volume size (quota) and performance, which set storage limit an EDA tool or workload may have access to. Snapshot data protection: As more users are using Azure resources, data protection is crucial. ANF snapshots protect primary data often and efficiently for fast recovery from corruption or loss, by restoring a volume to a snapshot in seconds or by restoring individual files from a snapshot. Enabling snapshots is recommended for user home directories and group shares for this reason as well. Large volume support: A set of workloads generates greater output than a single workload, and as such ANF’s large volume support is a feature that’s being widely adopted by EDA users of this scenario. ANF now supports single volumes up to 2PiB in size, allowing a more fine-tuned management of user’s storage footprint. Cool access: Cool access is an ANF feature that enables better cost control because only data that is being worked on at any given time remains in the hot tier. This functionality enables inactive data blocks from the volume and volume snapshots to be transferred from the hot tier to an Azure storage account (the cool tier), saving cost. Because EDA workloads are known to be metadata heavy, ANF does not relocate metadata to the cool tier, ensuring that metadata operations operate as expected. Dynamic capacity pool resizing: Cloud compute resources can be dynamically allocated. To support this deployment model, Azure NetApp Files (ANF) also offers dynamic pool resizing, which further enhances Azure-for-EDA's value proposition. If the size of the pool remains constant but performance requirements fluctuate, enabling dynamic provisioning and deprovisioning of capacity pools of different types provides just-in-time performance. This approach lowers costs during periods when high performance is not needed. Reserved Capacity: Azure allows compute resources to be reserved as a way to guarantee access to that capacity and allowing you to receive significant cost savings compared to the standard "pay-as-you-go" pricing model. This Azure offering is available to ANF. A reservation in 100-TiB and 1-PiB units per month for a one- or three-year term for a particular service level within a region is now available. Scenario 3: "Data Center Supplement" This scenario builds on Scenarios 1 and 2, while Scenario 3 involves EDA users expanding their workflow into Azure as their data center. In this scenario, a mixed EDA flow is hosted with tools from several EDA ISVs, spanning frontend, backend, and Analog mixed signal are being deployed. EDA Companies such as d-Matrix were able to design an entire AI chip, all in Azure as an example of Scenario 3. In this data center supplement scenario, data mobility and additional data life cycle management solutions are essential. Once again, Azure NetApp Files (ANF) rises to the challenge by offering additional features within its solution stack Backup support: ANF has a policy-based backup feature that uses AES-256-bit encryption during the encoding of the received backup data. Backup frequency is defined by a policy. Cross-region replication: ANF data can be replicated asynchronously between Azure NetApp Files volumes (source and destination) with cross-region replication. The source and destination volumes must be deployed in different Azure regions. The service level for the destination capacity pool might be the same or different, allowing customers to fine-tune their data protection demands as efficiently as possible. Cross-zone replication: Similar to the Azure NetApp Files cross-region replication feature, the cross-zone replication (CZR) capability provides data protection between volumes in different availability zones. You can asynchronously replicate data from an Azure NetApp Files volume (source) in one availability zone to another Azure NetApp Files volume (destination) in another availability zone. This capability enables you to fail over your critical application if a zone-wide outage or disaster happens. BC/DR: Users can construct their own solution based on their own goals by using a variety of BC/DR templates that include snapshots, various replication types, failover capabilities, backup, and support for REST API, Azure CLI, and Terraform. Summary The integration of ANF into the EDA workflow addresses the limitations of traditional on-premises infrastructure. By leveraging the latest CPU generations and Azure's on-demand HPC infrastructure, EDA users can achieve significant performance gains and improve productivity, all while being connected by the most optimized, performant file system that’s simple to deploy and support. The three Azure for EDA scenarios—Cloud Bursting, 24x7 Single Set Workload, and Data Center Supplement—showcase Azure's adaptability and effectiveness in fulfilling the changing needs of the semiconductor industry. As a result, ANF has become the default NFS solution for EDA in Azure, allowing businesses to innovate even faster.
GeertVanTeylingen
May 13, 2025 Place Azure Architecture Blog
711Views
1like
0Comments
Armchair Architects: What Is Responsible AI?
They we will dive deeper into the meaning of “what is responsible AI?” and what it entails. It sounds like a cool concept, but let’s have the armchair architects share their views on it as it may not necessarily be well understood.
AriyaKhamvongsa
Apr 19, 2024 Place Azure Architecture Blog
3.3KViews
2likes
0Comments