Blog Post

Azure Architecture Blog
6 MIN READ

Architecture to Resilience: A Decision Guide

varghesejoji's avatar
varghesejoji
Icon for Microsoft rankMicrosoft
May 04, 2026

A practical guide to turning existing architecture artifacts into a measurable resilience model, from failure analysis through health modeling and governance.

Start with the framework, accelerate with the tool

Watch the video walkthrough

The Application Resilience Framework originated from a practical gap we saw in resilience reviews: teams had architecture diagrams, monitoring data, incident history, and runbooks, but no consistent way to connect them into a measurable resilience model.

The framework is intended to close that gap by turning architecture context into a structured lifecycle for risk identification, mitigation validation, health modeling, and governance. It aligns closely with the Reliability pillar of the Azure Well-Architected Framework, especially the guidance around identifying critical flows, performing Failure Mode Analysis, defining reliability targets, and building health models.

Application Resilience Framework flow from artifact import to measurable operational resilience.

The Application Resilience Framework Tool helps teams apply this framework faster by starting with artifacts they already have, such as data flow diagrams or sequence diagrams in Mermaid or image format. The tool extracts workflows, application components, platform components, dependencies, and initial failure modes, then guides the team through the decisions needed to make resilience measurable.

From those artifacts, the tool creates the first version of a resilience model by extracting workflows, application components, platform components, dependencies, and initial failure modes. It then guides the team through one import step followed by four phases:

Import Artifacts -> Phase 1: Failure Mode Analysis -> Phase 2: Mitigation and Validation -> Phase 3: Health Model Mapping -> Phase 4: Operations and Governance

It is not a replacement for WAF guidance or Resilience Hub style assessments. It is a practical way to operationalize those concepts at the workload and workflow level, producing prioritized risks, mitigation plans, validation paths, health signals, dashboards, reports, and governance ownership.

How to use this guide

This guide follows the same flow as the tool. For each step, it covers:

  1. The decision: What needs to be decided?
  2. The options: What paths are available?
  3. The guidance: When each option fits

Use this with the video walkthrough. The video shows the tool in action. This guide explains the choices behind each step.

Question 1: What artifact should you import first?

The import step creates the starting point for the model. Regardless of the input path, the output is the same: workflows that move into Phase 1: Failure Mode Analysis.

Options

Import option

Best for

What happens

Data flow diagram

System, module, data movement, and dependency views

If imported as an image, the tool breaks it into sequence-style flows. Selected flows become workflows.

Sequence diagram

Transaction flow and service interaction views

Converted directly into workflows.

Mermaid input

Diagrams maintained as code in Mermaid format

Converted directly into workflows.

Image input

JPG or PNG diagrams

Azure Foundry Vision models interpret the image and convert it into workflows.

Manual entry

Missing or incomplete diagrams

User creates or corrects workflows manually.

When to pick which

Use data flow for system and dependency views. Use sequence diagrams for transaction or interaction views. Regardless of import path, the output is the same: workflows, components, dependencies, and initial failure modes ready for Phase 1.

Question 2: Which workflows should be analyzed first?

Phase 1 is Failure Mode Analysis. This is where the tool identifies what can fail and how important each failure is.

Options

  • Critical user flows: Login, checkout, payment, onboarding, request processing.
  • High-risk platform flows: Database writes, queue processing, storage access, identity, messaging, external APIs.
  • Known issue areas: Workflows with recent incidents, recurring alerts, or customer impact.

When to pick which

Start where failure creates the highest customer or business impact. The goal is not to model everything at once. The goal is to model the right thing first.

Deliverables

  • Failure Mode Analysis catalog
  • RPV risk scores
  • Criticality classification

Question 3: How should failure modes be prioritized?

After workflows and components are imported, the tool helps score each failure mode using Risk Priority Value or RPV, which uses the four factors of Impact, Likelihood, Detectability and Outage severity.

Options

  • Use generated failure modes and scores: Best for a fast first pass.
  • Tune the RPV scores with engineering input: Best when workload context matters.
  • Add custom failure modes: Best when known risks come from incidents, reviews, or customer experience.

When to pick which

Use the generated model to accelerate the first pass, then adjust it with real system knowledge. The goal is not to create the longest list of risks. The goal is to identify the risks that deserve attention first.

Deliverables

  • Failure Mode Catalog
  • RPV Risk Scores
  • Prioritized criticality list

Question 4: Are mitigations defined or validated?

Phase 2 is Mitigation and Validation. This is where each failure mode gets a response plan.

Options

  • Detection only: The team can detect the failure, but the response is not defined.
  • Defined mitigation: The response is documented, such as retry, fallback, failover, scaling, restore, or rebalance.
  • Validated mitigation: The response has been tested through a controlled validation or chaos test.

When to pick which

For low-risk items, documented mitigation may be enough. For critical and high-risk items, validation is the key. A mitigation that has not been tested is still an assumption.

Deliverables

  • Mitigation playbooks
  • Chaos test plans
  • Support playbooks

Question 5: Which risks need health signals?

Phase 3 is Health Model Mapping. This is where the tool connects risks to observability.

A failure mode should not just sit in a document. It should map to a signal that can show whether the system is healthy, degraded, or unhealthy.

Options

  • Map all failure modes: Best for small systems or highly critical workloads.
  • Map critical and high-risk failure modes first: Best for large systems.
  • Track unmapped risks as gaps: Best when observability coverage is still improving.

When to pick which

Start with the highest RPV items. Every critical failure mode should have at least one signal, such as a metric, log, alert, availability check, or dependency signal.

Deliverables

  • Health model
  • Signal definitions
  • Coverage report
  • Bicep templates

Question 6: Should the health model be exported or deployed?

Once the health model is built, the next decision is how to use it.

Options

  • Export for review: Best when the team needs to validate the model first.
  • Generate monitoring templates: Best when the team wants repeatable implementation.
  • Deploy to Azure: Best when the model is ready to become part of operations.
  • Use outputs in downstream tools: Best when support, SRE, or incident response workflows need structured playbooks.

When to pick which

Export first if the model is still being reviewed. Deploy when component relationships, signals, and coverage are accurate enough for operational use.

Question 7: How will governance keep the model current?

Phase 4 is Operations and Governance. This is where the resilience model becomes an ongoing practice.

Options

  • One-time assessment: Useful for quick discovery but limited long term.
  • Recurring review: Best for production workloads that change regularly.
  • Closed-loop governance: Best when incidents, failed validations, and monitoring gaps feed back into the model.

When to pick which

For production systems, use a recurring governance cadence. Assign owners, track gaps, review dashboards, and update the model as the system changes.

Deliverables

  • Governance model
  • Dashboards
  • Reports and exports
  • Runbooks

Putting it together: three adoption patterns

Once governance is defined, the tool can be used in different ways depending on the team’s maturity and objective. The three common adoption patterns are:

Pattern A: Quick resilience review

  • Import one critical workflow
  • Generate failure modes
  • Review RPV scores
  • Identify top risks
  • Export findings

Best for fast architecture reviews or early customer conversations.

Pattern B: Full workload assessment

  • Import multiple workflows
  • Build a full Failure Mode Catalog
  • Define mitigations and recovery steps
  • Create chaos test plans
  • Map risks to signals
  • Produce coverage reports

Best for structured resilience assessments.

Pattern C: Operational health model

  • Build and tune the health model
  • Export or deploy monitoring artifacts
  • Track risk and signal coverage
  • Review mitigation effectiveness
  • Assign governance ownership
  • Feed findings back into the model

Best when the goal is continuous operational improvement.

A short checklist before using the tool

  1. Which workflow should we import first?
  2. Do we have a data flow diagram, sequence diagram, or Mermaid file?
  3. What components and dependencies should be included?
  4. Which failure modes matter most?
  5. How should RPV be adjusted for this workload?
  6. Do critical failure modes have mitigations?
  7. Have those mitigations been validated?
  8. Are failure modes mapped to health signals?
  9. What coverage gaps remain?
  10. Should the health model be exported or deployed?
  11. Who owns ongoing review?
  12. How often should the model be updated?

Closing thought

The Application Resilience Framework Tool provides a practical way to move from architecture artifacts to measurable, continuously improving resilience.

It starts with data flow or sequence diagrams, builds a structured view of the system, and guides teams through the decisions that matter: what can fail, how severe it is, how it is mitigated, how it is detected, and how it is governed.

Tool repo: Application Resilience Framework Tool 

Updated May 04, 2026
Version 1.0
No CommentsBe the first to comment