Well-Architected Branches for Assessing Workload-Types

Microsoft

Mar 25, 2022

Microsoft offers prescriptive guidance called the Well-Architected Framework that optimizes workloads implemented and deployed on Azure. This guidance has been generalized for most workloads and creates a basis for reliable and secure applications that are cost optimized.

We have begun to build on this base content set to include more precise guidance for specific workload types, such as machine learning, data services and analytics, IoT, SAP, mission critical apps, and web apps. Machine Learning was the first branch from the base content, which came into fruition in the Fall of 2021.

Azure oriented prescriptive guidance needs to consider multiple dimensions of the workload type. Thus, these branches are developed by teams across Microsoft, including those that are customer-facing, partner-facing, product teams, and content teams.

Branches must meet several critical release criteria to become generally available, including:

Curated documentation based on all Well-Architected Framework pillars: Security, Reliability, Cost Optimization, Performance Efficiency, and Operational Excellence.
An assessment based on the core Well-Architected Review so that questions/answers/recommendations are specific to the workload-type but are in line with the objectives of the Well-Architected Review (to convey guidance not a sales pitch).
Early customers who have satisfactory results after implementing recommendations from the assessment.

As an example, teams across Microsoft led by the Customer Success Unit developed the Machine Learning branch to meet the specific needs of MLOps teams. This new space has all the same considerations as other workloads, but the technologies and processes used to create workloads leveraging machine learning capabilities differ dramatically.

The following generally opinioned statements in the Well-Architected Review become prescriptive in the AI/ML branch:

Branch	General Statement	Specific AI/ML Guidance
Cost Optimization	Managing costs to maximize the value delivered.	Provisioning of CPU, GPU for classical and deep learning models., usage of compute clusters for training, termination policies to terminate poorly performing runs and saving on computational costs etc.
Operational Excellence	Operational processes that keep a system running in production.	Build, design and orchestrate AML with MLOps principles, Monitoring performance of deployed models, Segregation of env, development using SDK, AutomatedML, AML Designer, etc.
Performance	The ability of a system to adapt to changes in load	Run experiments in parallel, Data partition strategy, Autoscaling for scalability, AML parallelrunconfig for processing large amounts of data, etc.
Reliability	The ability of a system to recover from failures and continue to function	Scalability, network capacity, Managing quotas , Dataset versioning , Workspace capacity limits , Logging ML runs, Private links, etc.
Security	Protecting applications and data from threats.	E.g., RBAC for AML , authentication , Data encryption best practices , Identity management, use of VNETs ,responsible ML , differential privacy, model interpretability , homomorphic encryptions, etc

By providing more precise guidance, MLOps teams have been much more effective in implementing recommendations that generate profound impact across their workloads. As a result, we have seen a three-fold increase in recommendations implemented by customers in the AML pilot than those who use the general assessment. From the standpoint of everyone, that’s a huge success!

As a result, we are moving the AML branch from Public Preview to General Availability in April 2022.

To learn more about the AML branch, there are several links and the video below: