Introduction
Chaos Studio is an Azure service that helps you measure, understand, and build application and service resilience to real-world incidents, such as an unexpected infrastructure disruption or an application failure causing 100% CPU usage on a VM. In this new series of blog posts, we’ll share best practices on performing resilience tests for common failure scenarios, provide step-by-step tutorials, and discuss how to leverage test results to improve the resilience of your cloud applications. Today, we’ll focus on using Chaos Studio to simulate a compute failure.
Resilience Testing Best Practices
We recommend using a hypothesis-driven approach for resilience testing to ensure actionable results:
- Define a hypothesis: outline a specific failure scenario and predict how your infrastructure will perform if it occurs.
- Design a fault injection experiment that reflects the failure scenario you wish to test and set up proper telemetry to monitor performance over the course of the experiment.
- Run your experiment and analyze results to determine if your hypothesis was validated or invalidated.
- Make necessary improvements to your configurations based on your findings.
As your cloud infrastructure changes and evolves, new dependency and configuration issues may arise – repeat this process over time to ensure continued reliability.
Simulate a Compute Failure Scenario
Today, we’ll be performing an Availability Zone shutdown on a Virtual Machine Scale Set configured with instances across multiple Availability Zones. Remember to define a hypothesis before conducting your resilience test, for example: “If one Availability Zone is shut down, the Virtual Machine Scale Set’s autoscale configuration will detect the drop in instance count and automatically provision additional instances in the remaining zones, maintaining overall capacity and performance.” Next, we’ll create and run a fault injection experiment to test our scenario using Chaos Studio.
Prerequisites
- A valid Azure subscription. If you don’t have one, https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account?icid=azurefreeaccount.
- A Virtual Machine Scale Set configured with instances across multiple availability zones. Ensure that it is located in a https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-region-availability. If you don’t have one, you can follow the https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/flexible-virtual-machine-scale-sets-portal to create one.
If this is your first time using Chaos Studio, follow the instructions below to register the resource provider for your subscription
- Open the https://portal.azure.com/.
- Search for and select Subscriptions. Select the subscription you’d like to use.
- Select Settings > Resource providers from the left-side menu.
- Search for and select Microsoft.Chaos. Select Register.
Create an Experiment and Set Up Monitoring
To create an Availability Zone shutdown experiment on your Virtual Machine Scale set, do the following:
- Open the https://portal.azure.com/. Search for and select Chaos Studio.
- Select Targets from the left-side menu.
- Select the Virtual Machine Scale Set you’d like to test and select Enable targets > Enable service-direct targets (All resources) > Review + Enable > Enable.
- Navigate back to Chaos Studio and select Experiments from the left-side menu.
- Select Create > New experiment.
- On the Basics tab, select a subscription and resource group for your experiment. Give your experiment a name and select the region you’d like to store it in.
- On the Permissions tab, select whether you’d like to use a System or User-assigned https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview to manage your experiment permissions. If you’re unsure of which to choose, select the system-assigned identity option. Check the Enable custom role creation and assignment checkbox – this will allow Chaos Studio to automatically assign the necessary permissions to your managed identity based on your experiment configuration.
- On the Experiment designer tab, select Add action > Add fault. Choose the VMSS Shutdown (version 2.0) fault from the dropdown. Select Next: Target resources and select your Virtual Machine Scale Set. Select Next: Scope, choose the zone you’d like to shut down, and select Add.
- Select the Review + create button, review the experiment configuration, and select Create.
The metrics you should monitor for your experiment run depend on the hypothesis you came up with for your scenario. Since our sample hypothesis predicted that our Virtual Machine Scale Set would provision additional instances in the event of a disruption based on its autoscale setting, we’ll show you how to track the availability of your Virtual Machine Scale Set’s virtual machine instances:
- Search for your Virtual Machine Scale Set by name using the Azure portal search bar and select it to go to its overview page.
- Select Monitoring > Metrics from the left-side menu.
- Configure a metric with the following values:
- Scope: your Virtual Machine Scale Set
- Metric Namespace: Virtual Machine Host
- Metric: VM Availability Metric (Preview)
- Aggregation: Avg
- Select Add metric.
- You may select Save to dashboard and choose the Pin to dashboard, Pin to Grafana, or Send to workbook options to save your metric where you’d like.
The VM Availability Metric will now display an average of the availability of your virtual machine instances within your Virtual Machine Scale Set over the course of your experiment run.
Run the Experiment and Analyze Results
To run your experiment, do the following:
- Within the Azure portal, navigate back to Chaos Studio and select Experiments from the left-side menu.
- Select your experiment and select Start experiment(s) > Yes from the bar at the top of the page.
- Select your experiment’s name to navigate to its overview page. Select the Details button under History to monitor its progress while running.
While your experiment is running, navigate to your Virtual Machine Scale Set > Monitoring > Metrics, or the location where you saved your VM Availability Metric, and view the impact of the Availability Zone shutdown on your Virtual Machine Scale Set’s average instance availability:
Recommendations to Improve Resiliency
Did your Virtual Machine Scale Set perform as you expected it to during the Availability Zone shutdown? If not, here are some steps you can take to improve your resiliency for future tests and protect against real-world incidents:
- Configure or review the autoscale settings on your Virtual Machine Scale Set to ensure rapid provisioning of additional instances in unaffected zones during a failure.
- Maintain a balanced instance count across Availability Zones to minimize the impact of losing an entire zone.
- Set up load balancing or adjust configurations to seamlessly redistribute traffic when a zone becomes unavailable.
After making improvements to your Virtual Machine Scale Set configuration, be sure to test and iterate on them by continuing to perform resilience testing regularly.
Conclusion
In this blog post, we have shown you how to use Chaos Studio to test your Virtual Machine Scale Sets against Availability Zone shutdowns. With the best practices laid out in this guide, you can conduct resilience tests on services across your cloud infrastructure using faults in Chaos Studio’s https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-library. Be sure to look out for more blog posts covering other scenarios in the “Resilience Testing with Azure Chaos Studio” series soon. Feel free to add a comment below on which scenarios you’d like to see next. Happy resilience testing!
Additional resources
- Chaos Studio Overview: https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FAzureChaosStudio&data=05%7C02%7Cprashabora%40microsoft.com%7C97b85263de9e45fec53208dcc261447e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638598970291980382%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=KXjm66iNnes%2Fi23UaLV6jQxB7CMUJ%2Bmb%2F2BKhOcJyqY%3D&reserved=0
- Documentation: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fchaos-studio%2F&data=05%7C02%7Cprashabora%40microsoft.com%7C97b85263de9e45fec53208dcc261447e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638598970291987614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=WeI29vxwtCJU%2Bt7m9gBFZePH2nCwH2X5fNo7S%2B1gEr0%3D&reserved=0
- MS Build Session Recording: https://www.youtube.com/watch?v=lk1yxLMj-7A
- https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fazure.microsoft.com%2Fen-us%2Fblog%2Fadvancing-microsoft-azure-resilience-with-chaos-studio%2F&data=05%7C02%7Cprashabora%40microsoft.com%7C97b85263de9e45fec53208dcc261447e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638598970291994792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=EuoO5oln%2BmznS%2B4d3pCERBGc28anm91TWpF3pinqczs%3D&reserved=0
- https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-region-availability