At Microsoft, we take resilience seriously. We understand that the consequences of unavailability are severe – your projects, applications, and even businesses depend on the Azure Cloud to be highly available and resilient to failure. If users can’t access your services, they are likely to get upset, but more so there can be financial, legal, and even life-or-death consequences when your application is down.
That’s why today we are excited to announce the public preview of Azure Chaos Studio, a new service for improving your applications’ resilience to disruptions. With Chaos Studio you can practice chaos engineering: a method of experimenting with controlled fault injection against your applications to help you measure, understand, and improve resilience against real-world incidents, such as region outages or memory leaks in an application. In this post, we’ll briefly share why practicing chaos engineering is important, give an overview of Chaos Studio, and share a little bit about our roadmap. When you’re ready to get started, visit our documentation or try it out for yourself in the Azure portal.
Why practice chaos engineering?
The move to cloud services introduces new challenges to building reliable applications. Layers of abstraction from physical infrastructure provide relief from needing to manage failures, but can also introduce new dependencies that are a “black box” when things go wrong. Cloud-native architectures simplify deployment and management, but teams may lack the confidence that they can remain resilient to failure using them. Much like Security, Resilience requires constant attention from both the cloud provider and the cloud consumer.
Whether you are developing a new application that will be hosted on Azure, migrating an existing application to Azure, or operating an application that already runs on Azure, it is important to improve your application's ability to handle and recover from disruptions that can negatively impact your customers experience and erode their trust in your business or mission. To avoid these negative consequences, you need to validate that your application responds effectively to disruptions that could be caused by a service you depend on, disruptions caused by a failure in the service itself, or even disruptions to incident response tooling and processes. Chaos experimentation enables you to test that your application is resilient to these failures at any phase in the service lifecycle – from development through to production.
Chaos engineering can be used for a wide variety of scenarios, including post-incident analysis, "game days,” BCDR drills, and validation of live site tooling and on-call processes. For many of these scenarios, you first build resilience using ad-hoc chaos experimentation then continuously validate that new deployments won't regress resilience by adding chaos testing as a deployment gate in your CI/CD pipeline.
How does Chaos Studio help?
Designing an experiment in Chaos Studio
Chaos Studio enables you to orchestrate fault injection on your Azure resources in a safe and controlled way. A few of the key benefits of Chaos Studio are:
- A fully-managed service: Unlike custom scripts and OSS tools, Chaos Studio doesn’t require you to do any management or maintenance of the service – just define your experiment and click start.
- Deeply integrated into Azure: Chaos Studio uses Azure Resource Manager (ARM), which makes it easy to deploy and run your chaos experiments the same way you deploy and manage your infrastructure – leveraging ARM templates, Azure Policy, role-based access control, and more. Chaos Studio also integrates with Azure Application Insights and leverages Azure Active Directory for securing access to your resources.
- Expanding fault library: Chaos Studio has 25+ faults in our fault library at public preview that cover several Azure services, and we’re just getting started. We will continuously add new faults that align to the failures users want to replicate in Azure and beyond.
- Replication of real-world scenarios: Chaos Studio’s orchestration capabilities enable you to build up complex experiments that replicate real-world incidents. You can run faults in sequence and/or in parallel, add time delays, and group target resources across regions.
- Controlled chaos: With Chaos Studio, you can easily stop an experiment and roll back the fault being injected to avoid having more impact to an environment than originally intended. Chaos Studio also uses a sophisticated resource onboarding and permissions model that ensures that fault injection only occurs by authorized users against authorized resources using authorized faults.
Chaos Studio is free to use through April 4, 2022, and thereafter usage will be charged pay-as-you-go by the target action-minute .
What are some of our customers and partners saying?
Chaos Studio is already being used by Azure customers that span industries including retail, finance, healthcare and emergency services, and it is being used across Microsoft to improve quality as well. Over 50 teams across Microsoft are running chaos experiments with Chaos Studio, including the Power Platform team and the Azure Key Vault team. Check out this video to hear how Chaos Studio has helped them to identify opportunities for improved resilience:
What’s next?
Today we begin our public preview for Chaos Studio and we’re excited to hear your feedback on what faults and features you’d like to see next. If you’re ready to systematically improve resilience with controlled chaos, get started by visit our documentation or get started in the Azure portal. Let the chaos begin!