Chaos Engineering: What is it? Who should do it? What can you learn? How do you get started? Armchair Architects, Uli Homann and Eric Charran, joined David Blank-Edelman for a lively discussion of Chaos Engineering from an architect’s point-of-view on the Azure Enablement Show. Read below for highlights and watch the video.
How do architects define and think about Chaos Engineering?
Chaos engineering is that part of application development in which you try to excise uncertainty about specific failure conditions that might happen with the system or its components. Many architects just guess what's going to happen if there's a failure condition, and in many instances, they don't even like to think about it. When you’re designing an elegant solution the last thing you want to think about is what happens if somebody kicks over a rack, unplugs something, or doesn't update a cert – how is your application going to respond? In today's world we need to think about these things and we shouldn't be guessing about it. The practices of chaos engineering help us to remove guessing and achieve certainty about what failure looks like in our applications and their architecture. Always expect the unexpected is the mantra.
There are two independent lifecycles. The cloud lifecycle and application lifecycle are independent of each other and can clash and cause failure when there are incompatibilities. It’s more important in the cloud than on prem where you have more control over the entire life cycle.
Introducing failure into your system is a great art form. The most extreme form is to do that in production but most customers and partners are not ready for that level of chaos engineering. Introducing it in the development and test cycle, maybe in the canary where you have very limited impact, is a great starting point for doing that.
How do you know if your org is mature enough?
Mature organizations should do chaos testing in production, however not all organizations are mature enough for this. For example, if you don't do unit tests or if your code has no exception handling features, then your organization may not be mature enough for chaos engineering.
What does it mean to introduce failure?
There’s an opportunity for us as architects to not guess but to actually acquire knowledge
before failures happen and to see what failure actually looks like. A principal component of chaos engineering is fault injection. Fault injection should be part of the application development process. You can think about it in simple terms such as: Suspend services, suspend components, pull out the network infrastructure, turn off the integration between your components – and then see what happens. It's a mindset that you have to embrace.
You need to look at how gracefully your components handle a failure – starting with the user experience all the way down to the data layer or storage layer. Did they handle it well or can it be optimized? Once you establish that baseline of failure, you fix your components, redeploy them, whether it's in a canary or whether it's in a testing environment. Do the same types of testing again and see how much you improved compared to your baseline.
At each step of the way embrace the failure. It's like smashing your beautiful glass house with a hammer but you're going to need to do that to remove obscurity around what's going to happen as systems fail.
Applying the scientific method to Chaos Engineering
You can identify a hypothesis and then test it to see what happens. There’s the idea of trying to understand the unknown. What happens if I apply cpu pressure, or what happens if I apply memory pressure, what happens if I run out of disk space. Or what if they set the clocks ahead by two hours or there is daylight savings time. Most people don’t run tests in production but if you don't, the internet will test it for you.
Some systems are not internet facing and so it's not as applicable. The list of failures you can introduce are things like decreasing the network, increasing latency, reducing throughput. Having services disappear and then reappear. Sometimes reappearing is more dangerous than disappearing because people plan for the failure of disappearance, but they don't plan for what they are going to do when the services come back up again.
Other things to consider
It’s helpful to create a database of past incidents to use when creating a chaos testing repertoire. Looking at your incidence database and making sure those issues are examined. Hypothesis based testing and considering real-world events that have happened before. It’s also important to consider timing. Chaos testing is not only what you test but when you test. Throwing the error in when it's least expected gives another level of robustness.