Armchair Architects: Architecting Mission Critical Apps

Microsoft

Jul 22, 2022

In a new episode of the Azure Enablement Show, Uli, Eric, and David have a lively discussion about what architects need to consider when designing mission critical solutions such as emergency services that must always work.

Read below for highlights and watch the video.

How is the architecture for mission critical apps different than for other apps?

We'd like there to be a consistent set of architectural principles for everything from the small to the large, but in general, when you're building mission critical applications, we need to think about things in terms of those uber important architectural patterns and system characteristics. When looking at mission critical apps it’s important to understand that there's going to be failure present within these applications at some componentry level. The only difference here with mission critical apps is that you have to ask, what happens when one of those application components has an outage? How graceful is your user experience? Can you employ some of the techniques we talked about in season one to increase liability and then potentially self-heal and self-recover such that the failure is transient, but the users of that mission critical application never know the difference? The goal is that no data is lost and no transactions are sacrificed.

Do we need to treat this class of applications differently than other apps?

Mission critical is always subjective in terms of what mission you are serving. Sometimes it's as extreme as life being at risk. Sometimes it's the business that’s at risk. If you have a mission critical environment that simply means that your business life or application is super important to you and is at risk if this application fails.

Let's take the example of 911 emergency calling services in the US. If that is down people are at risk and therefore you need to make sure the system is available. If you are running a business and are running black Friday shopping websites those need to be up and running because there's a lot of business that runs through the site. If your system is not up you might put the business at risk.

At the end of the day, it's a subjective decision but once you make that decision that this application, system, or solution is mission critical then you need to go and change gears, which also includes a cost conversation. A mission critical system is more expensive to host, operate, and maintain than a system that has only let's say two nines as an availability measure. That might be good enough for a lot of stuff but for the key pieces that really run your business you want a higher level of availability and robustness.

Does mission criticality mean more expense?

An important first step is to classify the critical nature of the system. If you can understand the classification system then the goal would be for you to identify which parts of the system or its components have different levels of mission criticality. If something is so big that it can’t have more than five minutes of down time a year then those types of classifications will help you as you take this to your decision makers to justify the investment required if they want that particular high level of service.

It can be really expensive to replicate data across multiple regions so that data isn’t lost in the case of a failure. You can determine whether or not you make modifications to the system architecture or actually have alternative scale units and then make it highly available based on the criticality.

It’s important to look at criticality and classification – what does that translate to in terms of reliability, security, and network connectivity, and how does that impact health modeling and pen testing and any operational procedures that need to be in place to support this.

Within a solution, can workloads be mission critical for different reasons?

When you think about mission critical workloads you also have to think about timing. It isn't necessarily black and white. It also can depend on the time of day or the situation you're in. Sometimes it’s important to be able to vary the architecture patterns and be able to turn things on and off.

For example, if you look at an imaginary system that has two components or two workloads, like a video stream of a sports event and a social capability around that sports event. If on the day of this event all of a sudden the video stream becomes mission critical because people are sitting in front of the tv or on the internet watching the sport event and obviously they want to see it live. The provider also offers video on demand later but people love to watch sport events in real time. The social component, while important, is actually not as mission critical. People will forgive you if this is a bit slow. They are not happy, but they won't be absolutely mad at you like they would if the video feed fails while the game is on. The situation most likely reverses after the game has ended. Afterwards people want to talk about the game so the social component becomes far more important than the video stream.

In order to detect and understand when the video signal is degrading or the social platform is showing signs of fatigue you need robust health modeling. You want a critical application that tells you as the owner that there's a problem before it becomes widespread, before it becomes noticeable by customers and certainly before they call you or start tweeting about the problem. Robust health monitoring, which includes architecture in terms of observability patterns, pipelines, logs, metrics – all those things you need to emit that telemetry but also aggregate it, running inferences across it to tell you if people from a particular region watching it the on these devices are showing network contention as they've dropped down from 4k to 720p. You’ll know if it looks like there's going to be a problem and you can do something about it.

Would an architect ever plan on changing design patterns on the fly?

There may not be a circumstance in an application's life cycle where you decide to switch to the beta design pattern instead of the alpha one. But there may be a scenario where you are in a situation you've anticipated (thanks to your observability pipeline and your health modeling), you observed a problem state, and you decide the application as it exists today in production can’t be dealt with. You didn’t foresee it. You need to quickly figure out what you need to do. Let's scale test it and deploy it as quickly as possible into those specific regions or across the entire implementation so we can fix this before it becomes an issue. There's almost an element of deployment and testing in terms of CI/CD and being able to fix things quickly so that the system doesn't just come down. This is really important.

Wouldn’t an architect want to design a contingency pattern in place to fall back on?

A key point here is composition of architecture patterns and solutions into an end-to-end capability. We can go back to the example of the sport event capability. There were three workloads we described: a real-time event, the following social media, and then video on demand, which is a different solution but it might reuse some of the capabilities of the real-time capability. These are different and therefore you can vary the cost and reliability and availability patterns accordingly because you have flexibility and that means you also use various patterns that are effectively going on against this load balancing elasticity, load chatting or throttling. All of those are fair game and obviously something that is not just for mission critical systems.

Mission critical systems mostly deal with availability and reliability meaning you cannot have

the system fail if disaster strikes. Let's say you are in a specific region and that entire region is out. That system still must be available and ideally is made out of multiple parts. Not necessarily many parts but multiple parts, so that you can determine based upon requirements, time of day, whatever it might be, how to make this capability available in a specific level of availability or robustness. Building super robust systems, meaning making them available in multiple areas, does cost money and therefore it must be planned out for the right reasons and not just because you can.

Watch the full video here:

For more information about Microsoft Azure Well-Architected Framework: https://aka.ms/azenable/82/02

Bonus Episode

In a bonus episode, our Armchair Architects, Uli and Eric, revisit the off-camera conversation with David about how cloud solution architects think about microservices when designing mission-critical applications.

Using microservice architecture patterns to maintain high availability

There is an architecture pattern the community has embraced called microservices. Microservices is something that we generally associate with a smaller feature scope of faster and more agile delivery against a managed feature scope. But it's also a really great pattern to think about for mission critical applications or applications in general. You can compose them into areas that also have potentially different availability environments and then you can utilize capabilities like fault domains and update domains.

Fault domain simply says how many copies of that application do I run across failure zones in a Microsoft data center. For example, a failure zone is a rack because the top of rack switch has proven to be the most unreliable component in a compute environment. Again, that's being mitigated with multiple top of rack switches but the concept of a fault domain is still there and is applicable. Update domain simply says, if I have five copies running I can only take a maximum of two out at the same time and update them to this next version and again microservices allow us to really think about this in an end-to-end fashion and use the functional updates and potentially technical debt updates against this scope so that the system provides availability. But the individual component might have degraded performance because I’m taking one update domain out which means I have only three copies left to do the work. Putting it all together in mission critical solutions against availability updates, fault domains, and operational constraints, is really important as part of thinking through how do you make your application highly available. It’s not a static thing. It is something that lives and breathes like all great applications do.

Don’t microservices run counter to the concept of simplicity?

Introducing microservices increases complexity. The alternative is the monolith and we know the monolith as a monolithic piece of code that gets deployed as a unit, tested as a unit, regression tested as a unit, and managed as a unit has significant drawbacks in terms of flexibility and accommodating for resiliency and failure. In a monolith if there's a failure then likely the entire monolith application actually fails. In a microservices architecture there is the tolerance of transient failure among service components and the ability for us to inject abilities to accommodate for that failure. So while it is significant it’s not necessarily a different level of complexity in terms of Microsoft and microservice architecture. For that complexity, what it buys you is governance around domain specific elements of functionality that can fail if your application is architected in the right way versus the monolithic approach.

There are solutions that contain hundreds of microservices but that doesn't have to be the case. You can look at a commerce application, where you have the commerce front end which is four or five micro services and then you have the back end with maybe 10 micro services or 15 fulfillment services. You have two systems that coordinate and cooperate together, ideally over a robust mechanism like a queue, where effectively the front-end system doesn't interact with a back-end system directly. It simply puts requests on the queue and utilizes master data and cachable data to effectively fulfill the requests. You have two systems that work together and because they are separated by a queuing mechanism the applications can scale independently of each other. They can also fail independently of each other if for some reason the backend system is not available or not quite capable of keeping up with demand. If you architect it right the front end can still continue to accept orders while the back end is catching up. There is a certain breakover point if queue length gets too long, the backend system cannot keep up, and you cannot recover anymore.

Is a microservices solution more or less expensive to operate?

Let’s take the commerce application as an example. We have a commerce front end and a commerce back end. You separate them with a queue and don't allow complex shared data updates like inventory. If you need to update the inventory often then that's expensive. One would argue that the system is ultimately cheaper because you can scale the front end independent from the back end and the queue acts as a buffer. You can also scale the queue independent of the front and the back ends and so therefore you have three levers of how to manage cost and growth. You should be able to ultimately manage a solution more affordably than if the front end had direct access to the back end, and in the worst case would effectively instantiate objects.

On the other side, as soon as a session object in the front end gets created then it would create something equivalent on the back end. Which means the front end and back end have to scale at the same level and most of the time the back-end components, like fulfillment or logistics, are way more expensive to instantiate than front-end components which deal with users and so forth. It would be a bad idea to link these from a semantic model so what you're doing is you decouple them and that allows you to scale them independently and also make sure that the front-end components, which are cheaper to build than the backend components and are more process heavy, are effectively not symmetric.

Arguably the monolith is always running, all of its components are humming, and there's only one level of capability – it's either on or off. You separate them and those high-use components are optimized for cost and those get hit while the other potentially more expensive components are waiting idle until they’re needed. They get spun up, do their jobs and then go to sleep.

Are microservices easier or harder to test?

There are more options, but with more options come more complexity – unless you have a plan. For example, if you're if you've got a microservice and it has maybe two or three dependencies on other microservices then you have options. You can test your component with a canary version of the target microservice, a test version of that microservice, a backwards or down-level compatible version of that microservice, or a mock object, in which you kind of determine as you run through baseline tests. Each one of those options are valid from an A/B testing perspective but the focus is on what condition you're actually releasing for and what you're testing for. Coupled with a canary methodology in which you are releasing a version of your microservice with you’re A/B testing with other dependent versions of the microservice against a small group of real-world users, the goal would be to leverage this flexibility to the best of your ability in order to release quality code and experiences.

The canary is really a replacement for user acceptance testing environments because UATs are usually artificially creating traffic. Whereas a canary is actually a production environment and it might have a lower SLA. In Azure we have a canary region that is a normal Azure region just happens to be a lower SLA because all the changes that we make to the various Azure services go in there first before they get released into the broader Azure regions. From our perspective it's a great way of replacing UAT with real production service while at the same time realizing that the changes haven't been tested against real production workloads and you have to make sure you manage user expectations. It’s largely a psychological experience because as canary users they want the latest features and they're going to be more tolerant of disruptions in terms of quality and code. It's a really authentic way to see how people interact with your system, albeit at a smaller scale, but you still have the benefit of people being more friendly to your system to report bugs and to help you improve the quality before you take it into the general release cycle.

If you think about the way user acceptance testing works it's not necessarily monolithic from a technology perspective but it's a monolithic from a testing perspective – you waterfall all changes and you reach certain stages such as dev, test, user acceptance, pre-production. Now in this new world when we are using microservices as our way to deliver, you don't necessarily want to create this kind of waterfall. You want to have the services roll into production on their own schedule based upon user demand capability. And therefore, you can use a canary-based model, a ring-based model, where ring zero is our canary ring, one is the next production region, and so forth. This is much faster because Azure is composed of about 400-450 services and they don’t all have to be available on the same day. We let each service do their thing and we use the canary method and the ring methodology to effectively make sure that the quality is as it needs to be for each individual service and therefore we don't need user acceptance testing in the classic sense. If you're really serious about microservices you should think about canary testing and ring-based deployment models as part of your methodology.

Watch the full video of the bonus episode here: