Over the last couple months I’ve been writing about Intune’s journey to become a globally scaled cloud service running on Azure. I’m treating this as Part 3 (here’s Part 1 and Part 2) of a 4-part series.
Today, I’ll explain how we were able to make such dramatic improvements to our SLA’s, scale, performance, and engineering agility.
I think the things we learned while doing this can apply to any engineering team building a cloud service.
Last time, I noted the three major things we learned during the development process:
After we rolled out our new architecture, we focused on evolving and optimizing our services/resources and improving agility. We came up with 4 groups of goals to evolve quickly and at high quality:
Here’s how we did it:
The overarching goal we defined for availability/SLA (strictly speaking, SLO) was to achieve 4+ 9’s for all our Intune services.
Before we started the entire process describe by this blog series, less than 25% of our services were running at 4+ 9’s, and 90% were running at 3+ 9’s.
Clearly something needed to change.
First, a carefully selected group of engineers began a systematic review of where we needed to drive SLA improvements across the 150+ services. Based on what we learned here, we saw, over the next six months, dramatic improvements. This review uncovered a variety of hidden issues and the fixes we rolled out made a huge difference. Here are a few of the big ones:
The result of all the above efforts was phenomenal. The chart below demonstrates this dramatic improvement after the changes were rolled out.
You’ll notice that we started with less than 25% of services at 4+ 9’s, and by the time we rolled out all the changes, 95% or more of our services were running at 4+ 9’s! Today, Intune maintains 4+ 9’s for over 95% of our services across all our clusters around the world.
The re-architecture process enabled us to primarily use scale out of the cluster to handle our growth. It was clear that the growth we were experiencing required us to additionally optimize in scale-up improvements. The biggest workload for Intune is triggered when a device checks-in to the service in order to receive policies, settings, apps, etc. – and we chose this workload as our first target.
The scale target goal we set was 50k devices checking in within a short period (approximately 10 minutes) for a given cluster. For reference, at the time we set this goal, our scale was at 3k devices in a 10-minute window for an individual cluster – in other words our scale had to increase by about 17x.
As with the SLA work we did, a group of engineers pursued this effort and acted as a single unit to tackle the problem. Some of the issues they identified and improved included:
By the end of this exercise, we were successfully able to increase the scale from 3k devices checking-in to 70k+ device check-ins – an increase of more than 23x -- and we did this without scaling out the cluster!
Our goal for performance had a very specific target: Culture change.
We wanted to ensure that our performance was continuously evaluated in production and we wanted to be able to catch performance regressions before releasing to production.
To do this, we first used Azure profiler and associated flame graphs to perform continuous profiling in production. This process showed our engineers how to drive several key improvements, and subsequently, it became a powerful daily tool for the engineers to determine bottlenecks in code, inefficiencies, high CPU usage, etc. Some of the improvements identified by the engineering team as a result of this continuous profiling include:
Our next action was to start a benchmark service that consistently and constantly ran high-traffic in our pre-production environments. Our goal here was to catch performance regressions. We also began running a consistent traffic load (that is equivalent to production loads) across all services in our pre-production environments. We made a practice of considering a drop in our pre-production environment as a major blocker for production releases.
Together, both of these actions become a norm in the engineering organization, and we are proud of this positive culture change in meeting the performance goal.
As called out in the first post in this series, Intune is composed of many independent and decoupled Service Fabric services. The development and deployment of these services, however, are genuinely monolithic in nature. They deploy as a single unit, and all services are developed in a single large repo – essentially, a monolith. This setup was an intentional decision when we started our modern service journey because a large portion of the team was focusing on the re-architecture effort and our cloud engineering maturity was not yet fully realized. For these reasons we chose simplicity over agility. As we dramatically developed the feature investments we were making (both in terms of the number of features and the number of engineers working them), we started experiencing agility issues. The solution was decoupling the services in the monolith from development, deployment, and maintenance perspectives.
To do this we set three primary goals for improving agility:
As indicated above, our agility was initially hurting us when it came to rapidly delivering features. Pull requests (PR) would sometimes take days to complete due to the aforementioned monolithic nature of the build environments – this meant that any change anywhere by anyone in Intune would impact everyone’s PR. On any given day, the churn was so high that it was extremely hard to get stable builds and fast builds or PRs. This, in turn, impacted our ability to deploy this massive build to our internal dogfood environments. In the best case, we were able to deploy once or twice per week. This, obviously, was not something we wanted to sustain.
We made an investment in developing and decoupling the monolithic services and improve our agility. Over a period of 2+ years, we invested two major improvements:
As this investment progressed and evolved, we started seeing huge benefits. The following demonstrate some of these:
The chart below demonstrates one such an example. The black line shows that we went from single digits to 1000’s of deployments per month in production environments. In pre-production dogfood environments, this was even higher – typically reaching 10’s of deployments per day across all the services in a single cluster.
Today, Intune is part monolith and part micro services. Eventually, we expect to compose 40-50 micro services from the existing monolith. There are challenges in managing micro services due to the way they are independently created and managed and we are developing tooling to address some of the micro service management issues. For example, binary dependencies between micro services is an issue because of versioning issues. To address this, we developed a dependency tool to identify conflicting or missing binary dependencies between micro services. Automation is also important if a critical fix needs to be rolled out across all micro services in order to mitigate a common library issue. Without proper tooling, it can also be very hard and time consuming to propagate the fix to all micro services. Similarly, we are developing tooling to determine all the resources required by a micro service, as well as all the resource management aspects, such as key rotation, expiration, etc.
There were 3 learnings from this experience that are applicable to any large-scale cloud service:
The improvements that came about from this stage of our cloud journey have been incredibly encouraging, and we are proud of operating our services with high SLA and performance while also rapidly increasing the scale of our traffic and services.
The next stage of our evolution will be covered in Part-4 of this series: A look at our efforts to make the Intune service even more reliable and efficient by ensuring that the rollout of new features produce minimal-to-no impact to existing feature usage by customers – all while continuing to improve our engineering agility.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.