Earlier this year, I published the 1st blog post in a 4-part series that examines Intune’s journey to become a global, scalable cloud service. Today, in Part 2, I’ll explain the three proactive actions we took to prepare for immediate future growth. The key things we learned along the way are summarized at the end.
While this blog primarily discusses engineering learnings, if you are an Intune administrator, I hope this blog gives you an added level of confidence on the service that you depend on every day; there is extraordinary amount of dedication and thought that go into building, operating, scaling and most importantly continuously improving Intune as a service. I hope some of the learnings in this blog are also applicable to you, we certainly learned a ton over the years on the importance of data driven analysis and planning.
To quickly recap Part 1 in this series, the four key things we learned from re-building Intune were:
Deciding to make our culture and decision-making ethos entirely data-driven was our absolute top priority. When we realized that the data and telemetry available to us could be core parts of engineering the Intune services, the decision was obvious. But we went further by making the use of data a fundamental part of every person’s job and every step we took with the product.
To entrench data-driven thinking into our teams, we took a couple different approaches:
In other words: We took every opportunity, in any incident or meeting, to emphasize data usage – and we kept doing it until the culture shift to a hypothesis-driven engineering mindset became a natural part of our behavior. Once we had this, every feature that we built had telemetry and alerting in place and it was verified in our pre-production environments before releasing to customers in production.
Now, every time we found a gap in telemetry and/or alerting in production, we could make it a high priority to track and fix the gaps. This continues to be a core part of our culture today.
The result of this change was dramatic and measurable. For example, before this culture change, we didn’t have access to (nor did we track) telemetry on how many customer incidents we could detect via our internal telemetry and alerting mechanism. Now, a majority of our core customer scenarios are detected by internal telemetry, and our goal is to get this to > 90%.
Having predictive capacity analysis within Intune was something we simply could not live without. We had to have a way to take proactive actions by anticipating potential scale limits much earlier than they actually happened. To do this, we invested in predictive models by analyzing all our scenarios, their traffic and call patterns, and their resource consumptions.
The modeling was a fairly complex and automated process, but here it is at a high level:
Once we had defined the capacities and workloads units, we could easily chart the maximum workload units we could support, the existing usage, and be alerted anytime the threshold exceeded a pre-defined percentage so that we could take proactive steps.
Initially, our thresholds were 45% of capacity as “red” line, and 30% as “orange” line to account for any errors in our models. We also chose a preference toward over-provisioning rather than over-optimizing for perf and scale. A snapshot of such a chart is included below in Figure 1. The blue bars represent our maximum capacity, the black lines represent our current workloads, and the orange and red lines represent their respective thresholds. Each blue bar represents one ASF cluster (refer to first blog on ASF). Over time, once we verified our models, we increased our thresholds significantly higher.
The results of the capacity modeling and prediction we designed turned out to be a major eye-opener. As you can see in Figure 1, we were above the “orange” line for many of our clusters, and this indicated that we needed to take some actions. From this data (and upon further analyses of our services, cluster, and a variety of metrics), we drew the following very valuable three insights:
We quickly realized that even though we could scale out, we could not scale our nodes up from the existing SKUs because we were running on pinned clusters. In other words, it was not possible to upgrade these nodes to a higher and more powerful D15 Azure SKU (running 3x CPU cores, 2.5x memory, SSDs, etc). As noted in Learning #2 above, learning that that an in-place upgrade of the cluster with higher SKU was not possible was a big lesson for us. As a result, we had to stand up an entirely new cluster with the new nodes – and, since all our data was in-memory, this meant that we needed to perform a data migration from the existing cluster to the new cluster.
This type of data migration from one cluster to another cluster was not something we had ever practiced before, and it required us to invest in many data migration drills. As we ran these in production, we also learned yet another valuable lesson: Any data move from one source to another required efficient and intelligent data integrity checks that could be completed in a matter of seconds.
The second major change (as mentioned in the three insights above) was implementing persistence for our in-memory services. This allowed us to rebuild our state in just a matter of a few seconds. Our analyses showed increasing amounts of time for rebuilds that were causing significant availability losses due to the state transfer using a full copy from primary to the secondary replicas. We also had a great collaboration (and very promising results) with Azure Service Fabric in implementing persistence with Reliable Collections.
The next major change was moving away from our home-grown pub/sub architecture which was showing signs of end-of-life. We recognized that it was time to re-evaluate our assumptions about usage, data/traffic patterns, and designs so that we could assess whether the design was still valid and scalable for the changes we were seeing. We found that, in the meantime, Azure had evolved significantly and now offered a much better solution that fit beyond what we could create.
The changes noted above represented what was essentially a re-architecture of Intune services, and this was a major project to undertake. Ultimately, it would take a year to complete. But, fortunately, this news did not catch us off guard; we had very early warning signs from the capacity models and the orange line thresholds which we had set earlier. These early warning signs gave us sufficient time to take proactive steps for scaling up, out, and for the re-architecture.
The results of the re-architecture were extremely impressive. See below for Figures 2, 3, and 4 which summarize the results. Figure 2 shows that the P99 CPU usage dropped by more than 50%, Figure 3 shows that the P99 latency reduced by 65%, and Figure 4 shows that the rebuild performance for state transfer of 2.4M objects went from 10 minutes to 20 seconds.
Through this process, we learned 3 critical things that are applicable to any large-scale cloud service:
After the rollout of our re-architecture, the capacity charts immediately showed a significant improvement. The reliability of our capacity models, as well as the ability to scale up and out, gave us enough confidence to increase the thresholds for orange and red lines to higher numbers. Today, most of our clusters are under the orange line, and we continue to constantly evaluate and examine the capacity planning models – and we also use them to load balance our clusters globally.
By doing these things we were ready and able to evolve our tools and optimize our resources. This, in turn, allowed us to scale better, improve SLAs, and increase the agility of our engineering teams. I’ll cover this in Part 3.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.