Intune’s Journey to a Highly Scalable Globally Distributed Cloud Service
Earlier this year, I published the 1st blog post in a 4-part series that examines Intune’s journey to become a global, scalable cloud service. Today, in Part 2, I’ll explain the three proactive actions we took to prepare for immediate future growth. The key things we learned along the way are summarized at the end.
While this blog primarily discusses engineering learnings, if you are an Intune administrator, I hope this blog gives you an added level of confidence on the service that you depend on every day; there is extraordinary amount of dedication and thought that go into building, operating, scaling and most importantly continuously improving Intune as a service. I hope some of the learnings in this blog are also applicable to you, we certainly learned a ton over the years on the importance of data driven analysis and planning.
To quickly recap Part 1 in this series, the four key things we learned from re-building Intune were:
- Make telemetry and alerting one of the most critical parts of your design – and continue to refine the telemetry and alerting after the feature is in production.
- Know your dependencies. If your scale solution doesn’t align to your dependent platform, all bets are off.
- Continually validate your assumptions. Many cloud services/platforms are still evolving, and assumptions from 1 month ago may no longer be valid.
- Make it a priority to do capacity This is the difference between being reactive and proactive for scale issues.
With all of that in mind, here (in chronological order) are the actions we took based on what we learned:
Action #1: Fostering a Data-driven Culture
Deciding to make our culture and decision-making ethos entirely data-driven was our absolute top priority. When we realized that the data and telemetry available to us could be core parts of engineering the Intune services, the decision was obvious. But we went further by making the use of data a fundamental part of every person’s job and every step we took with the product.
To entrench data-driven thinking into our teams, we took a couple different approaches:
- Moneyball training
- Repeated emphasis in daily standups and data examinations.
- Instituting weekly and monthly post-mortem reviews as well as Intune-wide service, SLA, and incident reviews.
In other words: We took every opportunity, in any incident or meeting, to emphasize data usage – and we kept doing it until the culture shift to a hypothesis-driven engineering mindset became a natural part of our behavior. Once we had this, every feature that we built had telemetry and alerting in place and it was verified in our pre-production environments before releasing to customers in production.
Now, every time we found a gap in telemetry and/or alerting in production, we could make it a high priority to track and fix the gaps. This continues to be a core part of our culture today.
The result of this change was dramatic and measurable. For example, before this culture change, we didn’t have access to (nor did we track) telemetry on how many customer incidents we could detect via our internal telemetry and alerting mechanism. Now, a majority of our core customer scenarios are detected by internal telemetry, and our goal is to get this to > 90%.
Action #2: Capacity Planning
Having predictive capacity analysis within Intune was something we simply could not live without. We had to have a way to take proactive actions by anticipating potential scale limits much earlier than they actually happened. To do this, we invested in predictive models by analyzing all our scenarios, their traffic and call patterns, and their resource consumptions.
The modeling was a fairly complex and automated process, but here it is at a high level:
- This model resulted in what we called workload units.
- A workload unit is defined as a resource-consuming operation.
- For example, a user login may equal to 1 workload unit while a device login may equal to 4 workload units – i.e., 4 users consume similar number of resources as 1 device.
- A resource is defined by looking at a variety of metrics:
- CPU cores
- Memory
- Network
- Storage
- Disk space
- And evaluating the most limiting resource(s).
- Typically, this turns out to be CPU and/or memory.
- Using the workload definitions, we generated capacity in terms of workload units.
- For example, if 1 user consumed 0.001% of CPU, 1 CPU core would equate to a capacity of 100,000 user workload units.
- That is, we can support a max of 100,000 users or 25,000 devices (since 4 users == 1 device) or combinations of them with 1 CPU core.
- We then compute the total capacity (i.e., max workloads) of the cluster based on the number of nodes in the cluster.
Once we had defined the capacities and workloads units, we could easily chart the maximum workload units we could support, the existing usage, and be alerted anytime the threshold exceeded a pre-defined percentage so that we could take proactive steps.
Initially, our thresholds were 45% of capacity as “red” line, and 30% as “orange” line to account for any errors in our models. We also chose a preference toward over-provisioning rather than over-optimizing for perf and scale. A snapshot of such a chart is included below in Figure 1. The blue bars represent our maximum capacity, the black lines represent our current workloads, and the orange and red lines represent their respective thresholds. Each blue bar represents one ASF cluster (refer to first blog on ASF). Over time, once we verified our models, we increased our thresholds significantly higher.
Figure 1: Intune’s Predictive Capacity Model
Action #3: A Re-Architecture Resulting from Capacity Prediction
The results of the capacity modeling and prediction we designed turned out to be a major eye-opener. As you can see in Figure 1, we were above the “orange” line for many of our clusters, and this indicated that we needed to take some actions. From this data (and upon further analyses of our services, cluster, and a variety of metrics), we drew the following very valuable three insights:
- Our limiting factor was the node as well as cluster This meant we had to scale up and out.
- Our architecture with stateful in-memory services required persistence so that secondary replicas could rebuild from on-node disk states rather than performing a full copy state transfer every time the secondary replica started (g. such as process, node restarts, etc).
- Our messaging (pub/sub) architecture needed to be replaced from a home-grown solution with Azure Event Hubs so that we could leverage the platform that satisfied our needs of high throughput and scale.
We quickly realized that even though we could scale out, we could not scale our nodes up from the existing SKUs because we were running on pinned clusters. In other words, it was not possible to upgrade these nodes to a higher and more powerful D15 Azure SKU (running 3x CPU cores, 2.5x memory, SSDs, etc). As noted in Learning #2 above, learning that that an in-place upgrade of the cluster with higher SKU was not possible was a big lesson for us. As a result, we had to stand up an entirely new cluster with the new nodes – and, since all our data was in-memory, this meant that we needed to perform a data migration from the existing cluster to the new cluster.
This type of data migration from one cluster to another cluster was not something we had ever practiced before, and it required us to invest in many data migration drills. As we ran these in production, we also learned yet another valuable lesson: Any data move from one source to another required efficient and intelligent data integrity checks that could be completed in a matter of seconds.
The second major change (as mentioned in the three insights above) was implementing persistence for our in-memory services. This allowed us to rebuild our state in just a matter of a few seconds. Our analyses showed increasing amounts of time for rebuilds that were causing significant availability losses due to the state transfer using a full copy from primary to the secondary replicas. We also had a great collaboration (and very promising results) with Azure Service Fabric in implementing persistence with Reliable Collections.
The next major change was moving away from our home-grown pub/sub architecture which was showing signs of end-of-life. We recognized that it was time to re-evaluate our assumptions about usage, data/traffic patterns, and designs so that we could assess whether the design was still valid and scalable for the changes we were seeing. We found that, in the meantime, Azure had evolved significantly and now offered a much better solution that fit beyond what we could create.
The changes noted above represented what was essentially a re-architecture of Intune services, and this was a major project to undertake. Ultimately, it would take a year to complete. But, fortunately, this news did not catch us off guard; we had very early warning signs from the capacity models and the orange line thresholds which we had set earlier. These early warning signs gave us sufficient time to take proactive steps for scaling up, out, and for the re-architecture.
The results of the re-architecture were extremely impressive. See below for Figures 2, 3, and 4 which summarize the results. Figure 2 shows that the P99 CPU usage dropped by more than 50%, Figure 3 shows that the P99 latency reduced by 65%, and Figure 4 shows that the rebuild performance for state transfer of 2.4M objects went from 10 minutes to 20 seconds.
Figure 2: P99 CPU Usage After Intune Services’ Re-Architecture
Figure 3: P99 Latency After Intune Services’ Re-Architecture
Figure 4: P99 Rebuild Times After Intune Services’ Re-Architecture
Learnings
Through this process, we learned 3 critical things that are applicable to any large-scale cloud service:
- Every data move that copies or moves data from one location to another must have data integrity checks to make sure that the copied data is consistent with the source data.
- This is a critical part of ensuring that there is no data loss and this has to be done before switching over and making the new data as active and/or authoritative.
- There are a variety of efficient/intelligent ways to achieve this without requiring an excessive amount of time or memory – but this is the topic of another blog. 😊
- It is a very bad idea to invent your own database (No-SQL or SQL, etc), unless you are already in the database business.
- Instead, leverage the infrastructures and solutions that have already been proven to work, and have been built by teams whose purpose is to build and maintain the databases.
- If you do attempt to do this yourself, you will inevitably encounter the same problems, waste precious time re-inventing the solutions, and then spend even more time maintaining your database instead of spending the time with the business logic.
- Finally, the experiences detailed above taught us that it’s far better to over-provision than over-optimize.
- In our case, because we set our orange lines thresholds low, it gave us sufficient time to react and re-architect. This mean, of course, that we were over provisioned, but it was a huge benefit to our customers.
Conclusion
After the rollout of our re-architecture, the capacity charts immediately showed a significant improvement. The reliability of our capacity models, as well as the ability to scale up and out, gave us enough confidence to increase the thresholds for orange and red lines to higher numbers. Today, most of our clusters are under the orange line, and we continue to constantly evaluate and examine the capacity planning models – and we also use them to load balance our clusters globally.
By doing these things we were ready and able to evolve our tools and optimize our resources. This, in turn, allowed us to scale better, improve SLAs, and increase the agility of our engineering teams. I’ll cover this in Part 3.
Source: EM+S Blog Feed