How we upgrade 300k ConfigMgr clients in 10 days
Published Jun 14 2019 10:33 AM 11.7K Views
Microsoft

Client Health

It all starts with ConfigMgr client health.  If clients aren't healthy then getting them to upgrade is hard.  Getting and keeping clients healthy is a daily activity we have people focused on and that provides us with a good foundation to start with.

 

Early testing

When we roll out a new build the first thing we do is send it to our test hierarchy, which we refer to as "PPE".  We will upgrade the hierarchy and then perform some automated tests against that environment covering various areas, including ConfigMgr client health.  We test in two scenarios, existing client interacting with an upgraded server side (site/Management Point/Distribution Point, etc.), and then again with a client that is upgraded to the new client bits and going through basic functionality tests.  Those automated tests take about 12 minutes to run.  We refer to these tests as our Quality Control, or "QC" tests.

 

Production testing

Once we upgrade our production (PROD) hierarchy we use the pre-production client functionality of the product to target out the new ConfigMgr client to a small subset of actual production machines.  This is actual Microsoft employee machines as well as a few test machines.  For those test machines we again run our QC tests, taking about another 12 minutes.    Combined with all our other QC tests (run in PPE and PROD) we reach a point of being ready for full client deployment in about 5 hours.  Due to all the other things we have going on, combined with a little bit of caution and risk management, we usually let things sit for a day post upgrade.

 

6 day push (kind of)

The day after upgrade we check our telemetry, such as the ConfigMgr client status in the ConfigMgr admin console, and if it all looks good then we promote the new client to be our production client.  Prior to the hierarchy upgrade we had disabled auto-client upgrades so it is at this point that we enable it again.  We have it set to upgrade over 6 days, which we find as reasonable for our ~300,000 clients.  We will see a surge of traffic against our Management Point and, lately, our Cloud Management Gateway (CMG) but we are built to expect this and prepared to scale up servers if it becomes necessary.  There is a small trick here, however.  We used to set to 10 days but didn't reach the numbers we wanted to hit in those 10 days.  By setting the system for 6 days we find that we force clients to schedule their upgrades but then machines that are off for weekends or traveling and offline and our "trailing machines" are less after 10 days.  Here is an example of what the outcome ends up being for us:

 

ClientUpgrades.jpg

This leaves us with a "long tail" to chase.  The last 10-20% of machines are often either on vacation or completely gone but still in our system (being a software development company means we have a lot of test machines in our environment).

6 Comments
Brass Contributor

Thanks for sharing this! 
I do have a question. When you mention "Getting and keeping clients healthy is a daily activity we have people focused on and that provides us with a good foundation to start with.", how many people do you actually have focused on that task on a daily basis? 
We have a much smaller footprint than 300k clients, but having 85+ locations globally with 24/7 production plants etc., it's something we struggle with regularly (devices being offline, users being on vacation, devices hanging etc.) impacting our 100% compliance target. We usually get to 98% but those last 2% is really a pain chase, regardless of the SCCM client health scripts and remediation jobs we have running.
So I was just wondering how much effort is spent on getting to 100% in other environments. 

Microsoft

Hi Andy,

We have essentially two headcount that are focused daily on client health related work.  They spend much of their time gathering data to find the trends and categorize things into "buckets" that we then try to address.  In our case that can often be investigation and bug filing back to windows or ConfigMgr for unreleased OS versions and such.  Other time is spent on trying to write scripts to proactively target and remediate other "buckets" of issues, but done in a way so as to not have negative side effects.

Brass Contributor

Hi Mike,

 

Thanks for the quick reply and the useful information. Nice to know how ConfigMgr client health is handled at Microsoft internally, those things always provide some valuable insights. 

Microsoft
This post didn't go into depths about how we do client health specifically here, but it sounds like you have some interest/curiosity so I put it on our list for future blog postings. Much of what we do is similar to what all the rest of the ConfigMgr community does, but we can share and perhaps learn from everyone as well.
Brass Contributor

Hi Mike,

 

That would be great to get a view on how Microsoft does ConfigMgr Client Health internally Smiley Happy. Looking forward to that one! 
One of our main struggles is convincing senior management that Client Health isn't just a ConfigMgr thing, it's a shared accountability as a lot of other infrastructure components needs to be healthy for ConfigMgr to be healthy or "successful at doing it's task". 

 

 

Microsoft

Hi Andy.

 

You are exactly correct on the shared accountability.  I was recently in a conversation in our org where folks were talking about how another process could "fix" ConfigMgr client health better than ConfigMgr.  I had to point out that the same process couldn't "fix" itself, but ConfigMgr could.  There is a lot of confusion in the client health space and what it takes to "fix the fixer" since more powerful software has more dependencies to keep it functional.

 

On a related note, if you have a Microsoft Premier support agreement I suggest you check out the PFE client health offering (https://blogs.technet.microsoft.com/michaelgriswold/2011/12/16/find-and-fix-those-unhealthy-clients/).

Version history
Last update:
‎Jun 14 2019 10:33 AM
Updated by: