Blog Post

Windows IT Pro Blog
4 MIN READ

Windows resiliency: Best practices and the path forward

John_Cable's avatar
John_Cable
Icon for Microsoft rankMicrosoft
Jul 25, 2024

The broad, open nature and scale of the Windows computing ecosystem is part of what makes it a powerful and unmatched choice across the globe. The recent CrowdStrike incident underscores the need for mission-critical resiliency within every organization, and our unique ability to support the change required.

When a major incident arises, we focus on remediation, learning, and change, all while communicating transparently to our ecosystem. On Saturday, David Weston described our "first responder" approach. Since the start, we engaged over 5,000 support engineers working 24x7 to help bring critical services back online. We are providing ongoing updates via the Windows release health dashboard, where we detail remediation steps, including a signed Microsoft Recovery Tool.

Editor's note 7.29.2024 - We recently released an analysis of CrowdStrike's outage report. To learn how security vendors and organizations can use the flexibility and integrated capabilities of Windows for increased security and reliability, see Windows Security best practices for integrating and managing security tools.

Our goal is to be your trusted partner as you leverage technology and the end-to-end Microsoft stack to deliver amazing value for your workforce, your customers, and your partners. That means, when an issue arises, we immediately engage with partners and customers to dig into the details, help, learn, and evolve.

This incident shows clearly that Windows must prioritize change and innovation in the area of end-to-end resilience. These improvements must go hand in hand with ongoing improvements in security and be in close cooperation with our many partners, who also care deeply about the security of the Windows ecosystem.

Examples of innovation include the recently announced VBS enclaves, which provide an isolated compute environment that does not require kernel mode drivers to be tamper resistant, and the Microsoft Azure Attestation service, which can help determine boot path security posture. These examples use modern Zero Trust approaches and show what can be done to encourage development practices that do not rely on kernel access. We will continue to develop these capabilities, harden our platform, and do even more to improve the resiliency of the Windows ecosystem, working openly and collaboratively with the broad security community.

There is always the chance that an outage will impact an organization. Over the last few days, we've been on thousands of calls with organizations around the world. We've observed that those who were able to remediate and recover the most quickly followed a similar set of practices. We want to share those best practices with you.

Best practices to support resiliency in your organization

  1. Have business continuity planning (BCP) and a major incident response plan (MIRP) in place. Include response and recovery best practices that outline the steps needed to get your environment back up and operating, including who to call and how to get support.
  2. Back up data securely and often. We recommend your organization utilize cloud storage and backup solutions, as these are great options for securely accessing, sharing, and collaborating on files from anywhere. Organizations utilizing cloud storage solutions have had better experiences getting back online, as this removed barriers to simply resetting the device.
  3. Ensure that you can restore your Windows devices quickly. A key component of resiliency in the event of an issue is to regularly create system restore points and use Windows built-in recovery options to restore devices. If you use Azure virtual machines, you can take a snapshot of your VMs. Organizations with recent restore points were able to recover more quickly from the recent CrowdStrike issue and we observed that virtualized/cloud environments were among the quickest to recover.
  4. Utilize deployment rings. Extend safe deployment practices into your environment by creating deployment rings to manage the rollout of updates and new features. Utilize your existing device management tools to manage deployment risk using the same approach Microsoft does. Alternatively, take advantage of automated deployment with Windows Autopatch. If you are using non-Microsoft products in your environment, including antivirus solutions, ensure that they offer ring-based deployment so you can control the pace and scale for your environment. As an example, Microsoft Defender allows for custom configuration of both engine and intelligence update staging.
  5. Use the latest Windows security defaults and enable Windows security baselines. Enable the security features that are available in Windows by default. Take advantage of Windows security baselines, which provide Microsoft-recommended, well-tested configurations based on feedback from Microsoft security engineering teams, product groups, partners, and organizations. Windows offers several built-in security features to leverage, from firewalls to encryption to biometrics, and more at the enterprise level with endpoint detection and response (EDR), data protection, vulnerability management, compliance monitoring and more.
  6. Adopting a cloud-native approach to managing Windows devices can make it easier to deploy updates and support recovery efforts in outage scenarios. Look at ways to move away from on-premises solutions to cloud management solutions, cloud identity solutions, and ring-based deployment and update management solutions like Windows Autopatch.

Our commitment to transparency

Our focus continues to be on helping our customers recover from this incident. We will practice transparency in sharing learnings, best practices, and, eventually, more detailed discussions that include changes designed to strengthen the broader ecosystem moving forward.

 

Updated Jul 29, 2024
Version 2.0
  • jdrch's avatar
    jdrch
    Brass Contributor

    Olli_Janatuinenif by "the Linux way" you mean bad or build failed 3rd party kernel modules rendering the kernel unbootable, no thanks. And many mission critical applications rely on such modules, such as Veeam.

  • jdrch's avatar
    jdrch
    Brass Contributor

    ynotnIIRC there's a regulatory angle to it too. I think the last time MSFT proposed locking down the kernel 3rd party vendors complained Microsoft was unfairly giving its own tools privileged kernel access.

  • Olli_Janatuinen's avatar
    Olli_Janatuinen
    Copper Contributor

    ynotnSure this was Crowdstrike fault and hopefully their customers will to sue them.

     

    But this discussion is more about what Microsoft can do and there jdrch comment is critical. If any change to current situation will be done, the Defender must be migrated to use that as well. Other why other vendors will not accept that. If this can be handled with VBS then fine but I'm a bit skeptic about it which why I tried to propose some alternatives but most likely there is other options too.

  • Olli_Janatuinen's avatar
    Olli_Janatuinen
    Copper Contributor

    Even when things like VBS are promising improvements, I think that security products will always need access to kernel mode.

     

    Alternative approach which I haven’t seen discussed yet would be the Linux way, meaning that force kernel mode drivers to be published as open source example after 5 years from now. That would allow to have just one kernel mode driver which all security products would share.

  • FabrizioDegni's avatar
    FabrizioDegni
    Brass Contributor

    Microsoft needs to be part of the certified apps with kernel access mode.

    The certification must follow the whole lcm.

  • Olli_Janatuinen's avatar
    Olli_Janatuinen
    Copper Contributor

    FabrizioDegni  exactly, currently anyone can get any garbage drivers certified because Microsoft don't have access to source code. I even tested that by publishing this: https://github.com/olljanat/BlueScreenOnce (Windows certified driver which purposely causes blue screen). However, I don't believe that all the experts on this are working for Microsoft and anyway it is just wasting resources to build multiple drivers for same purpose like case is now when every EDR vendor build their own driver which why I'm proposing to open source this part. EDRs can then still compete with their user mode parts.

     

    jdrch  I mean drivers which are needed by EDR products. Veeam does not directly belong to that category but I see that also their CBT driver is using StartType=0 like EDR drivers do. That probably should be denied and replaced with new start type which also makes sure that driver normally get started also in safe mode (which I believe is reason to use StartType=0) but which allow Windows to skip driver in case it prevents system from starting. Then only real boot critical disk, etc drivers would use StartType=0.

  • ynotn's avatar
    ynotn
    Copper Contributor

    I say boot all third party EDR vendors. And keep them at VBS Enclave level. Force them to use VBS vs the raw Kernel. Apple is able to do this, why not Microsoft?

  • ynotn's avatar
    ynotn
    Copper Contributor

    I don't agree with Olli_Janatuinen, neither do all the victims of the global meltdown caused by a very basic, oversight of a third party EDR, Crowdstrike. They should and need to be punished and made an example of and be walled off COMPLETELY from Kernel Level. The CEO is an ACCOUNTANT for crying out loud. He has no formal training in matters of software development and computer science. How can someone like that vet and make decisions so crucial and delicate like protocols for Kernel Level Delivery. It's absurd. THATS THE REAL PROBLEM HERE. Microsoft needs to be stringent here. Microsoft is getting blamed for a fault that entirely lies in Crowdstrikes horrible mind numbingly negligent push of updated data to Kernel driver. Microsoft CANNOT let this stand, or this will continue to happen over and over and over again. 

    Lives can be lost, literally. Surgical Procedures were halted. Schools were closed. 911 calls were not able to be made. Planes can fall from the sky, cars can end up crashing..... many things can go wrong... systems are too interlaced to take this lightly. Very real life threatening dangers exist here. This is NOT a lite matter. Crowdstrike CEO must be made an example of. Crowdstrike MUST be held accountable here. As an example of how seriously the consequences here are. This is a great youtube video from a retired Micosoft Developer on the matter: https://youtu.be/ZHrayP-Y71Q?si=hsyTMSJFNF6_Pw8U

    Also : "This is the 2nd time CrowdStrike CEO George Kurtz has been at the center of a global tech failure"
    https://www.bundle.app/en/finance/this-is-the-2nd-time-crowdstrike-ceo-george-kurtz-has-been-at-the-center-of-a-global-tech-failure-45DA5B0F-8A96-42E9-8589-CE69C7FDBB1E

    George Kurtz Background and 'qualifications' = https://www.crowdstrike.com/about-crowdstrike/executive-team/george-kurtz/ You'll note he is an Accountant, and his CTO? and Industrial Engineer, not a Software or Computer Scientist, or Software Engineer, but an INDUSTRIAL Engineer. 

    https://www.crowdstrike.com/about-crowdstrike/executive-team/elia-zaitsev/

     

    you see a pattern here yet? because I do.

     

  • ynotn's avatar
    ynotn
    Copper Contributor

    jdrch Yes, I'm sure EU will tone it down. Beside, perhaps Microsoft will take the initiative to at least here in the states, put their stake in the ground. Crowdstrike has been found difficient. They have been weighed and found defficient. Let them pay for their own gross incompetence and negligence. It shouldn't be Microsoft.