On July 19th, CrowdStrike (a cybersecurity platform provider) released an update that caused crashes on 8.5 million Microsoft Windows PCs and servers. Instead of starting up, computers showed “blue screens of death,” closing ports, preventing consumers from using ATMs, and delaying medical procedures.1
While CrowdStrike released a fix the same day, enterprises had to undertake an arduous process of manually rebooting tens of thousands of servers and PCs in safe mode and deleting files associated with the faulty update before installing a fix.
This was not a cyberattack, nor is it unique. There have been several recent, widespread software outages that created havoc across entire value chains, like this year’s ransomware attacks that prevented auto dealers from doing business and healthcare providers from receiving payments, or the unstable, aged systems that stranded holiday travelers a couple of years ago.
Since the outage started, we’ve had dozens of discussions with business and technology executives wrestling with its impact. Already, technology teams have mobilized to fix the problems their companies face. There is an important role for senior business leaders as well in providing resources, support, and guidance. Here we present questions they should ask in determining how to mitigate the impact of this event—and reduce the risk of the next one.
Our understanding of the issue
Nearly three-quarters of the world’s computers run the Microsoft Windows operating system, including both corporate servers that run applications and the laptops or PCs that employees use.2
CrowdStrike Falcon sensor is an endpoint detection and response (EDR) product. It installs an agent on PCs and servers to identify and contain malware and other types of cyberattack. In response to evolving threats, CrowdStrike installs configuration updates, sometimes multiple times per day—the one it released on July 19th was faulty. Because the Falcon agent runs at a low level and loads early in the Windows start-up process, the remediation cannot use automated software distribution tools and requires manual intervention.
Here is what happened:
- On Friday, July 19th, at 4:09 UTC, one of the channel file updates had logic errors, which, when triggered, caused Windows to crash.
- The channel file in question (Channel File 291) is used to provide the logic for evaluating and protecting against the abuse of named pipes (named pipes are mechanisms used by Windows for interprocess or intersystem communication).
- The update in the channel file was designed to target and protect newly observed, malicious named pipes used by common C2 (command and control) frameworks in cyberattacks.3
The logic error in the channel file affected all Microsoft Windows systems that downloaded the update after 4:09 UTC. Systems that came online after 5:27 UTC received the updated channel file (rolled back to a previous stable version) and were not affected.
Given the privileged position CrowdStrike’s agent has with the Windows kernel, remediation required manual activity with each impacted endpoint:
- For laptops/PCs: the remediation involved repeated rebooting of the Microsoft Windows host to attempt to fix the issue automatically; if that did not work, next steps required rebooting the computer in safe mode and deleting the offending files. Remediation was more complicated for companies that had chosen to encrypt end-user hard drives for security reasons.
- For cloud hosts: the remediation involved either a “rollback” to a snapshot before 4:09 UTC or detaching the system disk volume, fixing the issue manually, and reattaching the volume.
The nature of this outage illustrates the trade-offs IT organizations must make between updating their environments to protect against cyberattacks versus managing changes that can introduce instability.
Creating value beyond the hype
Immediately: How to accelerate and sustain recovery
Technology organizations at affected entities launched recovery efforts the same day as the outage. They established war rooms, communicated to stakeholders, and developed technical remediation plans to restore operations.
Still, there are questions senior executives should ask to ensure that recovery efforts are as fast and as sustainable as possible:
What does our team need to sustain pace through the remediation?
This is a tough and stressful time for IT teams that have been working nonstop since the outage. How long they will need to keep up the pace will depend on the complexity of their technology environment and the number of computers affected.
Senior leaders can ask their recovery teams what they need to see the effort through to the end—it might be more resources to remediate systems, or it might be as simple as a visit from members of the executive team to the war room to demonstrate how much the company values their efforts.
Can IT enlist end users to help in remediating PCs and laptops?
In some cases, IT staffers will want to fix the problem themselves. These efforts are necessary for servers, but less so for PCs. With clear instructions, end users can boot their computers in safe mode, delete problematic files, and reboot, saving IT support personnel from having to touch thousands of machines themselves.
Are we being sufficiently transparent and responsive to our employees and customers?
This outage has massively affected employees and customers. Past outages indicate that taking the time to acknowledge the impact and communicate in direct terms about what you know (and what you do not) matters a lot. After a large ransomware attack, one company’s CEO called major customers to offer apologies and explain the incident. Even years later, customers still recognize and appreciate this.
Sometimes, transparency and empathy are not enough. Many customers of impacted companies experienced not just inconvenience but also economic loss, and there may be challenging decisions ahead about what type of compensation to consider.
A technology survival guide for resilience
In the coming days: How to reduce the risk of future events
Events like this will happen again. Providers will suffer outages and other problems that will disrupt companies’ ability to conduct business. To manage these risks, senior executives should ask questions that can help their companies to prepare and reduce the impact of such events:
Do we have economic, operational, and technical transparency into our risks?
What would be the economic impact of seeing a factory, a process, or a site unable to operate for a few—or many—days? Many companies don’t know. Which applications supporting critical business processes run on resilient technology platforms, and which are mired in technical debt, creating risk? Many companies have a sense of this but don’t have systematic and reliable data. What technology vendors could put a company out of business for a few days if they had a problem? How many companies were monitoring their EDR platform as a top-level technology risk before last week? Senior executives can and must push for quantification and prioritization of different types of risks.
What architectural changes should we make to enhance resiliency—and how much will it cost?
CIOs and CTOs often struggle with the business’s enthusiasm for investing in new features rather than reducing technical debt and improving resiliency. There is a limited business case for it—until an outage that results in millions of dollars in lost revenue happens. In this instance, “re-paveable,” cloud-based systems that can be reinitiated with one touch could have accelerated recovery. Geo-resilient application architectures that can fail over between regions can ensure availability. Senior executives should ask technology teams: what have we not invested in, and what should we? In some cases, companies may need to increase their technology spending materially in order to achieve the resiliency they need.
Do we need to introduce more staging and testing into change processes?
Almost all resiliency problems stem from change. Somebody somewhere changed a configuration or updated a piece of software that disrupted the intricate technology ecosystem that lets companies run their business.
Deploying a new update to 1 percent or 5 percent of nodes, however, can dramatically reduce disruption in the event of a flawed release. This phased model requires more resources but may be worth the investment, given the reduction in risk of disruption.
Is our disaster recovery/business continuity (DR/BC) planning and testing sufficiently extensive and robust?
Every company plans for DR/BC. Many companies, however, do so in an incomplete and perfunctory fashion. Senior executives can ask the following questions:
- Do our DR/BC plans test and push a wide range of scenarios based on business priority?
- What would it take to do more live tests, in which the technology team brings up applications in the DR environment, rather than conducting a paper-based exercise?
- Does it make sense to conduct a senior-level crisis simulation that prepares the executive team for the tough decisions it may need to make with limited information in the event of a major outage?
Our entire economy runs on complicated, sometimes fragile, technology platforms—and companies have a responsibility to shareholders and customers to provide “all day, every day” support for business processes. Senior executives can get the insight they need to support and push IT managers on this imperative by asking pointed questions about speed of response and about preventing or limiting the impact of the next event.