On July 19, 2024, organizations around the world began to experience the “blue screen of death” in what would soon be considered one of the largest IT outages in history. Early rumors of a mass cyberattack were quickly squashed: it seemed a minor software update was to blame for countless shopping excursions cut short, airline flights grounded and critical surgeries postponed.
Nearly three weeks later, the world is still reeling from the faulty CrowdStrike update, and new details are emerging about what went wrong. On August 6, the company published an in-depth technical root cause analysis and acknowledged shortcomings in its software testing processes.
As the dust settles and we continue to learn more, here are three observations and lessons in digital resilience that every organization can take away from the incident:
1. Prepare for Your Worst Day
The global CrowdStrike outages highlighted risks associated with vendor lock-in (over-reliance on any one vendor) and even left some organizations questioning their cloud strategies completely. This scrutiny is important, but it needs to be balanced with practicality.
Virtually every organization today relies on cloud services for some aspect of its business. While sometimes keeping “crown jewels” on-prem and distributing workloads across different providers can limit failures, it also adds more complexity and cost. All these factors must be carefully considered when building an anti-fragile organization with a solid IT infrastructure.
As security leaders, we must prepare our organizations to function with limited digital capacity in the event of an outage or service degradation. Anything can happen, like when Google Cloud deleted a customer account in May 2024, causing two weeks of downtime for 647,000 users in a completely isolated and random misconfiguration accident.
As the saying goes, plans are nothing; planning is everything. Evaluate existing disaster recovery and business continuity plans with fresh eyes. Run and stress-test playbooks regularly for a wide range of scenarios. Go through the entire exercise of bringing backups online to see what’s working and what isn’t. Then, do it all over again and again.
2. Ask Hard Questions
The CrowdStrike incident has prompted many organizations to examine their third-party dependencies to understand better how vendor outages could impact their operations. Now is also a good time to review critical vendor vetting processes for both existing and future partners. For instance, until last month, you may not have considered the importance of phased software updates. Does the vendor give you the option to roll out updates gradually—first testing the patch on a test server, then deploying it to a small test group of users to ensure things are working properly—and, if necessary, stop mid-way if there’s an issue before it impacts the entire organization? How robust are the vendor’s secure development lifecycle and quality assurance processes? How do they test and validate their updates before sending them out into the world? What security certifications do they have to back up these claims?
Asking the right questions is critical to building trust. The more customers know, the better they can prepare for the unknown. On the vendor side, clearly defined customer expectations can drive process and quality improvements and, ultimately, ensure more resilient systems.
3. Communicate Openly
Consistent, transparent communication is critical during a crisis. On July 19, in the early hours of the incident, CrowdStrike’s communications were tightly coordinated and centralized—no speculation or mixed messages to be found on social media. Company leadership quickly took responsibility, apologized and kept customers in the loop as they worked to remediate the problem. Despite widespread issues, people respected this transparent approach, and many customer organizations have publicly voiced their continued loyalty to CrowdStrike.
Organizations can apply these valuable crisis communications lessons to their own DR/BCP contingency planning efforts. How will security leaders keep business stakeholders apprised of an unfolding situation? Are the right communication channels in place to quickly mobilize internal teams and get systems back online? What are the best ways to keep customers and partners in the know? What’s our corporate social media policy—and who is authorized to speak with members of the press during an incident?
There Will Be Another Black Swan
The CrowdStrike incident surfaced critical questions around software testing and update quality assurance that must be addressed. It also reinforces the inherent dangers of a technological world that, in the words of Thomas Friedman, “we’ve taken from connected to interconnected to interdependent.” This interdependency means that every organization will experience a black swan event at some point. It may come in the form of a critical vendor outage, a ransomware attack or something else. By embracing an “assume breach” mindset and continuously stress-testing contingency plans and processes, your team will be better prepared—mentally and operationally—to face a crisis, respond rapidly and emerge even stronger.
Omer Grossman is the global chief information officer at CyberArk. You can check out more content from Omer on CyberArk’s Security Matters | CIO Connections page.
Editor’s Note: For more insights from CyberArk CIO Omer Grossman on this topic and beyond, check out his appearance on CyberArk’s Trust Issues podcast episode, “Trust and Resilience in the Wake of CrowdStrike’s Black Swan.” The episode is available in the player below and on most major podcast platforms