A routine software update performed by the cybersecurity firm CrowdStrike caused one of the largest IT outages of all time as it affected numerous critical infrastructures and organizations across the globe.
CrowdStrike’s Falcon Sensor, a widely-used software within the Microsoft Windows operating system, was impacted by a coding defect that crashed thousands of computers and resulted in the infamous “blue screen of death” – an irreversible error that requires a system reboot.
The company revealed that the root problem causing the system’s crash was a bug in the program that should have caught the error before the software was updated. Without adequate oversight, the update was rolled out across customers’ computers and resulted in the catastrophic failure of major IT systems powering airports, banks, hospitals, and corporations.
Airline operations were among the most affected and hundreds of flights were delayed and canceled globally, which resulted in disruption to passengers’ itineraries and delivery times for cargo. Some airlines that reported operation issues included United Airlines, Delta Air, KLM, and Frontier.
Moreover, TV and radio stations including Sky News were also impacted and the London Stock Exchange (LSE) also reportedly experienced issues with its website and online systems.
Also read: CrowdStrike’s History: 13 Years of B2B Cybersecurity Growth
Moreover, the New York City Subway, 911 services in the state of Alaska, and multiple clinics and hospitals across the globe reported disruptions in their operations.
CrowdStrike estimated that around 8.5 million computers were affected by the faulty software update. The incident highlights the widespread adoption of the company’s solutions but also the weaknesses in its internal processes in appropriately analyzing and vetting these patches before they are rolled out.
CrowdStrike Assisted Customers But Fixing the Issue Was a True Headache
CrowdStrike immediately responded with a solution to the issue once it was identified. The company’s CEO George Kurtz went on social media to issue a clear statement: “This is not a security incident or cyberattack. The issue has been identified, isolated and a fix has been deployed.”
He also apologized to the affected parties for all of the inconveniences caused: “We’re deeply sorry for the impact that we’ve caused to customers, to travelers, to anyone affected by this, including our company.”
However, implementing a solution was challenging. First, the fix would need to be implemented manually. This resulted in a time-consuming task for IT teams across the globe that were already handling a spike in support tickets from within their organizations.
Moreover, some of the affected computers were impacted by a reboot loop that made it difficult for IT professionals to perform the required fixes. Thousands of servers, workstations, and other similar hardware needed to be individually reconfigured to restore them.
The Outage and Its Implications May Be Scrutinized by Congress
Microsoft (MSFT), whose Windows operating system was at the center of the outage, also got involved in these efforts. The company stated that it was taking “mitigation actions” and was actively investigating issues with cloud services in the US.
CrowdStrike took immediate steps to address the issue including quickly identifying and isolating the problem, deploying a fix, and directing customers to their support portal for the latest updates and continuous information.
The company also mentioned that it was taking steps to prevent such issues from happening again. Some of these actions include giving customers more control over the update process, providing additional details about planned updates to customers, and releasing a full analysis of the incident and its causes.
A group of lawmakers has already voiced their concern about the impact of the outage and how it jeopardized the operations of critical infrastructure. These calls for increased scrutiny of CrowdStrike’s actions could lead to the citation of its CEO and other key personnel to share their testimony on the issue in front of members of the House of Representatives.
“This is a really big mess, to be honest,” commented the vice president of IT of the Bridgewater State University, Steve Zuromski.
“What technicians are having to do, at least as of now, is manually go to each work station, boot it into what’s called safe mode, which is basically a diagnostic mode, delete the file that’s causing it to crash and then reboot,” he added.
Meanwhile, Jon Amato, an IT security expert from Gartner said: “This may be the biggest stress test that I’ve ever seen for direct, first-line IT support teams.”
Lessons Learned from the CrowdStrike Outage
The Crowdstrike outage is bringing up concerns about the current vulnerabilities faced by critical systems in the United States and overseas. If a single and relatively harmful software update can cause this kind of havoc, hostile nations and bad actors could take advantage of future updates to hijack or disrupt the operations of organizations across the globe.
The incident also highlighted the importance of having updated response plans for this kind of situations and put to the test the readiness of IT teams from hundreds of organizations whose preparedness was thoroughly examined.
Also read: 14 Important Cybersecurity Trends for 2024 – Stay On Top Of Online Threats
Moreover, it became clear that some countries had stronger ties than others with systems developed by Microsoft, while others, like China, emerged nearly unscathed by the incident as their IT infrastructures are mostly isolated from Western technologies.
The strategic importance of these companies and their solutions is more evident than ever and this may prompt the United States and its allies to increase their oversight of these organizations, their products, and contingency plans to protect their strategic systems from being targeted and disrupted by criminal organizations and hostile nation-states.
While this incident was not a cyberattack, it raised serious questions about the resilience of both Windows operating systems and the cybersecurity measures provided by CrowdStrike. It serves as a wake-up call for reassessing and potentially overhauling existing cybersecurity strategies and IT management practices.
Industry Experts Explain How to Protect Your Computers
Companies and organizations should take note of what happened recently and enhance their system’s resilience and ability to respond to faulty software updates and IT outages.
The best approach in these cases is to prepare for the worst by coming up with comprehensive step-by-step contingency plans and testing the preparedness of IT teams via live drills.
“What you want to identify first is: when we imagine all the different things that can go wrong, which one is going to be the most disastrous for us,” commented Spencer Kimball, the CEO of Cockroach Labs, a software company that develops specialized database that reduces downtimes.
While not always feasible, organizations should consider diversifying their IT infrastructure and working with various providers to avoid single points of failure in their architectures.
“Reliance on single software providers is part of the reality in modern IT estate management,” commented Frank Trovato from the Info-Tech Research Group. He argued that companies that rely solely on Azure – Microsoft’s cloud service – are entirely vulnerable to an Azure outage.
Also read: 7 Tips To Keep Your Business Safe From Hackers [Infographic]
Progressive rollouts instead of large single-package installations are also recommended. It is often safer to update module by module of critical software than to perform large updates that affect the entire program. It also doesn’t hurt to wait for others to test and vet an update before you install it. Organizations that didn’t immediately install the CrowdStrike update were mostly unaffected.
In large organizations with multiple development teams, standardizing DevOps practices can help prevent inconsistencies that might lead to issues slipping through the cracks. This can improve the overall quality of the software that is being rolled out and reduce the risk of major outages.
“This has happened before, and we can expect something like this to happen again in the future,” Trovato added.
As our reliance on interconnected IT systems continues to grow so does the importance of building resilient systems. This incident should prompt organizations across all sectors to reassess their IT contingency plans, diversify their technological dependencies, and prepare better for potential large-scale disruptions.