CrowdStrike outage: Reasons, analysis & precautions

DALL-E generated image

19^th July will go down in history as the day the world stood still. A harmless software update from Texas based cybersecurity firm CrowdStrike flirted the wrong way with Microsoft Windows and boom, the world plunged into chaos as IT systems cascaded into outages affecting banking, airlines, hospitals, television news stations, emergency services and many other industries and companies globally including the London Stock Exchange, Mumbai Airport, India and many other airlines.

Windows users around the world reported their computers showing the operating system’s Blue Screen of Death. As per Microsoft estimates, the incident, which is being described as one of the worst IT outages in history, affected 8.5 million computers around the world.

CrowdStrike now says that “a significant number” of affected devices are back online.

This is a product that runs with high privileges that protects endpoints. A malfunction in this can, as we are seeing in the current incident, cause the operating system to crash
Omer Grossman, CIO at CyberArk

Omer Grossman, CIO at CyberArk says the damage to business processes at the global level is dramatic. “The glitch is due to a software update of CrowdStrike’s EDR product. This is a product that runs with high privileges that protects endpoints. A malfunction in this can, as we are seeing in the current incident, cause the operating system to crash.”

Why It Happened

Kumar Ritesh, CEO and Founder of CYFIRMA explained that the massive outage in Microsoft systems caused by CrowdStrike updates was due to a compatibility issue between CrowdStrike’s Falcon sensor and a Windows update.

Such incidents underscore the importance of rigorous compatibility testing between security solutions and operating system updates to prevent widespread disruptions
Kumar Ritesh, CEO and Founder of CYFIRMA

“When the CrowdStrike sensor, a critical endpoint protection agent, was updated, it conflicted with changes introduced in the latest Windows update,” he explains.

Jake Moore, Global Security Advisor at ESET, says these outages are increasing in volume due to the sheer increase in the number of online users and traffic.

“After witnessing the blue screen of death (BSOD), many people are quick to suspect a cyberattack or find similarities to Netflix’s Leave The World Behind but this can often add to the confusion. It highlights the importance of these services and the millions of people they serve.”

After witnessing the blue screen of death (BSOD), many people are quick to suspect a cyberattack or find similarities to Netflix’s Leave The World Behind but this can often add to the confusion. It highlights the importance of these services and the millions of people they serve
Jake Moore, Global Security Advisor at ESET

He goes on to stress that such an event indicates our dependence on big tech organizations. “The inconvenience caused by the loss of access to services for thousands of people serves as a reminder of our dependence on Big Tech such as Microsoft in running our daily lives and businesses. Upgrades and maintenance to systems and networks can unintentionally include small errors, which can have wide-reaching consequences as experienced today by Crowdstrike’s customers.”

He also adds that another aspect of this incident relates to “diversity” in the use of large-scale IT infrastructure.

“This applies to critical systems like operating systems (OSes), cybersecurity products, and other globally deployed (scaled) applications. Where diversity is low, a single technical incident, not to mention a security issue, can lead to global-scale outages with subsequent knock-on effects.”

The range of possibilities ranges from human error – for instance a developer who downloaded an update without sufficient quality control – to the complex and intriguing scenario of a deep cyberattack, prepared ahead of time and involving an attacker activating a “doomsday command” or “kill switch”

Grossman says the cause for the malfunction can be multiple, “The range of possibilities ranges from human error – for instance a developer who downloaded an update without sufficient quality control – to the complex and intriguing scenario of a deep cyberattack, prepared ahead of time and involving an attacker activating a “doomsday command” or “kill switch”.

Precautions for the Future

Moore says businesses must test their infrastructure and have multiple fail safes in place, however large the company is. “This is typically referred to as a cyber-resilience plan. But as often is the case, it is simply impossible to simulate the size and magnitude of the issue in a safe environment without testing the actual network,” he advises.

“Such incidents underscore the importance of rigorous compatibility testing between security solutions and operating system updates to prevent widespread disruptions. There are measures that can be put in place to avoid such disruptions,” Kumar Ritesh says.

His advice to avoid such a situation again is to create a testing environment that mirrors production systems before deploying any security update or software patch.

“Test the update thoroughly in this environment to identify any compatibility issues or unexpected behavior. Avoid deploying updates across all systems simultaneously. Instead, roll them out gradually to a subset of machines. Monitor these systems closely for any adverse effects. If everything looks good, proceed with a wider rollout.

“Regularly back up critical systems so that in case an update causes problems like the current situation with Crowdstrike updates, you can restore the system to a previous state. Ensure backups are tested and reliable. Use patch management tools to automate the deployment of updates. These tools allow you to schedule updates, track their status, and roll back changes if needed.

“We would always encourage organizations to implement monitoring solutions that detect anomalies, performance issues, or unexpected behavior. And set up alerts to notify you immediately if any critical system experiences problems,” he says.

Aftermath

While the world slowly gets back to normal, it’s taking time. According to the media reports, while several banks and other financial institutions were affected, the medical and airlines sectors had it the worst. Hundreds of flight cancellations are still occurring. According to Wired, many hospitals in the UK, Australia and Israel faced chaos. The outage even impacted the national organ transplant matching system in the US. Banks that were temporarily affected included Wells Fargo, Charles Schwab and TD Bank. While most systems are now up and running, the fix for the CrowdStrike outage is manual and will take some time.

Grossman points out that the issue of customers getting back online and regaining continuity of business processes is not an easy task.

“It turns out that because the endpoints have crashed – the Blue Screen of Death – they cannot be updated remotely and this problem must be solved manually, endpoint by endpoint. This is expected to be a process that will take days.”

“CrowdStrike’s analysis and updates in the coming days will be of the utmost interest,” he added.

Navanwita Bora Sachdev

Navanwita is the editor of The Tech Panda who also frequently publishes stories in news outlets such as The Indian Express, Entrepreneur India, and The Business Standard