Our SecOps team director shares what we can learn from the incident that got 8,5 million screens worldwide singing the blues.
On July 19th, we witnessed one of the most significant IT outages in history, caused by a problematic update of CrowdStrike’s Falcon Sensor platform. The numbers are staggering: more than 8 million devices were affected, more than 5,000 flights were canceled, and the total damage is assessed at 10 billion USD.
In the software industry, there’s a well-known principle that one should be extra careful when deploying changes on a Friday. Especially to 100% of your customers, all at the same time.
However, this is not the only lesson we can take away from this incident that caused worldwide havoc.
Testing will never go out of fashion
Even though CrowdStrike is a cybersecurity company, this was not a security incident. It’s the oldest story in the book—a seemingly trivial change that broke everything.
This time it came in the shape of a faulty configuration update to CrowdStrike’s security software that was supposed to improve gathering information on potential threats. The update inadvertently caused recurring BSODs (“blue screen of death” – a critical error that indicates a system crash) on Windows machines worldwide, resulting in a historic disruption of daily life, businesses, and governments.
Everyone in software development and testing knows that minor changes are often the most dangerous ones because they can easily fly under the radar.
In the supercharged, fast-paced environment where software is delivered, it is worth reminding ourselves that critical software requires more focus on quality, diligent testing, and intelligent investigation. The alternative can have global ramifications.
As CrowdStrike’s post-mortem identifies, the tool they used for verifying the update had an issue, which meant the faulty patch went into production with the underlying problem undetected. It’s interesting to note the improvements to their process that CrowdStrike announced to implement after the incident:
- Local developer testing
- Content update and rollback testing
- Stress testing, fuzzing, and fault injection
- Stability testing
- Content interface testing
Any QA specialist will immediately recognize some of these as no more than common, everyday software testing processes. So why would a company as serious as CrowdStrike wait for a crisis of this magnitude to fortify its testing and deployment processes? We can only speculate, but someone somewhere must have declared the above a technical redundancy and prioritized deployment speed above all else.
Unfortunately, it isn’t just CrowdStrike. The industry has been trying to proclaim testing dead for decades, pushing to replace the entire trade with automated checks running on CI/CD platforms. “If it’s green, we’re good to go!” is today’s modus operandi.
Yet, in the supercharged, fast-paced environment where software is delivered, it is worth reminding ourselves that critical software requires more focus on quality, diligent testing, and intelligent investigation. The alternative, as this case has shown, can have global ramifications.
Massive outages are a breeding ground for cybercrime
It’s not just that planes stopped flying for a day or two; incidents of this scale open up ample opportunities for additional damage. One could say that every significant IT issue is a cybersecurity issue waiting to happen.
Panic, a rapidly developing situation, and a lack of education means a perfect storm for social engineering scams of all shapes and sizes.
Immediately after the outage was reported, panic ensued and every IT technician in the world was looking into ways to bring the system back from the dead. However, not every IT technician had passed security awareness training and had enough composure to weed out the scams from legitimate fixes.
Panic, a rapidly developing situation, and a lack of education means a perfect storm for social engineering scams of all shapes and sizes. Indeed, cybercriminals started impersonating CrowdStrike employees and shipping malicious “fixes” in record time to penetrate someone’s weakened defenses.
As noticed in a blog post from KnowBe4’s CEO:
Within hours of mass IT outages […], a surge of new domains began appearing online, all sharing one common factor: the name CrowdStrike. As the company grapples with a global tech outage that has delayed flights and disrupted emergency services, opportunistic cybercriminals are quick to exploit the chaos.
It’s hard not to notice the irony of “fixes” for a malfunctioning cybersecurity product being actual cyber-attacks. But that’s our world, and it won’t get any simpler soon.
Preparation and training are crucial
One thing is sure—this is not the first issue of its kind, and it won’t be the last. We can (and should) take all the steps to prevent such incidents from happening, but there is no 100% guarantee. So how can we reduce their harmful effects?
When a massive outage takes place, good incident response plans, readily available technical staff, and clear mitigation steps will help soften the blow. Coming into such situations utterly unprepared on the operations side of things is something no company can afford today, especially if their business depends almost wholly on digital systems.
Further, we don’t want to make a bad situation worse by having someone fall for a scam mid-crisis, so we need to focus on strengthening every company’s weakest link – its employees.
Unless you want to blame interns for a major scandal, security awareness training is crucial. An employee who can recognize a scam will not introduce additional risk into a system already on the brink of collapse.
Complementing that with up-to-date and followed security policies will result in an organization that enters chaos with less risk and more assurance that working order will be restored in no time.
Food for thought?
As media outlets move on to the next big story and CrowdStrike’s marketing department works to develop a strategy to repair the reputational damage, everyone else would be wise to focus on the lessons we can learn from the incident.
The event has underlined the importance of rigorous testing, the ever-present threat of cybercrime, and the necessity of robust preparation and training for incident prevention and response. As we navigate an increasingly digital world, the threats keep multyplying, and you should double-check your systems and processes are fortified against them.
If you feel like your company could work on these areas, we have some resources to get you started. Here’s how phishing simulations contribute to enterprise security, and if you need any assistance, you can always check out our cybersecurity services.