How the CrowdStrike Perfect Storm formed
Null pointer reference, Ring Zero, not testing on a live box, WHQL, and more
Imagine if you will a very large cybersecurity company rolled out software that made millions of systems around the world crash and fail, wreaking havoc on millions of people.
It’s not The Twilight Zone. It’s the real-life CrowdStrike Perfect Storm.
The company rolled out an update that contained a serious flaw. It contained a null pointer reference that caused an empty update file containing zeros, hence null, to be pushed into subscribers’ PCs, physical and virtual.
Pushing that unprocessable, nonexecutable null file into software will cause it to crash. It will not and cannot run unless that code element is removed.
Compounding the problem and confounding otherwise quick repair of the systems, the null pointer reference was in a data file that was indispensable for the CrowdStrike security program, called Falcon, to run, through its drivers.
These drivers in particular were tied directly into the kernel, the very heart, of Windows OS, with the highest level of privileges called Ring Zero.
Cybersecurity software on Windows requires such privileged, certified access to the operating system in order to defend against viruses and malware that target the kernel. With kernel access, viruses can do as they please on the system; without kernel access, antivirus software cannot stop or prevent viruses.
The CrowdStrike null pointer reference file was a data file of the sort it calls channel files, that tie into its drivers. Drivers are software that interact on different hardware levels to provide functionality between a device and its operating system.
In this case, the CrowdStrike drivers — which relied on the functionality of the data file which turned out to have the null pointer reference, and which caused the CrowdStrike software, called Falcon, to crash — were required for the system to boot. No Falcon, no boot.
Also complicating matters, and ironically making the damage worse, Microsoft has a process whereby drivers get certified and validated to run on Windows through WHQL. Windows Hardware Quality Labs Testing, now called Windows Hardware Certification. The term WHQL is still used. Otherwise, developers can digitally sign their drivers themselves but doing so will prompt a warning to the user at the time of installation.
CrowdStrike had obtained the certification for Microsoft itself to sign CrowdStrike’s software. CrowdStrike, using that WHQL certification, could roll out drivers and their related data files that can and did get automatically installed on systems subscribing to their service, without user interaction.
On the one hand, updates for antivirus must be implemented quickly, to keep up with quickly evolving malware variants.
But on the other hand, rolling out antivirus updates without adequate testing causes outages and damage not unlike those caused by viruses.
CrowdStrike used virtual validators to test their code and patch files. That was insufficient. Had they done real-world testing, they would have seen how their faulty update files caused crashes and inoperability of systems.
They did not test their code on actual Windows boxes, that is, devices running Windows — which is exactly what crashed, millions of them, belonging to their customers. The cascading effects were enormous.
CrowdStrike’s DevOps failure speaks for itself; it is res ipsa loquitur, in legal terms. It would also amount to a prima facie case of negligence, though of course that would have to be established through litigation or settlements. There are limits to contractual limitations of liability; they do not confer absolute immunity from recovery of damages by first- or third-parties, and certainly not from actions by shareholders for diminished value.
And in this case, the impacted systems could not be made operable again except by removal of the null pointer reference file through manual intervention. Those methods were and are cumbersome and time-consuming, and requiring, in the case of physical machines, going to each and every affected system one by one to remediate.
The recovery steps involved rebooting up to 15 times in the hope that the CrowdStrike Falcon software could download and replace the damaged file in time, before the damaged one got called and triggered a crash; manually booting into Safe Mode, where fewer programs are running, and removing the damaged file; inputting the Bitlocker encryption key, to be able to interact with the system; and/or booting from a USB stick containing a special remediation script provided by Microsoft.
And there we have it: the Perfect Storm.