Sunday, July 21, 2024

Well That Was Really The King Of IT Outages So Far!

 Here is the basic outline of what happened in case you have been hiding under a rock for the last few days!

The software patch that shook the world

Asa Fitch, Sam Schechner and Sarah E. Needleman

21 July, 2024

Hemant Rathod, an Indian executive, was sipping tea in a conference room on Friday morning in Delhi, about to send a long email to his team, when his computer went haywire.

The HP laptop suddenly said it needed to restart. Then the screen turned blue. He tried in vain to reboot. Within 10 minutes, the screens of three other colleagues in the room turned blue too.

“I had taken so much time to draft that email,” Rathod, a senior vice president at Pidilite Industries, a construction-materials company, said by phone half a day later, still carrying his dead laptop with him. “I really hope it’s still there so I don’t have to write it again.”

The outage, one of the most momentous in recent memory, crippled computers worldwide and drove home the brittleness of the interlaced global software systems that we rely on.

Triggered by an errant software update from the cybersecurity company CrowdStrike, the disruption as those in Asia were starting their days and Australians were well into them.

Over the course of less than 80 minutes before CrowdStrike stopped it, the update sailed into Microsoft Windows-based computers worldwide, turning corporate laptops into unusable bricks and paralysing operations at restaurants, media companies and other businesses.

US 911 call centres were disrupted, Amazon.com employees’ corporate email system went on the fritz, and tens of thousands of global flights were delayed or cancelled.

“In my 30-year technical career, this is by far the biggest impact I’ve ever seen,” said B.J. Moore, chief information officer for the Renton, Wash.-based healthcare system Providence, whose hospitals struggled to access patient records, perform surgeries and conduct CT scans.

Fixing the problem involved technical steps that confounded many users who aren’t tech-savvy. Some corporate IT departments were still working to unfreeze computer systems late on Friday. CrowdStrike said the outage isn’t a cyberattack.

Adding to the chaos – and further underlining the vulnerability of the global IT system – a separate problem hit Microsoft’s Azure cloud computing system on Thursday shortly before the CrowdStrike glitch, causing an outage for customers including some US airlines and users of Xbox and Microsoft 365.

The CrowdStrike problem laid bare the risks of a world in which IT systems are increasingly intertwined and dependent on myriad software companies – many not household names.

That can cause huge problems when their technology malfunctions or is compromised. The software operates on our laptops and within corporate IT setups, where, unknown to most users, they are automatically updated for enhancements or new security protections.

In a 2020 hack, Russian perpetrators inserted malicious code into updates of SolarWinds software in a way that compromised a swath of the US government and scores of private companies.

The rising frequency and impact of cyberattacks, including ones that insert damaging ransomware and spyware, have helped fuel the growth of CrowdStrike and such competitors as Palo Alto Networks and SentinelOne in recent years. CrowdStrike’s annual revenue has grown 12-fold over the past five years to over $US3 billion ($4.5bn).

But cybersecurity software such as CrowdStrike’s can be especially disruptive when things go wrong because it must have deep access into computer systems to rebuff malicious attacks.

Not all updates happen automatically, and computer attacks often occur because people or businesses are slow to adopt patches sent by software companies to fix vulnerabilities – in essence, failing to take the medicine the doctors prescribe. In this case, the medicine itself hurt the patients.

The global outage began with an update of a so-called “channel file”, a file containing data that helps CrowdStrike’s software neutralise cyber threats, CrowdStrike said. The update was timestamped 4:09am UTC – just after midnight in New York and just after 2pm Friday in eastern Australia.

That update caused CrowdStrike’s software to crash the brains of the Windows operating system, known as the kernel. Restarting the computer simply caused it to crash again, meaning that many users had to surgically remove the offending file from each affected computer.

The nature of the patch meant that the impact was uneven, with people in the same office even experiencing the outage very differently. Apple Macs, which don’t use the affected Windows software, were OK, and servers and PCs that weren’t on and internet-connected didn’t receive the toxic update.

CrowdStrike soon realised something was amiss and the update to the file was rolled back 78 minutes later. That meant it wouldn’t affect computers that were off or in sleep mode during that period. But for many of those that were switched on, the damage was done.

In a blog post, CrowdStrike told those users to boot into the Windows “safe mode,” delete the offending file – called C-00000291*. sys – and reboot.

IT teams often can fix problems on employees’ computers using remote-access software – tools that became especially common during the work-from-home boom of the pandemic. But for laptops and other PCs, that approach doesn’t work if the machines can’t restart.

For those systems, CrowdStrike’s fix had to be done in person – either by a tech-support person on site, or by a regular employee trying to apply the instructions.

Moore, the Washington State healthcare CIO, was away on vacation and initially wasn’t worried when emails about malfunctioning computer applications started landing in his inbox on Thursday night.

By late that night, he had learned that the outage had engulfed the nonprofit health system’s approximately 50 hospitals and 1000 clinics across seven US states. Hundreds of IT employees began deploying patches, which required manual remediation, he said.

Some of the system’s affected computers and devices were fixed by 6am Friday US time, and most were humming again by 10am. “It will be the end of the day before we get it all done,” Moore said on Friday morning.

As companies were grappling with the impact, CrowdStrike’s co-founder and chief executive officer, George Kurtz, was on TV trying to reassure customers – and shareholders – looking haggard after a long night.

“We identified this very quickly and rolled back this particular content file,” Kurtz said in a CNBC interview about nine hours after the faulty update.

“Some systems may not fully recover, and we’re working individually with each and every customer to make sure that we can get them up and running and operational,” he added.

The timeframe for the recovery could be hours or “a bit longer”, he said. Kurtz said on X that the outage isn’t “a security incident or cyberattack”.

Home Affairs Minister Clare O’Neil has provided an update on the CrowdStrike tech outage.

Microsoft CEO Satya Nadella took to X to offer his own reassurance that the company was working closely with CrowdStrike to bring systems back online.

Tesla CEO Elon Musk responded, “This gave a seizure to the automotive supply chain,” and later said, “We just deleted CrowdStrike from all our systems.”

For Rathod, the senior vice president at Pidilite, the travails didn’t end with his potentially lost email.

After switching to his iPad to keep working, he had to rush to the airport for a flight – only to find long lines and flummoxed security staff checking boarding passes manually. Flight information screens weren’t working, so he had to find airline staff to direct him to the right gate.

“It was a mess at Delhi airport,” Rathod said. “How can we depend so much on one company?”

Tom Dotan and Robert McMillan contributed to this article.

The Wall Street Journal

Here is the link:

https://www.theaustralian.com.au/business/the-wall-street-journal/the-software-patch-that-shook-the-world/news-story/fe6cce2fc54d97dbae8f57489532640f

I reckon all that can be said about this outage has pretty much been said and the article above is a pretty good summary for the record.

I find it interesting that my system just kept chugging along as I have no need of or awareness of Cloudstrike! To me what happened makes a care for simplicity in critical systems (hospitals and he like) and to plan any updates to happen at times when the use of the machine is not vital. (Given the rapid response just making sure updates happened on the weekend would have saved you!)

I am sure everyone from philosophers to we humble plebs are going to be pleased the simplicity of our operations and lack of pushed updates saved us completely.

There has to be a case for a total rethink of all the updating that seems to be presently inflicted on as all. Have you ever had a Windows Update that you really felt you needed? Maybe 3-4 times a year? I do not know but I feel we need some clever souls to redesign what is done, now we have seen the possible harm!

I also think we need to do something about the Windows Hegemony – Linux anyone? Surely Coles and Woolies would be better off with a Linux terminal network?

There are some hard questions that need answers.

What do you think?

David.

p,s. Remember this attack needs considerable effort to fix - if you are affected:

Here is the drill:

"The only remedy for Windows users affected by the “blue screen of death” error involves rebooting the computer and manually deleting CrowdStrike’s botched file update."

Glad you had never heard of Cloudstrike?

D.

No comments:

Post a Comment