Wednesday, May 22, 2024

This Sort Of Outage Affecting So Many Really Should not Happen,,,,

This broke last week:

Data deleted: UniSuper outage raises lockout fears

By David Swan

May 18, 2024 — 12.05am

For more than a week in early May, more than half a million Australians were unable to access their superannuation funds – more than $130 billion worth – and were left wondering if their balances had been wiped out or taken by hackers.

One of the nation’s largest superannuation funds, UniSuper, succumbed to a technical glitch, which knocked its services offline and left each one of its 600,000 customers – this reporter included –locked out of their accounts.

The root cause? The fund’s cloud computing provider, US tech giant Google, had accidentally erased UniSuper’s Google Cloud account.

Normally that would be easy enough to come back from, given UniSuper typically has duplication in place across two geographies, so if one service goes down it can be restored from the other. However, the deletion hit UniSuper’s account across both geographies, part of what the companies now call an “unprecedented sequence of events” and an isolated, one-of-a-kind occurrence.

But far from being a one-of-a-kind occurrence, outages and disruptions are the unfortunate new reality of our digital era, exacerbated by what some see as an overreliance on a handful of giant companies.

That such a seemingly innocuous error could affect billions of dollars and cause such widespread frustration among the public has raised questions about the degree to which the Australian economy relies on an increasingly small number of US-based tech giants for their computing services.

Just three American companies – Amazon, Microsoft, and Google – command about two-thirds of Australia’s cloud infrastructure.

When UniSuper picked up the issue, the company mobilised a team of more than 100 people from both UniSuper and Google Cloud, who worked 24/7 in shifts to try and fix it.

A day after the outage started, it issued a statement assuring members the incident was not a cyberattack and that no personal information or funds were compromised.

“We drew on every resource we had to get these systems online again for our members,” a UniSuper spokesman told this masthead.

“It was an isolated, ‘one-of-a-kind occurrence’ that should not have happened, as Google Cloud has confirmed.”

The spokesman said that UniSuper had backups in place with both Google and another unnamed provider, improving its ability to restore services.

Handing over the keys

The problem with cloud services is that the businesses using them have little or no control over the infrastructure and its security.

There are great benefits to using the cloud. By using third-party cloud providers like Google Cloud, Amazon Web Services or Microsoft Azure, businesses often have reduced infrastructure costs and scale resources up or down automatically based on demand. Using the cloud often means hiring less in-house IT staff, and therefore significantly reduced costs, and often higher levels of security and reliability.

Nearly half a million companies globally rely on Google’s ‘platform-as-a-service’, while an increasing number of Australian public sector departments use its technology. Australia Post, the CSIRO, the NSW Department of Customer Service and the Australian Department of Health and Aged Care are all Google Cloud customers.

Australia’s banks are also moving their infrastructure to the cloud: NAB signed a multimillion-dollar, long-term deal with Amazon Web Services in 2022, while CBA boss Matt Comyn has led a five-year cloud transition since 2020.

UniSuper shifted a significant portion of its operations to Google Cloud last year. The process involved transferring all non-production tasks, including 1900 virtual machines, across to Google, work that was previously spread across Microsoft’s Azure cloud platform and two of its data centres.

A UniSuper executive not authorised to speak publicly said that the company’s membership levels haven’t been affected by the outage and that members continue to join the fund.

So what lessons can we take from the incident?

One obvious solution is for companies to stop using the cloud so much, and instead return to how computing looked in the 1990s. That was a more decentralised model in which companies ran their own servers and were responsible for their own infrastructure.

US-based technology executive Lisa Rehurek is the founder and CEO of The RFP Success Company. Rehurek says a hybrid architecture – a combination of on-premise and cloud infrastructure – would have greatly reduced the chance of a single mistake causing such widespread problems.

“What the incident highlights is the potential risks of vendor lock-in,” she said. “Cloud providers offer many benefits but relying too much on one can increase an organisation’s vulnerability ... Keeping a hybrid set-up can help lower this risk.”

Transparency and communication were also issues during the outage, according to Rehurek. “There were concerns about the lack of clear and timely info from Google Cloud about the cause and fixes. Clear and prompt communication keeps trust and lets customers choose.”

Analyst firm Telsyte’s managing director Foad Fadaghi said firms are increasingly spreading their computing workload across multiple providers, rather than relying on just one.

Telsyte’s research shows that Amazon Web Services has a commanding share of Australia’s $4.3 billion infrastructure-as-a-service market, accounting for 42 per cent of the market, with Microsoft Azure at 27 per cent and Google Cloud at 13 per cent.

“Ironically, Google has benefited from being the second or third choice cloud provider in Australia,” he said.

“Google increased its market share in 2023 as Australian businesses look to de-risk and use multiple providers.”

For Gartner vice president and analyst Michael Warrilow, UniSuper did the right thing. “Failures at cloud mega vendors are rare. And they are typically the result of software bugs rather than hardware because of the cloud-native architecture that’s used,” he said.

“For every cloud outage there are many more outages with traditional IT environments. UniSuper showed best practice for cloud by having a backup in another provider,” Warrilow said.

“Other examples of best practice include monitoring, problem/incident and change management. [Using the] cloud doesn’t obviate the need to continue these disciplines.”

Most UniSuper customers – and members of the public more broadly – don’t care where the computing power is occurring, they just want it to work.

Here is the link:

https://www.smh.com.au/technology/data-deleted-unisuper-outage-raises-lockout-fears-20240516-p5je2m.html

I have to say I am still wondering how this sort of failure can be so prolonged given the infrastructure UniSuper is using, but I guess you do rather shoot yourself in the foot with big deletions!

David.

1 comment:

  1. David,

    Your banner says it all:

    H. L. Mencken - "For every complex problem there is an answer that is clear, simple, and wrong."

    ReplyDelete