Here are the results of the poll.
Are National Telco Outages, Like The Optus National Event Last Week, Technically Preventable At Acceptable Cost?
Yes 32 (76%)
No 10 (24%)
I Have no Idea 0 (0%)
Total No. Of Votes: 42
A clear outcome with a large
majority feeling it is possible to stop huge national outages. May just spend a bit more, more effectively?
Any insights on the poll are welcome, as a comment, as usual!
A good number of votes. But also a very clear outcome!
0 of 42 who answered the poll admitted to not being sure about the answer to the question!
Again, many, many thanks to all those very few who voted!
David.
Technically preventable or shorter time to repair (mitigations) but reducing/mitigating risk is harder in practice. The cost of adding each step to reduce/mitigate becomes almost exponential.
ReplyDeleteThe main focus is minimising the number of systems affected by problems & faster restoration. They already use a lot of component, device, power, link & location redundancy and change management process. Costs are 2x, 3x, 4x to have redundancy, it costs a lot to have technicians on standby, watching each update.
1) Duplicate systems to test changes before only updating part of live system (small number).
2) Once a few have been changed & checked, gradually roll out changes in small batches & continue testing.
3) If a problem is identified then it needs to be identified & isolated from rest of system. Standby devices running previous version need to be ready or other methods of faster rollback.
4) More info to know what was being updated & when.
5) Technicians on standby in person near some of the changes or in critical location ready to be deployed without relying on broken network/phone system.
6) Use services/links supplied by other networks (adds redundancy) between offices/datacentres in the event that work has to be coordinated when primary network isn't operational.
7) Contact a taxi/Uber to drive to senior employees to relay a message when a critical problem is discovered (knock on door at 5am if they have too).
8) An always on device & software at their homes that detects & alerts the technician/executive of a loss of service. Clear procedures to follow (on paper or pre-saved to device) in the event of each type of scenario... (incl. people & location to contact, how&when to contact media. In the cloud service to monitor & log service problems so some oncall technicians permanently using competitor network can be alerted.
9) Copper landline to Telstra exchange for better chance to contact critical people.
10) Can more of the mobile phone vs broadband vs company systems be run more separately instead of unified?
11) Adding redundancy adds another point of failure & relies on something knowing when to switch to a backup.
Or you can just replace the CEO, reset the clock and move on - cheaper option
ReplyDeleteQueensland Health just demonstrated to NSw Health what the future looks like with an iEMR. The root cause will be fascinating.
ReplyDelete