Brennan blog: Bouncing back. The role of resilience in IT

Bouncing back. The role of resilience in IT.

Richard Kempsey Brennan Content Writer Linkedin Profile

Bouncing back. The role of resilience in IT.

Never let a good crisis go to waste. And as the dust settles on what has been labelled as the largest IT outage of all time, many are using the episode to entirely reboot the concept of resilience. But what are some of the practical considerations IT organisations can harness to be better prepared for whatever comes next?

In the words of the philosopher Alain de Botton, resilience is, “A good half of the art of living”. As the waves of chaos unleashed by the recent global outage (#CrowdStrike) recede, it turns out that resilience may also be the smartest half of the cost of business.

Many were heaving a sigh of relief that the incident wasn’t malicious. Which must come as cold comfort to the many businesses and their customers who lost billions in revenue, millions of hours in inconvenience, and the expense of technical remediation.

The global outage is another timely reminder of not only how deeply interlinked the world’s technology infrastructure, systems, and software is, but how this interdependence is both its strength and its weakness.

Even if your organisation wasn’t directly impacted, odds are it was incidentally. When the system folds, it doesn’t matter how robust the walls of your house are when a wrecking ball crashes into the common wall you share with your neighbours.

Business continuity is how a company reacts in times of trouble. Resilience prevents it. In this particular incident, the software and hardware vendors in question have released mea culpas and long lists of practical, technical, and ethical upgrades. All of which is needed. But perhaps one of the most discussed and arguably most beneficial upsides is the renewed focus on systems and infrastructure resilience, and what it might look like.

These are five thought starters on our minds here at Brennan.

1. A connected response. A thorough review.

When you smashed the “Break in case of emergency” glass, what happened? Much will have been revealed over the past month, including the durability of your strategies, plans, and protocols. Did a well-oiled Disaster Response kick in? Were back-up systems in place and ready to deploy? Were recovery protocols followed? Was your in-house and partner support able to turn on a dime? Was everyone singing from the same hymn sheet? Had you run drills to pressure test your responses? If so, how did the actual delivery match up? A Post Incidence Review (PIR), prepared internally and augmented by third-party providers, can be an indispensable tool for forensically analysing what happened, and what didn’t.

2. Cool heads. Steady hands.

As bad as the outage was, the killer wasn’t the disruption alone. It was the physical hands-on intervention needed to remediate the 8.5 million affected machines – some of which were in extremely far-flung locations or physically inaccessible spots. Automated remediation works up to a point. But in this instance, there was no substitution for cool heads and steady hands. In short, living breathing humans, responsive to calls, proficient in the challenges, and adept in knowing and deploying the fixes. From CIOs rolling up their sleeves and going desk-to-desk to roll out resets to tech specialists driving through the night to outback locations, stories of how the tribe came together was a humbling reminder of how the tech community rallies together during a crisis.

3. Know thy neighbours.

One of the key characteristics of any organisation’s key value chain is understanding that the risks no longer sit just within your own walls. They’re interlinked with the customers you serve, the partners you align with, and even their partners. When one goes, all go. While the Security of Critical Infrastructure (SOCI) Act holds organisations to account across eleven sectors earmarked as critical to Australia’s sovereignty, security, and economy, events of the nature we’ve just experienced highlight the value in knowing where your neighbours’ systems and infrastructure intersect with yours, how response-ready they are in times of crises, as well as their appetite for and activity in mitigating risks.

4. Think disastrously.

Whether it’s a cyber breach, an IT failure, human error, or a natural disaster, the damages inflicted by unexpected events may look different, but the impact hurts the same. Over the past few years, cyber risks and cyber threats have dominated discussions on prevention and mitigation strategies. But the recent outage has underscored the need for organisations of all stripes to take a 30,000 ft view to risk tolerance, risk assessment, and risk management. Whether it’s a renewed focus on recoverability (driven by robust and always-on back up protocols), a drive to inject more diversity across infrastructure and operating systems, a doubling down on policies and procedures, or a mix of all, there’s no substitute for prosecuting a higher order interrogation of what can go wrong to inform the program of works needed to avoid the pain.

5. Be strategic. Get buy in.

Organisations can’t control external threats. But they can control their preparedness. And if resilience and contingency planning wasn’t already on the Board and ELTs radar, it is now. Core to this is cultivating a culture that’s organisationally baked in from top to bottom, embedding robust contingency plans that encompass infrastructure and key business operations. These need to be vetted, assessed for effectiveness, presented to business leads, and run through with the boots on the ground, inside and outside of your organisation. Have you scheduled regular backup and recovery strategy drills? Have loss-of-access scenarios been factored in? Who has access to the incidence response plan? Is it regularly reviewed for clarity and efficacy? Should you adopt a proactive posture with pen testing? What are the critical controls you need in place at 3 months, 6 months, a year? These are some of the starting points worth considering.

The worldwide outage was painful, costly, and unavoidable. Is it possible to predict, deflect, or avoid all future risks? Of course not. But pain breeds innovation, innovation fosters resilience, and the price of proactivity will always outweigh the cost of inertia. Bet on the future. It’s going to happen anyway.

Join us on social

Get in touch

Tell us what you need help with, and we’ll send the right expert your way.

Blogs