The importance of ‘live testing’ - learning from TSBs error

Contents

Headline story: “TSB lacked common sense before IT meltdown, says the report.” In a word – ouch.

There are several very interesting points that can be derived from this headline, and indeed the subsequent articles, the most standout points to me are:

It specifically mentions IT – because banks, like the vast and growing number of companies today, are IT businesses.
The wording – “TSB lacked…”. Not “TSBs IT department lacked…” or “TSBs CIO lacked…” – but referencing the business as a whole. Because like security issues, IT issues in an IT company are now a boardroom topic.

These points, and a firm grasp of the reason for them, should drive home the importance of testing. It should also highlight the importance of the frequency, depth and duration of that testing – and most importantly – the level of risk associated with the activity that the testing supports.

What is live testing?

It’s quite common in IT-driven companies to have significant and segregated test platforms, these are called “pre-production environments/platforms.” They mirror the technology and create a configuration of the live system on independent hardware and systems. Less commonly, they will also mirror the scale and capacity of the live systems (depending on the level of risk involved in that system) and it may be needed to prove the performance. It is important to note that this is not the same as the development or traditional test environment, but one that is controlled in the same way that the live environment is – quite often it is not managed by the development team, but by the infrastructure team to ensure this control.

This pre-production platform is where most testing takes place, it provides a safe space, somewhere that is controlled in terms of change. It ensures the test is true and that the results can be trusted. It can be quite expensive to deploy, maintain and manage – but that effort and expenditure are worth it to mitigate the risk of untested changes impacting the live environment.

However, in a system that is sufficiently large-scale, or perhaps public-facing, it is far more difficult to simulate the load and behaviour of the REAL user base in a pre-production platform. The sheer scale of the production platform and the associated workload can affect how a system accepts change or behaves, compared to the test platform. Add to that the complexity of human behaviour, which can be quite unpredictable, and you introduce an element of risk that demands attention. This is where live testing comes into its own.

In live testing, when a change is pushed to live, it is effectively on a trial basis. It may be to a subset of users initially, allowing a control group to interact and ‘use’ the system on a live basis. Or, it may be that a maintenance window is created to introduce that control – but the key is the change is not signed-off until all the test criteria have been met in production, and the performance and stability of the platform are observed as correct. Until that point, the rollback plan, which should have also been thoroughly proven in pre-production, should be ‘waiting in the wings’ ready to run.

How do we make sure it doesn’t happen to us?

I can only imagine the internal conversations that happened at TSB in the hours and days following the outage, but I imagine they included such statements as “it was tested” and “we got sign-off for the change”. But I would ask, was the risk and impact of that change properly understood when it was signed-off?

The key question a leader should be asking when anyone in IT, or indeed the whole business mentions words like ‘patching’, ‘updates’, ‘roll-out’ or ‘migration’ is “what is the worst that can happen?” Closely followed by “what is the test plan?” and “what is the rollback plan, has it been tested?”

What they should be asking themselves, and demanding their reports ask themselves before presenting the answers is, “Does the depth and rigour of testing, and rollback planning, reflect the level of operational and/or financial impact associated with an outage to the specific system/service in question?”

Sometimes it is as simple as numbers (e.g. retail), sometimes it is SLAs or contractual obligations (e.g. finance or law), sometimes it is reputation (public transport) and sometimes it is a combination of these factors.

The important point to understand at C-level, and to drive into the culture within IT, is that change control and risk decisions are not just IT decisions anymore – they are business decisions. It is the responsibility of everyone involved, regardless of whether they work in IT, to take the time to effectively understand and communicate both the risk and the impact all the way up to C-level, and inspect all the way back down to the planning level, to ensure that what has been planned is appropriate.

Then, when the worst happens and something comes out of the blue, we can roll back – and we know the rollback will work… because we tested it.

For more information about pre-production environments and IT project planning, contact us. We’re happy to help.

The importance of ‘live testing’ – what to learn from TSBs error

Enterprise Architect

David Murphy

I look after technical presales and architecture functions across Acora’s customer environments, our private cloud platform and our own internal infrastructure. In my 20+ years, I’ve worked on service desk, in engineering and design, and as a consultant and technical account manager, giving me a deep understanding of a whole range of client issues. I’m here to help clients make good technology decisions and ensure a smooth implementation journey that delivers real value and business benefits.

17 November 2021

Windows 11 – How the OS upgrade journey has changed

How to introduce Windows 11 is a key question for IT leaders. Although the overall method of getting from Windows 10 to Windows 11 hasn’t changed, you’ll need to test the new OS – first with IT, then with pilot...

20 October 2021

Windows 365 – do I need it?

You could be forgiven for missing Windows 365, even if you pay attention to the IT news. Partly, I suppose, because Microsoft chose to release it during the summer holidays! And really, it’s a pretty simple concept: a PC-per-individual, in...

18 August 2021

Protecting the Innocent

For us as IT professionals, the deceptively simple term ‘security’ encompasses a vast range of tasks and topics. And after spending all day dealing with firewalls, Web Application Gateways, anti-virus, anti-malware, proxy services, tenancy restrictions, just-in-time admin access, password vaults,...

4 August 2021

Making the Most of Microsoft 365 Licensing

Almost every Microsoft customer has started shifting to subscription-based Microsoft 365 (M365) services. These are part of the growing ‘evergreen’ Microsoft ecosystem, where individuals and organisations pay a monthly fee for use-rights to a product, rather than buying a perpetual...

3 August 2021

Yep, it’s Windows 11!

As mentioned in our article on Wednesday, Windows… 11?, Microsoft held a Livestream event yesterday evening which announced the release of Windows 11. The feed on Microsoft’s website appeared to become overwhelmed just minutes into the event, I was completely cut...

9 June 2021

Cloud Utopia: From Vision to Reality

Part 2 In our [previous] post, we presented Cloud Utopia as vision and destination. Now, let’s look at some of the practicalities of getting there, and what IT leaders can do to make the journey smoother and quicker for all...

26 May 2021

Cloud Utopia: Real Potential

Part 1 Picture a world in which you have no servers or other permanent IT infrastructure on your business premises. A purely subscription-based world, where all you need to think about are your data and processes. Upgrades, availability and scalability...

7 September 2020

Dynamics NAV vs. Business Central: What’s the Difference?

Microsoft launched Dynamics 365 Business Central in late 2018. At first glance, Business Central looks like a brand-new product, but it’s not – Dynamics 365 Business Central is the evolution of Microsoft’s popular ERP solution for SMEs: Dynamics NAV. Technology...

8 April 2020

Technical debt – when should you address it?

A term we have begun to use a lot more recently at Acora, ‘technical debt’ seems to be increasingly omnipresent in IT platforms everywhere! It sprung into our view this evening in a tweet linking an article at The Register:...

20 January 2020

Raiders of the lost password

At Microsoft Ignite this year, a conference for IT support partners, I decided to focus on security. This is a boardroom conversation right now and as I type various news stories exist about major companies who have been breached – and I imagine, a lot more...

10 July 2019

IT teams need to adapt to support the waves of Microsoft’s updates

There’s no doubt that Microsoft has a lot of good products, a lot of quality development teams and a lot of marketing people. Mix that with evergreen subscription models and you get a lot of news on a regular basis...

6 July 2017

The importance of IT Business Continuity Planning

The key question in the minds of every airline executive recently, is probably – Could what has just happened to BA happen to us? Any mature business regardless of vertical, sector, geography or internal structure, has at some point considered...

The importance of ‘live testing’ – what to learn from TSBs error