Information technology (IT) is at the core of operations across industries and the slightest of disruptions in IT services can heavily dent an enterprise’s business performance.
Uncertainty caused by events like the COVID-19 pandemic has forced business models to change on an urgent basis, and has thrown light on IT’s challenges in terms of scalability, security, and business sustenance. We have seen in the past how technological glitches—for instance the AWS outage and the Salesforce outage of May 2021—have resulted in revenue loss and customer dissatisfaction.
Technology related disruptions that impact business outcomes include:
- Resource unavailability
- Vulnerabilities, patches, and updates
- Identity and access management
- Insecure services enabled
- Performance issues
- Dynamic scalability
The aforementioned risks are dynamic in nature with complex interdependency between them. Under these chaotic technological disruptions, what should enterprises do to mitigate technology related risks and ensure business resilience?
Chaos engineering to strengthen resilience
Typically, different roles and teams under IT—such as application development, infrastructure and operations, testing and quality analysis, security, and site reliability—support and operationalize business systems. For a business to function seamlessly, these teams need to work together in harmony. In practice, we have seen these functions work in silos or with very limited interaction (for example, DevOps), but resilience, security, and performance are not considered integral to application development and delivery.
There is need for a framework that can enable alliance between IT teams to proactively mitigate technological related risks. The chaos engineering framework bridges this gap. It allows IT teams to incorporate nonfunctional requirement ground up during application lifecycle. It enables them to test hypotheses or assumptions in the real world and check system resilience.
The framework allows precise and measured amounts of failures and errors in the system, for the purpose of improving business resilience. It enables organizations to prioritize business services that will benefit from improved resilience. It aids businesses to investigate vulnerabilities in the technology ecosystem and apply resilience patterns.
It offers a gamut of capabilities, such as:
- Understanding of system modes and dependencies
- Monitoring, tracing, and observing behavior of IT systems
- Checking effectiveness of incident response process in case of emergencies
- Testing out stability patterns
- Identifying weakness and bugs that can cause business outages
- Performing blameless postmortems
A collaboration, not an IT delivery function
Chaos engineering experiments cannot be treated as another function of IT delivery. It involves a collaborative effort between different stakeholders (chaos specialists, production support, incident management, domain expert, testing expert, and DevOps teams) involved in application lifecycle and change in mindset.
Adopting chaos engineering is a journey and it involves building competency in a structured and collaborative manner. Below is a sample maturity model to adopt chaos engineering.
- Unit and nonfunctional testing of application ecosystem
- Integrated testing
- Performance testing
- Penetration testing
- Chaos testing
Organizations can start small in implementing chaos engineering framework.
A typical chaos experiment process involves identifying, prioritizing, and defining a steady state of the business function where resilience is needed. A chaos team then identifies failure scenarios, monitors key metrics, defines the last radius of the experiment, gets a buy-in from the business, communicates to the stakeholders, and plans the game day for resilience testing.
On the game day, the team conducts the experiment, and performs blameless game day incident analysis and postmortem of the experiment. The chaos team then identifies the action and resilience pattern required for the resolution. As part of the next step, they coordinate with the team concerned to test and validate the resilience pattern in a test or preproduction environment, before applying the changes to the production environment, and plan for the next game day.
There are many commercial and opensource tools available to conduct chaos experiments.
Depending on the IT environment and internal capabilities, organizations can choose between commercial and opensource tools to carry out these experiments.
By leveraging chaos engineering, businesses can build resilience by reducing surprises, disrupting frequency and severity, and enabling businesses to deliver definitive value to its customers.