Chaos Engineering: What Is It & How Does It Work?
Distributed software systems have many points of failure. Can the process of chaos engineering help identify problems and gauge resiliency?
What is Chaos Engineering?
Chaos engineering is the process of experimenting with a distributed computer system by introducing unexpected disruptions in order to gauge the system's resiliency and to identify potential points of failure.
Chaos engineering is often confused as a way of testing your system. However, chaos engineering is not testing since tests are simply statements about a system’s known properties. It makes assertions based on existing knowledge instead of verifying properties of the current system.
On the other hand, experimentation is all about exploring the unknown. It creates new knowledge by proposing a hypothesis. If it’s proven, then the confidence grows and if it’s disproved, then we learn something new.
Chaos engineering is all about experimentation. No amount of testing can match the insights gained through experimentation as experimentation can explore scenarios that cannot occur during testing.
How Chaos Engineering Started
Chaos engineering is not a very old concept. It started in 2008 when Netflix migrated from data centers to the cloud. The idea was to avoid single points of failure such as large databases and vertically-scaled components. The move to the cloud would necessitate horizontally-scaled components leading to a decrease in single points of failure. However, cloud deployment did not bring about the expected boost in uptime.
Netflix used Amazon Web Services (AWS) cloud infrastructure which suffered a huge outage in 2012. During the outage, Netflix consumers were unable to select and stream videos. Netflix wanted to ensure that the loss of an Amazon instance wouldn’t affect the streaming experience at Netflix.
In 2010, the engineering team at Netflix created Chaos Monkey – a tool to test the resilience of its IT infrastructure. Unlike most such practices, Chaos Monkey ran the experiments in the production environment during business hours rather than in a simulated environment not actually being used. That way, their team could respond to potential fallouts with all the resources they need and they’d know an actual outage would have the exact same effects. Doing chaos experiments this way is risky, as you’re creating the potential to actually affect customers. Most organizations will only do chaos engineering from the safety of not working on production environments.
Principles of Chaos Engineering
Unlike what the name may suggest, chaos engineering follows a systematic approach. The following principles describe an ideal way to run experiments on your system.
Build a Hypothesis
The first principle of chaos engineering is to focus on the measurable output of a system instead of its internal attributes. Measuring that output over a short period will show you the system’s steady state. Some metrics that represent a system’s steady state include latency percentile, error rates, system throughput, etc. Then you can hypothesize how that output will change when an incident is introduced. By observing systemic patterns while experimenting, chaos engineering verifies how outputs are based on your system health.
Introduce New Real-world Events
Chaos variables reflect real-world events, specifically hardware and software failure, and non-failure events like a traffic spike. Each event is prioritized by its frequency or potential impact. Any event that has the potential to disrupt the system’s steady state is a potential variable in chaos experiments.
Run Experiments in Environments Matching Production
Systems, in general, behave differently depending on the traffic patterns and environment. Due to its varying nature, the only way to capture the request path in a reliable manner is by modeling your testing environment accurately by mirroring real traffic conditions and usage.
Chaos experiments are usually run in a simulation of the production environment that is as accurate as possible. This is a protective measure to prevent a worst-case scenario that could occur when testing in production. You also need to have proper control over the system environment in case the experiment goes sideways.
Automate Experiments
Running experiments on a system is usually an interesting job and requires a lot of creativity. However, running the experiments manually will take engineers away from more meaningful work. Once an experiment has been established, with its outputs logged and its range of conditions set, you should automate it to explore a wider range of conditions. To continuously run the experiments, automate the process and let the software take care of the rest.
Understand Blast Radius
Experimenting in production is a bold move and can potentially cause unnecessary pain to the customer. While small incidents can be handled quickly, the chaos engineer must ensure that any fallout is contained and minimized. Additionally, you should also have the incident response team on-call to handle incident management.
Why is Chaos Engineering Important?
As your users depend more on your service, it becomes more and more important to be resilient to every possible incident, not just the ones you can plan for. If you suffer a major outage, your customers won’t accept “we didn’t think that could happen” as an excuse. Chaos engineering is important to explore these rare or strange events that you won’t likely encounter in normal production.
Goals of Chaos Engineering
Chaos engineering works similar to preventive medicine. The goal is to identify failure before it becomes an outage by proactively testing the system under stress. You literally break things here and there to build a more resilient system.
Which Companies Practice Chaos Engineering?
Chaos engineering is a fairly new practice and not a lot of companies practice chaos experiments at the moment. It’s generally practiced in big tech companies that use distributed systems and microservices architecture. Some big names that practice chaos engineering are:
- Netflix
- Amazon
- Microsoft
- Twilio
Over the past few years, companies in the traditional industries like banking are also adopting chaos engineering. For example, the National Australia Bank migrated to AWS in 2014. After making the transition from physical to cloud infrastructure, they utilized chaos engineering via Netflix’s Chaos Monkey tool and significantly reduced the number of incidents.
Benefits of Chaos Engineering
Chaos engineering can be extremely beneficial to businesses in many ways. We will describe some business, technical, and customer benefits of chaos engineering in this section.
Business Benefits
With chaos engineering, companies can avoid big incidents and prevent lengthy outages and revenue losses. It also gives companies an opportunity to scale without compromising reliability by scoping out potential problems with scaling using chaos experiments.
Technical Benefits
Chaos engineering offers benefits to technical teams. Insights gained from the chaos experiments help reduce incidents and help teams figure out why certain incidents might happen ahead of time. Over time, the tech team gets a better understanding of the system modes and dependencies, which allows them to build a more robust system. Moreover, chaos experiments are an excellent opportunity for on-call engineers to practice incident management.
Customer Benefits
The goal of chaos engineering is to identify system vulnerabilities before they become an issue. Being proactive about system management results in fewer outages and less disruption for the end-user. The two major customer benefits of chaos engineering are availability and reliability.
How Does Chaos Engineering Work?
In chaos engineering, everything is an experiment and each experiment starts with a specific fault injected into the system. Later, the admins observe what actually happened and compare it to what they thought would happen.
Chaos engineering experiments generally involve two groups of engineers. The first group generally controls the failure injection and the second one deals with the effects. Here’s the step-by-step flow of chaos engineering experiments in practice:
- Define the system’s steady state as a measurable output that indicates normal behavior.
- Hypothesize how the system’s output will change in the experimental groups compared to the control group.
- Introduce variables that reflect real-life events ranging from hardware and software failure to non-failure events like traffic spikes.
- Work on disproving the hypothesis by comparing the system’s steady state in both control and experimental groups.
The confidence in the system’s behavior grows if it turns out to be harder to disrupt the system’s steady state. Alternatively, if you discover a weakness, then work to improve it before the behavior manifests in the system at large.
Who Uses Chaos Engineering?
In chaos engineering, multiple stakeholders are involved. That is because chaos experiments impact a wide array of technology and decisions. It requires a holistic perspective using the perspective of developers, operators, customer success teams, and others. As the blast radius of the chaos experiment widens, so does the number of stakeholders from various teams.
The teams involved will depend on the size of the experiment. For example, the application development team generally runs the experiments without the fear of breaking the container if the blast radius is small and manageable. In case of a wider blast radius such as testing Kubernetes infrastructure, the application development team will need to involve the platform engineering teams.
Chaos Engineering Maturity Model
As organizations mature and expand, chaos engineering offers more opportunities. However, these opportunities do not come without their challenges. We will break down the opportunities and challenges that you can expect at various stages of maturity.
It’s important to note that chaos engineering offers many benefits regardless of your organization’s maturity model. Starting sooner can give you more time to develop your expertise in the area.
Chaos Engineering Tools
Chaos Monkey
Chaos Monkey is the first chaos engineering tool developed by Netflix in 2010. It’s an open-source tool that is designed to test the AWS infrastructure. Currently, Chaos Monkey is used by many companies alongside Netflix.
Chaos Kong
Chaos Kong is another chaos engineering tool developed by Netflix. Both Chaos Monkey and Chaos Kong are designed to test the AWS architecture, but Chaos Kong simulates a complete AWS region being dropped. This allows the team to respond and recover by moving the traffic to another region without affecting performance.
The Simian Army
The Simian Army is a set of tools that introduce various failures into a system to test its ability to survive them. It was inspired by Chaos Monkey and includes many chaos engineering tools including Janitor Monkey, Latency Monkey, Security Monkey, and Doctor Monkey.
Gremlin Platform
The Gremlin platform is a SaaS solution that is designed to improve web-based reliability. It helps clients set up and control chaos experiments and it can test an entire system based on different parameters and conditions.
AWS Fault Injection Simulator
Focused on chaos testing AWS services, AWS Fault Injection Simulator (FIS) is a managed service that enables users to perform fault injection experiments on their AWS workloads. These experiments create disruptive events to observe how an application will respond to stress.
Chaos engineering in SRE and DevOps
The goal of chaos engineering is to improve a system’s reliability and resilience, which makes it an essential part of any mature SRE (site reliability engineering) solution. The chaos engineering mindset also helps DevOps teams work with unpredictability. Many SRE practices such as SLOs (service level objectives), retrospectives, and runbooks can integrate with chaos engineering to improve efficiency.
The impact of SLOs on chaos engineering is the most important. It’s important for the team to determine the impact of a hypothetical failure, which is not exactly easy. For example, if you conduct an experiment where an entire service goes offline for approximately an hour, then the thought seems frightening. But, what if only a small fraction of your customers use that service? Alternatively, imagine traffic surpasses a given threshold and a certain page starts loading three seconds slower for every customer. SLO allows teams to determine which scenario affects the customers most
Chaos engineering also helps SRE teams improve their runbooks by giving them more opportunities to evaluate them. It also helps them build a library of incident retrospectives as teams write retrospectives for chaos experiments as they would for a real event.
How Can Blameless Help?
Nowadays, you cannot afford outages, and chaos engineering is your best bet against them. With Blameless tools, you can make the best out of your chaos engineering experiments. Out SLO Manager, incident retrospective, incident resolution, and other tools can help your organization make the most of these experiments. To learn more, sign up for the newsletter below or request a demo.