Software systems that exist today have advanced and complex components, where it becomes inherently distributed and dependent on many other platforms in the industry. If you take a simple software system, it will definitely consist of a couple of micro-services, a cloud infrastructure and a mobile setup. Most of these systems are highly dependent on Cloud Service Providers like AWS, Google Cloud and Azure, and they become core dependencies, making survival impossible without them.
How confident are you about your system when you develop a software platform? What if your cloud provider goes down for 8 hours? What if your system load increases by 10 times? You never know until this actually happens in your production environment.
When your system grows with SLAs and customers paying a licensing fee, they expect your software platform to be uninterrupted and available for business continuity. To provide an uninterrupted service, you need to prepare for any kind of chaos that can happen in production.
This is where Chaos Engineering comes into practice.
Chaos Engineering is the Art of breaking things in Production.
Site Reliability Engineering (SRE) plays a vital role in Chaos Engineering; it is all about ensuring the reliability of the site, even if half of the production system goes down. Sounds unrealistic? Well, this article will give an introduction to Chaos Engineering and how this should be practiced in your organisation to build more resilient systems for your customers.
In this article, I would like to talk about the following topics.
What is Chaos Engineering?
Chaos Engineering is the discipline of experimenting on a system, in order to build confidence in the system’s capability to withstand turbulent conditions in production. If you've ever had experience running distributed systems in production, you very well know that something is bound to go wrong. This is because these systems are dependent on many other components, and this interaction is necessary for its survival and for the fittest functionality. The number of ways your system can go down is enormous. It could be a network failure, IDP failure, unstable pods, surge in user traffic and many more.
When the above incidents start occurring, performance becomes poor, outages are triggered and so on. This is why it is important to identify these issues beforehand and prepare for them, to prevent future outages from happening.
And, most of these platforms have Service Level Agreements (SLAs) tied up with its users, promising uninterrupted service uptime. Violating SLAs is not just about credit discounts, but it is also about your reliability and competitiveness in the industry. Furthermore, whether it's bound to a legal document or not, certain performance drops or outages can cost serious losses for an organisation.
Chaos Engineering is the method of simulating these outages in production environment, bringing systematic weaknesses into light. This is an experimentation to ensure that your system can withstand turbulent situations if they occur. Chaos Engineering is an empirical process where verification leads to more resilient systems and builds confidence in the operational behaviour of those systems. It can be as simple as killing a few services or disconnecting an entire cloud datacenter.
We learn about the behavior of a distributed system by observing it during a controlled experiment. We call this Chaos Engineering.
Chaos doesn't cause problems, it reveals them
As Site Reliability Engineers (SREs), we want to be confident that our systems are resilient enough to withstand any chaotic situation. With Chaos Engineering, you can address those weaknesses proactively, going beyond the reactive processes that currently dominate most incident response models.
In a nutshell, Chaos Engineering is;
What Chaos Engineering is NOT…
These are common misconceptions and I want to point them out in this article. The following are NOT Chaos Engineering Practices;
How is Chaos Engineering Different from Testing Procedures?
Chaos Engineering is an experimental procedure. There is a fine distinction between testing and experimentation. In Testing, an assertion is made; given specific conditions, a system will emit a specific output based on the given specifications. Tests are typically binary and determine whether a property is true or false. Strictly speaking, this does not generate new knowledge about the system, it just assigns valence to a known property of it.
Experimentation generates new knowledge, and often suggests new avenues of exploration. Chaos engineering refers to the multiple methods to generate something unique. If you want to detect or identify the complexity of any behavioural defection in the system, then injecting communication failures is always a better choice.
It is important to understand this, because some engineers might say that they are confident about their product or system, after proper unit testing and integration tests. This is true. No argument about that. Testing is the first phase of making sure that you're confident about your system. But it is not enough.
Resilience is about resisting shocks and continuing the same. This is only one part of Chaos Engineering. The best part is about exploiting the weak points and building a highly confident system on top of them.
What is “Chaos Monkey”?
I want to give a brief introduction into Chaos Monkey as well, which is very famous and gives a historical introduction to what Chaos Engineering is really about. Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage.
The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez:
“Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.”
Netflix has built an entire army of monkeys, to simulate Chaotic Situations in the production environment, and this is called the Simian Army. Some famous monkeys are...
Principals of Chaos Engineering
In this section, I would like to describe the advanced principles of Chaos Engineering and how Chaos Engineering can be practiced in your organisation. Always think of Chaos Engineering as an empirical approach where you explore the weak points of your software system. There are 5 main principles.
The entire story of Chaos Engineering is wrapped around the diagram below.
Let's have a look at each of these in detail.
Principle 1: Build a Hypothesis around Steady State Behavior
This can be broken down into two sections. It is important to identify the “Steady State” of your system and “how to build a hypothesis” around it.
What is Steady State?
Steady state is the state your system is in, when it is considered steady. This is similar to humans. We call a human “steady” when he/she is in certain health conditions. Similarly, this is a measurable output of your system’s behaviour, like the overall system’s throughput, error rates, latency percentiles etc. Formulate these numbers into a “state” and you can state “our system is steady when it is below this range”. An example steady state is given below.
Build the Hypotheses
Now that the steady state is decided, you can simply build multiple hypotheses around it. Think of these as the “what if” questions.
Think of things that can possibly go wrong in the production environment. But, always make sure of the following;
Don’t make a hypothesis that you know will break you. Why? Because, if you know that it will break you, you can simply fix it or ignore it. You don’t really need an experiment to test it out, right? Chaos Engineering experiments could be expensive and catastrophic. Hence, always use them to identify unknown vulnerabilities of your system.
Principle 2: Vary Real-world Events
Always consider events that are plausible and real. This decision can come with years of experience in the industry, where certain events seem realistic and some are not. Prioritise events by either potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment.
Some example events are as follows.
Principle 3: Run Experiments in Production
Many software systems we see today, go through different environments and different tests, before they actually reach production. And each of these environments behave differently than the actual production environment. If you want to see what the users actually go through, production environment is your best choice. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic.
But you might ask, why are we trying to break the production environment? Isn't it risky to perform a chaotic experiment in production? That is true. However you can never replicate the actual production settings in a different environment. Chaos Engineering wants to capture the loopholes in the production environment. Hence, it is important that this is performed in the production environment itself. Don't worry! This is done as an experiment in a controlled environment.
Examples of inputs for chaos experiments:
When running experiments in production, it is always recommended to use canary deployments. You can actually do this to a canary that has the lowest user traffic.
Principle 4: Automate Experiments to Run Continuously
The practice of Chaos Engineering is a long running and a labour intensive process. Hence, it is important to automate it to avoid engineer burnouts. Automate experiments and run them continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis.
With each experiment gather important metrics, perform important calculations and persist the information in a suitable location. Some of the example metrics collected from an experiment are as follows. (These can also be considered as results from an experiment).
Principle 5: Minimise Blast Radius
Trust me, the last thing you want from Chaos Engineering is to cause actual chaos in your production platform. Even when performing these experiments, it is possible that certain customers feel the degradation of the platform. It is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimised and contained.
When you perform a Chaos Engineering experiment, always remember to identify metrics like the following. (This is to ensure that the Blast Radius in contained and identified)
Why Do Chaos Engineering?
This is a challenging question to answer. But, when you look at your software system as an architect or an engineering manager, you should be able to determine why Chaos Engineering is required for your organisation. I would like to point out some obvious reasons, related to architecture of any software system.
In addition to these, there are other reasons that can help engineers in your organisation to be strong and confident in what they do. This could be the on-call engineers, or even the engineers who actually perform the product development.
This can also help the sales team of your company to come up with stronger SLAs and pitch in about how confident you are about your products.
Which Companies are doing this?
Netflix may have started this at first, but this area of specialisation has advanced into many dynamics in industries all over the world. Chaos Engineering is practiced at industries varying from finance, e-commerce, to aviation and beyond. Some of the famous software engineering companies who regularly practices Chaos Engineering are as follows.
Have a look at the industry adaptation into Chaos Engineering and some personalities behind certain initiations. (View Diagram). You can also learn more from the Chaos Engineering Community and Chaos Conf.
Challenges Faced in Chaos Engineering
Do you really need Chaos Engineering?
A simple answer for this would be YES. But, if you actually think about it, some companies don’t really need Chaos Engineering and this would be an additional engineering cost that they cannot bear. Let me break down the factors to think about when making this decision. There could be more factors in addition to what is mentioned below.
Does your product have an SLA with its users?
If the answer is yes, then it would be ideal to practice Chaos Engineering to ensure that you provide the agreed availability for your product. Yet again, if your customer base is still small and you can tolerate this kind of downtime, then this can be done a bit later in the roadmap.
Do you have strong competitors in the market?
If you have strong competitors in the market, this would be an essential part in your product to ensure the reliability and resiliency of your product. This would also be a good selling point for your sales team to take your product into market.
How Big is your customer base?
If your customer base is huge and growing, then your system will also have to scale and be distributed as much as possible to provide high availability. Practicing Chaos Engineering would ensure how your system would react to growing user requests and how to polish up the architecture to fit in to the demand.
Do you have an architecture that is high performant, distributed and/or fault tolerant?
In this case, it is very important to ensure that your system has a strong resiliency towards unexpected chaotic situations. Chaos Engineering is a must in this case, to fortify your system for its best performance.
View my original blog here