Skip to main content

Chaos Engineering - The Art of Breaking Things in Production

Software systems that exist today have advanced and complex components, where it becomes inherently distributed and dependent on many other platforms in the industry. If you take a simple software system, it will definitely consist of a couple of micro-services, a cloud infrastructure and a mobile setup. Most of these systems are highly dependent on Cloud Service Providers like AWS, Google Cloud and Azure, and they become core dependencies, making survival impossible without them. 

How confident are you about your system when you develop a software platform? What if your cloud provider goes down for 8 hours? What if your system load increases by 10 times? You never know until this actually happens in your production environment.

When your system grows with SLAs and customers paying a licensing fee, they expect your software platform to be uninterrupted and available for business continuity. To provide an uninterrupted service, you need to prepare for any kind of chaos that can happen in production.

This is where Chaos Engineering comes into practice.

Chaos Engineering is the Art of breaking things in Production.

Site Reliability Engineering (SRE) plays a vital role in Chaos Engineering; it is all about ensuring the reliability of the site, even if half of the production system goes down. Sounds unrealistic? Well, this article will give an introduction to Chaos Engineering and how this should be practiced in your organisation to build more resilient systems for your customers.


In this article, I would like to talk about the following topics.

  • What is Chaos Engineering?
  • How is Chaos Engineering Different from Testing Procedures?
  • What is Chaos Monkey?
  • Principals of Chaos Engineering
  • Why Do Chaos Engineering?
  • What Companies are doing it?
  • Challenges Faced in Chaos Engineering
  • Do you really need Chaos Engineering?


What is Chaos Engineering?

Chaos Engineering is the discipline of experimenting on a system, in order to build confidence in the system’s capability to withstand turbulent conditions in production. If you've ever had experience running distributed systems in production, you very well know that something is bound to go wrong. This is because these systems are dependent on many other components, and this interaction is necessary for its survival and for the fittest functionality. The number of ways your system can go down is enormous. It could be a network failure, IDP failure, unstable pods, surge in user traffic and many more.

When the above incidents start occurring, performance becomes poor, outages are triggered and so on. This is why it is important to identify these issues beforehand and prepare for them, to prevent future outages from happening.

And, most of these platforms have Service Level Agreements (SLAs) tied up with its users, promising uninterrupted service uptime. Violating SLAs is not just about credit discounts, but it is also about your reliability and competitiveness in the industry. Furthermore, whether it's bound to a legal document or not, certain performance drops or outages can cost serious losses for an organisation.


Chaos Eng Keet Creative Software


Chaos Engineering is the method of simulating these outages in production environment, bringing systematic weaknesses into light. This is an experimentation to ensure that your system can withstand turbulent situations if they occur. Chaos Engineering is an empirical process where verification leads to more resilient systems and builds confidence in the operational behaviour of those systems. It can be as simple as killing a few services or disconnecting an entire cloud datacenter.

We learn about the behavior of a distributed system by observing it during a controlled experiment.  We call this Chaos Engineering.

Chaos doesn't cause problems, it reveals them

As Site Reliability Engineers (SREs), we want to be confident that our systems are resilient enough to withstand any chaotic situation. With Chaos Engineering, you can address those weaknesses proactively, going beyond the reactive processes that currently dominate most incident response models.

In a nutshell, Chaos Engineering is;

  • Controlled and planned Engineering experiments
  • Preparing for unpredictable failures
  • Preparing Engineers for failures
  • Preparing for Game Day
  • A way to improve SLAs by Fortifying Systems


What Chaos Engineering is NOT…

These are common misconceptions and I want to point them out in this article. The following are NOT Chaos Engineering Practices;

  • Random Chaos Engineering Experiments
  • Unsupervised Chaos Engineering Experiments
  • Unexpected Chaos Engineering Experiments
  • Breaking production by Accident


How is Chaos Engineering Different from Testing Procedures?

Chaos Engineering is an experimental procedure. There is a fine distinction between testing and experimentation. In Testing, an assertion is made; given specific conditions, a system will emit a specific output based on the given specifications. Tests are typically binary and determine whether a property is true or false. Strictly speaking, this does not generate new knowledge about the system, it just assigns valence to a known property of it.

Experimentation generates new knowledge, and often suggests new avenues of exploration. Chaos engineering refers to the multiple methods to generate something unique. If you want to detect or identify the complexity of any behavioural defection in the system, then injecting communication failures is always a better choice.

It is important to understand this, because some engineers might say that they are confident about their product or system, after proper unit testing and integration tests. This is true. No argument about that. Testing is the first phase of making sure that you're confident about your system. But it is not enough.

Resilience is about resisting shocks and continuing the same. This is only one part of Chaos Engineering. The best part is about exploiting the weak points and building a highly confident system on top of them.


What is “Chaos Monkey”?

I want to give a brief introduction into Chaos Monkey as well, which is very famous and gives a historical introduction to what Chaos Engineering is really about. Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage.

The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez:

“Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.”


Netflix has built an entire army of monkeys, to simulate Chaotic Situations in the production environment, and this is called the Simian Army.  Some famous monkeys are...

  • Chaos Kong
  • Chaos Gorilla
  • Latency Monkey
  • Doctor Monkey etc.


Principals of Chaos Engineering

In this section, I would like to describe the advanced principles of Chaos Engineering and how Chaos Engineering can be practiced in your organisation. Always think of Chaos Engineering as an empirical approach where you explore the weak points of your software system. There are 5 main principles.

  • Build a Hypothesis around Steady State Behavior
  • Vary Real-world Events
  • Run Experiments in Production
  • Automate Experiments to Run Continuously
  • Minimise Blast Radius


The entire story of Chaos Engineering is wrapped around the diagram below.

A picture containing screenshot

Description automatically generated


Let's have a look at each of these in detail.


Principle 1: Build a Hypothesis around Steady State Behavior

This can be broken down into two sections. It is important to identify the “Steady State” of your system and “how to build a hypothesis” around it.

What is Steady State?

Steady state is the state your system is in, when it is considered steady. This is similar to humans. We call a human “steady” when he/she is in certain health conditions. Similarly, this is a measurable output of your system’s behaviour, like the overall system’s throughput, error rates, latency percentiles etc. Formulate these numbers into a “state” and you can state “our system is steady when it is below this range”. An example steady state is given below.

  • 5xx Error rate below 5%
  • p90 latency is below 500 ms
  • Ops per second is above 10,000

Build the Hypotheses

Now that the steady state is decided, you can simply build multiple hypotheses around it. Think of these as the “what if” questions.

  • What if the load balancer breaks?
  • What if the cluster goes down?
  • What if the auth server breaks?
  • What if Redis becomes slow?
  • What if latency increases by 300ms? Etc.

Think of things that can possibly go wrong in the production environment. But, always make sure of the following;

Don’t make a hypothesis that you know will break you. Why? Because, if you know that it will break you, you can simply fix it or ignore it. You don’t really need an experiment to test it out, right? Chaos Engineering experiments could be expensive and catastrophic. Hence, always use them to identify unknown vulnerabilities of your system.


Principle 2:  Vary Real-world Events

Always consider events that are plausible and real. This decision can come with years of experience in the industry, where certain events seem realistic and some are not. Prioritise events by either potential impact or estimated frequency.  Consider events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event.  Any event capable of disrupting steady state is a potential variable in a Chaos experiment.

Some example events are as follows.

  • Hardware failures
  • Functional bugs
  • State transmission errors (e.g., inconsistency of states between sender and receiver nodes)
  • Network latency and partition
  • Large fluctuations in input (up or down) and retry storms
  • Resource exhaustion
  • Unusual or unpredictable combinations of inter-service communication
  • Byzantine failures (e.g., a node believing it has the most current data when it actually does not)
  • Race conditions
  • Downstream dependencies malfunction


Principle 3: Run Experiments in Production

Many software systems we see today, go through different environments and different tests, before they actually reach production. And each of these environments behave differently than the actual production environment. If you want to see what the users actually go through, production environment is your best choice. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic.

But you might ask, why are we trying to break the production environment? Isn't it risky to perform a chaotic experiment in production? That is true. However you can never replicate the actual production settings in a different environment. Chaos Engineering wants to capture the loopholes in the production environment. Hence, it is important that this is performed in the production environment itself. Don't worry! This is done as an experiment in a controlled environment.

Examples of inputs for chaos experiments:

  • Simulating the failure of an entire region or datacenter.
  • Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.
  • Injecting latency between services for a select percentage of traffic over a predetermined period of time.
  • Function-based chaos (runtime injection): randomly causing functions to throw exceptions.
  • Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions.
  • Time travel: forcing system clocks out of sync with each other.
  • Executing a routine in driver code emulating I/O errors.
  • Maxing out CPU cores on an Elasticsearch cluster.

When running experiments in production, it is always recommended to use canary deployments. You can actually do this to a canary that has the lowest user traffic.


A picture containing drawing, clock

Description automatically generated



Principle 4: Automate Experiments to Run Continuously

The practice of Chaos Engineering is a long running and a labour intensive process. Hence, it is important to automate it to avoid engineer burnouts. Automate experiments and run them continuously.  Chaos Engineering builds automation into the system to drive both orchestration and analysis.

With each experiment gather important metrics, perform important calculations and persist the information in a suitable location. Some of the example metrics collected from an experiment are as follows. (These can also be considered as results from an experiment).

  • Time to detect
  • Time for Notification and Escalation
  • Time to public notification
  • Time for graceful degradation to kick in
  • Time for self-healing to happen
  • Time to recovery - partial or full
  • Time to all clear and stable


Principle 5: Minimise Blast Radius

Trust me, the last thing you want from Chaos Engineering is to cause actual chaos in your production platform. Even when performing these experiments, it is possible that certain customers feel the degradation of the platform. It is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimised and contained.

When you perform a Chaos Engineering experiment, always remember to identify metrics like the following. (This is to ensure that the Blast Radius in contained and identified)

  • Who is impacted?
  • How many workloads?
  • What functionality?
  • How many locations?
  • And more


Why Do Chaos Engineering?

This is a challenging question to answer. But, when you look at your software system as an architect or an engineering manager, you should be able to determine why Chaos Engineering is required for your organisation. I would like to point out some obvious reasons, related to architecture of any software system.

  • Systems need to scale fast and smoothly
  • Microservice architecture is tricky
  • Services will fail
  • Dependencies on other companies will fail
  • Reduce the number of outages and downtime (lose less money)
  • Prepare for real world scenarios
  • Attackers trying to perform DDoS attacks

In addition to these, there are other reasons that can help engineers in your organisation to be strong and confident in what they do. This could be the on-call engineers, or even the engineers who actually perform the product development.

  • Train On-Call Engineers to be Prepared for Different kinds of Outages
  • Train Development Engineers to build more resilient systems
  • Engineering architects to make solid and reliable decisions

This can also help the sales team of your company to come up with stronger SLAs and pitch in about how confident you are about your products.


Which Companies are doing this? 

Netflix may have started this at first, but this area of specialisation has advanced into many dynamics in industries all over the world. Chaos Engineering is practiced at industries varying from finance, e-commerce, to aviation and beyond. Some of the famous software engineering companies who regularly practices Chaos Engineering are as follows.

  • Netflix
  • Amazon
  • Dropbox
  • Uber
  • Slack
  • Twilio
  • Facebook
  • And many more!

Have a look at the industry adaptation into Chaos Engineering and some personalities behind certain initiations. (View Diagram). You can also learn more from the Chaos Engineering Community and Chaos Conf.


Challenges Faced in Chaos Engineering

  • No time or flexibility to simulate disasters
  • Teams will always be spending their time fixing things, and building new features
  • This can be very political inside the organisation
  • Cost involved in fixing and simulating disasters
  • And many more company-related matters that build up resistance


Do you really need Chaos Engineering?

A simple answer for this would be YES. But, if you actually think about it, some companies don’t really need Chaos Engineering and this would be an additional engineering cost that they cannot bear. Let me break down the factors to think about when making this decision. There could be more factors in addition to what is mentioned below.


Does your product have an SLA with its users?

If the answer is yes, then it would be ideal to practice Chaos Engineering to ensure that you provide the agreed availability for your product. Yet again, if your customer base is still small and you can tolerate this kind of downtime, then this can be done a bit later in the roadmap.


Do you have strong competitors in the market?

If you have strong competitors in the market, this would be an essential part in your product to ensure the reliability and resiliency of your product. This would also be a good selling point for your sales team to take your product into market.


How Big is your customer base?

If your customer base is huge and growing, then your system will also have to scale and be distributed as much as possible to provide high availability. Practicing Chaos Engineering would ensure how your system would react to growing user requests and how to polish up the architecture to fit in to the demand.


Do you have an architecture that is high performant, distributed and/or fault tolerant?

In this case, it is very important to ensure that your system has a strong resiliency towards unexpected chaotic situations. Chaos Engineering is a must in this case, to fortify your system for its best performance.










View my original blog here



© 2021 Creative Software. All Rights Reserved | Privacy | Terms of Use