Chaos Engineering - The Art of Breaking Things in Production

Keet Sugathadasa

March 22, 2023

Software systems that existtoday have advanced and complex components, where it becomes inherentlydistributed and dependent on many other platforms in the industry. If you takea simple software system, it will definitely consist of a couple of micro-services,a cloud infrastructure and a mobile setup. Most of these systems are highlydependent on Cloud Service Providers like AWS, Google Cloud and Azure, and theybecome core dependencies, making survival impossible without them.

How confident are you aboutyour system when you develop a software platform? What if your cloud providergoes down for 8 hours? What if your system load increases by 10 times? Younever know until this actually happens in your production environment.

When your system grows with SLAsand customers paying a licensing fee, they expect your software platform to beuninterrupted and available for business continuity. To provide anuninterrupted service, you need to prepare for any kind of chaos that canhappen in production.

This is where ChaosEngineering comes into practice.

Chaos Engineering is the Artof breaking things in Production.

Site Reliability Engineering (SRE) plays a vital role in ChaosEngineering; it is all about ensuring the reliability of the site, even if halfof the production system goes down. Sounds unrealistic? Well, this article willgive an introduction to Chaos Engineering and how this should be practiced inyour organisation to build more resilient systems for your customers.

In this article, I would like to talk about thefollowing topics.

· What is Chaos Engineering?

· How is Chaos EngineeringDifferent from Testing Procedures?

· What is Chaos Monkey?

· Principals of ChaosEngineering

· Why Do Chaos Engineering?

· What Companies are doing it?

· Challenges Faced in ChaosEngineering

· Do you really need ChaosEngineering?

What is Chaos Engineering?

Chaos Engineering is thediscipline of experimenting on a system, in order to build confidence in thesystem’s capability to withstand turbulent conditions in production. If you'veever had experience running distributed systems in production, you very wellknow that something is bound to go wrong. This is because these systems aredependent on many other components, and this interaction is necessary for itssurvival and for the fittest functionality. The number of ways your system cango down is enormous. It could be a network failure, IDP failure, unstable pods,surge in user traffic and many more.

When the above incidents startoccurring, performance becomes poor, outages are triggered and so on. This iswhy it is important to identify these issues beforehand and prepare for them,to prevent future outages from happening.

And, most of these platformshave Service Level Agreements (SLAs) tied up with its users, promisinguninterrupted service uptime. Violating SLAs is not just about creditdiscounts, but it is also about your reliability and competitiveness in theindustry. Furthermore, whether it's bound to a legal document or not, certainperformance drops or outages can cost serious losses for an organisation.

Chaos Engineering is themethod of simulating these outages in production environment, bringingsystematic weaknesses into light. This is an experimentation to ensure thatyour system can withstand turbulent situations if they occur. Chaos Engineeringis an empirical process where verification leads to more resilient systems andbuilds confidence in the operational behaviour of those systems. It can be assimple as killing a few services or disconnecting an entire cloud datacenter.

We learn about the behavior ofa distributed system by observing it during a controlled experiment. We call this Chaos Engineering.

Chaos doesn't cause problems,it reveals them

As Site Reliability Engineers(SREs), we want to be confident that our systems are resilient enough towithstand any chaotic situation. With Chaos Engineering, you can address thoseweaknesses proactively, going beyond the reactive processes that currentlydominate most incident response models.

In a nutshell, Chaos Engineering is;

· Controlled and plannedEngineering experiments

· Preparing for unpredictablefailures

· Preparing Engineers forfailures

· Preparing for Game Day

· A way to improve SLAs byFortifying Systems

What Chaos Engineering is NOT…

These are common misconceptions and I want to pointthem out in this article. The following are NOT Chaos Engineering Practices;

· Random Chaos EngineeringExperiments

· Unsupervised Chaos EngineeringExperiments

· Unexpected Chaos EngineeringExperiments

· Breaking production byAccident

How is Chaos EngineeringDifferent from Testing Procedures?

Chaos Engineering is anexperimental procedure. There is a fine distinction between testing andexperimentation. In Testing, an assertion is made; given specific conditions, asystem will emit a specific output based on the given specifications. Tests aretypically binary and determine whether a property is true or false. Strictlyspeaking, this does not generate new knowledge about the system, it justassigns valence to a known property of it.

Experimentation generates newknowledge, and often suggests new avenues of exploration. Chaos engineeringrefers to the multiple methods to generate something unique. If you want todetect or identify the complexity of any behavioural defection in the system,then injecting communication failures is always a better choice.

It is important to understandthis, because some engineers might say that they are confident about theirproduct or system, after proper unit testing and integration tests. This istrue. No argument about that. Testing is the first phase of making sure thatyou're confident about your system. But it is not enough.

Resilience is about resistingshocks and continuing the same. This is only one part of Chaos Engineering. Thebest part is about exploiting the weak points and building a highly confidentsystem on top of them.

What is “Chaos Monkey”?

I want to give a briefintroduction into Chaos Monkey as well, which is very famous and gives ahistorical introduction to what Chaos Engineering is really about. Chaos Monkeyis a tool invented in 2011 by Netflix to test the resilience of its ITinfrastructure. It works by intentionally disabling computers in Netflix'sproduction network to test how remaining systems respond to the outage.

The name "ChaosMonkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez:

“Imagine a monkey entering a'data center', these 'farms' of servers that host all the critical functions ofour online activities. The monkey randomly rips cables, destroys devices andreturns everything that passes by the hand [i.e. flings excrement]. Thechallenge for IT managers is to design the information system they areresponsible for so that it can work despite these monkeys, which no one everknows when they arrive and what they will destroy.”

Netflix has built an entirearmy of monkeys, to simulate Chaotic Situations in the production environment,and this is called the Simian Army. Some famous monkeysare...

· Chaos Kong

· Chaos Gorilla

· Latency Monkey

· Doctor Monkey etc.

Principals of ChaosEngineering

In this section, I would like to describe the advancedprinciples of Chaos Engineering and how Chaos Engineering can be practiced inyour organisation. Always think of Chaos Engineering as an empirical approachwhere you explore the weak points of your software system. There are 5 mainprinciples.

· Build a Hypothesis aroundSteady State Behavior

· Vary Real-world Events

· Run Experiments in Production

· Automate Experiments to Run Continuously

· Minimise Blast Radius

The entire story of Chaos Engineering is wrappedaround the diagram below.

Let's have a look at each of these in detail.

Principle 1: Build aHypothesis around Steady State Behavior

This can be broken down into two sections. It isimportant to identify the “Steady State” of your system and “how to build ahypothesis” around it.

What is Steady State?

Steady state is the state your system is in, when itis considered steady. This is similar to humans. We call a human “steady” whenhe/she is in certain health conditions. Similarly, this is a measurable outputof your system’s behaviour, like the overall system’s throughput, error rates,latency percentiles etc. Formulate these numbers into a “state” and you can state“our system is steady when it is below this range”. An example steady state isgiven below.

· 5xx Error rate below 5%

· p90 latency is below 500 ms

· Ops per second is above 10,000

Build the Hypotheses

Now that the steady state is decided, you can simplybuild multiple hypotheses around it. Think of these as the “what if” questions.

· What if the load balancerbreaks?

· What if the cluster goes down?

· What if the auth serverbreaks?

· What if Redis becomes slow?

· What if latency increases by300ms? Etc.

Think of things that can possibly go wrong in theproduction environment. But, always make sure of the following;

Don’t make a hypothesis thatyou know will break you. Why? Because, if you know that it will break you, youcan simply fix it or ignore it. You don’t really need an experiment to test itout, right? Chaos Engineering experiments could be expensive and catastrophic.Hence, always use them to identify unknown vulnerabilities of your system.

Principle 2: Vary Real-world Events

Always consider events that are plausible and real.This decision can come with years of experience in the industry, where certainevents seem realistic and some are not. Prioritise events by either potentialimpact or estimated frequency. Considerevents that correspond to hardware failures like servers dying, softwarefailures like malformed responses, and non-failure events like a spike intraffic or a scaling event. Any eventcapable of disrupting steady state is a potential variable in a Chaosexperiment.

Some example events are as follows.

· Hardware failures

· Functional bugs

· State transmission errors(e.g., inconsistency of states between sender and receiver nodes)

· Network latency and partition

· Large fluctuations in input(up or down) and retry storms

· Resource exhaustion

· Unusual or unpredictablecombinations of inter-service communication

· Byzantine failures (e.g., anode believing it has the most current data when it actually does not)

· Race conditions

· Downstream dependenciesmalfunction

Principle 3: Run Experimentsin Production

Many software systems we seetoday, go through different environments and different tests, before theyactually reach production. And each of these environments behave differentlythan the actual production environment. If you want to see what the usersactually go through, production environment is your best choice. To guaranteeboth authenticity of the way in which the system is exercised and relevance tothe current deployed system, Chaos strongly prefers to experiment directly on productiontraffic.

But you might ask, why are wetrying to break the production environment? Isn't it risky to perform a chaoticexperiment in production? That is true. However you can never replicate theactual production settings in a different environment. Chaos Engineering wantsto capture the loopholes in the production environment. Hence, it is importantthat this is performed in the production environment itself. Don't worry! Thisis done as an experiment in a controlled environment.

Examples of inputs for chaosexperiments:

· Simulating the failure of anentire region or datacenter.

· Partially deleting Kafkatopics over a variety of instances to recreate an issue that occurred inproduction.

· Injecting latency betweenservices for a select percentage of traffic over a predetermined period oftime.

· Function-based chaos (runtimeinjection): randomly causing functions to throw exceptions.

· Code insertion: Addinginstructions to the target program and allowing fault injection to occur priorto certain instructions.

· Time travel: forcing systemclocks out of sync with each other.

· Executing a routine in drivercode emulating I/O errors.

· Maxing out CPU cores on anElasticsearch cluster.

When running experiments in production, it is alwaysrecommended to use canary deployments. You can actually do this to a canarythat has the lowest user traffic.

Principle 4: AutomateExperiments to Run Continuously

The practice of ChaosEngineering is a long running and a labour intensive process. Hence, it isimportant to automate it to avoid engineer burnouts. Automate experiments andrun them continuously. Chaos Engineeringbuilds automation into the system to drive both orchestration and analysis.

With each experiment gatherimportant metrics, perform important calculations and persist the informationin a suitable location. Some of the example metrics collected from anexperiment are as follows. (These can also be considered as results from anexperiment).

· Time to detect

· Time for Notification andEscalation

· Time to public notification

· Time for graceful degradationto kick in

· Time for self-healing tohappen

· Time to recovery - partial orfull

· Time to all clear and stable

Principle 5: Minimise BlastRadius

Trust me, the last thing youwant from Chaos Engineering is to cause actual chaos in your productionplatform. Even when performing these experiments, it is possible that certaincustomers feel the degradation of the platform. It is the responsibility andobligation of the Chaos Engineer to ensure the fallout from experiments areminimised and contained.

When you perform a ChaosEngineering experiment, always remember to identify metrics like the following.(This is to ensure that the Blast Radius in contained and identified)

· Who is impacted?

· How many workloads?

· What functionality?

· How many locations?

· And more

Why Do Chaos Engineering?

This is a challenging question to answer. But, whenyou look at your software system as an architect or an engineering manager, youshould be able to determine why Chaos Engineering is required for yourorganisation. I would like to point out some obvious reasons, related toarchitecture of any software system.

· Systems need to scale fast andsmoothly

· Microservice architecture istricky

· Services will fail

· Dependencies on othercompanies will fail

· Reduce the number of outagesand downtime (lose less money)

· Prepare for real worldscenarios

· Attackers trying to performDDoS attacks

In addition to these, thereare other reasons that can help engineers in your organisation to be strong andconfident in what they do. This could be the on-call engineers, or even theengineers who actually perform the product development.

· Train On-Call Engineers to bePrepared for Different kinds of Outages

· Train Development Engineers tobuild more resilient systems

· Engineering architects to makesolid and reliable decisions

This can also help the sales team of your company tocome up with stronger SLAs and pitch in about how confident you are about yourproducts.

Which Companies are doingthis?

Netflix may have started thisat first, but this area of specialisation has advanced into many dynamics inindustries all over the world. Chaos Engineering is practiced at industriesvarying from finance, e-commerce, to aviation and beyond. Some of the famoussoftware engineering companies who regularly practices Chaos Engineering are asfollows.

· Netflix

· Amazon

· Dropbox

· Uber

· Slack

· Twilio

· Facebook

· And many more!

Have a look at the industry adaptation into ChaosEngineering and some personalities behind certain initiations. (View Diagram). You can also learn more from the Chaos Engineering Communityand Chaos Conf.

Challenges Faced in ChaosEngineering

· No time or flexibility tosimulate disasters

· Teams will always be spendingtheir time fixing things, and building new features

· This can be very politicalinside the organisation

· Cost involved in fixing andsimulating disasters

· And many more company-relatedmatters that build up resistance

Do you really need ChaosEngineering?

A simple answer for this wouldbe YES. But, if you actually think about it, some companies don’t really needChaos Engineering and this would be an additional engineering cost that theycannot bear. Let me break down the factors to think about when making thisdecision. There could be more factors in addition to what is mentioned below.

Does your product have an SLAwith its users?

If the answer is yes, then itwould be ideal to practice Chaos Engineering to ensure that you provide theagreed availability for your product. Yet again, if your customer base is stillsmall and you can tolerate this kind of downtime, then this can be done a bitlater in the roadmap.

Do you have strong competitorsin the market?

If you have strong competitorsin the market, this would be an essential part in your product to ensure thereliability and resiliency of your product. This would also be a good sellingpoint for your sales team to take your product into market.

How Big is your customer base?

If your customer base is hugeand growing, then your system will also have to scale and be distributed asmuch as possible to provide high availability. Practicing Chaos Engineeringwould ensure how your system would react to growing user requests and how topolish up the architecture to fit in to the demand.

Do you have an architecturethat is high performant, distributed and/or fault tolerant?

In this case, it is very important to ensure that yoursystem has a strong resiliency towards unexpected chaotic situations. ChaosEngineering is a must in this case, to fortify your system for its bestperformance.

References

1. http://principlesofchaos.org/

2. https://www.slideshare.net/AnaMedina42/introduction-to-chaos-engineering-srecon-asia-ana-medina

3. https://learning.oreilly.com/library/view/chaos-engineering/9781491988459/

4. https://www.cuelogic.com/blog/chaos-engineering

5. https://www.slideshare.net/AmazonWebServices/chaos-engineering-why-breaking-things-should-be-practiced-aws-developer-workshop-at-web-summit-2018

View my original blog here

‍