Chaos Monkey, the tool that causes minor faults in order to prevent greater ones

5 min reading

Development , Startups / 14 February 2017

In the world of development, any change can cause a chain of errors throughout the whole system, so it’s essential always to be prepared. The idea behind Chaos Monkey is that the developers themselves should trigger faults in their tools and development as a form of training.

Alejandro Guirao, Devops at intelygenz and expert in this tool, talked about its scope at the event Haciendo el Chaos Monkey (Making Chaos Monkey), held at the BBVA Innovation Center in Madrid. We asked him to explain in detail the operation of this tool.

What’s the process behind the tool?

These types of tools belong to the concept known as resilience engineering, and use some kind of scientific method. First you have to measure your system: you must make sure it’s working and that its performance is adequate without doing anything to it. Then you formulate a hypothesis: if I attack it, if I mess around with it like this, will it resist? That’s when you start launching attacks with the tool. And at the end you measure, you measure and compare, and you may come to a different conclusion, and that leads to a new experiment. It’s a fairly reiterative cycle.

What are the main advantages of using Chaos Monkey?

It basically tests a system’s resilience. A system is composed of software, an architecture in which the software is installed, and a series of processes that are sometimes business, sometimes human. All this forms a pyramid and this is what’s ultimately tested with Chaos Monkey. The simple fact of provoking a minor glitch in your input means you can see whether you’re really able to survive it.

Is Chaos Monkey a worthwhile tool for companies in any sector?

The tool itself is mainly focused on the technical and IT side of companies. That’s why its greatest benefit is for systems or software development departments. But if we go a little further and don’t look so much at the tool but at its principles and practices, it can be extrapolated to any area. In fact, this is something that’s been done for some time now in the aeronautics industry, the security industry and even in the medical industry.

Could these simulated production failures ever actually harm the company?

Yes, in theory they could. In fact, this would be one way to uncover a major problem in what we’ve created –in the architecture, the software or the processes. Problems may arise and it’s a risk that exists, and one which has to be assumed at the very top levels of the company. That is, the entire senior and middle management must be aware that these experiments are being done and that there is a probability –not insignificant– that they may affect production. Also that in the end the medium to long-term benefits are going to vastly outweigh this process.

Isn’t using Chaos Monkey like throwing stones on your own glass house?

Let’s say that instead of waiting for a really big stone to land on your glass house, you start by throwing a little tiny stone to see whether it holds up or collapses. That way you can see whether there are any gaps in your system, and it’s a way of being able to fix it.

Does it test teamwork?

Yes it does. There’s no doubt that these types of tools that cause problems end up requiring multidisciplinary teams to resolve them, so they encourage teamwork, not only among people from systems, operations and development, but among everybody. For example, when Google –using the resilience engineering philosophy– conducted simulations of flooding in data centers, there was one case where they had to use a diesel generator during the simulation, and they didn’t have any diesel. So people began to see how they could manage to buy diesel; the engineers began to call around, but then the people in the administration did too, and other departments came up with the phone numbers of people they knew who could get hold of diesel or who could lend them money to get diesel. In fact one employee even offered the company his credit card so they could buy diesel. In the end it’s an effort that involves multidisciplinary teams.

“The best way of avoiding failure is to fail constantly.” What do you make of this phrase?

When you learn judo the first thing you learn to do is to fall –what’s known as “ukemi waza”– so when you do the exercises you don’t hurt yourself when you fall, because you’re no longer afraid and you know how to fall correctly. This is a little the same thing –if you’re used to a series of minor faults, then you can avoid major faults because of the experience you’ve had. It’s linked to the lean philosophy of startups, the fact of failing fast.

How has Chaos Monkey been received by the open source community?

Very favorably –of course, it was a real shock. When it was first announced by Netflix, nobody knew it was doing something on this scale. Everything Netflix does always sets a precedent. The fact that they said that they were constantly provoking these failures in their production, but that they had no effect because they’ve reached such a level of software development and engineering that they’re almost immune to numerous catastrophic errors… That made a lot of people want to emulate them. That’s when we began to see posts from companies that had decided to take the leap. And then the open source community as a whole began to use it.

Does the fact that Chaos Monkey was successful with the Netflix development team endorse the tool?

Today the Netflix team is unique in the world. It has engineers who are genuine experts in many performance issues. Netflix currently operates on the Amazon Web Services (AWS) platform and they have people who know more about AWS than the people at Amazon themselves. It’s impressive. So it’s always an endorsement when it comes from them.

What other major companies have used or use the tool?

As well as Netflix, Google –which uses its own version of Chaos Monkey– and Amazon, there’s Cover Flow, IBM and Yahoo, for example, who have published articles in their technical blogs saying they were beginning to use the tool. Also some other brands like Nike, which has a technology division, although that’s maybe not what comes to mind when you think of Chaos Monkey.

Do you recommend using it all the time or just as a stress test in particular situations in a process?

I think it should be used constantly in production, that’s to say, fairly regularly. It shouldn’t be done just once and then stopped, but the frequency should be increased until you get to a point that you’re more or less satisfied. But you have to be careful when you install it, as at the beginning it’s bound to be rather catastrophic and you’ll have some faults that affect production. With time things will settle down. Your system will have considerably improved if you’ve learned from your mistakes, and you can finally use it on an ongoing basis. What’s more, as Netflix said in the introduction to this tool: “You never know if that change you made yesterday has caused your platform to become weaker”. There are always new changes, developers assume new features, and someone may have gone in to fix something at a particular time and provoked totally unforeseen consequences.

Do you think open source is the future for companies in terms of software and development?

I personally believe it’s the future. The guaranteed quality of the code provided by open code –and particularly if it’s free software–, and the capacity not only to see the code but to be able to modify it, extend it and adapt it, is something that can’t be achieved with a proprietary software. In fact the software the big companies use, the ones Netflix and Google use… everything they’re based on and all the technology that Internet’s supported on is based on free software tools, so I’m 100% sure of it.

Are you interested in financial APIs? Discover all the APIs we can offer you at BBVA

It may interest you

What are fintechs and how do they work?

Fintechs are financial platforms that democratize finance, as well as the ecosystem, technology and companies on which they rely Fintechs are the next iteration of the financial world. What are these financial platforms and what types are there? In Spain, fintech companies are creating a mature and growing market thanks to the inherent advantages of […]

Startups / 03 October 2022
What is a broker and what is it for?

Brokers are tools that allow active trading on financial markets, and they are also the people who execute those orders. In one way or another, brokers have been with us for more than half a millennium. Although they are now known as trading platforms which can be used at different levels, from beginner through to […]

Startups / 29 March 2022
The fintech industry is growing in Spain, with the help of open banking

In 2020, the fintech industry consolidated in Spain, with a sector growth of +15% in the year

Startups / 26 August 2021

Name	Owner	Duration	Description
gobp.lang	BBVA	1 month	Language preference
aceptarCookies	BBVA	1 year	Configuration Accepted Cookies
_abck	BBVA	1 year	Helps protect against malicious website attacks
bm_sz	BBVA	4 hours	Helps protect against malicious website attacks
ADRUM_BTs	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT1	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BTa	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
xt_0d95e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
__s9744cdb192d044faa1bf201d29fafd1e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
wpml_browser_redirect_test	WPML	Session	Text translation in the portal
wp-wpml_current_language	WPML	24 hours	Text translation in the portal

Name	Owner	Duration	Description
AMCV_***	Adobe Analytics	Session	Unique Visitor IDs used in Cloud Marketing solutions
AMCVS_***	Adobe Analytics	2 years	Unique Visitor IDs used in Cloud Marketing solutions
demdex (safari)	Adobe Analytics	180 days	Create and store unique and persistent identifiers
sessionID	Adobe Analytics	Session	Launch's internal cookie used to identify the user
gpv_URL	Adobe Analytics	Session	Adobe Analytics plugin: getPreviousValue Capture the value of a certain variable in the following page view, in this case the prop1
gpv_level1	Adobe Analytics	Session	Cookie used to store the DataLayer levl1 of the previous page.
gpv_pageIntent	Adobe Analytics	Session	Cookie used to store the pageIntent of the previous page.
gpv_pageName	Adobe Analytics	Session	Cookie used to store the pagename of the previous page.
aocs	Adobe Analytics	Session	Cookie that stores the first values collected at the beginning of a process.
TTC	Adobe Analytics	Session	Cookie used to store the time between the App Page Visit event and the App Completed event.
TTCL	Adobe Analytics	Session	Cookie used to store the time between the LogIn event and App Completed.
s_cc	Adobe Analytics	Session	Determine if cookies are active
s_hc	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_ht	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_nr	Adobe Analytics	2 years	Determine the number of user visits
s_ppv	Adobe Analytics	Permanent	Adobe Analytics plugin: getPercentPageViewed Determine what percentage of the page a user views
s_sq	Adobe Analytics	Session	ClickMap/ActivityMap features
s_tp	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_visit	Adobe Analytics	2 years	Cookie used by Adobe to know when a session has been started.

Name	Owner	Duration	Description
OT2	VersaTag	90 days	VersaTag Cookie used to store a user id and the number of user visits.
u2	VersaTag	90 days	VersaTag Cookie where the user ID is stored
TargetingInfo 2	MediaMind	1 year	Cookie that serves to assign a unique random number that generates MediaMind.

Name	Owner	Duration	Description
mbox	Adobe Target	9 days	Cookie used by Adobe Target to test user experience customization.