BBVA API Market
In the world of development, any change can cause a chain of errors throughout the whole system, so it’s essential always to be prepared. The idea behind Chaos Monkey is that the developers themselves should trigger faults in their tools and development as a form of training.
Alejandro Guirao, Devops at intelygenz and expert in this tool, talked about its scope at the event Haciendo el Chaos Monkey (Making Chaos Monkey), held at the BBVA Innovation Center in Madrid. We asked him to explain in detail the operation of this tool.
These types of tools belong to the concept known as resilience engineering, and use some kind of scientific method. First you have to measure your system: you must make sure it’s working and that its performance is adequate without doing anything to it. Then you formulate a hypothesis: if I attack it, if I mess around with it like this, will it resist? That’s when you start launching attacks with the tool. And at the end you measure, you measure and compare, and you may come to a different conclusion, and that leads to a new experiment. It’s a fairly reiterative cycle.
It basically tests a system’s resilience. A system is composed of software, an architecture in which the software is installed, and a series of processes that are sometimes business, sometimes human. All this forms a pyramid and this is what’s ultimately tested with Chaos Monkey. The simple fact of provoking a minor glitch in your input means you can see whether you’re really able to survive it.
The tool itself is mainly focused on the technical and IT side of companies. That’s why its greatest benefit is for systems or software development departments. But if we go a little further and don’t look so much at the tool but at its principles and practices, it can be extrapolated to any area. In fact, this is something that’s been done for some time now in the aeronautics industry, the security industry and even in the medical industry.
Yes, in theory they could. In fact, this would be one way to uncover a major problem in what we’ve created –in the architecture, the software or the processes. Problems may arise and it’s a risk that exists, and one which has to be assumed at the very top levels of the company. That is, the entire senior and middle management must be aware that these experiments are being done and that there is a probability –not insignificant– that they may affect production. Also that in the end the medium to long-term benefits are going to vastly outweigh this process.
Let’s say that instead of waiting for a really big stone to land on your glass house, you start by throwing a little tiny stone to see whether it holds up or collapses. That way you can see whether there are any gaps in your system, and it’s a way of being able to fix it.
Yes it does. There’s no doubt that these types of tools that cause problems end up requiring multidisciplinary teams to resolve them, so they encourage teamwork, not only among people from systems, operations and development, but among everybody. For example, when Google –using the resilience engineering philosophy– conducted simulations of flooding in data centers, there was one case where they had to use a diesel generator during the simulation, and they didn’t have any diesel. So people began to see how they could manage to buy diesel; the engineers began to call around, but then the people in the administration did too, and other departments came up with the phone numbers of people they knew who could get hold of diesel or who could lend them money to get diesel. In fact one employee even offered the company his credit card so they could buy diesel. In the end it’s an effort that involves multidisciplinary teams.
When you learn judo the first thing you learn to do is to fall –what’s known as “ukemi waza”– so when you do the exercises you don’t hurt yourself when you fall, because you’re no longer afraid and you know how to fall correctly. This is a little the same thing –if you’re used to a series of minor faults, then you can avoid major faults because of the experience you’ve had. It’s linked to the lean philosophy of startups, the fact of failing fast.
Very favorably –of course, it was a real shock. When it was first announced by Netflix, nobody knew it was doing something on this scale. Everything Netflix does always sets a precedent. The fact that they said that they were constantly provoking these failures in their production, but that they had no effect because they’ve reached such a level of software development and engineering that they’re almost immune to numerous catastrophic errors… That made a lot of people want to emulate them. That’s when we began to see posts from companies that had decided to take the leap. And then the open source community as a whole began to use it.
Today the Netflix team is unique in the world. It has engineers who are genuine experts in many performance issues. Netflix currently operates on the Amazon Web Services (AWS) platform and they have people who know more about AWS than the people at Amazon themselves. It’s impressive. So it’s always an endorsement when it comes from them.
As well as Netflix, Google –which uses its own version of Chaos Monkey– and Amazon, there’s Cover Flow, IBM and Yahoo, for example, who have published articles in their technical blogs saying they were beginning to use the tool. Also some other brands like Nike, which has a technology division, although that’s maybe not what comes to mind when you think of Chaos Monkey.
I think it should be used constantly in production, that’s to say, fairly regularly. It shouldn’t be done just once and then stopped, but the frequency should be increased until you get to a point that you’re more or less satisfied. But you have to be careful when you install it, as at the beginning it’s bound to be rather catastrophic and you’ll have some faults that affect production. With time things will settle down. Your system will have considerably improved if you’ve learned from your mistakes, and you can finally use it on an ongoing basis. What’s more, as Netflix said in the introduction to this tool: “You never know if that change you made yesterday has caused your platform to become weaker”. There are always new changes, developers assume new features, and someone may have gone in to fix something at a particular time and provoked totally unforeseen consequences.
I personally believe it’s the future. The guaranteed quality of the code provided by open code –and particularly if it’s free software–, and the capacity not only to see the code but to be able to modify it, extend it and adapt it, is something that can’t be achieved with a proprietary software. In fact the software the big companies use, the ones Netflix and Google use… everything they’re based on and all the technology that Internet’s supported on is based on free software tools, so I’m 100% sure of it.
La crisis de la covid-19 ha provocado una crisis sin precedentes en los departamentos de tesorería de las empresas.
For many business sectors, the purchasing process is critical. APIs have helped improve the changes of many of these businesses, for example by offering instant financing to their customers.