Chaos Monkey is tool developed by Netflix to test the resiliency of their servers on the Amazon cloud when faced with failures. It periodically a terminates a random virtual machine that is running their application. Their automated error recovery is supposed to spin up a new virtual machine to replace the one that failed, and do so in a manner that appears seamless to customers.
Rather than just implementing the error recovery code, testing it once, and assuming that it will do the job, they are constantly testing it, figuring that if it really can seamlessly recover from failures, there should be no problem with randomly blowing away virtual machines.
Realizing that there is the possibility that the recovery could fail, they run Chaos Monkey between 9 AM and 3 PM on weekdays, so if a problem does occur, there will be people present who can deal with it. They also have a way for applications that they know are not ready for this to opt out.
This got me thinking about testing the agility of our teams.
One of the big reasons we do Agile development is so that we can change direction at sprint boundaries, if the priorities for delivering particularly stories changes. By finishing all their work by the end of the sprint, the team is able to change direction immediately.
Some teams have trouble understanding this. They resist breaking large stories into sprint-sized pieces, because they say it will increase the overall elapsed time to implement the change. This overlooks several things:
- If it takes you six months to implement the change, the customer's needs may have changed by the time you finish.
- If you try to test six-months-worth of coding and it doesn't work, you have to wade through all that code to find the error. If you are implementing it in Sprint-sized stories, you only have one-Sprint's-worth of code to look through.
- Priorities may change. The Product Owner may need to have the team implement some feature ahead of time to keep from losing a customer. If you are two months into a six-month product and you have dozens of modules open, it is very difficult to change direction.
For teams that cling to elapsed time as the only viable metric, I would propose engaging the Product Owner in the following exercise:
If you have several epics that have similar priorities, mix them up each Sprint. If the first epic has stories A, B, and C, and the second epic has stories P, Q, and R, and the team is currently working on story A, they will expect that in the next sprint they will be working on story B, and in the next, story C. Instead, have them work on story A, then P, then B, then Q, etc.
This will make transparent how agile they are, and will help get them out of the habit of assuming that if they don't finish a story in a sprint, they can just roll it over to the next with no consequences.
Of course, just as Netflix runs Chaos Monkey only during weekdays when people are present, you would want to be careful about how you do this exercise:
- Don't do it during a deadline crunch.
- Let the team know a sprint or two in advance that you are going to be doing this, and that you expect them to be able to do this.
- Discuss in the Retrospective how well this worked, and what they can do to make it work better.
- A few stubborn teams may say that this is stupid and a waste of their time. You may have to remind them that the Product Owner is responsible for the priority of stories in the backlog, and the team is responsible for committing only to what they can accomplish a Sprint.
Even teams that already buy into the idea of being able to change direction at sprint boundaries may discover impediments to doing so that they didn't see before.
Just as Netflix exercises their recovery software so they know it will work when they get a real failure, teams should exercise their agility regularly, so that when a critical customer demand comes along, they are practiced and ready for it.
One oft-mentioned feature of Lean manufacturing is the andon light or the andon cord. The idea is that any employee on the assembly line who encounters a problem pulls the andon cord, the line is stopped, and the light comes on to indicate where the problem is.
By the way, andon is the Japanese word for paper lantern, which has apparently been generalized to mean any lantern. Here are a couple of good references about the history of andon lanterns:
Although andon lights are frequently mentioned in the Agile development literature, some of their most important points are sometimes glossed over.
In A Study of the Toyota Production System, Shigeo Shingo, an industrial engineer noted in connection with Toyota’s SMED (Single Minute Exchange of Dies) program that reduced setup times for punch presses from many hours to less than a minute, said:
The andon is a visual control that communicates important information and signals the need for immediate action by supervisor. There are some managers who believe that a variety of production problems can be overcome by implementing Toyota’s visual control system. At Toyota, however, the most important issue is not how quickly personnel are alerted to a problem, but what solutions are implemented. Makeshift or temporary measures, although they may restore the operation to normal most quickly, are not appropriate.
The key point is that each time the line is stopped because of the andon, the team strives to make sure that the same error does not happen again. Shingo states this more forcefully when he says “At Toyota, there is only one reason to stop the line—to ensure that it won’t have to stop again.”
The andon lights used by continuous integration teams come close to this philosophy. The light comes on when a build fails, or the automated tests that run as part of the build fail. Everyone on the team stops what they are doing and works on fixing the build.
What is missing sometimes is the idea of making sure that it doesn’t happen again, perhaps by implementing a Poka-yoke solution that will prevent the error from happening again.
When you read about the Toyota production system, what is striking is the commitment to doing whatever it takes to eliminate waste and errors, even at the expense of some short-team pain.
This is difficult to do in a mature company that is trying to adapt to Agile development, because it is a major change, and various departments are not used to working closely together. But it is essential to eliminate waste.
Online game companies are frequently on the forefront of technology, both the technology of the games, as well as how they are developed. For example, IMVU, a 3D online chat website, has been a leader in continuous deployment, deploying as many as 50 changes a day.
Another development leader was Cmune, a Chinese company that used to make the MMO (Massively Multiplayer Online) first-person shooter game, UberStrike. (UberStrike was "sunsetted" in June 2016.)
In case you are not familiar with first-person-shooter games, there are various levels in a game, each one more difficult than the previous ones. As a player proceeds through the game, gaining more points (however this is achieved in the mechanics of the particular game), the player ascends to harder and harder levels.
Each level has a map, which defines the terrain and buildings that the player has to negotiate while playing the level. Designing a map is a two-pronged affair. First the terrain and buildings have to be defined in such a way as to be fun to play. Next they have to be modeled and textured, so they look realistic.
Traditionally, both steps are completed before the level is made available to players. If the majority of players decide that the level is too easy or too difficult, then all of the effort in modeling and texturing it is wasted.
Cmune decoupled these two steps for UberStrike with their Bluebox Maps program. Proposed level maps were made available to interested customers. They were not textured (they had a uniform blue color, thus the name of the program), and high-quality modeling had not been completed. Also, game mechanics, such as shooting, were not implemented. Here is an example of a Bluebox map.
Participating customers could download a Bluebox map and try it out, in order to determine whether it would be fun to play. Based on the feedback Cmune receives on a map, they either continued with the high-quality modeling and texturing, or they discarded the map.
Developers outside of the game world can learn from the Bluebox program. When we think about getting feedback from customers, we usually think of showing them completed features. Since we break even large epics down into sprint-sized pieces, the feature we are demonstrating to the customer may be a small, incremental change, but it is generally complete.
In some cases, it may be beneficial to break changes into even smaller pieces, large enough that the customer can see if we are going in the right direction, but not polished enough to actually release.
This must be done with caution, particularly if we are using continuous integration, or other SCM methodologies where everything gets checked into the main branch. (For some hints, see my previous post, Small Stories, Legacy Code, and Scaffolding.) Perhaps a feature flag can be added, so when it is turned on, it lets the customer go as far as the part being demonstrated, and then stops.
One of the buzzwords of Agile development is failing fast. The sooner you can find out that what you are developing is not what the customer wants, the sooner you can change course, without a lot of wasted development time.
Poka-yoke (the final "e" is pronounced like "eh?" in English) is the Japanese term for "error proofing", formalized by industrial engineer Shiego Shingo as part of the Toyota Production System. (He is said to have picked the term "error proofing" rather than "fool proofing" [baka-yoke] to underscore that the problem was not foolish workers, but the fact that everyone makes mistakes from time to time when given a chance.)
In the Toyota Production System, poka-yoke deals with designing vehicles and their assembly processes in such a way that it is difficult to assemble them incorrectly.
Here is an example from the 1940s of the need for poka-yoke: My father worked at Kaiser-Frazer when they used to make a car called the Frazer. It had "FRAZER" spelled out in individual chrome letters on the front of the hood, like this.
Posts that stuck out the back of each letter went through holes in the hood to hold the letters in place. The positions of the posts were standardized, making all the letters interchangeable. Sometimes, when an assembly worker was having a bad day, he would end up making ZEFFERs instead of FRAZERs. Automakers subsequently changed the letters so they could not be interchanged, either by putting the posts in a different position on each letter, or by making the logo a single unit, like this.
A more modern example is the floppy disk drives that used to be in PCs. They used an unkeyed four-pin power connector that could be installed the right way, which would make the drive work, or the wrong way, which would make it go up in smoke. You had to remember that the red wire went away from the data connector, unless you were working with the odd brand where the red wire had to go toward the data connector. (I fried several floppy disk drives this way.) That was the last unkeyed connector I remember seeing on a PC, so the PC industry evidently adopted poka-yoke.
There are all kinds of modern examples, like polarized power plugs, and IKEA furniture, which is difficult to assemble wrong, and the same concept applies to software development.
For example, suppose you have a routine that performs several different functions. (That, in itself, is probably a violation of the Single Responsibility Principle, but suppose there is a good reason for it to be that way.) Some of the functions require few parameters, while others require many. Callers have to remember to pass the right number of parameters, including dummy parameters. So you get something like this:
If you miscount the commas, you have a problem. It would be much less error-prone to have multiple entry points, each with a fixed number of parameters.
The same concept applies to development tools. In one build system I worked with, you have to type in the names of all the modules that are part of your change. If the build fails because you left out one of the changed modules, you have to re-submit the build request, again typing in the names all the modules. If you have 30 or 40 modules in your change, you might mistype or leave out one of the names you got right in the first request, causing the build to fail again. If you could just call up the first request and say you wanted to add a module, there would be much less chance of error.
Another case is anywhere that you have to enter the same information into two different places. Eventually, you will forget to update one of them, or will mistype the information while entering it the second time. If the systems can be made to talk to each other, this greatly lessens the chance for error.
Continuous Deployment (CD), where changes are released several times a day, is popular among online game sites. IMVU for example, is a very strong champion of CD, and they use it to 50 times a day.
I used to use IMVU, and as far as I could tell, the releasing of changes was completely invisible to the user. But I have tried some other online games the use CD, and was not the case for them. What frequently happens is that you are doing the game thing, and a window pops up that says "The game has been enhanced. Please refresh your browser." And when you refresh, the game takes two minutes to load in all of its assets, which really disrupts the game play.
An even worse situation is a Real-Time Strategy wargame that a friend was playing. The game deployed a change in the battle rules, regarding how much food had to be sent along with troops in battle. The change took effect immediately, right in the middle of a battle, and all her troops starved to death because the amount of food she had sent was no longer sufficient.
In case some of you are thinking "These are just games. What's the big deal if your imaginary troops starve to death?" there are a couple of things to keep in mind. One is that there is a lot of money in online games, and things like this make your users get angry and go play somebody else's game. Another is that the same thing could apply to any online system. Suppose someone is half way through booking a hotel room, and you deploy a change that alters the quoted room rate. They will not be happy when they get their bill, and it does not match what you quoted them.
It's probably not possible to come up with a universal design pattern to make live deployments transparent to users, but as a minimum, the change should not completely lose the user's state, should not require a bunch of time to reload assets that were already loaded, and should not change the outcome of transactions that are in process (like battles in a game, or hotel rooms that are being booked).
Page 4 of 5