Chaos engineering is much more than a set of tools and rules. It involves adopting a culture in which teams trust each other and collaborate to build resiliency, advance innovation, and launch products and services. As Nora Jones, an original member of the Netflix chaos engineering team, put it, culture change starts when teams begin asking not “what happens if this fails, but when this fails.” The Netflix team has always been clear on both the technical re-education and the culture shift required to make chaos engineering a success.
When it comes to thinking about this culture shift, it can be helpful to think back to when DevOps was new. Sure, people would say they were utilizing DevOps tools, but that did not necessarily mean they were actually practicing DevOps. DevOps involves breaking down siloes between different groups in an organization, creating an atmosphere of trust and enabling collaboration – some of the same attributes a chaos engineering culture needs to have. True DevOps required people’s mindsets to shift for the new tools to be truly effective. You may be thinking, OK, if an organization is already practicing great DevOps, then surely it can’t be that big a culture leap to adopt chaos engineering?
Yes…. and also very much, no. Chaos engineering forces organizations to confront something they often don’t like talking about: failure. A strong collaborative culture will undoubtedly form the basis of an excellent platform for your chaos engineering. However, there are (at least) two critical differences. Firstly, with DevOps, different parts of the organization whose interdependency is known and understood come together to work better. Modern software systems have evolved architectures that are now so complex, with so many moving parts, a vast number of known dependencies, and many unknown ones. No single person actually understands the whole system anymore. Secondly, skipping the education part of chaos engineering can be dangerous, unethical, and a highly dubious career move. The part where you actually give different service owners a chance to learn, ask questions, and get on board is crucial. Put it this way, you really don’t want to have to educate them about your chaos engineering experiment for the first time once the experiment is already done, you’ve revealed a failure, and they’re now busy fixing it.
Instead, consider implementing the following measures as part of your chaos engineering journey.
Be the change you want to see
Modeling the behaviors you’d like to see in those around you is the best way to start. You can help people encourage practices like collaboration and openness by codifying these values and skills into the way you operate as a company.
- Training and induction: Chaos engineering is not just a set of tools, but a fresh approach to failure and its relationship with success. That’s a skill that can be learnt in several scenarios, so consider building principles of chaos engineering into the onboarding process for new employees and making it part of the training opportunities you offer.
- Measure it: If your organization is serious about collaboration and trust, ensure that this is reflected in annual reviews as part of their KPIs.
- See and be seen: Join the growing chaos engineering community out there. Ask questions. Attend conferences, webinars and tutorials. Share relevant stories on social media.
- Start small and share your successes: One of the best ways to win people over to the concept that failure can be the prelude to success is by showing them with small projects in which you can illustrate exactly what post-experiment analysis looks like, how changes are implemented and how resiliency is bolstered. Once people understand, they tend to want a piece of it. Importantly, you’ve also helped take some of the stigmas out of failure and shown them a way to stimulate a more innovative culture.
Creating trust
People tend to be wary of chaos engineering at the outset, thinking that you plan to randomly break something they’re responsible for that they’ll then have to fix and that they may later get blamed for. It is your job to show them that is not the way chaos engineering works. Follow our step-by-step guide in “Root Out Failures Before they Become Outages, with Chaos Engineering.”
In reality, chaos engineering experiments are carefully planned in such a way as to minimize potential damage. That means taking into consideration the environment, the variables, lag, and even the time of day. It may sound obvious, but start with experiments in office hours where those responsible for fixing any issues will be on hand. Too many times, infrastructure and operations teams are thrown in at the deep end with an urgent outage that’s costing the business money, having to fix problems they don’t fully understand with incomplete data or tools. The experience burns many people. Ironically, this is exactly what good chaos engineering will help avoid in the future.
Collaboration as part of the culture
Start building trust and collaboration will follow. When there are known issues within a set of services, the first step is to bring together those responsible for a frank discussion about what is happening and why. In organizations where it has not traditionally been easy to escalate or resolve problems, this can feel alien at first. I’m always struck by how illuminating these discussions can be. They bring to light interdependencies that were not obvious at first and are an essential foundation for framing a successful chaos engineering experiment.
Success builds confidence
Starting small is a good idea, so go for the quick (but meaningful) wins. As well as the apparent goal of improving Service A or launching System B, you should have considered another goal in parallel: enabling a more honest and informed approach to failure. Some level of failure in a complex distributed system is a given. It cannot be avoided. But can it be predicted? Yes. Can its effects be mitigated? Yes. Can the underlying system be strengthened as a result? Yes.
Finally, whether you are starting out with your first experiments or your organization is already well acquainted with chaos engineering, there are times when it can be a good idea to ask for help. It could be a case of knowing how and when to escalate an issue internally. As an alternative, you can try tapping into the knowledge base in the tightly knit chaos engineering community via events, training opportunities, and one-to-one relationships. Or, your organizations could call on the services of an external chaos engineering expert, especially if there’s pressure to get new digital services up and running in production in a short time. Bringing in external help enables organizations to accelerate the scale, scope, and coverage of testing, ultimately speeding up time to market and improving product quality and customer experience.