In the world of ever-more complex systems, there is nothing more fragile than an attempt to make nothing fail. A system that assumes that everything must work is a system designed to fail. The reality of the world is that things will fail, and those cannot bring down the whole business. As British Airways has amply demonstrated, a fragile system where everything fails is not good for business.
Many years ago I wrote some posts on the challenges of five nines in a distributed world, and as systems become ever more about delivering functionality through a combination of services, micro-services and networks so the importance of designing for failure becomes ever more important, and the foundation of designing for failure is assuming it will happen.
The issue on failure is not just to define what a service’s operating conditions are; it is to understand what those calling and interacting with a service must do when it is not operational.
- Assumed to Fail – These services should be assumed to not be operational; these could be external services or services designed to add “sparkle” to a solution. But callers should not only be able to operate if they aren’t working, but should actually assume that not working is the norm. These services may in fact disappear entirely, never to return, and the callers should continue on regardless.
- Reboot Allowed – These are services whose availability is variable and may be up one moment and down another. When they are up, they provide new functionality or information, but callers should handle, seamlessly, the service state being available to unavailable from one call to the next.
- Minimum Viable Business – These services should not fail, and when they do, it represents a degradation in operations. So if payment services aren’t available, a site cannot close off its transactions. Degraded business operations may involve knock-on effects and other services being shut down while the dependent service is fixed. These are often considered business-critical services.
- Minimum Viable Operations – This is the bare minimum of operations for a business or solution. This is the set of services that cannot fail, as to do so would represent a safety-critical service.
When building anything, but particularly a service or micro-service, it is essential to understand what the true MVO is and what the MVB is. To engineer for an MVO is to engineer something that truly can’t fail and that is expensive to do at scale. Whether it is the locks in a prison, the control system of an aircraft or the steering and brakes in a car, the MVO is a true minimum. The MVB is the very least a business requires to be considered valid. For British Airways, this would have been flight scheduling, check-in and gate allocation. It wouldn’t have included the website, booking, holiday search, loyalty program and a host of other elements.
Systems, especially networked systems and those designed with multiple independent parts, must be designed to fail correctly. A service that calls another must change the way it relys on that call depending on the type of failure scenario allowed, and most importantly, when the worst happens, it must be possible to deliberately fail a system down to the MVB and then the MVO. A system that cannot be managed down to minimum operations is not a system designed well for a crisis.
This article is published as part of the IDG Contributor Network. Want to Join?