Learn a new word: Fault Tolerance
Why does your car continue to run if one of the tires goes flat?
How was Sully able to still steer and point the plane, eventually landing in the Hudson River, when both of the plane's engines had lost power?
How are our organizations able to (more or less), carry on when something goes wrong, or someone fails to get the email, or Jerry in accounting just screws up?
It's called Fault Tolerance, and it's today's entry in the wildly popular 'Learn a new word' series. First, some definitions from our pals at Wikipedia:
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.
A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely, when some part of the system fails. The term is most commonly used to describe computer systems designed to continue more or less fully operational with, perhaps, a reduction in throughput or an increase in response time in the event of some partial failure. That is, the system as a whole is not stopped due to problems either in the hardware or the software. A structure is able to retain its integrity in the presence of damage due to causes such as fatigue, corrosion, manufacturing flaws, or impact.
Why does fault tolerance matter?
Obviously it matters a ton in complex, mission-critical technologies and machines that rely on hundreds, if not thousands of components, connections, and systems. If every time a single failure point in a car or a plane or in a power delivery grid caused the entire system to crash and become inoperable, then, well, we would hardly every drive or fly anywhere and we'd be sitting in the cold and dark in our houses most of the time.
As the sage Bender once said, 'Screws fall all the time, sir. The world is an imperfect place.'
But why does falut tolerance matter more generally?
Because I think we don't spend nearly enough time thinking about what will happen when something goes wrong in our organizations, or in our lives for that matter. Even just thinking about bad things happening is so unpleasant for folks that we tend to underestimate the chances of them happening, and undervalue the impact when they do happen.
But the engineers who design systems and processes and machines with the idea of fault tolerance in mind seem to have come to terms with the inevitability of bad things happening - like both engines going dead on a jet plane, and have proactively designed the system response to such failures.
Put more simply, they know something is going to go wrong, because something ALWAYS goes wrong. The trick is knowing ahead of time not just that something will go wrong, but how to prepare the rest of the system and people and processes to not allow the thing that went wrong to crash the entire system.
Something always goes wrong. In your car and in your semi-annual budget task force.
Be ready instead of surprised next time. Think about fault tolerance and what it means for your shop.
Reader Comments