« blog 2024-12-03

Applying the “Hierarchy of Controls”

One of the US Chemical Safety Board’s “Key Lessons” from the Atchison, Kansas accidental mixing incident is that the operator in question, and others like it, should

apply the [National Institute for Occupational Safety and Health (NIOSH)] hierarchy of controls when evaluating controls and safeguards for preventing inadvertent mixing. For all chemical unloading activities that require human interaction, […] identify and address human factors issues that may increase the potential for an incorrect connection.¹

The scourge of every Slack workspace I’ve ever joined: NIOSH’s inverted pyramid of controls. Save this bad boy to your work laptop. Apply liberally.²

The Hierarchy’s basic premise is that there may be several ways to mitigate a hazard, but some of those strategies are stronger than others:

At the strong end, you can eliminate the hazardous part of your system entirely, then smack the big red ‘Easy’ button on your desk.
At the weak end, you can give Mike a pair of safety goggles (PPE) so the tank of magmatic polymer ooze won’t blind him when it burns off all his skin.

I’ve had the privilege to work on several software teams who take systems of controls (in security, compliance, and in plain program correctness) very seriously. Some practices from process engineering, e.g. the “blameless” retrospective reports we write after major failures, are commonplace “best practices” in tech already: after a major incident, the responding team will usually draw up some take-aways and propose some changes (to code or process) to prevent the issue from re-occurring.

That juncture — comparing candidate controls — is your cue to pull up the NIOSH diagram! The Hierarchy and its common examples are written for industrial environments, but the applicability is straightforward.

You might eliminate an long-disused table of sensitive data, substitute a vetted vendor for a sketchy one, introduce regression tests — an engineering control — or tweak your code review practices, an administrative one.

Some teams emphasize “defense in depth” when they pick controls.” A recent manager of mine put special emphasis on depth in the causal chain: can we add a control that makes recurrence less likely along with a monitoring control for detecting it sooner?

That’s a useful principle, and totally compatible with the Hierarchy of Controls model: a process plant doesn’t eschew PPE just because they’ve substituted a lesser hazard for a greater one; they, too, defend in depth. Practicing defense in depth without a Hierarchy of Controls has three main shortcomings:

It prompts an “incremental” defense. The call for a “preventative” control subtly promotes solutions where the same failure mode can still structurally occur but is less likely. This underemphasizes options for Elimination and Substitution (e.g. stop retaining that sensitive user data rather than trying to secure it). This is part of the additive bias I discuss in “Safety Through Incompatibility.”
It encourages quantity over quality. It underemphasizes the difference between Engineering controls and Administrative controls. Anything that must be read by engineers, enforced in code review, covered in training, etc. is an Administrative control. If “we’ll make sure it doesn’t happen again” is the best your team has to offer after an incident, you have no plan (only a shallow root-cause analysis).
The last layer of defense, automated monitoring, is some mix of Administrative control (“thou shalt respond to this page”) and PPE (“don the masks and douse the flames before they spread”). That’s valuable only if well-designed — if you’re Substituting a component for the hazardous one, your monitoring improvements should cover the new component’s failure modes.

Just use the Hierarchy. The next time you’re working on an incident retrospective, tag the controls you propose by their type — Elimination, Substitution, Engineering, Administrative, and PPE. Refer to them by name when you review postmortem plans. Complex schemes of weak controls mean waste and risk (and wasted weekends, paged to death). No more!