Safety Through Incompatibility

October 21, 2016: a Harcros Chemicals delivery truck arrives at the MGPI Processing plant in Atchinson, Kansas for a routine early-morning delivery of sulfuric acid. Together with a plant operator, the driver runs a hose from their trailer tank of pressurized sulfuric acid to one of the facility’s fill lines, checks for leaks, and starts the transfer of acid.

Not fifteen minutes later, a yellow-green cloud erupts from one of the facility’s holding tanks. It swallows the delivery truck. It engulfs the nearby control building; the operators scramble from the plant on foot. The driver takes refuge in a wastewater treatment plant nearby, and the green cloud — highly toxic chlorine gas — drifts northeast, on the wind, over Atchison.

How does a routine delivery cause an uncontrolled release of chlorine? Through a reaction between the delivered acid and a tank of sodium hypochlorite. At the MGPI plant, stocks of these two “incompatible” chemicals are loaded from the same connection area, using an identical hose connection. The driver connected to the wrong unmarked fill line.

Anyone who’s done their time on a rushed software engineering team has cleaned up an analogous accident: two things, usually separated, are mixed by accident or mistaken design.

What did it cost you? A couple hours of work and some griping in Slack? For a chemical process engineer (or any other species of real engineer), the stakes are astronomically higher. We have much to learn.

Industrial chemical accidents in the U.S. are subject to investigation by the U.S. Chemical Safety and Hazard Investigation Board (USCSB). According to their report, the Atchison incident teaches a “Key Lesson:” if components should never be connected, make it impossible to connect them!

An early-career software engineer generally works additively. One starts with some raw stuff (an empty project structure, then some language primitives, some data…) and accrues features and affordances until the program is at least capable of completing some task.

Considering capabilities as hazards doesn’t come as naturally, even though examples abound. There are examples in our ancient jokes: rm is hazardously compatible with your precious root directory. That’s fine for a library, but death in a system³ — you don’t need an API endpoint that drops your production database, even though your database libraries could do so.

Unfortunately for our safety analysis, the raw stuff of software engineering is less definite than the raw stuff of chemical plants (you know, the chemicals).

One valuable analogy is types. A function’s parameters are its fill lines — they constrain what sorts of data can be pumped into it. Whereas chemical process engineers use incompatible hose couplings, programmers can elect to use incompatible types to avoid mix-ups.

Strongly typed identifiers are a classic example of this principle. While there’s some underlying primitive type (e.g. an integer representing auto-incrementing IDs, or a string representation for a UUID⁴) it’s categorically incorrect to use an ID from one table to perform a lookup in another.

You can incorrectly mix IDs from different tables in queries, of course. Consider this buggy SQL query:

SELECT cart.id, cart.customer_id
FROM customers AS customer
JOIN carts AS cart
  -- This *should* match cart.customer_id to customer.id, but doesn't.
  ON customer.id = cart.id;

A library may allow this kind of mistake, but your program shouldn’t. A good ORM should support strongly typed identifiers out of the box. You may need to carefully massage raw ID representations into the correct ID types at your system boundaries (e.g. in your API handlers), but then you can code within that boundary without paranoia.

These kinds of mistakes happen. They make nightmarish bugs: updates meant for one entity are applied to an arbitrary other entity in its table (Customer A adds an item to Customer B’s cart — yikes!). They also slip through code review, especially in tandem with over-abbreviated variable names. A team I worked with recently switched to typed IDs, and uncovered a handful of these bugs in production, tucked away in sleepy code-paths where their filthy impact went unnoticed.

// Less safe: inadvertent mixing.
func unsafeDeleteCustomer(id int) { ... }
func unsafeOperation(c Cart) {
  // Compiles, but uses a cart ID to look up a customer.
  unsafeDeleteCustomer(c.id)
}

// More safe, and more expressive to boot!
func deleteCustomer(id CustomerId) { ... }
func operation(c Cart) {
  // Compiler error! c.id is not a CustomerId.
  deleteCustomer(c.id)
}

This type-level safety through incompatibility is the generalized response to the code smell Martin Fowler terms “primitive obsession.”⁵ Using proper closed enums over grab-bags of named values is another neat example; as for typed IDs, a good ORM and a good database schema do much of the heavy lifting.

Other incompatibilities are more specific to your system, your business logic, or the quirks of the data you inherited; these are harder to spot, but just as important. I recently worked on a system where two different sources of activity data were keyed differently:

The difference in those dates — a poor data model design, to be sure — was a constant stumbling block for new engineers: it seemed like the two dates would naturally correspond, but treating them interchangeably (comparing them to join sets of transactions with summaries) is completely misleading!⁶

We initially tried to solve this with what the USCSB would term an “administrative control:” authors were expected to understand the difference between the two dates, and reviewers were expected to confirm they were used correctly. Even if you can remember to do it, this is hard code to refactor. That activityDate argument date your function receives — what kind of date is it? Has it already been shifted?

Selective incompatibility proved a stronger solution. These were both dates, sure, but they should not mix. Define two wrappers for date types — SnapshotDate and TransactionDate — that can’t be directly compared. Instead, define projections from one type into the other.

The higher-level lesson for a software engineer reading the Atchison incident report isn’t specific to “accidental mixing” (whether acids or IDs). This incident analysis, with its parallels in software, illustrates the value of drawing safety lessons from older, more spectacularly dangerous engineering disciplines.⁷ There is nothing new under the sun.

Trevor Kletz’s 1985 Engineer’s View of Human Error describes “wrong connections.” His explanation of the concept gives an example from the dawn of anesthesia:

The problem wasn’t solved in 1989, and it wasn’t solved in 2016; I’d bet the U.S. still hasn’t consigned inadvertent mixing accidents to history. The ultra-serious process engineers are still learning from their past mistakes. So should we — from their mistakes and ours.

U.S. Chemical Safety and Hazard Investigation Board. Key Lessons for Preventing Inadvertent Mixing During Chemical Unloading Operations: Chemical Reaction and Release in Atchison, Kansas, No. 2017-01-I-KS. 2017. Available online.↩︎
U.S. Chemical Safety and Hazard Investigation Board. Key Lessons for Preventing Inadvertent Mixing During Chemical Unloading Operations: Chemical Reaction and Release in Atchison, Kansas, No. 2017-01-I-KS. 2017. Available online.

This lesson about incompatible hose couplings is one of eleven in this report. Several others have useful analogues in software engineering: applying the hierarchy of controls in hazard analysis; introduce interlocks to make manual processing safer; physically separate and clearly mark incompatible equipement, e.g. the fill lines for acid and bleach here; train workers on where to find their PPE and how to use emergency shutoff mechanisms.↩︎
This is an important point. Writing code on a team, you can assert that there are categorically incorrect ways to use certain resources/services — at least from the current standpoint — and you’d do well to preclude that incorrect usage initially. Type-level restrictions are easy to unwind later, but hard to introduce after the fact.↩︎
UUIDs are intrinsically safer than incrementing integer IDs, since a given UUID should be globally unique (or, rather, there’s a negligible risk of a collision). Rather than updating an arbitrary entity, the following examples would update nothing — that’s preferable.

This is a kind of global, data-level safety-through-incompatibility without a clear analog in the Atchison incident or USCSB report, and a major advantage for these kinds of ID schemes.↩︎
Fowler, Martin. Refactoring: improving the design of existing code. Addison-Wesley Professional, 2018. See also this short summary.↩︎
The two date sets should really be comparable, but their inclusion in myriad accounting and operations queries (with little offsets assuming their mismatch!) complicated that migration. A database view on summaries with a shifted date is probably the right starting point.↩︎
I’ve enjoyed books by Trevor Kletz and papers by Nancy Leveson on this point. It bears mentioning that much of software safety culture — e.g. blameless retrospectives, root-cause analysis, and “technical risk assessments” — are derived from principles in industrial safety (usually after some watering-down). We share an underlying language of “controls.”↩︎
Kletz, Trevor. An Engineer’s View of Human Error, Third Edition. Taylor & Francis, 2001. Pages 31–32.↩︎
Kletz, Trevor. An Engineer’s View of Human Error, Third Edition. Taylor & Francis, 2001. Pages 31–32.↩︎