October 21, 2016: a Harcros Chemicals delivery truck arrives at the MGPI Processing plant in Atchinson, Kansas for a routine early-morning delivery of sulfuric acid. Together with a plant operator, the driver runs a hose from their trailer tank of pressurized sulfuric acid to one of the facility’s fill lines, checks for leaks, and starts the transfer of acid.
Not fifteen minutes later, a yellow-green cloud erupts from one of the facility’s holding tanks. It swallows the delivery truck. It engulfs the nearby control building; the operators scramble from the plant on foot. The driver takes refuge in a wastewater treatment plant nearby, and the green cloud — highly toxic chlorine gas — drifts northeast, on the wind, over Atchison.
How does a routine delivery cause an uncontrolled release of chlorine? Through a reaction between the delivered acid and a tank of sodium hypochlorite. At the MGPI plant, stocks of these two “incompatible” chemicals are loaded from the same connection area, using an identical hose connection. The driver connected to the wrong unmarked fill line.
Anyone who’s done their time on a rushed software engineering team has cleaned up an analogous accident: two things, usually separated, are mixed by accident or mistaken design.
What did it cost you? A couple hours of work and some griping in Slack? For a chemical process engineer (or any other species of real engineer), the stakes are astronomically higher. We have much to learn.
Industrial chemical accidents in the U.S. are subject to investigation by the U.S. Chemical Safety and Hazard Investigation Board (USCSB). According to their report, the Atchison incident teaches a “Key Lesson:” if components should never be connected, make it impossible to connect them!
Work with chemical distributors to select hose couplings and fill line connections with uniquely shaped and color-coded fittings for each chemical or class of chemicals, especially where severe chemicals are unloaded in close proximity. This can include a combination of accepted fittings with unique shapes (e.g., square for acids, hexagon for bases) or different sized diameters (e.g., 2-inch or 3-inch round) for each fill line.2
An early-career software engineer generally works additively. One starts with some raw stuff (an empty project structure, then some language primitives, some data…) and accrues features and affordances until the program is at least capable of completing some task.
Considering capabilities as hazards doesn’t come as
naturally, even though examples abound. There are examples in our
ancient jokes: rm
is hazardously compatible with your
precious root directory. That’s fine for a library, but death in a
system3 — you don’t need an API endpoint
that drops your production database, even though your database libraries
could do so.
Unfortunately for our safety analysis, the raw stuff of software engineering is less definite than the raw stuff of chemical plants (you know, the chemicals).
One valuable analogy is types. A function’s parameters are its fill lines — they constrain what sorts of data can be pumped into it. Whereas chemical process engineers use incompatible hose couplings, programmers can elect to use incompatible types to avoid mix-ups.
Strongly typed identifiers are a classic example of this principle. While there’s some underlying primitive type (e.g. an integer representing auto-incrementing IDs, or a string representation for a UUID4) it’s categorically incorrect to use an ID from one table to perform a lookup in another.
You can incorrectly mix IDs from different tables in queries, of course. Consider this buggy SQL query:
SELECT cart.id, cart.customer_id
FROM customers AS customer
JOIN carts AS cart
-- This *should* match cart.customer_id to customer.id, but doesn't.
ON customer.id = cart.id;
A library may allow this kind of mistake, but your program shouldn’t. A good ORM should support strongly typed identifiers out of the box. You may need to carefully massage raw ID representations into the correct ID types at your system boundaries (e.g. in your API handlers), but then you can code within that boundary without paranoia.
These kinds of mistakes happen. They make nightmarish bugs: updates meant for one entity are applied to an arbitrary other entity in its table (Customer A adds an item to Customer B’s cart — yikes!). They also slip through code review, especially in tandem with over-abbreviated variable names. A team I worked with recently switched to typed IDs, and uncovered a handful of these bugs in production, tucked away in sleepy code-paths where their filthy impact went unnoticed.
// Less safe: inadvertent mixing.
func unsafeDeleteCustomer(id int) { ... }
func unsafeOperation(c Cart) {
// Compiles, but uses a cart ID to look up a customer.
(c.id)
unsafeDeleteCustomer}
// More safe, and more expressive to boot!
func deleteCustomer(id CustomerId) { ... }
func operation(c Cart) {
// Compiler error! c.id is not a CustomerId.
(c.id)
deleteCustomer}
This type-level safety through incompatibility is the generalized response to the code smell Martin Fowler terms “primitive obsession.”5 Using proper closed enums over grab-bags of named values is another neat example; as for typed IDs, a good ORM and a good database schema do much of the heavy lifting.
Other incompatibilities are more specific to your system, your business logic, or the quirks of the data you inherited; these are harder to spot, but just as important. I recently worked on a system where two different sources of activity data were keyed differently:
Granular transactions
, dated by when they
occurred.
Daily summaries
, dated by when they are
generated. The summary with date August 10, 2024
actually summarizes transactions dated to
August 9, 2024
!
The difference in those dates — a poor data model design, to be sure
— was a constant stumbling block for new engineers: it seemed like the
two dates would naturally correspond, but treating them interchangeably
(comparing them to join sets of transactions
with
summaries
) is completely misleading!6
We initially tried to solve this with what the USCSB would term an
“administrative control:” authors were expected to understand the
difference between the two dates, and reviewers were expected to confirm
they were used correctly. Even if you can remember to do it, this is
hard code to refactor. That activityDate
argument date your
function receives — what kind of date is it? Has it already been
shifted?
Selective incompatibility proved a stronger solution. These were both
dates, sure, but they should not mix. Define two wrappers for
date types — SnapshotDate
and TransactionDate
— that can’t be directly compared. Instead, define projections from one
type into the other.
The higher-level lesson for a software engineer reading the Atchison incident report isn’t specific to “accidental mixing” (whether acids or IDs). This incident analysis, with its parallels in software, illustrates the value of drawing safety lessons from older, more spectacularly dangerous engineering disciplines.7 There is nothing new under the sun.
Nothing new for process safety engineers, either.
Trevor Kletz’s 1985 Engineer’s View of Human Error describes “wrong connections.” His explanation of the concept gives an example from the dawn of anesthesia:
Figure 2.13 shows the simple apparatus devised in 1867, in the early days of anaesthetics, to mix chloroform vapour with air and deliver it to the patient. If it was connected up the wrong way round liquid chloroform was blown into the patient with results that could be fatal. Redesigning the apparatus so that the two pipes could not be interchanged was easy; all that was needed were different types of connection or different sizes of pipe. Persuading doctors to use the new design was more difficult and the old design was still killing people in 1928. Doctors believed that highly skilled professional men would not make such a simple error but as we have seen everyone can make slips occur however well-motivated and well-trained, in fact, only when well-trained.8
The book’s third edition (2001) includes another example:
Do not assume that chemical engineers would not make similar errors. In 1989, in a polyethylene plant in Texas, a leak of ethylene exploded, killing 23 people. The leak occurred because a line was opened for repair while the air-operated valve isolating it from the rest of the plant was open. It was open because the two compressed air lines, one to open the valve and one to close it, had identical couplings, and they had been interchanged. […]9
The problem wasn’t solved in 1989, and it wasn’t solved in 2016; I’d bet the U.S. still hasn’t consigned inadvertent mixing accidents to history. The ultra-serious process engineers are still learning from their past mistakes. So should we — from their mistakes and ours.