You probably have a folder of old technical incident writeups — that time you dropped an important database table, or accidentally disabled an error rate monitor, or sent every customer a test email. Every company has a corporate euphemism for these writeups; I’ll call them “postmortems.”
In Trevor Kletz’s advice to process plant managers, this history takes physical form: the little ‘black book.’
Keep a ‘black book’ or ‘memory book’ in each control room, a folder of reports on accidents that have occurred. Do not include falls and bruises, but only accidents of technical interest, and include accidents from other plants.1
Now be honest: do you read your own postmortems? Has anyone revisited them since they were first written and approved? “The black book should be compulsory reading for newcomers, at all levels,” Kletz continues, “and old hands should dip into it from time to time to refresh their memories.” Postmortem writeups shouldn’t just serve to marshal an initial response; if you’re serious about preventing repeat incidents, they should be a living part of your operational culture.
In their control room, a process plant manager like Kletz could see evidence of that role in wear and tear.
If spotlessly clean, like poetry books in libraries, they are probably never read and the reasons for this should be sought. Perhaps they are incomprehensible. [… One] may be surprised how often they are out-of-date, or cannot readily be found or are spotlessly clean.2
Software engineers often ignore this secondary role of incident postmortems — to teach future engineers and incident-responders — in favor of immediately pragmatic tasks: satisfy requirements from on high; assuage fears; outline a narrow root cause. Managers of software engineers aren’t much better.
You can have a laudable incident management culture — a blameless and honest one, earnest about improving your processes — that still shits out dry, perfunctory, box-ticking reports for nobody to read ever again.
Instead, make postmortem readability an explicit value (not just “this can be read,” but “I want to read it”). A postmortem’s author should step back from the obligatory template and consider what they want various readers to remember.
Audience | Questions to consider |
---|---|
Peer engineers | Is this failure part of a pattern
of preventable incidents in your system? Has some link in this incident’s causal chain cropped up in other incidents? Do other parts of your product have similar failure modes? Even if you won’t rewrite the code in remediation, could this have been pre-empted with a different design/implementation? |
New hires | What’s the single most important rule or
reminder to take away? Does the indicent demonstrate anything unique about your system, a characteristic failure mode the uninitiated wouldn’t expect? |
Outsiders | How would you describe th eincident to a
friend on another team, or one writing software at a different
company? How would you describe it to a chemical process engineer? Is that narrative — the memorable, widely-relatable version — clear in your written report? |
These ‘black book’ summaries can go in your primary report or live in some team-internal folder. They shouldn’t interfere with writing a precise and descriptive report, and your responsibility to write something interesting doesn’t override responsibilities to your colleagues (fairness) or customers (objectivity).
Postmortems can be great reads. The NTSB and USCSB have small cult followings, largely outside their respective industries; I only arrived at “Safety Through Incompatibility” because the USCSB’s YouTube backlog tickles my brain. A failure that seems highly specific may exemplify a much broader class of failures outside your domain, just as chemical process failures are (surprise!) good analogies for crummy software engineering.
Keep in mind: it’s a privilege (one worth exercising) to treat incident management lightheartedly. The tradition of keeping ‘black books’ and revisiting them was a hard-learned lesson in process engineering; I remember Kletz’s somber epigraph for Learning from Accidents:
This book has been written
to remember the dead and injured
and to warn the living