I’ve been studying corporate enterprise systems for thirty years and have worked with about eighty different organizations in the process. Most of them had mission-critical applications that failed a lot. I was the software “jack of all trades” who came along to remedy those failures.
Out of that experience, some incredibly strong patterns have emerged, which have illuminated my problem solving this whole time. These patterns are themselves consistent, reliable, and solid—even though the phenomena they describe are complex, expensive, and often painful to deal with.
Pattern: Change is the enemy
First off, if a working system stops working, it’s because something has changed. That’s obvious, right? If your enterprise system is working, and if you’re able to keep everything around it from changing, you will have perfect reliability—so any failure must be a result of some kind of change.
That includes code, configuration, operating systems, APIs, and database schemas. But it also might include availability, which could change suddenly and without warning. And what about changes that are entirely unauthorized, such as malicious changes that should have been spotted by your intrusion detection system?
So we start with an incredibly basic principle: If you want a guarantee of zero risk, it means you have to have zero change.More