I’ve been studying corporate enterprise systems for thirty years and have worked with about eighty different organizations in the process. Most of them had mission-critical applications that failed a lot. I was the software “jack of all trades” who came along to remedy those failures.
Out of that experience, some incredibly strong patterns have emerged, which have illuminated my problem solving this whole time. These patterns are themselves consistent, reliable, and solid—even though the phenomena they describe are complex, expensive, and often painful to deal with.
Pattern: Change is the enemy
First off, if a working system stops working, it’s because something has changed. That’s obvious, right? If your enterprise system is working, and if you’re able to keep everything around it from changing, you will have perfect reliability—so any failure must be a result of some kind of change.
That includes code, configuration, operating systems, APIs, and database schemas. But it also might include availability, which could change suddenly and without warning. And what about changes that are entirely unauthorized, such as malicious changes that should have been spotted by your intrusion detection system?
So we start with an incredibly basic principle: If you want a guarantee of zero risk, it means you have to have zero change.
Pattern: Change is imperative
That having been said, no usable system is without change. Everything matters: regulations get revised, new products are launched, someone comes up with better ideas, someone with authority comes up with worse ideas, external data formats change, your markets become international, your transaction volume grows beyond its architectural limits, bits of technology become obsolete, cost structures fluctuate. Versions of cloud resources can change every few days, often with literally zero warning.
The problem is that change can cause your systems to fail, but you do not have the option of escaping change. You’re carrying fragile, interconnected systems through a dynamic business and regulatory world.
If only you could lock it all down though! Control all the changes so there are no surprises! But alas.
Pattern: Change isn’t controlled
You might be using a container manager like Rancher, or your enterprise might be sold on an entire orchestration solution like Terraform. These go a long way towards controlling everything you can control.
Under an orchestration system, or within a container environment, you would not have to respond to change because change would be responding to you. That’s an especially important (implied, not stated) part of Terraform’s appeal.
Additionally, many shops have dedicated time for a (weekly or so) Change Control meeting. In a certain Fortune 100 shop I worked in, those procedures were absolutely necessary, and they helped a lot. Did they eliminate the need for troubleshooting and recovery? No, and the two reasons go right to the heart of the larger problem here:
- Not every change is subject to a change control regime; and
- Just because a change is understood, approved, tested, and verified doesn’t mean it can’t have undesired results in the field.
You will never win with change control alone because some changes escape your control, and because control doesn’t guarantee success. It’s a partial solution.
Pattern: Change can be monitored
To recap the first three patterns: Your applications require stability, but you live in a dynamic world, and it’s not possible to control all the changes.
That sucks! Fortunately, changes that can’t be controlled can be monitored.
For example, it’s impossible to prevent your API vendor from making a “backward compatible” change in their WSDL, but it’s definitely possible to monitor that WSDL and keep a log of changes. You cannot control the SSL certs or the corresponding DNS settings of resources that are outside of your operation, but you can absolutely raise an alert when they do change.
Further, it’s entirely possible to roll out a change that you thought would be harmless (like bumping the embedded version number on an application) only to find an unanticipated outcome (like breaking some downstream batch app that parses your log files overeagerly). That problem might only reveal itself days later, and the correlation could be hard to pick up on.
Here’s the takeaway: If you want to reduce the instances of downtime and cultivate fast recovery when something does go wrong, continuous state monitoring is an essential part of the plan.
Result: A radical four-part plan for remedy and resilience.
A DBA I collaborated with on various gigs in the past liked to say “Nothing hard is ever easy.” He was incredibly annoying but also completely right. (So, basically your typical DBA.)
Monitoring the state of all the resources that make up an enterprise application is like that. It’s surprisingly difficult if you haven’t done it already!
In an upcoming blog post 🔜 I will share my four-step process for implementing that monitoring in a way that makes sense, yields consistency, and most importantly doesn’t impose on your existing deployment structures.