Your system is feature-perfect, but is it still fragile?

Dotan Reis
4 min readFeb 3, 2021

All too often we find ourselves deploying software that works flawlessly on our the drawing board, only to see it underperform or fail in production. Sometimes it happens quite fast, other times it takes a while.

The problem is, shipping a complete system to production after only testing it locally is a little bit like trying to go surfing after you’ve thoroughly and absolutely learned how to surf — in a classroom.

This is because production environment are complex. There’s a lot of unknowns: your business needs can change. Customer behavior can surprise you. Third party services can change their behavior. Your infrastructure can start acting up when it’s experiencing heavy traffic.

The issue with complex environment is that you can’t really predict them. It’s a chaotic system, like the weather. You can predict a little bit in advance but not more than that. So how do you prepare for what you can’t predict?

Chaos (Photo by NASA on Unsplash)

It’s good to keep this in mind when thinking about the architecture process. Too often the process is this: ask the product people to give you a spec, or a list of required features, and design a system that satisfies them in the most efficient way.

But this is just throwing the responsibility on other people, and they can’t predict complex environment any more than you can. What’s more, they don’t always have enough engineering context to have to correct intuition as to what might be the relevant weaknesses in your design, and how to put that into a requirements list.

When designing the system it’s crucial to have all the context: the business and product context of what might change in the requirements, and the engineering context of what the system can handle. But this is also not enough.

Bad at predicting the future (Photo by Viva Luna Studios on Unsplash)

We all know intuitively that systems that have been around long enough, and had to sustain significant enough stress from users, attackers and business, are by now probably pretty resilient. They will probably be harder to break, having endured so much. Can we get that resiliency faster?

The concept of Anti-Fragility discusses exactly this. The main idea: you don’t want to wait for things to break, and you can’t predict the future. But there’s hope: when a system becomes stronger in order to resolve a stress it’s in, it also becomes generally stronger. It doesn’t make that much of a difference what you defend against, in terms of stress, because if you defend against a large enough group of things, you’re likely to be safe against what’s really going to happen.

Strength (Photo by Hans-Jurgen Mager on Unsplash)

So what do you do? first, you need to have “a seat at the table”. You can’t leave it to someone else to write a spec for you. You need to know what the business and product uncertainties are, and they need to know what the system’s possible weaknesses are.

Then, create a list of possible things that could happen and stress your system. Be creative, remember: we’re not dealing with probabilities. This can be things like, “competitors drop their prices (maybe we have to too)”, “third-party X stops working (maybe we need to switch vendors)”, “We need to start supporting multiple languages (maybe we need different modularity)”, “we need to switch databases to handle the scale”, “all the customers use the system at the same time”, “aliens land and become our customers”, “our messaging system become slow” and so on.

At riseup, we have a monthly strategy meeting for the entire company, which brings up a lot of possible future changes, and give developers a chance to describe the state of the system. If you don’t have that, try talking to different people in the company, and ask them to come up with scenarios on future needs the company may have, regardless of how unlikely.

Once you have a list, go over the items and try to see how each one might affect the system. Do you think it will just work fine, or can you solve it with a flip of a button? Great, move on to the next item. If not, think how you could modify your design today to tackle this possible challenge tomorrow.

Naturally, unless you have unlimited time and money, you won’t be able to defend against everything and create a system that’s 100% bullet-proof, but hopefully you’ll find some adjustments that are cost-effective enough to implement immediately. Next time the outside world surprises you, this might make a big difference.

Photo by Jonathan Borba on Unsplash

In the end it’s like the army saying: “Hard in training, easy in battle”. Our systems are going to battle when we ship them to production, let’s send them prepared.

This post is inspired by a great episode in Software Engineering Radio that covers most of this post and other methods for achieving anti-fragility. https://www.se-radio.net/2020/01/episode-396-barry-oreilly-on-antifragile-architecture/

--

--

Dotan Reis

Software developer @ riseup. MA student @ The Cohn Institute in Tel Aviv University