“Resilience Thinking” in the Micro-Service Architecture

Micro services — the idea of splitting up your “monolith” software architecture into dozens, even hundreds of small services — pops up all over the place these days. At Egnyte we are also regularly discussing this topic, and investigating how to move towards this model more and more.

Resilience is a hot topic in this area. Resilience is all about gracefully handling failures — which will inevitably happen at increasing rates as your system becomes more distributed (a natural consequence of the micro-service architecture).

As you tear more and more of your software apart and move them to separate services, the number of possible points of failure increases.

Before, you just called method “b” on object “a” and it always “just worked™.” But now, you send message “b” to service “a” and you have to take into account the network may be down, that “a” is not reachable, overloaded, down, may crash while processing, or is a different “a” than you talked to 5 seconds ago.

For sure, this adds complexity, but what I like about it is that it surfaces risks that have always existed, but may not yet have been part of a developer’s mental model.

It was always the case that calling method “b” on “a” could fail, but you never saw that happen in practice, so whatever. Now that method “b” lives elsewhere on the network, you have to assume it will fail and figure out how to deal with it.

Will you automatically retry? If so, how often, with exponential back-off or not? How do you make sure that “b” didn’t silently crash somewhere without you noticing? How do you debug that?

You’re forced to think about these issues (I hope — if not, good luck with that) whereas before resilience could be more realistically dealt with in a more reactive fashion: “Hey, writing this file fails once every day, perhaps we should put a try-catch around that, so it no longer crashes the whole thread.”

At an infinite timescale, software will crash at every single line of your code base (even if it’s because some random electron shoots through your hardware causing a crash — please don’t comment that this is not actually possible, I don’t know much about physics). Again, micro services just throw the “unexpected failure” problem in developers’ faces more visibly, so that they can no longer ignore it.

My point is this: the micro service architecture comes with many costs, but the end result will be that “resilience thinking” will have to become part of every developer’s life. And in my view, that’s a good thing.