Reliability of Sequential Layers

Have you ever seen an architecture diagram with many layers of services, and wondered how reliable it is?

After all, KISS is better, right?

Our intuition tells us it makes sense: if each layer can fail, more layers just mean more failure.

But exactly how many layers become too much?

1. Model

By layers, I mean units that depend on each other to process an input.

This can be:

calling an API (like a WebService), that itself will call another service, itself calling another service, etc;
each step of a single process (checking inputs validity, processing, formatting, logging, etc.);
and so on.

The key being that the failure of a single layer leads to the failure of the full process.

In this model, each processing unit is a rectangle box, and each link is a line between two boxes.

This is similar to having electrical components connected in series:

I is the input of the process. O is its output. \(U_i\) is the \(i^{th}\) unit that process the data and sends it to the next unit.

2. Modeling

We want to calculate the probability that the whole system S works \(P(S)\), knowing the probability that a unit i works is \(P(U_i)\).

Let’s first assume we only have 2 units: \(U_1\) and \(U_2\).

In this case, the probability that the system works is the probability that both \(U_1\) and \(U_2\) work at the same time:

\[P(S) = P(U_1\ and\ U_2)\]

In probability, we know that \(P(A\ and\ B) = P(A \cap B)\), so we have:

\[P(S) = P(U_1 \cap U_2)\]

We also assume that \(U_1\) and \(U_2\) and independent systems, allowing us to write \(P(U_1 \cap U_2) = P(U_1).P(U_2)\):

\[P(S) = P(U_1).P(U_2)\]

However, we can generalize as:

\[P(S) = \prod_{i=1}^{n} P(U_i)\]

We have the probability that it works, but the probability that it fails is \(P(\overline{S}) = 1 - P(S)\)

3. Trends

Let’s simulate many units, each having a random probability of success between a lower bound and 100%, and graph \(P(S)\).

I wrote a quick python program to generate that graph:

3.1. Unreliable units

With units having a random probability of success between 80% and 100% (maybe those units are people for example):

This decreases very quickly, so much so that at 10 layers, the probability of failure is already 70%.

3.2. Medium reliability units

With units having a random probability of success between 97% and 100% (maybe those are not very well tested services):

This time it decreases less fast, and after 20 layers, it reaches a 5% failure rate.

3.3. High reliability units

If each unit is a High Availability Service, with a "six nine" availability (so each unit having a probability of success between 99.9999% and 100%), we get:

4. Conclusion

We’ve seen that the failure rate of sequential unit of processing dramatically increases with the number of units, if those units are unreliable, but do not impact much the reliability if each unit is very reliable.

In the real world, units are dissimilar: some highly reliable (such as computing units or transistors), mixed with less reliable ones (such as networks).