How Many Unit-tests?

Writing unit-tests is an integral part of writing software.

How many tests are needed?

What coverage target should one aim for?

What’s coverage anyway?

Follow along on this quick journey, where I wonder those very questions.

1. Why do we need any tests?

First and foremost, why would you need any tests, if you could simply prove that the program is right?

After all, in mathematics, if we assume that there are an infinite number of prime numbers for example, we can just prove it, and we don’t have to test all numbers to check that this statement is true.

That idea lead to Formal Methods.

Formal methods offer a spectrum of tools to write formal specifications, formal verifications and formal proofs of the architecture/software.

However, formal proofs of software is today so complex that it isn’t used in many cases, but mostly on critical subsets of programs (such as the kernel, a cryptographic library, network switches, CPU design, etc.).

Despite those limitations, formal methods are not a black & white choice, but more of a spectrum. As such, using some of those methods can help a great deal in preventing making mistakes in the first place.

In particular, the following helps in achieving this purpose:

TLA+, to specify your program/system;
Type Systems, to enforce data structures coherence throughout the program;
Algebraic Data Structures;
Languages that put some enforcement on some safety thing (such as memory safety, thread safety, etc.): Rust, Ada, GC languages, etc;
Etc.

So, testing our software is a necessity coming from the fact that we cannot prove that we didn’t do any mistake, in either the specification of said program, in its implementation or in its configuration.

Furthermore, is it acceptable that the first time a line of code is executed happens in production or by the customer?

If you never tested the code you wrote, how can you assure that it is working as expected?

2. Target: 100% coverage!

That line of thoughts naturally leads to pursuing the goal to test everything, and, as a consequence, to target a 100% test coverage.

Not so fast! 100% coverage of what exactly? Should each line of code be tested? Or each function? Or each feature? Or each if branch?

Let’s define and explore different type of coverage.

3. Coverages

3.1. Entry-point coverage

Entry-point coverage measures the proportion of functions that have been called at least once.

This is the easiest form of coverage, and should be aimed for 100% without doubts.

3.2. Statement coverage

Statement coverage measures the proportion of statements/lines that have been executed at least once.

At first glance, we should also aim to cover 100% of lines of code, but this is usually impossible.

Indeed, let’s consider the following artificial bad Java code:

enum State {
  ON,
  OFF,
  SWITCHING
}

class Switch {
  static boolean isStateOn(final State state) {
    if (state == State.ON) {
      return true;
    }
    if (state == State.OFF) {
      return false;
    }
    if (state == State.SWITCHING) {
      return false;
    }
    throw new UnsupportedOperationException();
  }
}

In this case, line 18 cannot be tested, because the State enum does not contain any value allowing the program to reach this point.

However, this line is needed in the event that the enum gets more values: isStateOn would then need to be updated. This serves as a way to make the code break in the future, should developers add more functionalities. This often is a requirement from coding standards, and is considered a good practice.

The good news is that those cases are usually limited in number, so they need to be understood. Some languages would allow to avoid some of those quirks entirely (with union-types support for example).

3.3. Branch coverage

Branch coverage can and is defined of multiple ways. Let’s see a simple piece of code:

 if (a < 5 && b > 17) {
   return true;
 }
 return false;

This may be seen as having 2 branches (the 2 return statements), or many more (such as 5, if you take into account that a < 5 and b > 17 can both be true and false).

Thus, branch coverage is ill-defined for most purposes.

3.4. MD/DC coverage

Modified Condition/Decision Coverage (MC/DC) is a stricter form of coverage.

This requires that:

all functions must be entered at least once;
all simple branches have been executed (the 2 from the previous section);
all elementary boolean condition in every branch condition must have been true & false at least once;
each elementary boolean condition must be chosen to affect the branch, all others being held constant.

The number of test can grow quite quickly though.

4. Number of tests & coverage

We naturally arrive at the following question: how many tests are thus needed?

To answer this question, we must first look at how we can model a program, and then ask questions about this model.

4.1. Basic path coverage

All code can be represented as a Graph.

In particular, each line represents a node in our graph, and each execution can lead to a different node: this is a Control Flow Graph.

Usually, there is an implicit link between each line, as the program is executed sequentially. AlThough, a if can occur, and connect us to a different line in the program.

Let’s see an example:

 static Boolean doYourThing(final Integer a, final Integer b) {
   Boolean exit;
   if (a < 5 && b > 17) {
     exit = true;
   }
   if (a < -1 && b <= 42) {
     exit = false;
   }
   return exit;
 }

We effectively have 9 nodes, because we have 9 lines (lines 1 to 9, as line 10 is decorative only).

Furthermore, because we do not have goto statements, the entry point 1 and the exit point of the function 9 are the same nodes:

Now it’s just a matter of counting the flow branches we see: we have 3 branches.

4.2. McCabe Cyclomatic Complexity

Given a program represented as a control flow graph, we can define its cyclomatic complexity M as follows:

\(M = E - N + 2P\)

with:

E: number of edges;
N: number of nodes;
P: number of connected components.

In our previous example, we have M = 9 - 8 + 2×1 = 3.

We can see that the number of branches is equal to the cyclomatic complexity in this case.

4.3. Number of tests

In the general case, the cyclomatic complexity represents the upper bound of the number of tests required to achieve full branch coverage, but is only a lower bound on the number of tests required to achieve full path coverage:

branch coverage ≤ M ≤ number of paths

It’s a good proxy for how many tests is required for a given function.

Often, the cyclomatic complexity of a function is limited to 10 by static analysis tools. Another perspective on this is: if you have a function with a cyclomatic complexity greater than 10, have you written at least 10 tests for it?

4.4. Combinatorial Testing

The minimal coverage doesn’t take into account every input values of parameter the function takes.

This raises the following question: what is the number of tests needed to actually cover its behaviour for its input space?

4.5. Number of tests: upper bound

This is the product over all inputs of the number of values that a given input can take.

In our previous example, it would be (number of values for a)×(number of values for b) = (2³¹⁻¹)² = 4.6×10¹⁷.

4.6. Empirical upper bound: Combinatorial Testing

This is a lot of tests.

Luckily, it has been shown that “when a program has many parameters, then exhaustive combinations of a few of them tend to uncover almost all problems”.

We can then wonder how many tests might be needed to actually tests a t set of parameters of a given function. Unfortunately, this is not known, but an upper bound has been found so far:

\(upperBound = \nu^tlog(N)\)

with ν the number of states a given parameter can take, t is the number of parameters to take at the same time and N is the total number of the function’s parameters.

For our previous example, whose signature is repeated here:

Boolean doYourThing(final Integer a, final Boolean b, final Boolean c)

We have:

t	upperBound
2	5×10¹⁸
3	1×10²⁸

upperBound

5×10¹⁸

1×10²⁸

Programs exists to actually estimate a better number of tests, and even generate those automatically (by the NIST here (archive)).

4.7. Effectiveness of coverage testing

Coverage testing is a way to ensure we detect faults in the code.

The number of faults in a given code being unknown, we may ask the following question: “Does a% of MC/DC coverage correlate with a particular level of fault coverage?”

Y. Wei & all (2010) (archive), gave a partial answer to this. It seems that there is a strong correlation of fault detection when the MC/DC coverage is growing from a low value, but it is rather low in the middle, and it is rather high when the coverage approaches 100% (at least, with random testing).

You can effectively see random testing as a function that looks like a logit function, with number of faults detected in y-axis, and number of tests in x-axis:

5. Reference

The content of this post is mostly inspired from reading Embedded Software Development for Safety-Critical Systems (2016) by Chis Hobbs (based on multiple standards for Medical, Planes, Trains & Automotive industries.). ISBN: 9781498726702. Section V.17.