The paradox of test coverage

When I learn that code owned by a team has low test coverage, I expect “here be dragons.” But I never know what to expect if the code coverage is high. I call this a paradox of high test coverage.

High test coverage does not tell much about the quality of unit tests. Low coverage does.

The low coverage argument is self-explanatory. If tests cover only a small portion of the product code, they cannot prevent bugs in the code that is not covered. The opposite is, however, not true: high test coverage does not guarantee a quality product. How is this possible?

Test issues

While unit tests ensure the quality of the product code, nothing, except the developer, ensures the quality of the unit tests. As a result, tests sometimes have issues that allow bugs to sneak in. Finding unit test issues is more luck than science. It usually happens by accident—usually when tests continue to pass despite code changes that should trigger test failures.

One of the simplest examples of a unit test issue is missing asserts. Tests without asserts are unlikely to flag issues. Other common problems include incorrect setup and bugs caused by copying existing tests and incorrectly adapting them to test a new scenario.

Mocking issues

Mocking allows the code under test to be isolated from its dependencies and simulate the dependency behavior. However, when the simulation is incorrect or the behavior of the dependency changes, tests may happily pass, hiding serious issues.

I’ve been working with C++ code bases, and I often see developers assume, without confirming, that a dependency they use won’t throw an exception. So, when they mock this dependency, they forget about the exception case. Even though their tests cover all the code, an exception in production takes the entire service down.

Uncovered code

Getting to 100% code coverage is usually impractical, if not impossible. As a result, a small amount of code is still not covered. Similar to the low coverage scenarios, any change to the code that is not covered can introduce a bug that won’t be detected.

Chasing the coverage number

Test coverage is only a metric. I’ve seen teams do whatever it takes to achieve the metric goal, especially if it was mandated externally, e.g., at the organization or company level. Occasionally, I encountered teams that wrote “test” code whose primary purpose was increasing coverage. Detecting or preventing bugs was a non-goal.

Low test coverage is only the tip of the iceberg

At first sight, low test coverage seems a benign issue. But it often signals bigger problems the team is facing, like:

spending a significant amount of time fixing regressions
shipping high-quality new features is slow due to excessive manual validation
many bugs reach production and are only caught and reported by users
the on-call, if the team has one, is challenging
the engineering culture of the team is poor, or the team is under pressure to ship new features at an unsustainable pace
the code is not very well organized and might be hard to work with, only slowing down the development even further
test coverage is likely lower than admitted to and will continue to deteriorate

I’ve worked on a few teams where developers understood the value of unit testing. They treated test code like product code and never sent a PR without unit tests. Because of this, even if they experienced the problems listed above, it was at a much smaller scale. They also never needed to worry about meeting the test coverage goals – they achieved them as a side effect.

On-call Manual: Measuring the quality of the on-call

Reasonable on-call is no accident. Getting there requires a lot of hard work. But how can you tell if you’re on the right track if the experience can completely change from one shift to another? One answer to this question is monitoring.

How does monitoring help?

At the high level, monitoring can tell you if the on-call duty is improving, staying the same, or deteriorating over a longer period. Understanding the trend is important to decide whether the current investment in keeping the on-call reasonable is sufficient.

At the more granular level, monitoring allows identifying areas that need attention the most, like:

noisy alerts
problematic dependencies
features causing customers’ complaints
repetitive tasks

Continuously addressing the top issues will gradually improve the overall on-call experience.

What metrics to monitor

There is no one correct answer to what metrics to monitor. It depends a lot on what the team does. For example, frontend teams may choose to monitor the number of tickets opened by the customers, while backend teams may want to focus more on time spent on fixing broken builds or failing tests. Here are some metrics to consider:

outages of the products the team owns
external incidents impacting the products the team owns
the number of alerts, broken down by urgency
the number of alerts alerts acted on and ignored
the number of alerts outside the working hours
time to acknowledge alerts
the number of tickets opened by customers
the number of internal tasks
build breaks
test failures

How to monitor?

On-call monitoring is difficult because there isn’t a single metric that can reflect the health of the on-call. My team uses quantitative (data) and qualitative metrics (opinions).

Qualitative metrics

Quantitative metrics can usually be collected from alerting systems, bug trackers, and task management systems. Here are a few examples of quantitative metrics we are tracking on our team:

the number of alerts
the number of tasks
the number of alerts outside the working hours
the noisiest alerts, tracked by alert ID

As quantitative metrics are collected automatically, we built a dashboard to show them in an easy-to-understand way. Keeping historical data allows us to track trends.

Qualitative metrics

Qualitative metrics are opinions about the shift from the person ending the shift. Using qualitative metrics in addition to quantitative metrics is necessary because numbers are sometimes misleading. Here is an example: handling a dozen tasks that can be closed almost immediately without much effort is easier than collaborating with a few teams to investigate a hard-to-reproduce customer report. However, considering only how many tasks each on-call got during their shift, the first shift appears heavier than the second.

On our team, each person going off-call fills out an On-call survey that is part of the On-call report. Here are some of the questions from the survey:

Rate your on-call experience from 1 to 10 (1: easy, 10: horrible)
Rate your experience with resources available for resolving on-call issues (e.g., runbooks, documentation, tools, etc.) from 1 to 10 (1: no resources or very poor resources, 10: excellent resources that helped solve issues quickly)
How much time did you spend on urgent activities like alerts, fire fighting, etc. (0%-100%)?
How much time did you spend on non-urgent activities like non-urgent tasks, noise, etc. (0%-100%)?
Additional comments (free flow)

We’ve been conducting this survey for a couple of years now. One interesting observation I made is that it is not uncommon for a horrible shift for one person to be decent for someone else. Experienced on-calls usually rate their shifts easier than developers who just finished their first shift. This is understandable. We still treat all opinions equally—improving the on-call quality for one person improves it for everyone.

The Additional comments question is my favorite as it provides insights no other metric can capture.

Call to Action

If being on-call is part of your team’s responsibilities and you don’t monitor it, I highly encourage you to start doing so. Even a simple monitoring system will tell you a lot about your on-call and allow you to improve it by addressing the most annoying issues.

“Think big” or make progress?

I am being told to “think big.”

But I don’t know what this means.

And I doubt that most people who tell others to think big can do this themselves.

It is easy to come up with big ideas that are not realistic. “Inhabit Venus” sounds like a big idea, but I can do nothing meaningful to implement it. Finding big ideas that are also realistic is hard.

Big ideas

For the sake of the argument, let’s define a Big Idea as follows:

An idea is big if its implementation spans multiple teams or requires substantially altering business-critical systems. By definition, implementing such an idea takes quarters or even years to complete.

Given the risk and the funding Big Ideas require, pitching them is not easy. In many cases, even Staff Software Engineers do not have enough credibility to get such an idea funded. Instead, it takes a team of Product Managers, Engineering Managers, and Software Engineers to prepare and propose the idea.

Big Ideas promise high rewards but are inherently risky. Many won’t bring the promised benefits, and some will flop completely. Because of how long it takes to implement a Big Idea, not everyone who started it will be there to witness its completion and get the reward.

Earlier in my career, I spent most of my time trying to find the “big thing.” I was not successful. I was obsessed with “thinking big” but couldn’t come up with anything. It took me a while to realize that time was passing, but my career was not progressing. This realization led me to an alternative path.

If not Big Ideas, then what?

When I found that trying to “think big” was not helping my career, I shifted my focus to finding and solving problems I and my team or users faced. No problem was too small. I simplified code that was hard to modify, added missing test coverage, or sped up the build. These were simple changes that I could tackle immediately, and they usually didn’t take more than a day to finish. But they helped push my career on a growing trajectory. Here are the most important reasons why:

I learned how to take the initiative and propose improvements no one asked me to do
I got better at identifying problems
I improved life for our users, my team, and myself

The nice thing about these small bets is that they can boost a career even though they are easy to find and not risky:

due to the smaller scope, they are much easier to complete
if something goes wrong, it usually is not a big deal
delivering them consistently helps build the credibility
they sometimes lead to bigger ideas

Some of these small ideas occasionally had a much bigger impact than I expected. Here is one example. In response to an incident, I needed to write a tool that inspected data in our data store and deleted stale records. The store my team used was one of the most common infra used by many teams across the company. When researching how to complete my task, I noticed that some other teams already built similar one-off solutions. Instead of creating another specialized tool, I wrote a framework to make these tools rapidly. Because my framework saved weeks of development time, a few teams adopted it. I moved to a different team a few years ago, but my framework is still in use. Some developers even extended it to handle more scenarios.

I observed that, with time, my ideas started to grow. I started noticing bigger problems that needed more time and people to be solved. They still are not in the “Big Ideas” category, but many are noticeable as they span multiple teams.

Parting words

Please note that I am not saying never to “think big.” If you can propose or help drive a “Big Idea,” by all means, do so. But don’t dismiss smaller problems, especially if your “Big Idea” is not quite there yet. At the end of the day, when the review time comes, it’s always better to show a few completed small ideas than a “Big Idea” that hasn’t or couldn’t be implemented.

Do Unit Tests Find Bugs?

I’ve been writing software for over 20 years and don’t believe unit tests find bugs.

Yet, I wouldn’t want to work in a code base without unit tests.

Why unit tests don’t find bugs?

To understand why unit tests don’t find bugs, we can look at how they are created. Here are the three main ways to handle unit tests:

developers write the tests along with writing the code
Test Driven Development (TDD)
unit tests are considered a waste of time, so they don’t exist

When the same software developer writes unit tests and code simultaneously, the tests tend to reflect closely what the code does. Both tests and code follow the same logic, stemming from the same understanding of the problem. As a result, the tests won’t find major implementation issues. If they find small typos or bugs, it’s usually only by chance.

Test-driven development calls for writing unit tests before implementing product changes. Because no product code exists, the unit tests are expected to fail initially or even not compile. The goal is to write product code to make the tests pass. In TDD, new unit tests are added mostly to drive the implementation of new scenarios. An unsupported scenario could be considered a bug, but it’s far-fetched. As a result, TDD rarely finds existing bugs.

If unit tests don’t exist, they cannot find any bugs.

If unit tests don’t find bugs, why do we write them?

While unit tests are not great at finding bugs, they are extremely effective at preventing new ones. Unit tests pin the program’s behavior. Any change that visibly modifies this behavior should make the tests fail. The developer whose changes caused the failures should examine them and either fix the tests—if the change in the behavior was intentional—or fix the code. Many test failures indicate assumptions that the developer unknowingly broke. Without tests, they would turn into customer-impacting bugs.

Other important advantages of unit tests include:

Documentation – comprehensive unit tests can serve as product specification
More modular and maintainable code – writing unit tests for tightly coupled code is difficult. Unit tests drive writing more modular and loosely coupled code because it is much easier to test.
Automated testing – unit tests are much faster to run and more comprehensive than testing changes manually.

If unit tests don’t find bugs, what does?

There are many ways to find bugs in the code. Integration testing, fuzz testing, and stress testing are just some examples. However, the three below are my favorite because they require little to no additional effort from the developers:

Exploratory testing: Try using the product you’re working on. See what happens if you combine a few features or try less common scenarios.
Code reviews: One weakness of unit tests is that they are implemented with the same perspective as the code. Code reviews offer the ability to look at the change from a different angle, which often leads to discovering issues.
Paying attention: Whenever you code, debug, or troubleshoot an issue, have your eyes open. Many bugs are hiding in plain sight. Carefully reading error messages, logs, or stack traces can lead to identifying serious problems.

Code, the Universe and Everything…

The paradox of test coverage

Test issues

Mocking issues

Uncovered code

Chasing the coverage number

Low test coverage is only the tip of the iceberg

On-call Manual: Measuring the quality of the on-call

How does monitoring help?

What metrics to monitor

How to monitor?

Qualitative metrics

Qualitative metrics

Call to Action

Top 5 Unit Test Problems That Haunt Software Developers

Flaky unit tests

Slow tests

Bugs in unit tests

Hard to maintain tests

Writing “tests” only to hit the code coverage number

“Think big” or make progress?

Big ideas

If not Big Ideas, then what?

Parting words

Do Unit Tests Find Bugs?

Why unit tests don’t find bugs?

If unit tests don’t find bugs, why do we write them?

If unit tests don’t find bugs, what does?