On-call Manual: Measuring the quality of the on-call

Reasonable on-call is no accident. Getting there requires a lot of hard work. But how can you tell if you’re on the right track if the experience can completely change from one shift to another? One answer to this question is monitoring.

How does monitoring help?

At the high level, monitoring can tell you if the on-call duty is improving, staying the same, or deteriorating over a longer period. Understanding the trend is important to decide whether the current investment in keeping the on-call reasonable is sufficient.

At the more granular level, monitoring allows identifying areas that need attention the most, like:

  • noisy alerts
  • problematic dependencies
  • features causing customers’ complaints
  • repetitive tasks

Continuously addressing the top issues will gradually improve the overall on-call experience.

What metrics to monitor

There is no one correct answer to what metrics to monitor. It depends a lot on what the team does. For example, frontend teams may choose to monitor the number of tickets opened by the customers, while backend teams may want to focus more on time spent on fixing broken builds or failing tests. Here are some metrics to consider:

  • outages of the products the team owns
  • external incidents impacting the products the team owns
  • the number of alerts, broken down by urgency
  • the number of alerts alerts acted on and ignored
  • the number of alerts outside the working hours
  • time to acknowledge alerts
  • the number of tickets opened by customers
  • the number of internal tasks
  • build breaks
  • test failures

How to monitor?

On-call monitoring is difficult because there isn’t a single metric that can reflect the health of the on-call. My team uses quantitative (data) and qualitative metrics (opinions).

Qualitative metrics

Quantitative metrics can usually be collected from alerting systems, bug trackers, and task management systems. Here are a few examples of quantitative metrics we are tracking on our team:

  • the number of alerts
  • the number of tasks
  • the number of alerts outside the working hours
  • the noisiest alerts, tracked by alert ID

As quantitative metrics are collected automatically, we built a dashboard to show them in an easy-to-understand way. Keeping historical data allows us to track trends.

Qualitative metrics

Qualitative metrics are opinions about the shift from the person ending the shift. Using qualitative metrics in addition to quantitative metrics is necessary because numbers are sometimes misleading. Here is an example: handling a dozen tasks that can be closed almost immediately without much effort is easier than collaborating with a few teams to investigate a hard-to-reproduce customer report. However, considering only how many tasks each on-call got during their shift, the first shift appears heavier than the second.

On our team, each person going off-call fills out an On-call survey that is part of the On-call report. Here are some of the questions from the survey:

  • Rate your on-call experience from 1 to 10 (1: easy, 10: horrible)
  • Rate your experience with resources available for resolving on-call issues (e.g., runbooks, documentation, tools, etc.) from 1 to 10 (1: no resources or very poor resources, 10: excellent resources that helped solve issues quickly)
  • How much time did you spend on urgent activities like alerts, fire fighting, etc. (0%-100%)?
  • How much time did you spend on non-urgent activities like non-urgent tasks, noise, etc. (0%-100%)?
  • Additional comments (free flow)

We’ve been conducting this survey for a couple of years now. One interesting observation I made is that it is not uncommon for a horrible shift for one person to be decent for someone else. Experienced on-calls usually rate their shifts easier than developers who just finished their first shift. This is understandable. We still treat all opinions equally—improving the on-call quality for one person improves it for everyone.

The Additional comments question is my favorite as it provides insights no other metric can capture.

Call to Action

If being on-call is part of your team’s responsibilities and you don’t monitor it, I highly encourage you to start doing so. Even a simple monitoring system will tell you a lot about your on-call and allow you to improve it by addressing the most annoying issues.

Top 5 Unit Test Problems That Haunt Software Developers

Well-written unit tests are one of the most effective tools for ensuring product quality. Unfortunately, not all unit tests are well written, and the ones that are not are often a source of frustration and lost productivity. Here are the most common unit test issues I encountered during my career.

Flaky unit tests

Flaky tests pass most of the time, but not always. They may randomly fail even though no code has changed. The quickest and most common “fix” developers employ is to re-run them. With time, the number of flaky tests grows, and even multiple re-runs are insufficient.

Flaky tests are caused primarily by the following:

  • shared state
  • dependency on external systems

A shared state is the number one cause of test flakiness. Static variables could be one example. If one test sets a static variable and another passes only if this variable is set, the second test will fail if the order of execution changes.

Debugging flakiness caused by shared state is usually tricky because sharing state is rarely intentional.

Tests that depend on external systems tend to be flaky because the systems they rely on are outside their control. Any deployments, crashes, or throttling will cause test failures. Network, which is inherently unreliable, is yet another contributor. The best fix is to mock external dependencies.

Multithreaded applications deserve special mention. Race conditions in the product code could make tests for these applications flaky, and finding the root cause is often challenging.

Slow tests

Slow tests are a productivity killer. If running tests for a code change takes more than a few seconds, developers will use it as an excuse to find a distraction.

One of the most common reasons tests are slow is their dependency on external systems: network calls and the time to process the requests initiated by tests add up.

But tests that depend on external systems are also flaky, so slowness and flakiness go hand-in-hand.

Again, mocking external dependencies is the best fix to make tests fast and reliable.

If relying on external systems is intentional (e.g., end-to-end testing), it is worth separating end-to-end tests into a dedicated suite executed separately, for instance, as part of the nightly build.

I was once on a team where running all the tests took more than two hours because most of them communicated with a database. These tests were also flaky, so merging more than one Pull Request a day was virtually impossible.

Bugs in unit tests

Tests are there to ensure the quality of the product, but nothing is there to ensure the quality of tests. As a result, tests may fail to do their job due to bugs. Unfortunately, identifying these bugs is not easy. Paying attention can help. For instance, if all tests continue to pass after changing the product code, it usually indicates either bugs in tests or missing test coverage.

Hard to maintain tests

Tying tests and implementation details closely usually causes numerous test failures after even simple product code changes. Keeping tests focused on functionality instead of on the implementation can significantly reduce the number of unnecessary test failures.

Writing “tests” only to hit the code coverage number

Test code written solely to meet code coverage goals is usually low quality. Assertions in such code are often missing because they don’t contribute to the coverage goal but can cause failures. Test coverage reported by tools can make the manager look good, but this test code is useless as it can’t prevent bugs. What’s worse, the high coverage hides areas that do need attention.

This is my list of the top 5 unit test issues. What’s yours?

“Think big” or make progress?

I am being told to “think big.”

But I don’t know what this means.

And I doubt that most people who tell others to think big can do this themselves.

It is easy to come up with big ideas that are not realistic. “Inhabit Venus” sounds like a big idea, but I can do nothing meaningful to implement it. Finding big ideas that are also realistic is hard.

Big ideas

For the sake of the argument, let’s define a Big Idea as follows:

An idea is big if its implementation spans multiple teams or requires substantially altering business-critical systems. By definition, implementing such an idea takes quarters or even years to complete.

Given the risk and the funding Big Ideas require, pitching them is not easy. In many cases, even Staff Software Engineers do not have enough credibility to get such an idea funded. Instead, it takes a team of Product Managers, Engineering Managers, and Software Engineers to prepare and propose the idea.

Big Ideas promise high rewards but are inherently risky. Many won’t bring the promised benefits, and some will flop completely. Because of how long it takes to implement a Big Idea, not everyone who started it will be there to witness its completion and get the reward.

Earlier in my career, I spent most of my time trying to find the “big thing.” I was not successful. I was obsessed with “thinking big” but couldn’t come up with anything. It took me a while to realize that time was passing, but my career was not progressing. This realization led me to an alternative path.

If not Big Ideas, then what?

When I found that trying to “think big” was not helping my career, I shifted my focus to finding and solving problems I and my team or users faced. No problem was too small. I simplified code that was hard to modify, added missing test coverage, or sped up the build. These were simple changes that I could tackle immediately, and they usually didn’t take more than a day to finish. But they helped push my career on a growing trajectory. Here are the most important reasons why:

  • I learned how to take the initiative and propose improvements no one asked me to do
  • I got better at identifying problems
  • I improved life for our users, my team, and myself

The nice thing about these small bets is that they can boost a career even though they are easy to find and not risky:

  • due to the smaller scope, they are much easier to complete
  • if something goes wrong, it usually is not a big deal
  • delivering them consistently helps build the credibility
  • they sometimes lead to bigger ideas

Some of these small ideas occasionally had a much bigger impact than I expected. Here is one example. In response to an incident, I needed to write a tool that inspected data in our data store and deleted stale records. The store my team used was one of the most common infra used by many teams across the company. When researching how to complete my task, I noticed that some other teams already built similar one-off solutions. Instead of creating another specialized tool, I wrote a framework to make these tools rapidly. Because my framework saved weeks of development time, a few teams adopted it. I moved to a different team a few years ago, but my framework is still in use. Some developers even extended it to handle more scenarios.

I observed that, with time, my ideas started to grow. I started noticing bigger problems that needed more time and people to be solved. They still are not in the “Big Ideas” category, but many are noticeable as they span multiple teams.

Parting words

Please note that I am not saying never to “think big.” If you can propose or help drive a “Big Idea,” by all means, do so. But don’t dismiss smaller problems, especially if your “Big Idea” is not quite there yet. At the end of the day, when the review time comes, it’s always better to show a few completed small ideas than a “Big Idea” that hasn’t or couldn’t be implemented.

Do Unit Tests Find Bugs?

I’ve been writing software for over 20 years and don’t believe unit tests find bugs.

Yet, I wouldn’t want to work in a code base without unit tests.

Why unit tests don’t find bugs?

To understand why unit tests don’t find bugs, we can look at how they are created. Here are the three main ways to handle unit tests:

  • developers write the tests along with writing the code
  • Test Driven Development (TDD)
  • unit tests are considered a waste of time, so they don’t exist

When the same software developer writes unit tests and code simultaneously, the tests tend to reflect closely what the code does. Both tests and code follow the same logic, stemming from the same understanding of the problem. As a result, the tests won’t find major implementation issues. If they find small typos or bugs, it’s usually only by chance.

Test-driven development calls for writing unit tests before implementing product changes. Because no product code exists, the unit tests are expected to fail initially or even not compile. The goal is to write product code to make the tests pass. In TDD, new unit tests are added mostly to drive the implementation of new scenarios. An unsupported scenario could be considered a bug, but it’s far-fetched. As a result, TDD rarely finds existing bugs.  

If unit tests don’t exist, they cannot find any bugs.

If unit tests don’t find bugs, why do we write them?

While unit tests are not great at finding bugs, they are extremely effective at preventing new ones. Unit tests pin the program’s behavior. Any change that visibly modifies this behavior should make the tests fail. The developer whose changes caused the failures should examine them and either fix the tests—if the change in the behavior was intentional—or fix the code. Many test failures indicate assumptions that the developer unknowingly broke. Without tests, they would turn into customer-impacting bugs.

Other important advantages of unit tests include:

  • Documentation – comprehensive unit tests can serve as product specification
  • More modular and maintainable code – writing unit tests for tightly coupled code is difficult. Unit tests drive writing more modular and loosely coupled code because it is much easier to test.
  • Automated testing – unit tests are much faster to run and more comprehensive than testing changes manually.

If unit tests don’t find bugs, what does?

There are many ways to find bugs in the code. Integration testing, fuzz testing, and stress testing are just some examples. However, the three below are my favorite because they require little to no additional effort from the developers:

  • Exploratory testing: Try using the product you’re working on. See what happens if you combine a few features or try less common scenarios.
  • Code reviews: One weakness of unit tests is that they are implemented with the same perspective as the code. Code reviews offer the ability to look at the change from a different angle, which often leads to discovering issues.
  • Paying attention: Whenever you code, debug, or troubleshoot an issue, have your eyes open. Many bugs are hiding in plain sight. Carefully reading error messages, logs, or stack traces can lead to identifying serious problems.

The Curious Case of Bugs that Fix Themselves

I can’t count how many times I had this conversation:
  “Good news! The bug is fixed!”
  “Did you fix it?”
  “No.”
  “Did anyone else fix it?”
  “No.”
  “How is it fixed then?”
  “I don’t know. I could reproduce it last week, but I cannot reproduce it anymore, so it’s fixed.”
  “Hmmm, can you dig a bit more to understand why it no longer reproduces?”
  [Two hours later]
  “I have a fix. Can you review my PR?”

Can bugs fix themselves?

I grow extremely skeptical whenever a fellow software developer tries to convince me that an issue they were assigned to fix magically fixed itself. Software defects are a result of incorrect program logic. This logic has to change for the defect to be fixed.

Can bugs disappear? Yes, they can and do disappear. But this doesn’t mean that they are fixed. Unless the root cause has been identified and addressed, the issue exists and will pop up again.

The most common reasons bugs disappear

Seeing an issue disappear might feel like a stroke of luck – no bug, no problem. But if the bug was not fixed, it is there – always lurking, ready to strike. Here are the most common reasons a bug may disappear:

Environment changes

A change in the environment no longer triggers the condition responsible for the bug. For instance, a bug that could be easily reproduced on February 29th is not reproducible on March 1st.

Configuration changes

The code path responsible for the bug may no longer be exercised after reconfiguring the application.

Data changes

Many bugs only manifest for specific data. If this data is removed, the bug disappears until the next time the same data shows up.

Unrelated code changes

Someone modified the code, changing the condition that triggers the bug.

Concurrency (threading) bugs

Concurrency bugs are among the hardest to crack because they can’t be reproduced consistently. Troubleshooting is difficult: even small modifications to the program (e.g., adding additional logging) can make reproducing the issue much harder, which is why concurrency bugs are a great example of Heisenbugs. And the worst part: when the fix lands, there is always the uncertainty of whether it worked because the bug could never be reproduced consistently, to begin with.

The bug was indeed fixed

A developer touching the code fixed the bug. This fix doesn’t have to be intentional – sometimes, refactoring or implementing a feature may result in deleting or fixing the buggy code path.

The bug didn’t disappear

The developer tasked with fixing the bug missed something or didn’t understand the bug in the first place. We’ve all been there. Dismissing a popup without reading what it says or ignoring an error message indicating a problem happens to everyone.

Fixing a bug can be easier than figuring out why it stopped manifesting. But understanding why a bug suddenly disappeared is important. It allows for re-assessing its severity and priority under new circumstances.

Conclusion

If nobody fixed it, it ain’t fixed.