On-Call manual: Onboarding a new person to the on-call rotation

One (selfish) reason to celebrate a new team member is that they will eventually join the on-call rotation. And when they do, the existing shifts will move farther apart. However, adding an unprepared engineer to the on-call rotation can be a disaster. This post describes what on-call onboarding looks like on our team.

The on-call onboarding process is the same for each new team member. It consists of the following steps:

  1. Regular ramp-up
  2. On-call overview
  3. Shadow shift
  4. Reverse shadow shift
  5. First solo shift

Let’s look into each of these steps in more detail.

Regular ramp-up

The regular ramp-up aims to help new team members familiarize themselves with the problems the team is solving and teach them how to work effectively in the team’s codebase. We want new colleagues to work on the code they will be responsible for when they are on call later. This approach allows them to acquire basic context that will be useful for maintaining this code and troubleshooting issues.

On-call overview

Regular ramp-up is rarely sufficient for new people to grasp the entire infra the team is responsible for. And knowing this infra is just the tip of the iceberg. There is much more an effective on-call needs to be familiar with, for instance:

  • what are the dependencies, and what is the impact of their failures
  • how to find dashboards and use them for debugging
  • where to find the documentation (e.g., runbooks)
  • expectations, e.g., is the on-call responsible for alerts raised outside working hours
  • how to do deployments and rollbacks
  • tools used to troubleshoot and fix issues
  • standard operating procedures
  • and more

On our team, we organize knowledge-sharing sessions that give new team members an overview of all these areas. We record these sessions to make revisiting unclear topics easy.

Shadow on-call shift

During the shadow on-call shift, the on-call-in-training (a.k.a. secondary on-call) shadows an experienced on-call (a.k.a. primary on-call). Both on-calls are subscribed to all tasks and alerts, but resolving issues is the primary on-call’s responsibility. The primary on-call is expected to show the secondary on-call how to deal with outages. This is usually limited to problems occurring during working hours. Finally, the primary on-call can ask the secondary on-call to handle non-critical tasks, providing guidance as needed.

Reverse shadow on-call shift

After the shadow shift, things get real: the on-call in training becomes the primary on-call. They are now responsible for handling all alerts, tasks, deployments, etc. However, they are not alone—they have an experienced on-call having their back during the entire shift.

We schedule shadow and reverse shadow shifts back-to-back. This way, everything the on-call-in-training learned during the first shift is fresh when they become the primary on-call.

First solo shift

Once shadowing is complete, we add the new team member to the on-call rotation. We add them to the queue’s end, giving them additional time to learn more about our systems and the infrastructure.

In addition to training new on-calls, our team maintains a chat to discuss on-call problems and get help when resolving issues. Both new and experienced on-calls regularly use this chat when they are stuck because they know someone will be there to help them.

On-call Manual: Boost your career by improving your team’s on-call

I have yet to find a team maintaining critical systems that is happy with its on-call. Most engineers dread their on-call shifts and want to forget about on-call as soon as their shift ends. For some, hectic on-call shifts are the reason to leave the team or even the company.

But this is great news for you. All these factors make improving on-call a great career opportunity. Here are a few reasons:

  • Team-wide impact. Making the on-call better increases work satisfaction for everyone on the team.
  • Finding work is easy. No on-call is perfect. There’s always something to fix.
  • No competition. Most engineers consider work related to on-call uninteresting, so you can fully own the entire area. As a result, your scope might be bigger than any other development work you own.

Getting started

It is difficult to propose meaningful improvements to your team’s on-call before your first shift. You need to become familiar with your team’s on-call responsibilities and problems before trying to make it better.

Once you have a few shifts under your belt, you should know the most problematic areas. Come up with a few concrete actions to remedy the biggest issues. This list doesn’t have to be complete to get started. Some examples include tuning (or deleting) the noisiest alerts, refactoring fragile code, or automating time-consuming manual tasks.

Talk to your manager about the improvements you want to make. No manager who cares about their team would refuse the offer to improve the team’s on-call. If the timing is not right (e.g., your team is closing a big release), ask your manager when a better time would be. Mention that you may need their help to ensure the participation of all team members.

Set your expectations right. Despite the improvements, don’t expect your team members to suddenly start loving their on-call. It’s a win if they stop dreading it.

Execution

From my experience, the two most effective ways to improve the on-call is to have regular (e.g., twice a year) fixathons combined with ongoing maintenance.

During a fixathon, the entire team spends a few days fixing the biggest on-call issues. In most cases, these will be issues that started occurring since the previous fixathon but weren’t taken care of by on-calls during their shifts. You may need to work closely with your manager to ensure the entire team’s participation, especially at the beginning.

Ongoing maintenance involves fixing problems as they arise, usually done by the person on call. As some shifts are heavier than others, the on-call may not always be able to address all issues.

Your role

Before talking about what your role is, let’s talk about what your role isn’t.

Your role isn’t to single-handedly fix all on-call issues.

This approach doesn’t scale. If you try it, you will eventually burn out, struggling to do two full-time jobs simultaneously: your regular responsibilities and fixing on-call issues. The worst part is that your team members won’t feel responsible for maintaining the on-call quality. They might even care less because now somebody is fixing issues for them.

While you should still participate in fixing on-call issues, your main role is to:

  • organize fixathons – identify the most pressing issues and distribute issues for the team to work on, track progress, and measure the improvement
  • ensure on-calls are addressing issues they encountered during their shifts
  • build tools – e.g., dashboards to monitor the quality of the on-call or queries that allow to identify the biggest problems quickly

If you do this consistently, your team members will eventually find fixing on-call issues natural.

Skills you will learn

Driving on-call improvements will help you hone a few skills that are key for successful senior and even staff engineers:

  • leading without authority – as the owner of the on-call improvement area you’re responsible for coming up with the plan and leading its execution
  • scaling through others – because you involve the entire team, you can get much more done than if you did it yourself
  • influencing the engineering culture of the team – ingraining a sense of responsibility for the on-call quality in team members is an impactful change
  • holding people accountable – making sure everyone does their part is always a challenge
  • identifying problems worth solving – instead of being told what problems to solve, you are responsible for finding these problems and deciding if they are worth solving

Expanding your scope

Once you start seeing the results of your work, you can take it further to expand your scope.

You can become the engineer who manages the on-call rotation for your team. This work doesn’t take a lot of time but can save a lot of headaches for your manager. The typical responsibilities include:

  • managing the on-call schedule
  • organizing onboarding new team members to the on-call rotation
  • helping figure out shift swaps and substitutions

Another way to increase your scope is to share your experience with other teams. You can organize talks showing what you did, the results you achieved, and what worked and what didn’t. You can also generalize the tools you built so that other teams can use them.

The paradox of test coverage

When I learn that code owned by a team has low test coverage, I expect “here be dragons.” But I never know what to expect if the code coverage is high. I call this a paradox of high test coverage.

High test coverage does not tell much about the quality of unit tests. Low coverage does.

The low coverage argument is self-explanatory. If tests cover only a small portion of the product code, they cannot prevent bugs in the code that is not covered. The opposite is, however, not true: high test coverage does not guarantee a quality product. How is this possible?

Test issues

While unit tests ensure the quality of the product code, nothing, except the developer, ensures the quality of the unit tests. As a result, tests sometimes have issues that allow bugs to sneak in. Finding unit test issues is more luck than science. It usually happens by accident—usually when tests continue to pass despite code changes that should trigger test failures.

One of the simplest examples of a unit test issue is missing asserts. Tests without asserts are unlikely to flag issues. Other common problems include incorrect setup and bugs caused by copying existing tests and incorrectly adapting them to test a new scenario.

Mocking issues

Mocking allows the code under test to be isolated from its dependencies and simulate the dependency behavior. However, when the simulation is incorrect or the behavior of the dependency changes, tests may happily pass, hiding serious issues.

I’ve been working with C++ code bases, and I often see developers assume, without confirming, that a dependency they use won’t throw an exception. So, when they mock this dependency, they forget about the exception case. Even though their tests cover all the code, an exception in production takes the entire service down.

Uncovered code

Getting to 100% code coverage is usually impractical, if not impossible. As a result, a small amount of code is still not covered. Similar to the low coverage scenarios, any change to the code that is not covered can introduce a bug that won’t be detected.

Chasing the coverage number

Test coverage is only a metric. I’ve seen teams do whatever it takes to achieve the metric goal, especially if it was mandated externally, e.g., at the organization or company level. Occasionally, I encountered teams that wrote “test” code whose primary purpose was increasing coverage. Detecting or preventing bugs was a non-goal.

Low test coverage is only the tip of the iceberg

At first sight, low test coverage seems a benign issue. But it often signals bigger problems the team is facing, like:

  • spending a significant amount of time fixing regressions
  • shipping high-quality new features is slow due to excessive manual validation
  • many bugs reach production and are only caught and reported by users
  • the on-call, if the team has one, is challenging
  • the engineering culture of the team is poor, or the team is under pressure to ship new features at an unsustainable pace
  • the code is not very well organized and might be hard to work with, only slowing down the development even further
  • test coverage is likely lower than admitted to and will continue to deteriorate

I’ve worked on a few teams where developers understood the value of unit testing. They treated test code like product code and never sent a PR without unit tests. Because of this, even if they experienced the problems listed above, it was at a much smaller scale. They also never needed to worry about meeting the test coverage goals – they achieved them as a side effect.

On-call Manual: Measuring the quality of the on-call

Reasonable on-call is no accident. Getting there requires a lot of hard work. But how can you tell if you’re on the right track if the experience can completely change from one shift to another? One answer to this question is monitoring.

How does monitoring help?

At the high level, monitoring can tell you if the on-call duty is improving, staying the same, or deteriorating over a longer period. Understanding the trend is important to decide whether the current investment in keeping the on-call reasonable is sufficient.

At the more granular level, monitoring allows identifying areas that need attention the most, like:

  • noisy alerts
  • problematic dependencies
  • features causing customers’ complaints
  • repetitive tasks

Continuously addressing the top issues will gradually improve the overall on-call experience.

What metrics to monitor

There is no one correct answer to what metrics to monitor. It depends a lot on what the team does. For example, frontend teams may choose to monitor the number of tickets opened by the customers, while backend teams may want to focus more on time spent on fixing broken builds or failing tests. Here are some metrics to consider:

  • outages of the products the team owns
  • external incidents impacting the products the team owns
  • the number of alerts, broken down by urgency
  • the number of alerts alerts acted on and ignored
  • the number of alerts outside the working hours
  • time to acknowledge alerts
  • the number of tickets opened by customers
  • the number of internal tasks
  • build breaks
  • test failures

How to monitor?

On-call monitoring is difficult because there isn’t a single metric that can reflect the health of the on-call. My team uses quantitative (data) and qualitative metrics (opinions).

Qualitative metrics

Quantitative metrics can usually be collected from alerting systems, bug trackers, and task management systems. Here are a few examples of quantitative metrics we are tracking on our team:

  • the number of alerts
  • the number of tasks
  • the number of alerts outside the working hours
  • the noisiest alerts, tracked by alert ID

As quantitative metrics are collected automatically, we built a dashboard to show them in an easy-to-understand way. Keeping historical data allows us to track trends.

Qualitative metrics

Qualitative metrics are opinions about the shift from the person ending the shift. Using qualitative metrics in addition to quantitative metrics is necessary because numbers are sometimes misleading. Here is an example: handling a dozen tasks that can be closed almost immediately without much effort is easier than collaborating with a few teams to investigate a hard-to-reproduce customer report. However, considering only how many tasks each on-call got during their shift, the first shift appears heavier than the second.

On our team, each person going off-call fills out an On-call survey that is part of the On-call report. Here are some of the questions from the survey:

  • Rate your on-call experience from 1 to 10 (1: easy, 10: horrible)
  • Rate your experience with resources available for resolving on-call issues (e.g., runbooks, documentation, tools, etc.) from 1 to 10 (1: no resources or very poor resources, 10: excellent resources that helped solve issues quickly)
  • How much time did you spend on urgent activities like alerts, fire fighting, etc. (0%-100%)?
  • How much time did you spend on non-urgent activities like non-urgent tasks, noise, etc. (0%-100%)?
  • Additional comments (free flow)

We’ve been conducting this survey for a couple of years now. One interesting observation I made is that it is not uncommon for a horrible shift for one person to be decent for someone else. Experienced on-calls usually rate their shifts easier than developers who just finished their first shift. This is understandable. We still treat all opinions equally—improving the on-call quality for one person improves it for everyone.

The Additional comments question is my favorite as it provides insights no other metric can capture.

Call to Action

If being on-call is part of your team’s responsibilities and you don’t monitor it, I highly encourage you to start doing so. Even a simple monitoring system will tell you a lot about your on-call and allow you to improve it by addressing the most annoying issues.

Top 5 Unit Test Problems That Haunt Software Developers

Well-written unit tests are one of the most effective tools for ensuring product quality. Unfortunately, not all unit tests are well written, and the ones that are not are often a source of frustration and lost productivity. Here are the most common unit test issues I encountered during my career.

Flaky unit tests

Flaky tests pass most of the time, but not always. They may randomly fail even though no code has changed. The quickest and most common “fix” developers employ is to re-run them. With time, the number of flaky tests grows, and even multiple re-runs are insufficient.

Flaky tests are caused primarily by the following:

  • shared state
  • dependency on external systems

A shared state is the number one cause of test flakiness. Static variables could be one example. If one test sets a static variable and another passes only if this variable is set, the second test will fail if the order of execution changes.

Debugging flakiness caused by shared state is usually tricky because sharing state is rarely intentional.

Tests that depend on external systems tend to be flaky because the systems they rely on are outside their control. Any deployments, crashes, or throttling will cause test failures. Network, which is inherently unreliable, is yet another contributor. The best fix is to mock external dependencies.

Multithreaded applications deserve special mention. Race conditions in the product code could make tests for these applications flaky, and finding the root cause is often challenging.

Slow tests

Slow tests are a productivity killer. If running tests for a code change takes more than a few seconds, developers will use it as an excuse to find a distraction.

One of the most common reasons tests are slow is their dependency on external systems: network calls and the time to process the requests initiated by tests add up.

But tests that depend on external systems are also flaky, so slowness and flakiness go hand-in-hand.

Again, mocking external dependencies is the best fix to make tests fast and reliable.

If relying on external systems is intentional (e.g., end-to-end testing), it is worth separating end-to-end tests into a dedicated suite executed separately, for instance, as part of the nightly build.

I was once on a team where running all the tests took more than two hours because most of them communicated with a database. These tests were also flaky, so merging more than one Pull Request a day was virtually impossible.

Bugs in unit tests

Tests are there to ensure the quality of the product, but nothing is there to ensure the quality of tests. As a result, tests may fail to do their job due to bugs. Unfortunately, identifying these bugs is not easy. Paying attention can help. For instance, if all tests continue to pass after changing the product code, it usually indicates either bugs in tests or missing test coverage.

Hard to maintain tests

Tying tests and implementation details closely usually causes numerous test failures after even simple product code changes. Keeping tests focused on functionality instead of on the implementation can significantly reduce the number of unnecessary test failures.

Writing “tests” only to hit the code coverage number

Test code written solely to meet code coverage goals is usually low quality. Assertions in such code are often missing because they don’t contribute to the coverage goal but can cause failures. Test coverage reported by tools can make the manager look good, but this test code is useless as it can’t prevent bugs. What’s worse, the high coverage hides areas that do need attention.

This is my list of the top 5 unit test issues. What’s yours?