On-Call manual: Onboarding a new person to the on-call rotation

One (selfish) reason to celebrate a new team member is that they will eventually join the on-call rotation. And when they do, the existing shifts will move farther apart. However, adding an unprepared engineer to the on-call rotation can be a disaster. This post describes what on-call onboarding looks like on our team.

The on-call onboarding process is the same for each new team member. It consists of the following steps:

Regular ramp-up
On-call overview
Shadow shift
Reverse shadow shift
First solo shift

Let’s look into each of these steps in more detail.

Regular ramp-up

The regular ramp-up aims to help new team members familiarize themselves with the problems the team is solving and teach them how to work effectively in the team’s codebase. We want new colleagues to work on the code they will be responsible for when they are on call later. This approach allows them to acquire basic context that will be useful for maintaining this code and troubleshooting issues.

On-call overview

Regular ramp-up is rarely sufficient for new people to grasp the entire infra the team is responsible for. And knowing this infra is just the tip of the iceberg. There is much more an effective on-call needs to be familiar with, for instance:

what are the dependencies, and what is the impact of their failures
how to find dashboards and use them for debugging
where to find the documentation (e.g., runbooks)
expectations, e.g., is the on-call responsible for alerts raised outside working hours
how to do deployments and rollbacks
tools used to troubleshoot and fix issues
standard operating procedures
and more

On our team, we organize knowledge-sharing sessions that give new team members an overview of all these areas. We record these sessions to make revisiting unclear topics easy.

Shadow on-call shift

During the shadow on-call shift, the on-call-in-training (a.k.a. secondary on-call) shadows an experienced on-call (a.k.a. primary on-call). Both on-calls are subscribed to all tasks and alerts, but resolving issues is the primary on-call’s responsibility. The primary on-call is expected to show the secondary on-call how to deal with outages. This is usually limited to problems occurring during working hours. Finally, the primary on-call can ask the secondary on-call to handle non-critical tasks, providing guidance as needed.

Reverse shadow on-call shift

After the shadow shift, things get real: the on-call in training becomes the primary on-call. They are now responsible for handling all alerts, tasks, deployments, etc. However, they are not alone—they have an experienced on-call having their back during the entire shift.

We schedule shadow and reverse shadow shifts back-to-back. This way, everything the on-call-in-training learned during the first shift is fresh when they become the primary on-call.

First solo shift

Once shadowing is complete, we add the new team member to the on-call rotation. We add them to the queue’s end, giving them additional time to learn more about our systems and the infrastructure.

In addition to training new on-calls, our team maintains a chat to discuss on-call problems and get help when resolving issues. Both new and experienced on-calls regularly use this chat when they are stuck because they know someone will be there to help them.

On-call Manual: Boost your career by improving your team’s on-call

I have yet to find a team maintaining critical systems that is happy with its on-call. Most engineers dread their on-call shifts and want to forget about on-call as soon as their shift ends. For some, hectic on-call shifts are the reason to leave the team or even the company.

But this is great news for you. All these factors make improving on-call a great career opportunity. Here are a few reasons:

Team-wide impact. Making the on-call better increases work satisfaction for everyone on the team.
Finding work is easy. No on-call is perfect. There’s always something to fix.
No competition. Most engineers consider work related to on-call uninteresting, so you can fully own the entire area. As a result, your scope might be bigger than any other development work you own.

Getting started

It is difficult to propose meaningful improvements to your team’s on-call before your first shift. You need to become familiar with your team’s on-call responsibilities and problems before trying to make it better.

Once you have a few shifts under your belt, you should know the most problematic areas. Come up with a few concrete actions to remedy the biggest issues. This list doesn’t have to be complete to get started. Some examples include tuning (or deleting) the noisiest alerts, refactoring fragile code, or automating time-consuming manual tasks.

Talk to your manager about the improvements you want to make. No manager who cares about their team would refuse the offer to improve the team’s on-call. If the timing is not right (e.g., your team is closing a big release), ask your manager when a better time would be. Mention that you may need their help to ensure the participation of all team members.

Set your expectations right. Despite the improvements, don’t expect your team members to suddenly start loving their on-call. It’s a win if they stop dreading it.

Execution

From my experience, the two most effective ways to improve the on-call is to have regular (e.g., twice a year) fixathons combined with ongoing maintenance.

During a fixathon, the entire team spends a few days fixing the biggest on-call issues. In most cases, these will be issues that started occurring since the previous fixathon but weren’t taken care of by on-calls during their shifts. You may need to work closely with your manager to ensure the entire team’s participation, especially at the beginning.

Ongoing maintenance involves fixing problems as they arise, usually done by the person on call. As some shifts are heavier than others, the on-call may not always be able to address all issues.

Your role

Before talking about what your role is, let’s talk about what your role isn’t.

Your role isn’t to single-handedly fix all on-call issues.

This approach doesn’t scale. If you try it, you will eventually burn out, struggling to do two full-time jobs simultaneously: your regular responsibilities and fixing on-call issues. The worst part is that your team members won’t feel responsible for maintaining the on-call quality. They might even care less because now somebody is fixing issues for them.

While you should still participate in fixing on-call issues, your main role is to:

organize fixathons – identify the most pressing issues and distribute issues for the team to work on, track progress, and measure the improvement
ensure on-calls are addressing issues they encountered during their shifts
build tools – e.g., dashboards to monitor the quality of the on-call or queries that allow to identify the biggest problems quickly

If you do this consistently, your team members will eventually find fixing on-call issues natural.

Skills you will learn

Driving on-call improvements will help you hone a few skills that are key for successful senior and even staff engineers:

leading without authority – as the owner of the on-call improvement area you’re responsible for coming up with the plan and leading its execution
scaling through others – because you involve the entire team, you can get much more done than if you did it yourself
influencing the engineering culture of the team – ingraining a sense of responsibility for the on-call quality in team members is an impactful change
holding people accountable – making sure everyone does their part is always a challenge
identifying problems worth solving – instead of being told what problems to solve, you are responsible for finding these problems and deciding if they are worth solving

Expanding your scope

Once you start seeing the results of your work, you can take it further to expand your scope.

You can become the engineer who manages the on-call rotation for your team. This work doesn’t take a lot of time but can save a lot of headaches for your manager. The typical responsibilities include:

managing the on-call schedule
organizing onboarding new team members to the on-call rotation
helping figure out shift swaps and substitutions

Another way to increase your scope is to share your experience with other teams. You can organize talks showing what you did, the results you achieved, and what worked and what didn’t. You can also generalize the tools you built so that other teams can use them.

The paradox of test coverage

When I learn that code owned by a team has low test coverage, I expect “here be dragons.” But I never know what to expect if the code coverage is high. I call this a paradox of high test coverage.

High test coverage does not tell much about the quality of unit tests. Low coverage does.

The low coverage argument is self-explanatory. If tests cover only a small portion of the product code, they cannot prevent bugs in the code that is not covered. The opposite is, however, not true: high test coverage does not guarantee a quality product. How is this possible?

Test issues

While unit tests ensure the quality of the product code, nothing, except the developer, ensures the quality of the unit tests. As a result, tests sometimes have issues that allow bugs to sneak in. Finding unit test issues is more luck than science. It usually happens by accident—usually when tests continue to pass despite code changes that should trigger test failures.

One of the simplest examples of a unit test issue is missing asserts. Tests without asserts are unlikely to flag issues. Other common problems include incorrect setup and bugs caused by copying existing tests and incorrectly adapting them to test a new scenario.

Mocking issues

Mocking allows the code under test to be isolated from its dependencies and simulate the dependency behavior. However, when the simulation is incorrect or the behavior of the dependency changes, tests may happily pass, hiding serious issues.

I’ve been working with C++ code bases, and I often see developers assume, without confirming, that a dependency they use won’t throw an exception. So, when they mock this dependency, they forget about the exception case. Even though their tests cover all the code, an exception in production takes the entire service down.

Uncovered code

Getting to 100% code coverage is usually impractical, if not impossible. As a result, a small amount of code is still not covered. Similar to the low coverage scenarios, any change to the code that is not covered can introduce a bug that won’t be detected.

Chasing the coverage number

Test coverage is only a metric. I’ve seen teams do whatever it takes to achieve the metric goal, especially if it was mandated externally, e.g., at the organization or company level. Occasionally, I encountered teams that wrote “test” code whose primary purpose was increasing coverage. Detecting or preventing bugs was a non-goal.

Low test coverage is only the tip of the iceberg

At first sight, low test coverage seems a benign issue. But it often signals bigger problems the team is facing, like:

spending a significant amount of time fixing regressions
shipping high-quality new features is slow due to excessive manual validation
many bugs reach production and are only caught and reported by users
the on-call, if the team has one, is challenging
the engineering culture of the team is poor, or the team is under pressure to ship new features at an unsustainable pace
the code is not very well organized and might be hard to work with, only slowing down the development even further
test coverage is likely lower than admitted to and will continue to deteriorate

I’ve worked on a few teams where developers understood the value of unit testing. They treated test code like product code and never sent a PR without unit tests. Because of this, even if they experienced the problems listed above, it was at a much smaller scale. They also never needed to worry about meeting the test coverage goals – they achieved them as a side effect.

On-call Manual: Measuring the quality of the on-call

Reasonable on-call is no accident. Getting there requires a lot of hard work. But how can you tell if you’re on the right track if the experience can completely change from one shift to another? One answer to this question is monitoring.

How does monitoring help?

At the high level, monitoring can tell you if the on-call duty is improving, staying the same, or deteriorating over a longer period. Understanding the trend is important to decide whether the current investment in keeping the on-call reasonable is sufficient.

At the more granular level, monitoring allows identifying areas that need attention the most, like:

noisy alerts
problematic dependencies
features causing customers’ complaints
repetitive tasks

Continuously addressing the top issues will gradually improve the overall on-call experience.

What metrics to monitor

There is no one correct answer to what metrics to monitor. It depends a lot on what the team does. For example, frontend teams may choose to monitor the number of tickets opened by the customers, while backend teams may want to focus more on time spent on fixing broken builds or failing tests. Here are some metrics to consider:

outages of the products the team owns
external incidents impacting the products the team owns
the number of alerts, broken down by urgency
the number of alerts alerts acted on and ignored
the number of alerts outside the working hours
time to acknowledge alerts
the number of tickets opened by customers
the number of internal tasks
build breaks
test failures

How to monitor?

On-call monitoring is difficult because there isn’t a single metric that can reflect the health of the on-call. My team uses quantitative (data) and qualitative metrics (opinions).

Qualitative metrics

Quantitative metrics can usually be collected from alerting systems, bug trackers, and task management systems. Here are a few examples of quantitative metrics we are tracking on our team:

the number of alerts
the number of tasks
the number of alerts outside the working hours
the noisiest alerts, tracked by alert ID

As quantitative metrics are collected automatically, we built a dashboard to show them in an easy-to-understand way. Keeping historical data allows us to track trends.

Qualitative metrics

Qualitative metrics are opinions about the shift from the person ending the shift. Using qualitative metrics in addition to quantitative metrics is necessary because numbers are sometimes misleading. Here is an example: handling a dozen tasks that can be closed almost immediately without much effort is easier than collaborating with a few teams to investigate a hard-to-reproduce customer report. However, considering only how many tasks each on-call got during their shift, the first shift appears heavier than the second.

On our team, each person going off-call fills out an On-call survey that is part of the On-call report. Here are some of the questions from the survey:

Rate your on-call experience from 1 to 10 (1: easy, 10: horrible)
Rate your experience with resources available for resolving on-call issues (e.g., runbooks, documentation, tools, etc.) from 1 to 10 (1: no resources or very poor resources, 10: excellent resources that helped solve issues quickly)
How much time did you spend on urgent activities like alerts, fire fighting, etc. (0%-100%)?
How much time did you spend on non-urgent activities like non-urgent tasks, noise, etc. (0%-100%)?
Additional comments (free flow)

We’ve been conducting this survey for a couple of years now. One interesting observation I made is that it is not uncommon for a horrible shift for one person to be decent for someone else. Experienced on-calls usually rate their shifts easier than developers who just finished their first shift. This is understandable. We still treat all opinions equally—improving the on-call quality for one person improves it for everyone.

The Additional comments question is my favorite as it provides insights no other metric can capture.

Call to Action

If being on-call is part of your team’s responsibilities and you don’t monitor it, I highly encourage you to start doing so. Even a simple monitoring system will tell you a lot about your on-call and allow you to improve it by addressing the most annoying issues.

Tag: career

On-Call manual: Onboarding a new person to the on-call rotation

Regular ramp-up

On-call overview

Shadow on-call shift

Reverse shadow on-call shift

First solo shift

On-call Manual: Boost your career by improving your team’s on-call

Getting started

Execution

Your role

Skills you will learn

Expanding your scope

The paradox of test coverage

Test issues

Mocking issues

Uncovered code

Chasing the coverage number

Low test coverage is only the tip of the iceberg

On-call Manual: Measuring the quality of the on-call

How does monitoring help?

What metrics to monitor

How to monitor?

Qualitative metrics

Qualitative metrics

Call to Action

Top 5 Unit Test Problems That Haunt Software Developers

Flaky unit tests

Slow tests

Bugs in unit tests

Hard to maintain tests

Writing “tests” only to hit the code coverage number