A simple way to ship maintainable software

This was my first solo on-call shift on my new team. I was almost ready to go home when a Critical alert fired. I acknowledged it almost instantly and started troubleshooting. But this was not going well. Wherever I turned, I hit a roadblock. The alert runbook was empty. The dashboards didn’t work. And I couldn’t see any logs because logging was disabled.

Some team members were still around, and I turned to them for help. I learned that the impacted service shipped merely a week before, and barely anyone knew how it worked. The person who wrote and shipped it was on sick leave.

It took us a few hours to figure out what was happening and to mitigate the outage. This work made one thing apparent – this service was not ready for the prime time.

In the week following the incident, we filled the gaps we had found during the outage. Our main goal was to ensure that future on-calls wouldn’t have to scramble when encountering issues with this service.

But the bigger question left unanswered was: how can we avoid similar issues with any new service or feature we will ship in the future?

The idea we came up with was the Service Readiness Checklist.

What is the Service Readiness Checklist?

The Readiness Checklist is a checklist that contains requirements each service (or a bigger feature) needs to meet to be considered ready to ship. It serves two purposes:

to guarantee that none of the aspects related to operating the service have been forgotten
to make it clear who is responsible for ensuring that requirements have been reviewed and met

When we are close to shipping, we create a task that contains a copy of the readiness checklist and assign it to the engineer driving the project. They become responsible for ensuring all requirements on the checklist.

Having one engineer responsible for the checklist helps avoid situations where some requirements fall through the cracks because everyone thought someone else was taking care of them. The primary job of this engineer is to ensure all checkboxes are checked. They may do the work themselves if they choose to or assign items to people involved in the project and coordinate the work.

Occasionally, the checklist owner may decide that some requirements are inapplicable. For example, the checklist may call for setting up deployment, but there is nothing to do if the existing deployment infrastructure automatically covers it.

The checklist will usually contain more than ten requirements. They are all obvious, but it is easy to miss some just because of how many there are.

Example readiness checklist

There is no single readiness checklist that would work for every team because each team operates differently. They all follow different processes and have their own ways of running their code and detecting and troubleshooting outages. There is, however, a common subset of requirements that can be a starting point for a team-specific readiness checklist:

[ ] Has the service/feature been introduced to the on-call?
[ ] Has sufficient documentation been created for the service? Does it contain information about dependencies, including the on-calls who own them?
[ ] Does the service have working dashboards?
[ ] Have alerts been created and tested?
[ ] Does the service/feature have runbooks (a.k.a. playbooks)?
[ ] Has the service been load tested?
[ ] Is logging for the service/feature enabled at the appropriate level?
[ ] Is automated deployment configured?
[ ] Does the service/feature have sufficient test coverage?
[ ] Has a rollout plan been developed?

Success story

Our team was tasked to solve a relatively big problem on tight timelines. The solution required building a pipeline of a few services. Because we didn’t have enough people to implement this infrastructure within the allotted amount of time, we asked for help. Soon after, a few engineers temporarily joined our team. We were worried, however, that this partnership may not work out because of the differences in our engineering cultures. The Service Readiness Checklist was one of the things (others included coding guidelines, interface-based programming, etc.) that helped set clear expectations. With both teams on the same page, the collaboration was smooth, and we shipped the project on time.

On-call Manual: How to cope with on-call anxiety

On-call anxiety is real. I’ve been there, and I know engineers who experienced it. Many factors contribute to it, but from my experience, three stand out.

1. Unpredictability

Unpredictability is the number one reason for on-call anxiety. You might be responsible for a wide range of services. They may break anytime for various reasons like network issues, deployments, failing dependencies, shared infra outages, data center drains (a.k.a. storms), excavators damaging etc. On-calls, especially new ones, worry they won’t know what to do if they get an alert. How would they figure out what broke? How would they come up with a fix?

What to do about it?

With experience, the unpredictability aspect of the on-call gets easier. But even for the most seasoned on-call engineers, handling an outage can be difficult without the proper tools like:

Easy-to-navigate dashboards that allow to tell quickly if a service is working correctly and identify problematic areas in case of failures
Playbooks (a.k.a. runbooks) explaining troubleshooting and mitigation steps
Documentation describing the service and its dependencies, including the relevant on-call rotations to reach out if necessary

Having a team eager to jump in and help mitigate an outage quickly is priceless. Your team members understand some areas better than you. Knowing they have your back is reassuring.

2. Too many alerts and incidents

The second most common reason engineers fear their on-call is a never-ending litany of alerts, requests, and tasks. If you get a new alert when you barely finished acknowledging a previous one and are also expected to handle customer tickets and deal with requests from other teams, fretting your on-call is understandable. The exhaustion is usually exacerbated by the feeling of not doing a decent job. I was on a rotation like this once. After a while, I realized that everyone, not only me, was overwhelmed. Even though we toiled long hours, most alerts were ignored, customer tickets remained answered, and requests from other teams were only handled after they escalated them to the manager.

What to do about it?

There is no way a single person can fix a very heavy on-call by themselves. They won’t have the time during their shift, and by the time the shift ends, they will be so fed up that they won’t want to hear about anything on-call-related. There are, however, a few low-hanging fruits that can help improve the quality of the on-call quickly:

Delete alerts – find routinely ignored alerts and determine if they’re useful. If they aren’t – delete them.
Tune noisy but useful alerts – adjust thresholds and windows for flapping alerts, alerts that fire prematurely, and short-lived alerts.
Get a secondary on-call – a second person could help handle tasks the primary on-call does not have the capacity to deal with (e.g., customer tickets). This could be only temporary.

These ideas can alleviate on-call pain but are unlikely to fix a bad on-call for good. Improving a heavy on-call requires identifying and addressing problems at their source and demands effort from the entire team to maintain on-call quality. I wrote a post dedicated to this topic. Take a look.

3. Middle-of-the-night alerts

Many on-call rotations are 24/7. The on-call is responsible for dealing with incidents promptly, even if they happen in the middle of the night. Waking to an alert is not fun, and if it happens regularly, it is a valid reason to resent being on-call.

What to do about it?

While it may not be possible to avoid all middle-of-the-night alerts, there might be some actions you can take to reduce the disruption. A lot will depend on your specific situation, but here are some ideas:

Check your dashboards in the evening and address any issues that could raise an alert.
Increase alert thresholds outside working hours. If your traffic is cyclical – e.g., you have much lower traffic at night because most requests come from one timezone – you may be able to relax thresholds outside working hours. Even if an incident happens, its impact will be smaller. Also, alerts get much noisier if the traffic volume is low (e.g., if you get ten requests in an hour and one fails, you might get an alert due to a 10% error rate).
Disable alerts at night. Some outages won’t cause any impact unless they last for a long time. For instance, our team was responsible for a service that would work fine even if one of its dependencies was down for a day. This 24-hour grace period allowed us to turn off alerts at night.

On-Call manual: Onboarding a new person to the on-call rotation

One (selfish) reason to celebrate a new team member is that they will eventually join the on-call rotation. And when they do, the existing shifts will move farther apart. However, adding an unprepared engineer to the on-call rotation can be a disaster. This post describes what on-call onboarding looks like on our team.

The on-call onboarding process is the same for each new team member. It consists of the following steps:

Regular ramp-up
On-call overview
Shadow shift
Reverse shadow shift
First solo shift

Let’s look into each of these steps in more detail.

Regular ramp-up

The regular ramp-up aims to help new team members familiarize themselves with the problems the team is solving and teach them how to work effectively in the team’s codebase. We want new colleagues to work on the code they will be responsible for when they are on call later. This approach allows them to acquire basic context that will be useful for maintaining this code and troubleshooting issues.

On-call overview

Regular ramp-up is rarely sufficient for new people to grasp the entire infra the team is responsible for. And knowing this infra is just the tip of the iceberg. There is much more an effective on-call needs to be familiar with, for instance:

what are the dependencies, and what is the impact of their failures
how to find dashboards and use them for debugging
where to find the documentation (e.g., runbooks)
expectations, e.g., is the on-call responsible for alerts raised outside working hours
how to do deployments and rollbacks
tools used to troubleshoot and fix issues
standard operating procedures
and more

On our team, we organize knowledge-sharing sessions that give new team members an overview of all these areas. We record these sessions to make revisiting unclear topics easy.

Shadow on-call shift

During the shadow on-call shift, the on-call-in-training (a.k.a. secondary on-call) shadows an experienced on-call (a.k.a. primary on-call). Both on-calls are subscribed to all tasks and alerts, but resolving issues is the primary on-call’s responsibility. The primary on-call is expected to show the secondary on-call how to deal with outages. This is usually limited to problems occurring during working hours. Finally, the primary on-call can ask the secondary on-call to handle non-critical tasks, providing guidance as needed.

Reverse shadow on-call shift

After the shadow shift, things get real: the on-call in training becomes the primary on-call. They are now responsible for handling all alerts, tasks, deployments, etc. However, they are not alone—they have an experienced on-call having their back during the entire shift.

We schedule shadow and reverse shadow shifts back-to-back. This way, everything the on-call-in-training learned during the first shift is fresh when they become the primary on-call.

First solo shift

Once shadowing is complete, we add the new team member to the on-call rotation. We add them to the queue’s end, giving them additional time to learn more about our systems and the infrastructure.

In addition to training new on-calls, our team maintains a chat to discuss on-call problems and get help when resolving issues. Both new and experienced on-calls regularly use this chat when they are stuck because they know someone will be there to help them.

On-call Manual: Boost your career by improving your team’s on-call

I have yet to find a team maintaining critical systems that is happy with its on-call. Most engineers dread their on-call shifts and want to forget about on-call as soon as their shift ends. For some, hectic on-call shifts are the reason to leave the team or even the company.

But this is great news for you. All these factors make improving on-call a great career opportunity. Here are a few reasons:

Team-wide impact. Making the on-call better increases work satisfaction for everyone on the team.
Finding work is easy. No on-call is perfect. There’s always something to fix.
No competition. Most engineers consider work related to on-call uninteresting, so you can fully own the entire area. As a result, your scope might be bigger than any other development work you own.

Getting started

It is difficult to propose meaningful improvements to your team’s on-call before your first shift. You need to become familiar with your team’s on-call responsibilities and problems before trying to make it better.

Once you have a few shifts under your belt, you should know the most problematic areas. Come up with a few concrete actions to remedy the biggest issues. This list doesn’t have to be complete to get started. Some examples include tuning (or deleting) the noisiest alerts, refactoring fragile code, or automating time-consuming manual tasks.

Talk to your manager about the improvements you want to make. No manager who cares about their team would refuse the offer to improve the team’s on-call. If the timing is not right (e.g., your team is closing a big release), ask your manager when a better time would be. Mention that you may need their help to ensure the participation of all team members.

Set your expectations right. Despite the improvements, don’t expect your team members to suddenly start loving their on-call. It’s a win if they stop dreading it.

Execution

From my experience, the two most effective ways to improve the on-call is to have regular (e.g., twice a year) fixathons combined with ongoing maintenance.

During a fixathon, the entire team spends a few days fixing the biggest on-call issues. In most cases, these will be issues that started occurring since the previous fixathon but weren’t taken care of by on-calls during their shifts. You may need to work closely with your manager to ensure the entire team’s participation, especially at the beginning.

Ongoing maintenance involves fixing problems as they arise, usually done by the person on call. As some shifts are heavier than others, the on-call may not always be able to address all issues.

Your role

Before talking about what your role is, let’s talk about what your role isn’t.

Your role isn’t to single-handedly fix all on-call issues.

This approach doesn’t scale. If you try it, you will eventually burn out, struggling to do two full-time jobs simultaneously: your regular responsibilities and fixing on-call issues. The worst part is that your team members won’t feel responsible for maintaining the on-call quality. They might even care less because now somebody is fixing issues for them.

While you should still participate in fixing on-call issues, your main role is to:

organize fixathons – identify the most pressing issues and distribute issues for the team to work on, track progress, and measure the improvement
ensure on-calls are addressing issues they encountered during their shifts
build tools – e.g., dashboards to monitor the quality of the on-call or queries that allow to identify the biggest problems quickly

If you do this consistently, your team members will eventually find fixing on-call issues natural.

Skills you will learn

Driving on-call improvements will help you hone a few skills that are key for successful senior and even staff engineers:

leading without authority – as the owner of the on-call improvement area you’re responsible for coming up with the plan and leading its execution
scaling through others – because you involve the entire team, you can get much more done than if you did it yourself
influencing the engineering culture of the team – ingraining a sense of responsibility for the on-call quality in team members is an impactful change
holding people accountable – making sure everyone does their part is always a challenge
identifying problems worth solving – instead of being told what problems to solve, you are responsible for finding these problems and deciding if they are worth solving

Expanding your scope

Once you start seeing the results of your work, you can take it further to expand your scope.

You can become the engineer who manages the on-call rotation for your team. This work doesn’t take a lot of time but can save a lot of headaches for your manager. The typical responsibilities include:

managing the on-call schedule
organizing onboarding new team members to the on-call rotation
helping figure out shift swaps and substitutions

Another way to increase your scope is to share your experience with other teams. You can organize talks showing what you did, the results you achieved, and what worked and what didn’t. You can also generalize the tools you built so that other teams can use them.

On-call Manual: Measuring the quality of the on-call

Reasonable on-call is no accident. Getting there requires a lot of hard work. But how can you tell if you’re on the right track if the experience can completely change from one shift to another? One answer to this question is monitoring.

How does monitoring help?

At the high level, monitoring can tell you if the on-call duty is improving, staying the same, or deteriorating over a longer period. Understanding the trend is important to decide whether the current investment in keeping the on-call reasonable is sufficient.

At the more granular level, monitoring allows identifying areas that need attention the most, like:

noisy alerts
problematic dependencies
features causing customers’ complaints
repetitive tasks

Continuously addressing the top issues will gradually improve the overall on-call experience.

What metrics to monitor

There is no one correct answer to what metrics to monitor. It depends a lot on what the team does. For example, frontend teams may choose to monitor the number of tickets opened by the customers, while backend teams may want to focus more on time spent on fixing broken builds or failing tests. Here are some metrics to consider:

outages of the products the team owns
external incidents impacting the products the team owns
the number of alerts, broken down by urgency
the number of alerts alerts acted on and ignored
the number of alerts outside the working hours
time to acknowledge alerts
the number of tickets opened by customers
the number of internal tasks
build breaks
test failures

How to monitor?

On-call monitoring is difficult because there isn’t a single metric that can reflect the health of the on-call. My team uses quantitative (data) and qualitative metrics (opinions).

Qualitative metrics

Quantitative metrics can usually be collected from alerting systems, bug trackers, and task management systems. Here are a few examples of quantitative metrics we are tracking on our team:

the number of alerts
the number of tasks
the number of alerts outside the working hours
the noisiest alerts, tracked by alert ID

As quantitative metrics are collected automatically, we built a dashboard to show them in an easy-to-understand way. Keeping historical data allows us to track trends.

Qualitative metrics

Qualitative metrics are opinions about the shift from the person ending the shift. Using qualitative metrics in addition to quantitative metrics is necessary because numbers are sometimes misleading. Here is an example: handling a dozen tasks that can be closed almost immediately without much effort is easier than collaborating with a few teams to investigate a hard-to-reproduce customer report. However, considering only how many tasks each on-call got during their shift, the first shift appears heavier than the second.

On our team, each person going off-call fills out an On-call survey that is part of the On-call report. Here are some of the questions from the survey:

Rate your on-call experience from 1 to 10 (1: easy, 10: horrible)
Rate your experience with resources available for resolving on-call issues (e.g., runbooks, documentation, tools, etc.) from 1 to 10 (1: no resources or very poor resources, 10: excellent resources that helped solve issues quickly)
How much time did you spend on urgent activities like alerts, fire fighting, etc. (0%-100%)?
How much time did you spend on non-urgent activities like non-urgent tasks, noise, etc. (0%-100%)?
Additional comments (free flow)

We’ve been conducting this survey for a couple of years now. One interesting observation I made is that it is not uncommon for a horrible shift for one person to be decent for someone else. Experienced on-calls usually rate their shifts easier than developers who just finished their first shift. This is understandable. We still treat all opinions equally—improving the on-call quality for one person improves it for everyone.

The Additional comments question is my favorite as it provides insights no other metric can capture.

Call to Action

If being on-call is part of your team’s responsibilities and you don’t monitor it, I highly encourage you to start doing so. Even a simple monitoring system will tell you a lot about your on-call and allow you to improve it by addressing the most annoying issues.