On-call Manual: How to cope with on-call anxiety

On-call anxiety is real. I’ve been there, and I know engineers who experienced it. Many factors contribute to it, but from my experience, three stand out.

1. Unpredictability

Unpredictability is the number one reason for on-call anxiety. You might be responsible for a wide range of services. They may break anytime for various reasons like network issues, deployments, failing dependencies, shared infra outages, data center drains (a.k.a. storms), excavators damaging etc. On-calls, especially new ones, worry they won’t know what to do if they get an alert. How would they figure out what broke? How would they come up with a fix?

What to do about it?

With experience, the unpredictability aspect of the on-call gets easier. But even for the most seasoned on-call engineers, handling an outage can be difficult without the proper tools like:

  • Easy-to-navigate dashboards that allow to tell quickly if a service is working correctly and identify problematic areas in case of failures
  • Playbooks (a.k.a. runbooks) explaining troubleshooting and mitigation steps
  • Documentation describing the service and its dependencies, including the relevant on-call rotations to reach out if necessary

Having a team eager to jump in and help mitigate an outage quickly is priceless. Your team members understand some areas better than you. Knowing they have your back is reassuring.

2. Too many alerts and incidents

The second most common reason engineers fear their on-call is a never-ending litany of alerts, requests, and tasks. If you get a new alert when you barely finished acknowledging a previous one and are also expected to handle customer tickets and deal with requests from other teams, fretting your on-call is understandable. The exhaustion is usually exacerbated by the feeling of not doing a decent job. I was on a rotation like this once. After a while, I realized that everyone, not only me, was overwhelmed. Even though we toiled long hours, most alerts were ignored, customer tickets remained answered, and requests from other teams were only handled after they escalated them to the manager.

What to do about it?

There is no way a single person can fix a very heavy on-call by themselves. They won’t have the time during their shift, and by the time the shift ends, they will be so fed up that they won’t want to hear about anything on-call-related. There are, however, a few low-hanging fruits that can help improve the quality of the on-call quickly:

  • Delete alerts – find routinely ignored alerts and determine if they’re useful. If they aren’t – delete them.
  • Tune noisy but useful alerts – adjust thresholds and windows for flapping alerts, alerts that fire prematurely, and short-lived alerts.
  • Get a secondary on-call – a second person could help handle tasks the primary on-call does not have the capacity to deal with (e.g., customer tickets). This could be only temporary.

These ideas can alleviate on-call pain but are unlikely to fix a bad on-call for good. Improving a heavy on-call requires identifying and addressing problems at their source and demands effort from the entire team to maintain on-call quality. I wrote a post dedicated to this topic. Take a look.

3. Middle-of-the-night alerts

Many on-call rotations are 24/7. The on-call is responsible for dealing with incidents promptly, even if they happen in the middle of the night. Waking to an alert is not fun, and if it happens regularly, it is a valid reason to resent being on-call.

What to do about it?

While it may not be possible to avoid all middle-of-the-night alerts, there might be some actions you can take to reduce the disruption. A lot will depend on your specific situation, but here are some ideas:

  • Check your dashboards in the evening and address any issues that could raise an alert.
  • Increase alert thresholds outside working hours. If your traffic is cyclical – e.g., you have much lower traffic at night because most requests come from one timezone – you may be able to relax thresholds outside working hours. Even if an incident happens, its impact will be smaller. Also, alerts get much noisier if the traffic volume is low (e.g., if you get ten requests in an hour and one fails, you might get an alert due to a 10% error rate).
  • Disable alerts at night. Some outages won’t cause any impact unless they last for a long time. For instance, our team was responsible for a service that would work fine even if one of its dependencies was down for a day. This 24-hour grace period allowed us to turn off alerts at night.

How to improve your coding skills (without spending a lot of time)

Every developer wants to get better at coding. But they sometimes don’t know how and feel stuck. The good news is that there are a few simple ways to improve coding skills that many developers neglect. Most can be done as part of a regular job, so they don’t require additional time. And doing them consistently can make you a better coder.

Code reviews

Code reviews are the easiest way to learn from others. Unfortunately, many developers treat code reviews as a chore. They want to spend as little time as possible on them. This way, they lose great learning opportunities.

What makes code reviews special is their interactivity. They allow asking the author about their choices and discuss alternatives they considered. Other participants often leave interesting comments or propose novel alternative solutions.

Reading other developers’ code

I learned a lot about programming by reading code written by other developers. I often want to know how a feature or library I use works. Usually, the fastest way to get this information is by inspecting its code.

For instance, TypeScript supported some features, e.g., async/await, before they were added to JavaScript. I was curious how it was possible. So, I wrote short TypeScript snippets and checked how they were transpiled.

Your company’s codebase is another great resource to learn from. When I get stuck, I often search my company repository to check how other developers solved a similar problem. Our repo is big, so if I can’t find anything useful, I am almost sure what I am trying to do is questionable.

I use the same strategy for my side projects but search GitHub.

If you find reading other developers’ code challenging (I sure did), take a look at this post.

Debugging

Stepping through the code, analyzing the stack trace, and inspecting variables will allow you to understand important nuances and easy-to-miss details much deeper. I often fire a debugger if I can’t answer all my questions after reading the code.

“Borrowing” code

Let’s be honest. Not all code needs to be written from scratch. Sometimes, we just need a boilerplate. But sometimes, we don’t know how to solve a problem. In these cases, copying and adapting code is often faster (and easier). It could be the code you wrote in the past or someone else’s code, e.g., copied from StackOverflow (I have yet to find a software developer who never copied code from StackOverflow.) AI-powered programming tools are built on this idea. Tools like Github Copilot ask you to constantly vet and adapt the code they generate. Here is the thing, though. You’ll learn nothing if you don’t try to understand why the code you copied works or can’t correctly adapt it.

Programming contests

Advent of Code taught me a lot. It is a light programming contest that takes place every December and consists of a series of small programming puzzles that can be solved in any programming language. I find it an excellent way to keep my coding skills sharp.

Solving Advent of Code problems is a good exercise, but examining other participants’ solutions is where the real learnings are. And quite frankly, it can be a humbling experience. The different ways and techniques the participants use to solve the problems can be astonishing. I remember being proud of my ultra-short, 30-line-long solution, only to see someone else solve the same problem in the same programming language with just two lines of code because they used a clever idea.

What it is like to work in Meta’s (Facebook’s) monorepo

I love monorepos! Or at least I love Meta’s (Facebook’s) monorepo, which happens to be the only monorepo I have ever worked with. Here is why.

Easy access to all code

Meta’s monorepo contains most of the company’s code. Any developer working at Meta has access to it. We can search it, read it, and check the commit history. We also can, and frequently do, modify code code managed by other teams.

This easy access to all the code is great for the developer’s productivity. Engineers can understand their dependencies deeper, debug issues across the entire stack, and implement features or bug fixes regardless of who manages the code. All this is available at their fingertips. They can hit the ground quickly without talking to other teams, reading their out-of-date wikis, and spending time figuring out how to clone and build their code.

Linear commit history

Meta’s monorepo does not use branches, so the commit history is linear. The linear commiit history saves engineers from having to reverse engineer a London Tube Map-like merge history to determine if a given commit’s snapshot contains their changes. With linear commit history, answering this question boils down to comparing commit times.

No versioning

Versioning is one of the most complex problems when working with multiple repos. Each repo is independent, and teams are free to decide which versions of dependencies they want to adopt. However, because each repo evolves at its own pace, different repos will inevitably end up with different versions of the same package. These inconsistencies lead to situations where a project may contain more than one version of the same dependency, but no single version works for everyone.

I experienced this firsthand during my time at Amazon. I was working on the Alexa app, which consisted of tens of packages, each pulling in at least a few dependencies. It was a versioning hell: conflicts were common, and resolving them was difficult. For example – one package used an older dependency because a newer version contained a bug. Another package, however, required the latest version because older versions lacked the needed features.

A monorepo solves versioning issues in a simple way: there is no versioning. All code is built together, so each package or project has only one version for a given commit.

Atomic commits

Monorepos allow atomic cross-project commits. Developers can rename classes or change function signatures without breaking code or tests. They just need to fix all the code affected by their change in the same commit.

Doing the same is impossible in a multi-repo environment. Introducing breaking changes is either safe but slow (as it requires multiple commits for a proper migration) or fast but at the expense of broken builds.

This problem plagued the ASP.NET Core project in its early days (ProjectK anyone?). The team was working on getting abstractions right, so the foundational interfaces constantly changed. Many packages (each in its repo) implemented or used these interfaces. Whenever they changed, most repos stopped compiling and needed fixes.

Build

Builds in monorepos are conceptually simple: all code in the repo is built at a given commit.

This approach makes it possible to quickly tell what’s included in the build and create bundles where all build artifacts match.

While the idea is simple, building the entire monorepo becomes increasingly challenging as the repository grows. Compiling big monorepos, like Meta’s, in a reasonable time is impossible without specialized build tools and massive infra.

Multiple repos make creating a list of matching packages surprisingly hard. I learned this when working on ASP.NET Core. The framework initially consisted of a couple of dozen of repos. Our build servers were constantly grinding because of what we called “build waves.” A build wave was initiated by a single commit that triggered a build. When this build finished, it triggered builds in repos depending on it. This process continued until all repos were built. Not only was this process slow and fragile, but with a steady stream of commits across all the repos, producing a list of matching packages was difficult.

The ASP.Net Core team eventually consolidated all the code in a single repository adopting the monorepo approach. This change happened after I left the team, but I believe the challenges behind getting fast and consistent builds were an important reason.

What are the problems with monorepos?

If monorepos are so great, why isn’t everyone using them? There are several reasons.

Scale

Scale poses the biggest challenge for monorepos. Meta’s repository is counted in terabytes and receives thousands of commits each day. Detecting conflicts and ensuring that all changes are merged correctly and don’t break the build without hurting developers’ productivity is tough. As most off-the-shelf tools cannot handle this scale, Meta has many dedicated teams that maintain the build infrastructure. Sometimes, they need to go to great lengths to do their job. Here is an example:

Back in 2013, tooling teams ran a simulation that showed that in a few years, basic git commands would take 45 minutes to execute if the repo continued to grow at the rate it did. It was unacceptable, so Facebook engineers turned to Git folks to solve this problem. At that time, Git was uninterested in modifying their SCM (Source Code Management) to support such a big repo. The Mercurial (hg) team, however, was more receptive. With significant contributions from Facebook, it rearchitected Mercurial to meet Facebook’s requirements. This is why Meta (a.k.a. Facebook) uses Mercurial (hg) as its source control.

Granular project permissions

Monorepos make accessing any code in the repository easy, which is great for developers’ productivity. However, companies often have sensitive code only selected developers should be able to access. This requirement goes against the idea of the monorepo, which aims to make all code easily accessible. As a result, enforcing access to code in a monorepo is problematic. Creating separate repos for sensitive projects is also not ideal, especially if these projects use the common infrastructure the monorepo provides for free.

Release management

A common strategy to maintain multiple releases is to create a branch for each release. Follow-ups (e.g., bug fixes) can be merged to these branches without bringing unrelated changes that could destabilize the release. This strategy won’t work in monorepos with a linear history.

I must admit that I don’t know how teams that ship their products publicly handle their releases. Our team owns a few services we deploy to production frequently. If we find an issue, we roll back our deployment and fix the bug forward.

A single commit can break the build

Because for monorepos, the entire codebase is built at a given commit, merging a mistake that causes compilation errors will break the build. These situations happen despite the tooling that is supposed to prevent them. In practice, this is only rarely a problem. Developers are only affected if the project that doesn’t compile is a dependency. And even then, they can workaround the problem by working off of an older commit until the breakage is fixed.

On-call Manual: Boost your career by improving your team’s on-call

I have yet to find a team maintaining critical systems that is happy with its on-call. Most engineers dread their on-call shifts and want to forget about on-call as soon as their shift ends. For some, hectic on-call shifts are the reason to leave the team or even the company.

But this is great news for you. All these factors make improving on-call a great career opportunity. Here are a few reasons:

  • Team-wide impact. Making the on-call better increases work satisfaction for everyone on the team.
  • Finding work is easy. No on-call is perfect. There’s always something to fix.
  • No competition. Most engineers consider work related to on-call uninteresting, so you can fully own the entire area. As a result, your scope might be bigger than any other development work you own.

Getting started

It is difficult to propose meaningful improvements to your team’s on-call before your first shift. You need to become familiar with your team’s on-call responsibilities and problems before trying to make it better.

Once you have a few shifts under your belt, you should know the most problematic areas. Come up with a few concrete actions to remedy the biggest issues. This list doesn’t have to be complete to get started. Some examples include tuning (or deleting) the noisiest alerts, refactoring fragile code, or automating time-consuming manual tasks.

Talk to your manager about the improvements you want to make. No manager who cares about their team would refuse the offer to improve the team’s on-call. If the timing is not right (e.g., your team is closing a big release), ask your manager when a better time would be. Mention that you may need their help to ensure the participation of all team members.

Set your expectations right. Despite the improvements, don’t expect your team members to suddenly start loving their on-call. It’s a win if they stop dreading it.

Execution

From my experience, the two most effective ways to improve the on-call is to have regular (e.g., twice a year) fixathons combined with ongoing maintenance.

During a fixathon, the entire team spends a few days fixing the biggest on-call issues. In most cases, these will be issues that started occurring since the previous fixathon but weren’t taken care of by on-calls during their shifts. You may need to work closely with your manager to ensure the entire team’s participation, especially at the beginning.

Ongoing maintenance involves fixing problems as they arise, usually done by the person on call. As some shifts are heavier than others, the on-call may not always be able to address all issues.

Your role

Before talking about what your role is, let’s talk about what your role isn’t.

Your role isn’t to single-handedly fix all on-call issues.

This approach doesn’t scale. If you try it, you will eventually burn out, struggling to do two full-time jobs simultaneously: your regular responsibilities and fixing on-call issues. The worst part is that your team members won’t feel responsible for maintaining the on-call quality. They might even care less because now somebody is fixing issues for them.

While you should still participate in fixing on-call issues, your main role is to:

  • organize fixathons – identify the most pressing issues and distribute issues for the team to work on, track progress, and measure the improvement
  • ensure on-calls are addressing issues they encountered during their shifts
  • build tools – e.g., dashboards to monitor the quality of the on-call or queries that allow to identify the biggest problems quickly

If you do this consistently, your team members will eventually find fixing on-call issues natural.

Skills you will learn

Driving on-call improvements will help you hone a few skills that are key for successful senior and even staff engineers:

  • leading without authority – as the owner of the on-call improvement area you’re responsible for coming up with the plan and leading its execution
  • scaling through others – because you involve the entire team, you can get much more done than if you did it yourself
  • influencing the engineering culture of the team – ingraining a sense of responsibility for the on-call quality in team members is an impactful change
  • holding people accountable – making sure everyone does their part is always a challenge
  • identifying problems worth solving – instead of being told what problems to solve, you are responsible for finding these problems and deciding if they are worth solving

Expanding your scope

Once you start seeing the results of your work, you can take it further to expand your scope.

You can become the engineer who manages the on-call rotation for your team. This work doesn’t take a lot of time but can save a lot of headaches for your manager. The typical responsibilities include:

  • managing the on-call schedule
  • organizing onboarding new team members to the on-call rotation
  • helping figure out shift swaps and substitutions

Another way to increase your scope is to share your experience with other teams. You can organize talks showing what you did, the results you achieved, and what worked and what didn’t. You can also generalize the tools you built so that other teams can use them.

Generating Ideas and Driving them to Completion

It is impossible to achieve a successful and fulfilling career in Software Engineering only by following someone else’s orders. A significant part of taking the lead is coming up with ideas to innovate and move the team or the company forward.

The two models of idea generation

Over the years, I’ve witnessed companies using two main ways to generate ideas: on-demand and organic.

On-demand idea generation

The on-demand idea generation model works as follows: the manager shows up out of nowhere and demands, “We need some ideas for X!” They organize a brainstorming meeting where the team members try to invent some ideas. After the meeting, everyone returns to their work, feeling they have fulfilled their idea generation duty until the next time.

My experience with on-demand idea generation has been mixed. This setting usually seeks big ideas, which are generally quite hard to come up with on the spot and under pressure. In the end, only a few ideas are proposed, and barely any are implemented.

The most important reason why on-demand idea generation is ineffective is that most interesting ideas strike at unexpected times and not during a scheduled meeting.

Organic idea generation

Organic idea generation is the opposite of the on-demand model. Here, the ideas stem from observations made when working on daily tasks:

  • writing or reviewing code
  • investigating issues reported by users
  • struggling with tools or infrastructure
  • mitigating incidents
  • discussing issues with co-workers

All these activities are great opportunities to identify problems and propose improvements.

One of the biggest advantages of organic idea generation is that it is a continuous process. As a result, it allows for the generation of many ideas.

Most ideas generated organically are small: refactor some code, add test coverage, or fix a non-critical but annoying bug. Some are medium, e.g., redesigning a component for better extensibility. Occasionally, you will stumble upon a big idea that may lead to revamping your entire architecture and unlocking previously unthinkable possibilities.

Executing ideas

Even the best idea is not worth much if not acted upon. However, careless execution may have negative consequences. For example, failing to deliver a promised feature on time due to working on unplanned and non-critical refactoring is hard to justify. Here is my approach to avoiding these problems.

I start by noting the idea in my work log. This way, I rest assured that I won’t forget about it and will consider it when planning my work for the next week. Implementing small ideas is usually a matter of finding time to work on it. If I don’t have the bandwidth, I may ask a fellow developer working on related code to pick it up or use it to ramp up a new team member.

Medium and big ideas require more thinking. When planning my week, I block a couple of hours to write a one-pager describing the idea in more detail. Writing allows me to get more clarity on my idea, understand its feasibility, and weigh the costs and benefits.

Most ideas never reach the execution stage. Some are just not great, and external circumstances may block others. Over time, when these circumstances change, an infeasible idea may become viable. I recently revived an idea I had a year ago when I learned that our partner team had fixed a long-standing issue in their system.

I share ideas I believe are worth pursuing with my team and my manager to gather feedback. Depending on this feedback, I either shelve the idea or continue working on it until completed, often with other teammates.

But there is one more step after successful implementation: spreading the word about what we did, how we did it, and who made it possible. This step is especially important for bigger projects that take a while to implement and involve other team members. Everyone who contributed deserves to get the credit.

Conclusion

The most successful software developers generate many ideas because they understand that only some will come to fruition. But ideas are only the first step. The key is execution. Successfully executing an idea, letting the right people know, and sharing the credit is a huge career booster.