On-call Manual: Handling Incidents

If your on-call is perfect and you can’t wait for your next shift, you can stop reading now.

But, in all likelihood, this isn’t the case.

Rather, you feel exhausted by dealing with tasks, incidents, and alerts that often fire after hours, and you constantly check the clock to see how much time is remaining in your shift.

The on-call can be overwhelming. You may not know all the systems you are responsible for very well. Often, you will get tasks asking you to do things you have never done before. You might be pulled into investigating outages of infrastructure owned by other teams.

I know this very well – I have been participating in on-call for almost seven years, and I decided to write a few posts to share what I learned. Today, I would like to talk about handling incidents.

Handling Incidents ⚠️

Unexpected system outages are the most stressful part of the on-call. A sudden alert, sometimes in the middle of the night, can be a source of distress. Here are the steps I recommend to follow when dealing with incidents.

1. Acknowledge

Acknowledging alerts is one of the most important on-call responsibilities. It tells people that the on-call is aware of the problem and is working on resolving it. In many companies, alerts will be escalated up the management chain if not acknowledged promptly.

2. Triage

Triaging means assessing the issue’s impact and urgency and assigning it a priority. This process is easier when nothing else is going on. If there are active alerts, it is crucial to understand if the new alert is related to these alerts. If this is not the case, the on-call needs to decide which alerts are more important.

3. Troubleshoot

Troubleshooting is, in my opinion, the most difficult task when dealing with alerts. It requires checking and correlating dashboards, logs, code, output from diagnostic tools, etc., to understand the problem. All this happens under huge pressure. Runbooks (a.k.a. playbooks) with clear troubleshooting steps and remediations make troubleshooting easier and faster.

4. Mitigate

Quickly mitigating the outage is the top priority. While it may sound counterintuitive, understanding the root cause is not a goal and is often unnecessary to implement an effective mitigation. Here are some common mitigations:

Rolling back a deployment – outages caused by deployments can be quickly mitigated by rolling back the deployment to the previous version.
Reverting configuration changes – problems caused by configuration changes can be fixed by reverting these changes.
Restarting a service – allowing the service to start from a clean state can fix entire classes of problems. One example could be leaking resources: a service sometimes fails to close a database connection. Over time, it exhausts the connection pool and, as a result, can’t connect to the database.
Temporarily stopping a service – if a service is misbehaving, e.g., corrupting or losing data due to a failing dependency, temporarily shutting it down could be a good way to stop bleeding.
Scaling – problems resulting from surges in traffic can be fixed by scaling the fleet.

5. Ask for help

Many on-call rotations cover multiple services, and it may be impossible to be an expert in all of them. Turning to a more knowledgeable team member is often the best thing to do during incidents. Getting help quickly is especially important for urgent issues that have a big negative impact. Other situations when you should ask for assistance are when you deal with multiple simultaneous issues or cannot keep up with incoming tasks.

6. Root cause

The root cause of an outage is often found as a side effect of troubleshooting. When this is not the case, it is essential to identify it once the outage has been mitigated. Failing to do so will make preventing future outages caused by the same issue impossible.

7. Prevention

The final step is to implement mechanisms that prevent similar outages in the future. Often, this requires fixing team culture or a process. For example, if team members regularly merge code despite failing tests, an outage is bound to happen.

I use these steps for each critical alert I get as an on-call, and I find them extremely effective.

The self-inflicted pain of premature abstractions

Premature abstraction occurs when developers try making their code very general without a clear need. Examples of premature abstraction include:

Creating a base class (or interface) even though there is only one known specialization/implementation
Implementing a more general solution and using it for one purpose, e.g., coding the visitor pattern, only to check if a value exists in a binary search tree
building a bunch of microservices for an MVP (Minimum Viable Product) application serving a handful of requests per minute

I have seen many mid-level and even senior software developers, myself included, fall into this trap. The goal is always noble: to come up with a clean, beautiful, and reusable architecture. The result? An unnecessarily complex mess that even the author cannot comprehend and which slows down the entire team.

Why is premature abstraction problematic?

Adding abstractions before they are needed adds needless friction because they make the code more difficult to read and understand. This, in turn, increases the time to review code changes and risks introducing bugs just because the code was misunderstood. Implementing new features takes longer. Performance may degrade, thorough testing is hard to achieve, and maintenance becomes a burden.

Abstractions created when only one use case exists are almost always biased toward this use case. Adding a second use case to fit this abstraction is often only possible with serious modifications. As the changes can’t break the first use case, the new “abstraction” becomes an awkward mix of both use cases that don’t abstract anything.

With each commit, the abstraction becomes more rooted in the product. After a while, it can’t be removed without significantly rewriting the code, so it stays there forever and slows the team down.

I witnessed all these problems firsthand when, a few years ago, I joined a team that owned an important functionality in a popular mobile app. At that time, the team was migrating their code to React Native. One of the foundations for this migration was a workflow framework implemented by a couple of team members that was inspired by Operations from Apple’s Foundation Framework. When I joined the team, the workflow framework was a few weeks late but “almost ready.” It took another couple of months before it was possible to start using it to implement simple features. Only then did we find out how difficult it was! Even a simple functionality like sending an HTTP request required writing hundreds of lines of code. Simple features took weeks to finish, especially since no one was willing to invest their time reviewing huge diffs.

One of the framework’s features was “triggers,” which could invoke an operation automatically if certain conditions were satisfied. These triggers were a source of constant performance issues as they would often unexpectedly invoke random operations, including expensive ones like querying the database. Many team members struggled to wrap their heads around this framework and questioned why we needed it. Writing simple code would have been much easier, faster, and more enjoyable. After months of grinding, many missed deadlines, and tons of functional and performance issues, something had to be done. Unfortunately, it turned out that removing the framework was not an option. Not only did “the team invest so much time and effort in it,” but we also released a few features that would have to be rewritten. Eventually, we ended up reducing the framework’s usage to the absolute minimum for any new work.

What to do instead?

It is impossible to foresee the future, and adding code because it might be needed later rarely ends well. Rather, writing simple code, following the SOLID principles, and having good test coverage are encouraged. This way, you can add new abstractions later when you do need them without introducing regressions and breaking your app.

Prioritize bugs like a boss

Me at my first job: “A bug? Oh, no! We MUST fix it!”

Me at Microsoft: “A bug? Oh, no! We CAN’T fix it!”

Me now: “A bug? Let’s talk!”

When I started my career over twenty years ago, I found dealing with bugs easy: any reported defect had to be fixed as soon as possible. This approach could work because our company was small, and the software we built was not complex by today’s standards.

My early years at Microsoft taught me the opposite: fixing any bug is extremely risky and should not be taken lightly. At that time, I worked on the .NET Framework (the one before .NET Core), which had an extremely high backward compatibility bar. The reason for this was simple: .NET Framework was a Windows component used by thousands of applications. Updates, often included in Windows Service Packs, were installed in place. Any update that changed the .NET Framework behavior could silently break users’ applications. As a team, we spent more time weighing the risk of fixing bugs than fixing them.

Both these situations were extremes and wouldn’t be possible today. As software complexity skyrocketed and users can choose from many alternatives, dealing with bugs has become much more nuanced. Impaired functionality is one, but not always the most important, aspect to consider when prioritizing fixing a bug. Here are the most common criteria to look at when triaging bugs.

Is it a bug?

While in most situations, there is no doubt that a reported issue is a bug, this is not always the case. While on the ASP.Net Core team, users reported some bugs they expected or preferred an API to behave differently. It did mean, however, that these issues were valid. For example, if you are building a spec-compliant HTTP server, you can’t fix the typo in the [Referer HTTP header](https://en.wikipedia.org/wiki/HTTP_referer) regardless of how many bugs asking you to do so you receive.

Security

Security bugs and vulnerabilities can lead to unauthorized access to sensitive data, financial losses, or operational disruptions. As they can also be exploited to deploy malware and infiltrate company networks, fixing security bugs is almost always the highest priority.

Regulatory and Compliance

Regulatory and Compliance bugs are another category of high-priority bugs. Even if they don’t significantly impact the functionality, they may have serious legal and financial consequences.

Privacy

Bugs that lead to the disclosure of sensitive data are treated very seriously, and fixing them is always a top priority. In the U.S., the law requires that businesses and government agencies report data breaches.

Business Impact

Business impact is an important aspect of determining a bug’s priority. Bugs that impact company revenue or other key business metrics will almost always be a higher priority than bugs that do not impact the bottom line.

I worked at Amazon during Amazon’s 2018 Prime Day. Due to extremely heavy traffic, the website experienced issues for hours, making shopping impossible. Bringing the website to life was the top priority for the company that day, followed by months of bug fixing and reliability improvements.

Functional Impact

Impact on functionality is usually the first thing that comes to mind when hearing the word “bug.” Rightly so! Functionality limited due to bugs leaves users extremely frustrated. Even small issues can lead to increased customer support tickets, customer loss, or damage to the company’s reputation.

Timing

Timing could be an important factor in deciding whether or not to fix a bug. Hasty bug fixes merged just before releasing a new version of a product can destabilize it and block the release. Given the pressure and shortened validation time, assessing the severity of these bugs and the risk of the fixes is crucial. In many companies, bugs reported in the last days before a major release get a lot of scrutiny, and fixing them may require the approval of a Director or even a VP.

The cost and difficulty of fixing the bug

Sometimes, bugs are not getting fixed due to the high cost. Even serious bugs may be punted for years if fixing them requires rewriting the product or has significant undesirable side effects. Interestingly, users often get used to these limitations with time and learn to live with them.

The impact of fixing a bug

Every bug fix changes the software’s behavior. While the new behavior is correct, users or applications may rely on the old, incorrect behavior, and modifying it might be disruptive. In the case of the .NET Framework, many valid bugs were rejected because fixing them could break thousands of applications.

The extreme case of a bugfix that backfired spectacularly was when our team fixed a serious bug in one of the Microsoft Windows libraries. The fix broke a critical application of a big company, an important Microsoft customer. The company claimed that fixing and deploying the application on their side was not feasible. As we didn’t want to (and couldn’t) revert the fix that shipped to hundreds of millions of PCs worldwide, we had to re-introduce the bug to bring back the old behavior and gate it behind a key in the Windows registry.

Is there a workaround?

When triaging a bug, it is worth checking if it has an acceptable workaround. Even a cumbersome workaround is better than being unable to do something because of a bug. A reasonable workaround often reduces the priority of fixing a bug.

When was the last time you used this? – Part 2: Algorithms

In the previous post, I reviewed the most common data structures and reflected on how often I have used them as a software engineer over the past twenty or so years. In this post, I will do the same for the best-known algorithms.

Searching

Searching for an item in a collection is one of the most common operations every developer faces daily. It can be done by writing a basic loop or using a built-in function like indexOf in Java or First in C#

If elements in the collections are sorted, it is possible to use Binary Search to find an item much faster. Binary search is a conceptually simple algorithm that is notoriously tricky to implement. As noted by Jon Bentley in the Programming Pearls book: “While the first binary search was published in 1946, the first binary search that works correctly for all values of n did not appear until 1962.” Ironically, the Binary Search implementation Bentley included in his book also had a bug – an overflow error. The same error was present in the Java library implementation for nine years before it was corrected in 2006. I mention this to warn you: please refrain from implementing binary search. All mainstream languages have binary search either as a method of array type or as part of the library, and it is easier and faster to use these implementations.

Fun fact: even though I know I should never try implementing binary search on the job, I had to code it in a few of my coding interviews. In one of them, the interviewer was intrigued by my midpoint computation. I got “additional points” for explaining the overflow error and telling them how long it took to spot and fix it in the Java library.

Sorting

Sorting is another extremely common operation. While it is interesting to implement different sorting algorithms as an exercise, almost no developer should be expected or tempted to implement sorting as part of their job. There is no need to reinvent the wheel when language libraries already contain excellent implementations.

Sometimes, it is better to use a data structure that stores items in sorted order instead of sorting them after the fact.

Bitwise operations

Even though the level of abstraction software developers work at has increased and bitwise operations have become rarer, I still deal with them frequently. What I do, however, is no longer the Hacker’s Delight level of hackery. Rather, it is as simple as setting a bit or checking if a given bit (a flag) is set. Software developers working closer to the metal or on networking protocols use bitwise operations all the time.

Recursion

Mandatory joke about recursion: To understand recursion, you must first understand recursion

I used recursive algorithms multiple times on every job I’ve had. Recursion allows solving certain classes of problems elegantly and concisely. It is also a common topic asked in coding interviews.

Backtracking

The backtracking algorithm recursively searches the solution space by incrementally building a candidate solution. A classic example could be a Sudoku solver, which tries to populate an empty cell with a valid digit and then fills other cells in the same manner. If the chosen digit didn’t lead to a solution, the solver would try the next valid digit and continue until a solution is found or all possibilities are exhausted.

So far, I haven’t used backtracking at my job but I used it to solve some programming puzzles. I found that backtracking can become impractical quickly due to a combinatorial explosion.

Depth-First Search

Depth-First Search (DFS) is an algorithm for traversing trees and graphs. Because I frequently work with trees, I have used DFS (the recursive version) often, and I have never had to implement the non-recursive version.

DFS is a good bet for interview questions that involve trees.

Breadth-First Search

Breadth-First Search (BFS) is another very popular algorithm for traversing graphs. It has many applications but is probably best known for finding the shortest path between two nodes in a graph.

I have used BFS only sporadically to solve problems at work. DFS was usually a simpler or better choice. BFS is, however, an essential tool for Advent of Code puzzles – each year, BFS is sufficient to solve at least a few puzzles. BFS is also a very common algorithm for coding interviews.

Memoization

Memoization is a fancy word for caching. More precisely, the result of a function call is cached on the first invocation, and the cached result is returned for the subsequent calls with the same arguments. Memoization only works for functions that return the same result for the same arguments (a.k.a. pure functions). This technique is widely popular as it can considerably boost the performance of expensive functions (at the expense of memory).

Dynamic programming

I have to admit that groking the bottom-up approach to dynamic programming took me a while. Even now, I occasionally encounter problems for which I struggle with defining sub-problems correctly. Fortunately, there is also the top-down approach which is much more intuitive. It is a combination of a recursive algorithm and memoization. The solutions to sub-problems are cached after computing them for the first time to avoid re-computing them if they are needed again.

I have used the top-down approach a handful of times in my career, but I have never used the bottom-up approach for work-related projects.

Advanced algorithms

There are many advanced algorithms, like graph coloring, Dijkstra, or Minimum spanning trees, that I have never used at work. I implemented some of them as an exercise, but I don’t work on problems that require using these algorithms.

Image: AlwaysAngry, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

“When was the last time you used this?” – Part 1: Data Structures

A candidate recently asked me, “When was the last time you used this data structure, if ever?”

The candidate admitted that as someone who worked on company internal tools, they hadn’t needed to use more advanced data structures in years. They were genuinely curious about how often I dealt with problems where these data structures were useful.

Their question provoked me to review data structures I used at work, learned for interviews, or used to solve programming puzzles and think about how often I used them. I share my list below.

Caveat: Every software developer deals with different problems. I created the list based on my experience. If I haven’t used a data structure, it doesn’t mean that it is not used or not useful. Instead, it likely means I could solve my problems without it.

Dictionary

Dictionary is one of the most commonly used data structures. It can be applied to a wide range of problems. I use dictionaries daily.

Typical implementations of Dictionary use a hash table (with a linked list to handle collisions) or a balanced Binary Search Tree. Understanding the underlying data structure gives immediate insights into the cost of basic operations like insertion or lookup.

Nowadays, every modern programming language offers an implementation of Dictionary.

Set

Set is another data structure that I use very frequently. It is surprising how often we need to handle duplicates efficiently. Set shares a lot with Dictionary. These similarities make sense because Set could be considered a Dictionary without the value.

Linked list

I implemented linked lists several times at the beginning of my career over twenty years ago, but I haven’t needed to do this since. Many standard libraries include implementations of linked lists, but again, I don’t remember the last time I needed them.

Linked list-related questions used to be a staple of coding interviews, but fortunately, they are less popular these days.

Knowing how linked lists work could still be valuable because they are sometimes used to implement other data structures, such as stacks or queues.

Stack

While I rarely need to use stack directly, this data structure is extremely common.

Every program uses a stack to track invoked functions, parameter passing, and local data storage (the call stack).

Stack is a foundation for many algorithms, such as backtracking, tree traversal, and recursive algorithms. It is often used to evaluate arithmetic expressions and for syntax parsing. JVM (Java Virtual Machine) or CLR (Common Language Runtime) are implemented as stack machines.

Even though this happened long ago, I vividly remember reviewing a diff in which a recursive tree traversal was converted to the iterative version with explicit stack to avoid stack overflow errors for extremely deep trees.

Queue

Task execution management is one of the most common applications for queues: iOS uses the DispatchQueue, WebServers queue incoming requests, and drinks at Starbucks are prepared on the FIFO (First-In, First-Out) principle.

I also use queues most frequently for task execution. My second most frequent use is solving Advent of Code puzzles with BFS (Breadth First Search), which uses a queue to store nodes to visit.

An interesting implementation fact about queues is that they often use a circular buffer under the hood for performance reasons. Implementations using linked lists are usually slower due to allocating and deallocating each node individually.

Heap / Priority Queue

I don’t remember when I had to use the Heap data structure to solve a problem at work. I don’t feel too bad about this (except when I forgot about Heap during an interview). Microsoft added the PriorityQueue type only in .NET 6 – about 20 years after they shipped the first version of .NET Framework. Apparently, they, too, didn’t consider Heap critical.

Although I didn’t need to use Heap directly, I am sure some libraries I integrate my code use it. Heap is crucial to efficiently implementing many algorithms (e.g., Dijkstra, Kruskal’s Minimum Spanning Trees).

Trees

It is challenging to talk about trees because they come in many shapes and colors. There are binary trees, n-ary trees, Binary Search Trees (BST), B-trees, Quadtrees, Octrees, and Segment Trees, to name just a few.

I have worked with (mostly n-ary) trees in every job. HTML and XML Document Object Models (DOM), C# Expression Trees, Abstract Syntax Trees, and domain-specific hierarchical data all require an understanding of the Tree data structure.

I have never had to implement a BST at work, but balanced BSTs are one way to implement (sorted) Dictionaries and Sets. For instance, the std::map and std::set containers in C++ are usually implemented as Red-black trees.

I used Quadtrees and Octrees only to solve a few Advent of Code puzzles that required spatial partitioning.

Graphs

I’ve only rarely had to use graphs for my daily job. In most cases, they were “natural” graphs – e.g., a dependency graph – that naturally formed an adjacency list.

Having said that, entire domains, such as Computer or Telecommunication Networks, Logistics, or Circuit Design, are built on graphs, so developers working in these domains work with graphs much more often.

This is my list. How about you? Are there data structures I haven’t included, but you use them all the time? Or maybe you don’t use some, which I consider a must. Please let me know.

Image: Jorge Stolfi, CC BY-SA 3.0, via Wikimedia Commons