Why Should You Care About Minimal Reproducible Examples (and how to create one)

You’ve spent hours debugging a tricky bug. You can reproduce it but can’t quite figure out the root cause. You’re starting to believe that the bug might not be in your code, but in the library, you are using. Given how much time you’ve already spent on this investigation, you are getting desperate and want to ask for help. What’s the best way to do it? Create a minimal repro!

What’s a minimal repro?

Minimal repro, sometimes called Minimal Reproducible Example, is a code snippet reproducing a bug and providing context with as little code as possible. In the ideal case, another person should be able to copy the code and run it on their machine to reproduce the bug successfully. This is often impossible to achieve, but the closer to this ideal, the better.

How to create a minimal repro?

Creating a minimal repro requires some thought. It’s not about code-golfing. The minimal repro should be as little as possible, but it should also retain all necessary context. Here are a few tips:

  • Remove any code that is not needed to reproduce the issue, but make sure that your example still compiles
  • Avoid changes that make code shorter at the expense of understandability – e.g., don’t shorten the names of variables if it makes code harder to comprehend
  • Keep only important data – if your array has 1 million items but you need only two items to reproduce the issue, only include these two.
  • Remove any artifacts, like configuration files, that are not needed to reproduce the issue. If possible, set all mandatory options all inputs directly in the code.
  • Reduce the additional steps needed to reproduce the issue to the absolute minimum.

Pro tip: Occasionally, instead of removing unneeded code to isolate the issue, it is better to start a new project and try to write code replicating a bug from scratch.

Why create a minimal repro?

Surprisingly, the main benefit of creating a minimal repro is not making a code snippet that you could use to ask for help. Rather, ruthlessly eliminating noise helps build a deeper understanding of the problem and frequently leads to finding the root cause of the issue and a proper fix.

If creating a minimal repro didn’t help you figure out the cause of the bug and a fix, you have something you can use to ask for assistance. You can share your example with your teammates, post it on StackOverflow, or include it when opening a GitHub issue. This is where minimal repros shine – they are critical in getting help quickly.

During my time at Microsoft, I was one of the maintainers of EntityFramework, Asp.NET Core, and SignalR repos. As part of my job, I investigated hundreds of issues reported by users. Clean, concise repros were one of the main factors deciding whether or not an issue was resolved quickly.

For most reported issues containing a concise, clean repro, engineers needed a glance to determine it was indeed a bug. If it was, they could often find the culprit in the code in a few minutes. Finally, they frequently used the example included in the bug report to create a unit test for the fix.

Reports with convoluted or incomplete examples dragged on for weeks. Building a repro often required multiple follow-ups with the author. Due to excruciatingly slow progress, these bug reports had a higher abandon rate.

The bottom line is that you will get help faster if you make it easy to give it. Minimal repro is an effective way to do this.

A Powerful Git Trick No One Knows About

Here is a Git trick I learned a long time ago that I can’t live without (and when I say no one knows about it, I mean it – it is not well documented, and no developers I have worked with knew about it):

Some git commands take - (dash) as the reference to the previous branch.

git checkout and git merge are two commands I use it with all the time.

git checkout - switches to the previous branch. This makes toggling between the two most recently used branches quick and super easy:

Using git checkout -

git merge - merges the previous branch to the current branch. It is especially powerful when combined with git checkout -. You can switch to the target branch and then merge from the previous branch like this:

Using git merge -

One command I wish supported - is git branch -d. It would make cleaning branches after merging effortless. Presumably, this option is not available to prevent accidentally deleting the wrong branches.

Bonus trick

While we are at it – did you know that the cd (change directory) shell command also supports -? You can use cd - to toggle between the two most recent directories.

Don’t let “later” derail your software engineering career

One thing I learned during my career as a software engineer is that leaving unfinished work to complete it “later” never works.

Resuming paused work is simply so hard that it hardly ever happens without external motivation.

Does it matter, though?

If the work was left unfinished not because it was deprioritized but because it was boring, tedious, or difficult, then more often than not, it does matter. The remaining tasks are usually in the “important but not urgent” category.

What happens if this work is never finished?

Sometimes, there are no consequences. Occasionally, things explode. In most cases, it’s a toll the team is silently paying every day.

From my observations, the most common software engineering activities left “for later” are these:

  • adding tests
  • fixing temporary hacks
  • writing documentation

Adding test “later”

Insufficient test coverage slows teams down. Verifying each change manually and thoroughly takes time, so it is often skipped. This results in a high number of bugs the team needs to focus on instead of building new features. What’s worse, many bugs are re-occurring as there is no easy way to prevent them.

I don’t think there is ever a good reason to leave writing tests for “later.” The best developers I know treat test code like they treat product code. They wouldn’t ship a half-baked feature, and they won’t ship code without tests – no exceptions.

Fixing temporary hacks “later”

Software developers introduce temporary hacks to their code for many reasons. The problem is that there is nothing more permanent than temporary solutions. “The show must go on,” so new code gets added on top of the existing hacks. This new code often includes additional hacks required to work around the previous hacks. With time, adding new features or fixing bugs becomes extremely difficult, and removing the “temporary” hack is impossible without a major rewrite.

In an ideal world, software developers would never need to resort to hacks. The reality is more complex than that. Most hacks are added for good reasons, like working around an issue in someone else’s code, fixing an urgent and critical bug, or shipping a product on time. However, the decision to introduce a hack should include a commitment to the proper, long-term solution. Otherwise, the tech debt will grow quickly and impact everyone working with that codebase.

Writing documentation “later”

Internal documentation for software projects is almost always an afterthought. Yet, it is another area that, if neglected, will cause team pain. Anyone who has been on call knows how difficult it is to troubleshoot and mitigate an issue quickly without a decent runbook.

In addition, documentation also saves a lot of time when working with other teams or onboarding new team members. It is always faster to send a link to a wiki describing the architecture of your system than to explain it again and again.

One way to ensure that documentation won’t be forgotten is to include writing documentation as a project milestone. To make it easier for the team, this milestone could be scheduled for after coding has been completed or even after the product has shipped. If the entire team participates, the most important topics can be covered in just a few days.

How does “later” impact YOU?

Leaving unfinished work for “later” impacts you in two significant ways. First, it strains your mental capacity. The brain tends to constantly remind us about unfinished tasks, which leads to stress and anxiety (Zeigarnik effect). Second, being routinely “done, except for” can create an impression of unreliability. This perception may hurt your career, as it could result in fewer opportunities to work on critical projects.

Top 5 Jokes Software Developers Tell Themselves and their Managers

Software developers are boring. Not only do they keep repeating the same jokes all the time, but they also take them very seriously!

Here are the most common ones

I will add unit tests later

“I will add unit tests later” is one of the most common jokes software developers tell. In most cases, they believe it. But then reality catches up, and tests are never added. If you were too busy to add unit tests when it was the easiest, you won’t have time to make up for this later, when it becomes more difficult, and you might even be questioned about the priority and value of this work.

I am 99% done

I am terrified to hear that someone “is 99% done,” as if it were positive. I have seen too many times how the last 1% took more time and effort than the first 99%. Even worse, I participated in projects that were 99% completed but never shipped.

My code has no bugs

Claiming that someone’s code has no bugs is a very bold statement. I did it several times at the beginning of my career, only to be humbled by spectacular crashes or non-working core functionality. I found that saying: “So far, I haven’t found bugs in my code. I tested the following scenarios: …” is a much better option. Listing what I did for validation can be especially useful as it invites inquiring about scenarios I might have missed.

This joke is especially funny when paired with “I will add unit tests later.”

No risk – it’s just one line of code

While a one-line change can feel less risky than bigger changes, that doesn’t mean there is no risk. Many serious outages have been caused by one-line configuration changes, and millions of dollars are lost every year due to last-minute “one-liners.”

Estimates

Most of the estimates provided by software development are jokes. The reason for this is simple: estimating how much time even a software development task is more art than science. Trivial mistakes, like misplaced semicolons, can completely derail even simple development tasks. Bigger, multi-month, multi-person projects have so many unknowns that providing accurate estimates is practically impossible. From my experience, this is how this game works:

  • software developers provide estimates, often adding some buffer
  • their managers feel that these estimates are too low, so they add more buffer
  • project managers promise to deliver the project by some date that has nothing to do with estimates they got from engineering
  • the project takes longer than the most pessimistic estimates
  • everyone keeps a straight face

Bonus

“My on-call was pretty good! I was woken up only three times this week.”

On-call Manual: Handling Incidents

If your on-call is perfect and you can’t wait for your next shift, you can stop reading now.

But, in all likelihood, this isn’t the case.

Rather, you feel exhausted by dealing with tasks, incidents, and alerts that often fire after hours, and you constantly check the clock to see how much time is remaining in your shift.

The on-call can be overwhelming. You may not know all the systems you are responsible for very well. Often, you will get tasks asking you to do things you have never done before. You might be pulled into investigating outages of infrastructure owned by other teams.

I know this very well – I have been participating in on-call for almost seven years, and I decided to write a few posts to share what I learned. Today, I would like to talk about handling incidents.

Handling Incidents ⚠️

Unexpected system outages are the most stressful part of the on-call. A sudden alert, sometimes in the middle of the night, can be a source of distress. Here are the steps I recommend to follow when dealing with incidents.

1. Acknowledge

Acknowledging alerts is one of the most important on-call responsibilities. It tells people that the on-call is aware of the problem and is working on resolving it. In many companies, alerts will be escalated up the management chain if not acknowledged promptly.   

2. Triage

Triaging means assessing the issue’s impact and urgency and assigning it a priority. This process is easier when nothing else is going on. If there are active alerts, it is crucial to understand if the new alert is related to these alerts. If this is not the case, the on-call needs to decide which alerts are more important. 

3. Troubleshoot

Troubleshooting is, in my opinion, the most difficult task when dealing with alerts. It requires checking and correlating dashboards, logs, code, output from diagnostic tools, etc., to understand the problem. All this happens under huge pressure. Runbooks (a.k.a. playbooks) with clear troubleshooting steps and remediations make troubleshooting easier and faster. 

4. Mitigate

Quickly mitigating the outage is the top priority. While it may sound counterintuitive, understanding the root cause is not a goal and is often unnecessary to implement an effective mitigation. Here are some common mitigations:

  • Rolling back a deployment – outages caused by deployments can be quickly mitigated by rolling back the deployment to the previous version.
  • Reverting configuration changes – problems caused by configuration changes can be fixed by reverting these changes. 
  • Restarting a service – allowing the service to start from a clean state can fix entire classes of problems. One example could be leaking resources: a service sometimes fails to close a database connection. Over time, it exhausts the connection pool and, as a result, can’t connect to the database. 
  • Temporarily stopping a service – if a service is misbehaving, e.g., corrupting or losing data due to a failing dependency, temporarily shutting it down could be a good way to stop bleeding.
  • Scaling – problems resulting from surges in traffic can be fixed by scaling the fleet.

5. Ask for help

Many on-call rotations cover multiple services, and it may be impossible to be an expert in all of them. Turning to a more knowledgeable team member is often the best thing to do during incidents. Getting help quickly is especially important for urgent issues that have a big negative impact. Other situations when you should ask for assistance are when you deal with multiple simultaneous issues or cannot keep up with incoming tasks.  

6. Root cause

The root cause of an outage is often found as a side effect of troubleshooting. When this is not the case, it is essential to identify it once the outage has been mitigated. Failing to do so will make preventing future outages caused by the same issue impossible.

7. Prevention

The final step is to implement mechanisms that prevent similar outages in the future. Often, this requires fixing team culture or a process. For example, if team members regularly merge code despite failing tests, an outage is bound to happen.

I use these steps for each critical alert I get as an on-call, and I find them extremely effective.