Overthinking Software Projects

“Weeks of coding can save you hours of planning.”

Most developers learn this truth the hard way. I did when I implemented a feature based on incorrect assumptions, and the only way to save it was to rewrite it.

While junior engineers often fall into the trap of jumping straight to coding without thinking about the problem, more senior engineers fall into a different trap: overthinking.

Senior developers are routinely responsible for projects requiring a handful of engineers and a few months to complete. There is no way to execute these projects without careful thinking and planning. Even if you wanted to start coding on the first day, you couldn’t. You simply wouldn’t know where to start.

The top priority in the initial phases of bigger software projects is to sort out the ambiguity that blocks development. The most common way to do this is to devise a design that will guide the implementation.

The tricky part is that figuring out the design is by itself an ambiguous problem. There is usually more than one way to implement the solution, and many constraints and requirements are unclear. Even estimating how long it will take to prepare the design can be difficult.

All these uncertainties put pressure on the engineer(s) responsible for the project. Making a bad decision can lead to wasted time (amplified by the size of the team) or even a project failure.

The “Am I Overthinking This?” book cover
The “Am I Overthinking This?” book cover.

When stakes are high, it is natural to proceed carefully, explore potential problems, and work with others to identify gaps that may otherwise go unnoticed. However, setting limits for these activities is crucial. Failing to do so will inevitably lead to overthinking, which can also put the project at risk due to:

  • Delays – analyzing every imaginable scenario takes a long time and will eat into development time, making meeting expected timelines impossible.
  • Overengineering – trying to address minor or hypothetical issues leads to overly complex designs that are hard to implement and expensive to maintain.
  • Missed opportunity – endless discussions, revisions, and feedback rounds steal time engineers could spend on other project activities or elsewhere.

How do you know if you are overthinking?

One of the problems with overthinking is that the line between productive analysis and overthinking is thin and easy to miss. However, there are signs of getting there:

  • Although all critical requirements have been satisfied, new requirements are being added.
  • The same topics continue to be discussed repeatedly without reaching any resolution.
  • Edge cases, minor issues, and esoteric scenarios start to dominate the discussion.
  • The debate moves to future scenarios that are out of the scope of the project at hand.

What to deal with overthinking?

It is important to understand that the goal of the planning and design phases is not to identify and solve all possible problems. First, it is impossible even to list all the issues. No matter how much time you spend thinking you will miss something. Second, many problems will never materialize, or if they do, their impact will be minimal. To focus on what’s important, create a list of requirements, decide which are critical (the list should be short), and satisfy those. Leave out the remaining ones and revisit them if they become a major problem.

Many problems have no one correct or even best solution. No amount of debate is going to change that. It may take time to arrive at this conclusion, but once it is settled, you must pick one of the alternatives and live with the consequences.

If you are not sure, prototype. An hour of coding can save you a few hours of meetings. Code does win arguments.

Design for your current needs. If you don’t operate at Facebook’s scale, don’t design for Facebook’s scale. Don’t build abstractions for your hypothetical future features.

Accept the fact that things may not work out. Fortunately, you are not pouring concrete. This is software, and the ability to modify it is one of its greatest advantages. You can build small and evolve your solution when your needs grow.

Build vertically. A small feature working end-to-end will allow you to discover issues early and course-correct. If you build horizontally, you won’t see gaps until very late when all the layers are ready. Fixing bigger problems at this stage will be an undertaking.

Want your productivity to skyrocket? Avoid this trap!

As a junior engineer, I felt the urge to jump on each new project that showed on the horizon. I never checked what I already had on my plate. Inevitably, I ended up with too many projects to work on simultaneously. Whenever my manager or a customer mentioned one of my projects, I immediately switched to it to show I was making progress. It took me years to realize that while this approach pleased the customer or the manager (and saved my junior dev butt) for the moment, it quietly hurt everyone.

The three-line execution graph

Executing multiple projects of the same priority at the same time looks like this:

Three line graph - non-focused

Initially, there are two projects: Project A and Project B. You start working on Project A, but after a while, you receive a call from the customer inquiring about the progress of Project B. To make this customer happy, you switch to project B. In the meantime, a new interesting project, C, pops up. It is cool and seems small, so you pick it up. Your manager realizes that project A is dragging and asks about it. You somehow manage to finish your toy project C and move to A, which is well past the deadline. Then you pick B again.

If you didn’t jump from project to project, the execution could look like this:

Three-line graph focused

If you compare these graphs, three things stand out in the second scenario:

  • Overall, the execution took less time. Resuming a paused project requires time to remember where the project was left off and get in the groove, i.e., to switch the context, which , which is time-consuming.
  • Projects A and B finished much quicker than in the first case. While you didn’t make customer B feel good by saying you were working on their project, ultimately, the project was completed sooner. In fact, both projects, A and B, were finished much sooner than they would have if you bounced between them.
  • Project C came in late, so it should wait unless it is a much higher priority than other projects. Otherwise, it disrupts the execution of these projects.

I know that life is not that simple. Completely avoiding context switching is rarely possible. But if you can limit it, your productivity will dramatically increase.

Does it mean you should only work on one thing at a time?

In the past I thought a good solution for junior engineers to combat context switching was to ask them to work on just one project at a time. But this idea has a serious drawback – projects often get stuck due to factors outside our control. If this happens and there is no backup, idling until the issue gets resolved is a waste of time.

Having two projects with different priorities works best. You execute on the higher-priority project whenever you can. If you can’t, you turn to the other project until the main project gets unblocked.

What I like about this approach is that it is always clear what to work on: the higher priority project wins unless working on it is impossible.

Falling back to the lower priority project means there might be some context switching. While it is not ideal, it is better than idly waiting until the issues blocking the main project are resolved.

But my TL (Tech Lead) always works on five projects!

Indeed, experienced senior and staff engineers often work on a few projects at the same time. In my experience, however, it is a different kind of work. It might be preparing a high-level design, working on an alignment with a partner team, breaking projects into smaller tasks, and tracking the progress.

The secret is that most of these activities don’t require as much focus as coding. Handling a few of them at the same time is much more manageable because the cost of switching contexts is much lower.

A simple way to ship maintainable software

This was my first solo on-call shift on my new team. I was almost ready to go home when a Critical alert fired. I acknowledged it almost instantly and started troubleshooting. But this was not going well. Wherever I turned, I hit a roadblock. The alert runbook was empty. The dashboards didn’t work. And I couldn’t see any logs because logging was disabled.

Some team members were still around, and I turned to them for help. I learned that the impacted service shipped merely a week before, and barely anyone knew how it worked. The person who wrote and shipped it was on sick leave.

It took us a few hours to figure out what was happening and to mitigate the outage. This work made one thing apparent – this service was not ready for the prime time.

In the week following the incident, we filled the gaps we had found during the outage. Our main goal was to ensure that future on-calls wouldn’t have to scramble when encountering issues with this service.

But the bigger question left unanswered was: how can we avoid similar issues with any new service or feature we will ship in the future?

The idea we came up with was the Service Readiness Checklist.

What is the Service Readiness Checklist?

The Readiness Checklist is a checklist that contains requirements each service (or a bigger feature) needs to meet to be considered ready to ship. It serves two purposes:

  • to guarantee that none of the aspects related to operating the service have been forgotten
  • to make it clear who is responsible for ensuring that requirements have been reviewed and met

When we are close to shipping, we create a task that contains a copy of the readiness checklist and assign it to the engineer driving the project. They become responsible for ensuring all requirements on the checklist.

Having one engineer responsible for the checklist helps avoid situations where some requirements fall through the cracks because everyone thought someone else was taking care of them. The primary job of this engineer is to ensure all checkboxes are checked. They may do the work themselves if they choose to or assign items to people involved in the project and coordinate the work.

Occasionally, the checklist owner may decide that some requirements are inapplicable. For example, the checklist may call for setting up deployment, but there is nothing to do if the existing deployment infrastructure automatically covers it.

The checklist will usually contain more than ten requirements. They are all obvious, but it is easy to miss some just because of how many there are.

Example readiness checklist

There is no single readiness checklist that would work for every team because each team operates differently. They all follow different processes and have their own ways of running their code and detecting and troubleshooting outages. There is, however, a common subset of requirements that can be a starting point for a team-specific readiness checklist:

  • [ ] Has the service/feature been introduced to the on-call?
  • [ ] Has sufficient documentation been created for the service? Does it contain information about dependencies, including the on-calls who own them?
  • [ ] Does the service have working dashboards?
  • [ ] Have alerts been created and tested?
  • [ ] Does the service/feature have runbooks (a.k.a. playbooks)?
  • [ ] Has the service been load tested?
  • [ ] Is logging for the service/feature enabled at the appropriate level?
  • [ ] Is automated deployment configured?
  • [ ] Does the service/feature have sufficient test coverage?
  • [ ] Has a rollout plan been developed?

Success story

Our team was tasked to solve a relatively big problem on tight timelines. The solution required building a pipeline of a few services. Because we didn’t have enough people to implement this infrastructure within the allotted amount of time, we asked for help. Soon after, a few engineers temporarily joined our team. We were worried, however, that this partnership may not work out because of the differences in our engineering cultures. The Service Readiness Checklist was one of the things (others included coding guidelines, interface-based programming, etc.) that helped set clear expectations. With both teams on the same page, the collaboration was smooth, and we shipped the project on time.

Do code reviews find bugs?

I recently overheard this: “Code reviews are a waste of time – they don’t find bugs! We should rely on our tests and skip all this code review mambo-jumbo.”

And I agree – tests are important. But they won’t identify many issues code reviews will. Here are a few examples.

Unwanted dependencies

Developers often add dependencies they could easily do without. Bringing in libraries that are megabytes in size only to use one small function or relying on packages like true may not be worth the cost. Code reviews are a great opportunity to spot new dependencies and discuss the value they bring.

Potential performance issues

In my experience, most automated tests use only basic test inputs. For instance, tests for code that operates on arrays rarely use arrays with more than a few items. These inputs might be sufficient to test the basic functionality but won’t put the code under stress.

Code reviews allow spotting suboptimal algorithms whose execution time rapidly grows with the input size or scenarios prone to combinatorial explosion.

The latter bit our team not too long ago. When reviewing a code change, one of the reviewers mentioned the possibility of a combinatorial explosion. After discussing it with the author, they concluded it would never happen. Fast forward a few weeks, and our service occasionally uses 100% CPU before it crashes due to running out of memory. Guess what? The hypothetical scenario did happen. Had we analyzed the concern mentioned in the code review more thoroughly, we would’ve avoided the problem completely.

Code complexity and readability

Computers execute all code with the same ease. They don’t care what it looks like. Humans are different. The more complex the code, the harder it is to understand and correctly modify. Code review is the best time to identify code that will become a maintenance nightmare due to its complexity and poor readability.

Missing test coverage

The purpose of automated tests is to flag bugs and regressions. But how do we ensure that these tests exist in the first place? Through a code review! If test coverage for a proposed change is insufficient, it is usually enough to ask in a code review for improving it.

Bugs? Yes, sometimes.

Code reviews do find bugs. They don’t find all of them, but any bug caught before it reaches the repository is a win. Code reviews are one of the first stages in the software development process making them the earliest chance to catch bugs. And the sooner a bug is found, the cheaper it is to fix.

Conclusion

Tests are not a replacement for code reviews. However, code reviews are also not a replacement for tests. These are two different tools. Even though there might be some overlap, they have different purposes. Having both helps maintain high-quality code.

Storytime

One memorable issue I found through a code review was the questionable use of regular expressions. My experience taught me to be careful with regular expressions because even innocent-looking ones can lead to serious performance issues. But when I saw the proposed change, I was speechless: the code generated regular expressions on the fly in a loop and executed them.

At that time, I was nineteen years into my Software Engineering career (including 11 years at Microsoft), and I hadn’t seen a problem whose solution would require generating regular expressions on the fly. I didn’t even completely understand what the code did, but I was convinced this code should never be checked in (despite passing tests). After digging deeper into the problem, we found that a single for loop with two indices could solve it. If not for the code review, this change would’ve made it to millions of phones, including, perhaps, even yours, because this app has hundreds of millions of downloads across iOS and Android.

What it is like to work in Meta’s (Facebook’s) monorepo

I love monorepos! Or at least I love Meta’s (Facebook’s) monorepo, which happens to be the only monorepo I have ever worked with. Here is why.

Easy access to all code

Meta’s monorepo contains most of the company’s code. Any developer working at Meta has access to it. We can search it, read it, and check the commit history. We also can, and frequently do, modify code code managed by other teams.

This easy access to all the code is great for the developer’s productivity. Engineers can understand their dependencies deeper, debug issues across the entire stack, and implement features or bug fixes regardless of who manages the code. All this is available at their fingertips. They can hit the ground quickly without talking to other teams, reading their out-of-date wikis, and spending time figuring out how to clone and build their code.

Linear commit history

Meta’s monorepo does not use branches, so the commit history is linear. The linear commiit history saves engineers from having to reverse engineer a London Tube Map-like merge history to determine if a given commit’s snapshot contains their changes. With linear commit history, answering this question boils down to comparing commit times.

No versioning

Versioning is one of the most complex problems when working with multiple repos. Each repo is independent, and teams are free to decide which versions of dependencies they want to adopt. However, because each repo evolves at its own pace, different repos will inevitably end up with different versions of the same package. These inconsistencies lead to situations where a project may contain more than one version of the same dependency, but no single version works for everyone.

I experienced this firsthand during my time at Amazon. I was working on the Alexa app, which consisted of tens of packages, each pulling in at least a few dependencies. It was a versioning hell: conflicts were common, and resolving them was difficult. For example – one package used an older dependency because a newer version contained a bug. Another package, however, required the latest version because older versions lacked the needed features.

A monorepo solves versioning issues in a simple way: there is no versioning. All code is built together, so each package or project has only one version for a given commit.

Atomic commits

Monorepos allow atomic cross-project commits. Developers can rename classes or change function signatures without breaking code or tests. They just need to fix all the code affected by their change in the same commit.

Doing the same is impossible in a multi-repo environment. Introducing breaking changes is either safe but slow (as it requires multiple commits for a proper migration) or fast but at the expense of broken builds.

This problem plagued the ASP.NET Core project in its early days (ProjectK anyone?). The team was working on getting abstractions right, so the foundational interfaces constantly changed. Many packages (each in its repo) implemented or used these interfaces. Whenever they changed, most repos stopped compiling and needed fixes.

Build

Builds in monorepos are conceptually simple: all code in the repo is built at a given commit.

This approach makes it possible to quickly tell what’s included in the build and create bundles where all build artifacts match.

While the idea is simple, building the entire monorepo becomes increasingly challenging as the repository grows. Compiling big monorepos, like Meta’s, in a reasonable time is impossible without specialized build tools and massive infra.

Multiple repos make creating a list of matching packages surprisingly hard. I learned this when working on ASP.NET Core. The framework initially consisted of a couple of dozen of repos. Our build servers were constantly grinding because of what we called “build waves.” A build wave was initiated by a single commit that triggered a build. When this build finished, it triggered builds in repos depending on it. This process continued until all repos were built. Not only was this process slow and fragile, but with a steady stream of commits across all the repos, producing a list of matching packages was difficult.

The ASP.Net Core team eventually consolidated all the code in a single repository adopting the monorepo approach. This change happened after I left the team, but I believe the challenges behind getting fast and consistent builds were an important reason.

What are the problems with monorepos?

If monorepos are so great, why isn’t everyone using them? There are several reasons.

Scale

Scale poses the biggest challenge for monorepos. Meta’s repository is counted in terabytes and receives thousands of commits each day. Detecting conflicts and ensuring that all changes are merged correctly and don’t break the build without hurting developers’ productivity is tough. As most off-the-shelf tools cannot handle this scale, Meta has many dedicated teams that maintain the build infrastructure. Sometimes, they need to go to great lengths to do their job. Here is an example:

Back in 2013, tooling teams ran a simulation that showed that in a few years, basic git commands would take 45 minutes to execute if the repo continued to grow at the rate it did. It was unacceptable, so Facebook engineers turned to Git folks to solve this problem. At that time, Git was uninterested in modifying their SCM (Source Code Management) to support such a big repo. The Mercurial (hg) team, however, was more receptive. With significant contributions from Facebook, it rearchitected Mercurial to meet Facebook’s requirements. This is why Meta (a.k.a. Facebook) uses Mercurial (hg) as its source control.

Granular project permissions

Monorepos make accessing any code in the repository easy, which is great for developers’ productivity. However, companies often have sensitive code only selected developers should be able to access. This requirement goes against the idea of the monorepo, which aims to make all code easily accessible. As a result, enforcing access to code in a monorepo is problematic. Creating separate repos for sensitive projects is also not ideal, especially if these projects use the common infrastructure the monorepo provides for free.

Release management

A common strategy to maintain multiple releases is to create a branch for each release. Follow-ups (e.g., bug fixes) can be merged to these branches without bringing unrelated changes that could destabilize the release. This strategy won’t work in monorepos with a linear history.

I must admit that I don’t know how teams that ship their products publicly handle their releases. Our team owns a few services we deploy to production frequently. If we find an issue, we roll back our deployment and fix the bug forward.

A single commit can break the build

Because for monorepos, the entire codebase is built at a given commit, merging a mistake that causes compilation errors will break the build. These situations happen despite the tooling that is supposed to prevent them. In practice, this is only rarely a problem. Developers are only affected if the project that doesn’t compile is a dependency. And even then, they can workaround the problem by working off of an older commit until the breakage is fixed.