Program testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence.

— Edsger W. Dijkstra, “The Humble Programmer”, 1972

Preface

I’m writing this primarily to organize my own thinking about software testing. Over the years, I’ve accumulated a set of (strong) opinions (that are loosely held) and intuitions about what testing is good at, where it fails, and how it should be used in practice, but I’ve often struggled to articulate them clearly, sometimes even to myself. I certainly don’t pretend that any of these are truths or best practices.

Testing is an intense topic, with no shortage of strong beliefs and dogma. There are people willing to die on certain hills over software development practices, yet their deaths rarely make those hills more attractive to non-believers. After almost two decades in the software industry, both as an individual contributor and a (still hands-on) manager, during one Christmas break, I decided to examine my feelings and beliefs to, hopefully, form a coherent view. What follows is an attempt to reason through them from the first principles and see if they hold together. Wish me luck!

Big Beautiful Test Suite

We’ve all been there and seen this situation many times. Over time, a codebase accumulates a large and impressive test suite: thousands of tests, high coverage numbers, dashboards that look reassuring (and concrete!). At some point, test coverage becomes something your manager’s manager asks to track.

At the same time, the costs become hard to ignore. CI pipeline gets slower and more expensive. Builds take long enough that waiting for a green one becomes a regular and annoying part of the job. Tests become flaky (and suddenly your manager’s manager is asking to track flakiness too), reruns become routine, race conditions start showing up, and the jokes about “waiting for the green build” stubbornly remain relevant and salty. To keep the machine running, many companies sacrifice a human and spawn a Release Engineer role, which is non-ironically considered a sign of engineering maturity rather than dysfunction.

DORA metric meme

I don’t like this situation (nobody likes it) but is there an alternative? Writing fewer tests feels irresponsible, and suggesting it often sounds heretical. Yet the costs are real: infrastructure, ongoing maintenance, slower cycle time, and growing opportunity cost as engineers spend more time dealing with the test suite than building new shiny features.

What bothers me most is that these tradeoffs are rarely made explicit. We keep adding lots of tests because not adding them feels wrong, without thinking too much about their cost. The mere existence of a large test suite often becomes evidence of quality on its own, even when getting a green build has turned into a daily struggle.

The worst part is that it doesn’t matter how big the beautiful test suite is, or that you have 100% test coverage, you still get a steady monthly supply of production defects, many of which are hard to reproduce and fix. Which, inevitably, becomes yet another metric your manager’s manager asks to track.

So the next time a major customer discovers yet another critical defect and threatens to switch to a competitor, you find yourself asking a question:

Why do we even write tests?

For most of my career, this wasn’t even a question. The reaction was always the same: Why didn’t we test this scenario, how did we miss it? Let’s do a retro with 5-whys! Let’s RCA it! Let’s collect lessons learned. And among other things, we almost always learn the same lesson: we need to write more tests, or better tests. Until the next time it happens again.

Does your team have the same bias: When in doubt, write more tests?

The funny thing about software is that it doesn’t occupy physical space. But if tests were physical objects, the CI pipeline of a “mature” organization would look like a hoarder’s house, filled with all these “it won’t hurt to have it” and “we might need this someday” kinds of tests. But I digress.

iPod meme

It was only later in my career that I seriously asked this question: So why do we write tests, really?

Everyone knows that software has bugs, so we write tests to reduce bugs. But tests are software too. So the way we reduce bugs is by adding more software, and therefore more bugs. In other words, testing is the practice of writing additional buggy software in the hope that it will reduce the bugs in the original buggy software. Good that we’re not writing tests for tests.

Tests also make us feel safer when we refactor. But in practice, in many cases, refactoring code leads to refactoring tests as well. Yes, I know, this is where the discussion turns into why tests were written incorrectly, are too tightly coupled, or violate some principle we should have stuck to more carefully, but we’ve already established that tests are software, and software has bugs, hence tests have bugs too.

Tests are also often described as documentation. Actually, it’s a very common practice to start onboarding to a new project by reading their test suite and memorizing the most profound and insightful parts of it. Unit tests are actually a form of cyberpunk haiku, and at some point I genuinely hoped someone would mint an NFT out of a unit test.

These are all well-known truths, basic common knowledge. And yet they didn’t answer the question that kept bothering me:

Why do we keep getting more and more miserable as our Big Beautiful Test Suite grows?

Then one day there was an angry @channel message in Slack asking who owned a particular Jenkins job. It turned out, that job had spun up several massive EC2 instances and left them running over a long weekend. Again. And we burnt a lot of money! AGAIN!!!

Wait… money? Right…

We’re a business. Customers don’t pay for tests, they pay for working software. Building software costs money, and writing and maintaining tests (and documentation) costs money too. So there has to be a good reason to spend money on something customers never see.

That reason is to make sure customers never see failures we already expect can happen when we accidentally break something they rely on, while building new features or improving the existing ones.

After all, it’s not that uncommon to refactor a toaster into a blender with an InstantiationAwareBeanPostProcessor in front of it. We pay to see, report and fix those failures ourselves, so the customer never knows. And the sausage-making stays on our side. Hereby, I declare:

Tests don’t make software correct  they make failures cheaper

This already sounds like a good enough reason to justify spending money on tests. And it’s a very familiar and practical one. Anyone who remembers the early days of their career (being a junior developer triaging and fixing bugs) knows that defects aren’t assessed just by finding the root cause and coding a fix.

We also look at the risk of the fix, the effort of the fix (which translates to cost), and the impact and severity of the defect itself — all of it at the same time, sometimes, in dedicated triage meetings where engineering, business and product people meet. Some organizations formalize this even further and calculate defect scores, where effort and risk reduce the score, while severity and business impact increase it.

A defect that’s hard and risky to fix, but only affects a small number of users in non-critical scenarios, often doesn’t get fixed. Or at least not now. You get pushback, a request for a workaround, and a suggestion to “handle it properly” in some future major release where a larger redesign or refactor is already planned, along with a long list of other things quietly swept under the rug.

Best I can do is postpone the fix meme

At that point, we’re not really doing software engineering anymore. We’re doing economics, weighing the actual cost of the fix, the cost of the risk that comes with it, and the business loss of leaving the defect in place.

Now apply the same model we use for defects to tests

Tests cover scenarios, not just lines of code, and you can reason about those scenarios using the same categories. Is this a critical scenario? How often does it happen? How many other pieces of code actually depend on it working correctly? And how brittle is it likely to be as the system evolves?

Effort is still effort, but it’s broader than just writing the test once. It includes maintaining it, debugging false failures, rerunning flaky tests and paying the infrastructure cost of running it. A heavy, slow test isn’t just more expensive to execute. It stretches feedback loops, slows cycle time, and quietly makes the developer experience worse and time-to-market slower, and makes time-to-green-build a new metric for your manager’s manager. This is probably less applicable to unit tests, but very much depends on the size of your Big Beautiful Test Suite.

Framed this way, writing a test is just another trade-off. We’re making a bet that a scenario is important enough, stable enough, and central enough to justify the ongoing cost, fragility, and drag on iteration speed we’re introducing. Pretending that tests live outside the usual trade-offs is a narrow way of thinking that ignores cost, risk, timelines and impact.

A test doesn’t live just in a pull request. It lives in the P&L, hidden in the fragmentation between payroll and infrastructure costs, and in anonymous employee satisfaction surveys. It’s a small cog in the oppression machine of a CI pipeline, part of a sprawling network of other tests, all ready to fail your green build at the slightest provocation, watched over by sparkled corgis with laser eyes, judging your diff silently and without mercy. So think twice before adding a new test. It will outlive the pull request. It knows where you work. It will look for you. It will find you. And it will flake.

I will find you and I will flake for you meme

So if we can afford not to fix actual defects, then, dare I say it, maybe, just maybe, we can afford to skip writing some tests?

I don’t have a rule of thumb for deciding when a test is not worth writing. I honestly don’t. I’ve been in plenty of discussions where I said something like “you’re testing something, but not the thing you’re actually fixing or building”, but that judgment is always local and very specific to the situation. Maybe someone smarter than me will generalize this one day. I can’t. What I can do is to tell a story.

We once had a genuinely interesting defect. One system was trying to connect to another over a SOCKS proxy using Protobuf, with Netty involved along the way. On the client side, everything looked fine. The logs clearly showed the correct port. On the server side, however, it looked like the client was trying to connect to a different, incorrect port.

The root cause turned out to be endianness. Netty is big-endian by default. The client was sending the port as a little-endian short. So the port looked correct in the client logs and completely wrong on the server.

To create a test, we had to build a SOCKS proxy using Netty’s own building blocks. That required implementing a protected Netty method. And since it was protected, the only way to access it was to put the test into Netty’s internal SOCKS codec package. Yes, the test literally declared the package as io.netty.handler.codec.socks which rightfully belongs to Netty, not our Big Beautiful Test Suite. At that point, this should have raised eyebrows. It didn’t. Well, it did but more like “cool, that’s smart!”

Beavis meme

Then we upgraded Netty. In the new version, Netty moved that protected method to a different package (io.netty.handler.codec.socksx.v5). Guess, what happened? That’s right, the test broke. Nothing else, but the test. The actual fix was fine. And in that moment, something fundamental shifted. Trust was lost. Innocence was gone. Dreams were shattered. Illusions dissolved. The CI pipeline stood silent, judging us. That day, an L3 engineer got an L4 battle scar.

This test caused real financial damage and created no value whatsoever. The bug was rare and unlikely to regress, while the test was heavy, awkward and tightly coupled to Netty’s SOCKS internals and predictably broke as soon as the dependency evolved. If we routinely accept that some real defects are not worth fixing, why is it so hard to admit that some tests are not worth writing? This particular test was a great way to reproduce and fix the issue but definitely not worth adding to the Big Beautiful Test Suite.

I would argue that the eventual test failure in that endianness bug fix was actually a good thing. It forced us to come back later, revisit the test, and remove that monstrosity. Most tests aren’t like this. They stay green and quiet, sitting there and consuming CI resources without raising any alarms. Sometimes for years. Possibly decades. Some of these tests were written long before you joined the project by people who are no longer with us. Nobody touches these tests.

All I’m saying is that it’s easy not to notice a single PR adding a few seconds to CI but a few hundred PRs later, the build is ten minutes slower. Meanwhile, I’ve never seen anyone check how much build time their new tests add. Maybe, we should?

Tests on probation

When writing tests, we often just don’t know enough upfront. We don’t really know how often a code path will be exercised, which inputs are common, or which edge cases are theoretical versus real, so we make an educated guess. Not that different from estimates.

Shot in the dark meme

This doesn’t have to be a permanent decision. We can come back to it later, but only if we make a conscious effort to leave ourselves that option. Yes, I’m looking at you, manager’s manager, to enable that. Developers usually instrument systems so they can troubleshoot failures. Much less often do they instrument them to understand how the system is actually used.

After the code has been live for a while, those guesses can be reviewed against actual data in logs and traces. We can see which paths are hot, how parameters cluster into ranges, and which cases almost never happen. We can bucket values, look at frequencies, and even spot outliers with minimal setup in Kibana, Grafana, or our own data warehouse! From there, we can adjust our tests accordingly: focus coverage on hot paths and common values, trim tests around inputs that never seem to occur, and hopefully shave a few seconds off the build time.

As a side effect, we also end up with real visibility into how the product is actually used, which will make our product management and data engineering friends extremely happy. It also reduces the temptation to pipe sensitive data into Mixpanel and hope it doesn’t pipe it further like it did last time.

Eventual inconsistency

Production has a lot more entropy than any test environment we run. But test suites actually do the opposite — they try to minimize and contain entropy. They usually start from a clean slate: databases are recreated, fixtures are reloaded, schemas are current, and APIs look exactly like in today’s Postman collection.

In production, the state doesn’t reset, it accumulates. So we end up with a customer record created five years ago, partially shaped by migrations everyone forgot about, and one that caused a small outage at the time. It’s like a five-year long stream of events applied to that record, while fields get added, renamed, half-deprecated, catalogs update, invariants shift and compensating actions are often some kind of manual migrations and workarounds. This kind of state drift is rarely (if ever) modeled in tests.

Instead, we only document this kind of drift after it breaks production — we make a code change and add a “regression” test. Which often wasn’t actually a regression, or even an edge case. It was a dormant invariant violation, the grey area of undefined behaviour, that emerged from a design gap, accumulated state, entropy, and their combinatorial complexity, not from a recent code change.

Types of headache meme

I’m not even sure it’s possible to model our software-specific flavour of entropy, unless we keep the same database instance around for years, treat its data the same way we do in production, and even then accept that we’ll still miss the inputs and state transitions we couldn’t reasonably foresee. Some of the things that break systems are, by definition, the ones nobody thought to model in the first place.

In that light, aren’t penetration testing and chaos testing often better tools for exploring the input space, discovering edge cases, and protecting against entropy than yet another pseudo-regression test?

Conclusion

Again, I’m not advocating for writing fewer tests or claiming that some kinds of tests are categorically better than others. What I’m suggesting is a change in behavior. Instead of adding tests on autopilot every time we build a feature or fix a defect, we should pause and remember that tests come with real cost and long-lasting impact on developer productivity and overall experience. Once we acknowledge that, we can try to quantify the impact of the tests we’re adding and make an explicit trade-off.

Ideally, that changes how we think about testing. It expands our default options (and our System 1 thinking), and nudges our culture away from “when in doubt, just add more tests” toward a broader toolbox. Sometimes that means not writing a test. Sometimes it means choosing a different kind of test. Sometimes it means rewriting the code to support faster, cheaper tests, or making it more observable so we can learn from real usage. And sometimes it means treating penetration testing and chaos testing as first-class citizens in our workflows, instead of afterthoughts.