organize automated tests without getting eaten by your devs

There’s not much a developer hates more than a blocked pipeline by a flaky test. Well, maybe having to refactor someone’s legacy code, but pipeline delays are right up there.

They’re on their own deadlines, and every minute counts. If the blocker is a real bug, no one argues - better to stop a bad merge than fire-drill it in production. But if the blocker turns out to be a flaky test? That’s when things get… heated. Faster than you can say “sorry,” your end-to-end tests get yanked from the pipeline and you’re back to hoping the right bugs get caught before prod. Not a great look.

And yet, ignoring tests isn’t an option either. If you’ve ever had something critical fail in production because “oh yeah, that test was turned off,” you know the pain. Tests need to run, and they need to run as soon in the pipeline as possible. Anything else is just a slow slide into “we’ll fix it later” - a.k.a. never.

How do you keep your pipeline trustworthy without triggering a developer revolt?

Here’s the strategy we use at Octomind that’s been keeping our devs (mostly) happy, our pipeline (mostly) green, and our releases (mostly) bug-free.

The e2e testing setup: An hourglass, not a pyramid

Forget the textbook “testing pyramid.” Our shape looks more like an hourglass.

  • Thousands of unit tests: These are quick to write, quick to run, and cover the bulk of our low-level logic. Modern AI helps here - generating solid starting points for many cases - but we still make sure they test useful things, not just padding our coverage stats. Unit tests are our wide base, and they don’t slow anyone down.
  • A pinch of integration tests: Just a handful. They’re trickier to write, more prone to timing issues, and not something we want to balloon in number. They cover only the most critical cross-component behaviors.
  • A generous layer of e2e tests: This is where things get interesting. We use our own tooling to create, run, and maintain them. Contributions come from all over - developers, QA, even business folks. The key here isn’t the quantity, it’s the stability.

The OctoQA rotation

We don’t have a dedicated QA, but we consider software quality to be an essential part of every Octoneer’s work. To keep stability front and center, we introduced a rotating role: OctoQA.

Each week, a different person wears the OctoQA hat. Their mission: monitor and manage our “non-pipeline” e2e tests.

Why non-pipeline? Because fresh e2e tests are often flaky at first. Not because the code is bad, but because test setup is hard - isolation issues, data dependencies, timing quirks. Even seasoned test writers don’t always nail it on the first try.

The test quarantine process

Here’s how it works:

  1. New e2e test is written → It does not go straight into the blocking pipeline.
  2. Instead, it is run in staging and as part of nightly scheduled runs.
  3. Each morning, OctoQA reviews the results:
    • Fails? Sent back to the author for fixes.
    • Passes 10 consecutive runs? Promoted to the next environment, eventually graduating to the pipeline.

This “quarantine first, promote later” approach means our blocking pipeline stays green for the right reasons - not because we stripped all the tests out of it, but because only proven and stable tests make it in.

What happens if a pipeline test fails without cause?

It rarely happens, but even after promotion, a test can occasionally turn flaky. In that case, we don’t let it torture developers. It’s immediately pulled back into quarantine for investigation. Once fixed and stable again, it can rejoin the main pipeline.

Lessons learned

After running this system for a while, a few truths have become obvious:

  • Trust is everything: The fastest way to kill a pipeline’s credibility is to have it cry wolf all the time.
  • New tests need a proving ground: 'Write it → ship it' is fine for unit tests, but e2e tests need to earn their way into the pipeline.
  • Shared responsibility works: Rotating ownership means no one can shrug and say “not my problem.”
  • It’s easier to promote a good test than to repair a broken reputation of end-to-end testing: Once devs stop trusting your tests, getting them to take failures seriously again is a long road.

So yes, the pipeline still blocks when it has to. But it blocks for real bugs, not false alarms. That’s how you avoid both broken prod releases and angry dev mobs.

daniel roedler, CTO/CPO of Octomind
Daniel Rödler
Chief Product Officer and Co-founder
read more blogtoposts
; ;