"Green" Isn't Done

"Certainly! Here's your complete project!"

Your codebase is fighting entropy every day. AI doesn't change that — it accelerates entropy unless your harness enforces hard boundaries. The goal isn't "more autonomy." The goal is bounded entropy so you're not on call for your AI.

This is the first post in a series about that problem. We'll start where most of us start: staring at a green test suite and feeling safe.

We've all done the responsible version of this.

You wrote a well-researched PRD.

You wrote an implementation plan.

You wrote thoughtful acceptance criteria.

You handed all that context to Claude / Codex / whatever.

Your agent wrote a ton of code.

You ran 57 shiny new unit tests, all green.

Your agent said: "Staff Engineer standards." Or: "This is an amazing feature." Or: "You're so smart, I love you."

You started dogfooding.

The feature wasn't wired into prod.

The handler existed; nothing registered it. The feature flag existed; nothing ever read it. The real user path never touched the code you just "validated."

You validated a component. You didn't validate the system.

And those green tests? They were a done-shaped artifact — something that looks like proof when you're tired and want to believe.

"That doesn't apply to me. I have tests."

Great. Did they prove behavior, or did they prove you can make a green test?

Because in agentic coding, "lots of tests" is not automatically "lots of safety." It's often just a bigger stage.

And the performance playing on it has a name.

Verification Theater

Verification Theater is what happens when the artifacts look like proof but the system was never actually validated. It has three recurring patterns. You've met them all. You just didn't have names for them.

Test confetti. A pile of unit tests that never touch real behavior. Mocks. Helpers. Utilities. The happy path of a function that isn't called in production. It feels convincing because it's work — it's volume, it's green, it's 47% of your codebase. It's also useless if it can't fail when the feature isn't wired.

Hollow implementations. Code that exists, compiles, satisfies interfaces, and doesn't do the work. Stubs. TODOs. Hardcoded returns. "In production we'd do X" code that never gets replaced.

A harness that accepts hollow implementations is paying bounties for dead cobras and acting surprised when your agent starts a cobra farm at industrial scale. Anyway.

Integration blindness. No test that drives a real request through the real flow. The handler exists. The router never calls it. The feature flag exists. Nothing reads it.

"A better model wouldn't ship dead code."

Well yes… but actually no.

Even frontier models behave like greedy search engines for done-shaped artifacts — they latch onto the first thing that looks like completion and optimize toward it. Your agent is doing what you do at 2am on DoorDash: first thing that looks edible, ship it. Except it's very confident about the order.

Your harness decides what "done-shaped" means. If your harness rewards "produce tests," it produces tests. If your harness rewards "green," it optimizes for green.

And then it hands you a monument, proudly declaring "Look upon my works, ye Mighty, and despair." Except the "works" are fifty green unit tests and an unused codepath.

Be honest: if you had to bet real money, which would you trust more?

A) Test confetti: looks like coverage, proves nothing.

def test_handler_returns_success():
    handler = MyNewHandler()
    result = handler.handle(fake_request())
    assert result.status == 200  # ✅ Green. Handler works in isolation.

# But: is MyNewHandler registered in the router?
# Is it reachable from any real user path?
# Would the app even import this module?
# This test cannot answer any of those questions.

B) One integration check that actually catches dead code.

def test_feature_is_reachable():
    response = client.post("/api/the-actual-endpoint", json=valid_payload)
    assert response.status_code == 200
    assert response.json()["feature_flag_value"] is not None
    
# If MyNewHandler isn't registered, this fails.
# If the feature flag isn't read, this fails.
# If the route doesn't exist, this fails.

One of these is a monument to process theater. The other is proof.

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Your agent optimized for green. Green stopped measuring anything.

The fix isn't "use a better model." It's stop rewarding a bad proxy.

What counts as done, then?

Done means proven, not narrated.

Not "it feels high quality." Not "the agent is confident." Not "there are tests." The cobra bounty is the whole lesson: reward the proxy, get proxies at scale.

So — if verification didn't actually run, is the task done, blocked, or basically done?

It's blocked. Always.

"Basically done" is how you end up debugging your agent's shipped feature at 1:17 AM.

"Blocked" is how you end up asleep.

Ok now what. How do I fix it today without adopting process theater?

If you do only one thing differently:

Stop treating unit tests as the completion signal for agent work.

Here’s the founder-friendly checklist.

The “Don’t be on call for your AI” checklist

One integration test beats fifty unit tests.Pick one real user path and assert it: HTTP request, CLI command, UI flow, DB write.
Require a production anchor.Every test should touch real production code and fail if the implementation is hollow.
Ban tautologies.No assert(true). No “returns something” when the spec is “returns the right thing.”
Require verification to run.If the agent didn’t actually run the checks, it isn’t done.
If it can’t be verified, it’s blocked — not done.Missing secrets, missing services, flaky tests. Surface it. Stop pretending.

Once “blocked” is a real outcome, you stop being on call for optimism.

"But integration tests are slow and flaky."

Yes. That's why they're valuable as a completion signal.

We're not talking about writing 200 brittle end-to-end tests. We're talking about proving at least one real path so dead code can't masquerade as delivery. A single stable smoke check catches the common kills: not wired, wrong registration, wrong entrypoint, wrong flag, wrong injection, wrong environment assumption.

And if your repo can't support one stable smoke path today, that's not an AI problem. That's the accumulated entropy you've been carrying, and all AI does is compound your practices — good or bad. This isn't anyone's fault — it's a consequence of misaligned incentives, as most interesting failures are.

This is why I built verification enforcement into my workflow: "done" means the checks actually ran and passed, or the task is explicitly blocked with a reason. Done means done.

What's next

Next: why "don't touch X" in a prompt is a polite request, not a safety boundary — and what hard walls actually look like. I'll be posting a series on how you can improve your agentic software development without needing to grind LeetCode over the next 12 weeks, so keep an eye out!

If you've ever been on call for your AI — reply or DM me your story. What looked "proven"? What broke in reality? What do you wish the harness had enforced?