Saad Tarhi
← Back to articles

Articles

Why AI Makes Testing More Important, Not Less

AI helps small teams build complex systems way earlier than before. I argue that this “complexity compression” is exactly why tests, even a few targeted ones, matter more in AI era startups.

Nov 25, 20256 min read
AITestingCodebase

How AI Compresses Complexity Into Month Two

If you only read product launches and investment decks, you could get the impression that AI will eventually “write all the code” and testing will somehow become less of a concern.

My experience has been the opposite.

In the AI heavy startup codebase I was brought in to stabilize, which I described in that earlier piece, the lack of tests turned a fast moving experiment into a slow motion disaster. And the more I read about how AI affects codebases in general, the more it matches what I saw.

This article is about a simple idea:

AI compresses complexity into the early life of a project. That makes tests more valuable, not less.


Complexity compression in the AI era

Before AI coding assistants were mainstream, you usually reached serious complexity after some combination of:

  • years of development
  • many features
  • multiple teams contributing

At that point you would inevitably start investing in:

  • more tests
  • better CI
  • stricter reviews

With AI assisted coding and “vibe coding”, that timeline is compressing:

  • Tools can scaffold multi service architectures in days.
  • Single developers can generate huge chunks of code across frontend, backend, and infra.
  • Entire products can be prototyped with minimal manual typing.

Reports like GitClear’s analysis of AI driven churn show increases in duplication and short lived churn in AI heavy repos, both of which are traditional technical debt signals. That pattern matches smaller academic and industry studies on AI assisted code quality as well.

  • Code change datasets show increases in duplication and short lived churn in AI heavy repos, both of which are traditional technical debt signals.
  • Security reviews of Copilot style generated code keep finding common architecture and security anti patterns when teams accept AI output without systematic review.

In plain language: you can reach “late stage codebase problems” in month two.

That is complexity compression.


How this looked in a real AI generated codebase

In one AI heavy startup that asked me to help untangle their codebase:

  • The architecture already had:
    • several Node and Next.js services
    • remote sandboxes
    • a reverse proxy
    • WebSockets for streaming logs and previews
    • a database and ORM
  • The codebase was large and evolving rapidly.
  • And there were zero automated tests.

The dynamic was predictable:

  • When something broke, nobody knew where to start.
  • Fixing one path would break another because there were no regression tests.
  • Every change felt risky, even though the team was shipping constantly.

The same kind of issues big companies solved with layers of tests and tooling were showing up in a very young startup that had none of that infrastructure.

The difference is that large organizations had years to grow their testing culture as their systems grew. AI heavy startups often skip straight to “big system, no safety net.”


Common arguments I hear against testing in AI heavy teams

I have heard some variants of these:

  1. “The code is mostly AI generated anyway, tests will slow us down.”
    • Reality: tests are one of the few ways you can detect when AI output quietly regresses important behavior, especially across services.
  2. “We are still searching for product market fit, tests are for later.”
    • Reality: AI lets you build “later stage” complexity before you find product market fit. A total lack of tests makes it harder to iterate quickly once you get initial traction.
  3. “We test manually when we ship.”
    • Reality: with multi service systems and many flows, you will miss things. Manual testing also does not scale with AI speed.

The point is not that you need perfect coverage on day one. It is that “zero tests” is much more dangerous than it used to be, because your system reaches non trivial complexity much earlier.


Where tests give the highest leverage in AI generated systems

You do not need an elaborate test strategy to get value. In my experience, a few targeted tests can give you outsized protection.

1. Smoke tests for critical paths

Identify one or two paths that define whether your product “works” or not. For example:

  • “User asks AI to create an app, gets a running preview.”
  • “User can reconnect to an existing app without errors.”

Write end to end or high level integration tests that exercise these flows end to end, even if they are slow.

If you only have time to test one thing, test the thing that makes your product worth existing.

2. Contract tests at architectural boundaries

AI generated code makes it easy to accidentally break contracts between services. Some simple contract tests can help:

  • Check that the public API of a service still returns what other services expect.
  • Validate that message formats and queue payloads match what consumers expect.
  • For sandbox systems, verify that the sandbox is reachable and serving something non empty on the expected port.

These are particularly useful in multi repo setups, where local type safety is not enough.

3. Guardrail tests for “gates” around AI output

Many AI heavy systems now include “gates” that validate or auto fix AI output before it hits production:

  • syntax checks
  • security filters
  • domain specific validators

Write unit or small integration tests around those gates:

  • When the AI produces clearly bad output, the gate should catch it.
  • When the AI produces acceptable output, the gate should let it through.

This is one of the few places where tests can directly tame the messiness of generative models.


How to start testing if your AI project already feels out of control

If you are already in “complex codebase, no tests, lots of AI output” territory, it can feel impossible to retrofit tests.

I would start small:

  1. Pick one painful incident and encode it as a test
    • For example, the infinite loading bug from my story.
    • Reproduce it once, write a failing test, then fix the code so the test passes.
  2. Add a tiny smoke test to CI
    • Even a single end to end check that ensures the main flow works is better than nothing.
    • Accept flakiness at first, then make it more reliable over time.
  3. Decide one place where you will always add tests going forward
    • New service boundaries
    • New AI output gates
    • New core flows
  4. Use AI to generate test skeletons, then review them like any critical code

The goal is not perfect coverage. It is breaking the “zero tests” pattern and slowly building your own safety net.


Final thought

AI is changing how we write software, but it is not changing what makes software reliable.

If anything, the combination of:

  • faster code generation
  • easier access to complex architectures
  • more pressure to ship flashy AI features

means that we reach the failure modes of mature systems earlier and more often.

Tests, reviews, and clear boundaries are not old fashioned ceremony. They are the only things standing between “this looks very impressive” and “we have no idea why everything is on fire.”

If you are building with AI, you do not need to become a testing purist. You just need to respect the fact that AI is very good at creating complexity, and very bad at cleaning it up for you.

Related articles

Hand-picked recommendations and nearby topics.