Articles
Why AI Makes Testing More Important, Not Less
AI helps small teams build complex systems way earlier than before. I argue that this “complexity compression” is exactly why tests, even a few targeted ones, matter more in AI era startups.
How AI Compresses Complexity Into Month Two
If you only read product launches and investment decks, you could get the impression that AI will eventually “write all the code” and testing will somehow become less of a concern.
My experience has been the opposite.
In the AI heavy startup codebase I was brought in to stabilize, which I described in that earlier piece, the lack of tests turned a fast moving experiment into a slow motion disaster. And the more I read about how AI affects codebases in general, the more it matches what I saw.
This article is about a simple idea:
AI compresses complexity into the early life of a project. That makes tests more valuable, not less.
Complexity compression in the AI era
Before AI coding assistants were mainstream, you usually reached serious complexity after some combination of:
- years of development
- many features
- multiple teams contributing
At that point you would inevitably start investing in:
- more tests
- better CI
- stricter reviews
With AI assisted coding and “vibe coding”, that timeline is compressing:
- Tools can scaffold multi service architectures in days.
- Single developers can generate huge chunks of code across frontend, backend, and infra.
- Entire products can be prototyped with minimal manual typing.
Reports like GitClear’s analysis of AI driven churn show increases in duplication and short lived churn in AI heavy repos, both of which are traditional technical debt signals. That pattern matches smaller academic and industry studies on AI assisted code quality as well.
- Code change datasets show increases in duplication and short lived churn in AI heavy repos, both of which are traditional technical debt signals.
- Security reviews of Copilot style generated code keep finding common architecture and security anti patterns when teams accept AI output without systematic review.
In plain language: you can reach “late stage codebase problems” in month two.
That is complexity compression.
How this looked in a real AI generated codebase
In one AI heavy startup that asked me to help untangle their codebase:
- The architecture already had:
- several Node and Next.js services
- remote sandboxes
- a reverse proxy
- WebSockets for streaming logs and previews
- a database and ORM
- The codebase was large and evolving rapidly.
- And there were zero automated tests.
The dynamic was predictable:
- When something broke, nobody knew where to start.
- Fixing one path would break another because there were no regression tests.
- Every change felt risky, even though the team was shipping constantly.
The same kind of issues big companies solved with layers of tests and tooling were showing up in a very young startup that had none of that infrastructure.
The difference is that large organizations had years to grow their testing culture as their systems grew. AI heavy startups often skip straight to “big system, no safety net.”
Common arguments I hear against testing in AI heavy teams
I have heard some variants of these:
- “The code is mostly AI generated anyway, tests will slow us down.”
- Reality: tests are one of the few ways you can detect when AI output quietly regresses important behavior, especially across services.
- “We are still searching for product market fit, tests are for later.”
- Reality: AI lets you build “later stage” complexity before you find product market fit. A total lack of tests makes it harder to iterate quickly once you get initial traction.
- “We test manually when we ship.”
- Reality: with multi service systems and many flows, you will miss things. Manual testing also does not scale with AI speed.
The point is not that you need perfect coverage on day one. It is that “zero tests” is much more dangerous than it used to be, because your system reaches non trivial complexity much earlier.
Where tests give the highest leverage in AI generated systems
You do not need an elaborate test strategy to get value. In my experience, a few targeted tests can give you outsized protection.
1. Smoke tests for critical paths
Identify one or two paths that define whether your product “works” or not. For example:
- “User asks AI to create an app, gets a running preview.”
- “User can reconnect to an existing app without errors.”
Write end to end or high level integration tests that exercise these flows end to end, even if they are slow.
If you only have time to test one thing, test the thing that makes your product worth existing.
2. Contract tests at architectural boundaries
AI generated code makes it easy to accidentally break contracts between services. Some simple contract tests can help:
- Check that the public API of a service still returns what other services expect.
- Validate that message formats and queue payloads match what consumers expect.
- For sandbox systems, verify that the sandbox is reachable and serving something non empty on the expected port.
These are particularly useful in multi repo setups, where local type safety is not enough.
3. Guardrail tests for “gates” around AI output
Many AI heavy systems now include “gates” that validate or auto fix AI output before it hits production:
- syntax checks
- security filters
- domain specific validators
Write unit or small integration tests around those gates:
- When the AI produces clearly bad output, the gate should catch it.
- When the AI produces acceptable output, the gate should let it through.
This is one of the few places where tests can directly tame the messiness of generative models.
How to start testing if your AI project already feels out of control
If you are already in “complex codebase, no tests, lots of AI output” territory, it can feel impossible to retrofit tests.
I would start small:
- Pick one painful incident and encode it as a test
- For example, the infinite loading bug from my story.
- Reproduce it once, write a failing test, then fix the code so the test passes.
- Add a tiny smoke test to CI
- Even a single end to end check that ensures the main flow works is better than nothing.
- Accept flakiness at first, then make it more reliable over time.
- Decide one place where you will always add tests going forward
- New service boundaries
- New AI output gates
- New core flows
- Use AI to generate test skeletons, then review them like any critical code
- There are already guides on using AI to help test AI applications, for example overviews of AI assisted code reviews and technical debt that show how to lean on models without skipping human judgment.
The goal is not perfect coverage. It is breaking the “zero tests” pattern and slowly building your own safety net.
Final thought
AI is changing how we write software, but it is not changing what makes software reliable.
If anything, the combination of:
- faster code generation
- easier access to complex architectures
- more pressure to ship flashy AI features
means that we reach the failure modes of mature systems earlier and more often.
Tests, reviews, and clear boundaries are not old fashioned ceremony. They are the only things standing between “this looks very impressive” and “we have no idea why everything is on fire.”
If you are building with AI, you do not need to become a testing purist. You just need to respect the fact that AI is very good at creating complexity, and very bad at cleaning it up for you.
Related articles
Hand-picked recommendations and nearby topics.
AI Generated Comments And Docs You Can’t Trust (And What To Do Instead)
AI makes it trivial to flood a repo with comments and documentation that look helpful but are not. This piece breaks down how that happens and offers a few simple rules to keep docs honest and useful.
What It's Really Like Debugging An AI Generated Startup Codebase
A look inside a small AI startup where most of the code was generated by models. I share what it actually felt like to debug the system, what broke, and what that says about AI driven “ship fast” culture.