What It's Really Like Debugging An AI Generated Startup Codebase

Stepping Into An AI Built Codebase

A small AI startup that had just launched its product brought me in to help steady the foundations of its codebase and make it safer to run in production.

From the outside, the app looked surprisingly polished. From the inside, the codebase felt like a house built in a rush on wet concrete: impressive facade, foundation on fire.

This is not a takedown of that startup. It is a field note from the AI era: what it actually feels like to debug a heavily AI generated codebase inside a fast paced “ship at all costs” culture, and what that means for the rest of us.

From “we are 99 percent done” to “nothing really feels safe”

The founder’s view was simple: the product was “97 or 99 percent done.” A few stubborn bugs were blocking users from reaching the “magic moment,” but the foundation was “basically there.”

Then I opened the repos.

A Next.js app for the UI and AI chat experience
A Node.js service orchestrating AI calls, sandboxes, and database writes
A separate Node.js reverse proxy speaking WebSockets to remote sandboxes
A database layer with PostgreSQL and Prisma
Daytona or Modal style sandboxes running user apps
Several repos glued together with custom scripts

For a 2 or 3 month old project with a small team, that is a lot of moving parts.

In the past, you needed years and multiple engineers to accidentally grow into this level of complexity. Today, with AI coding assistants and “vibe coding”, small teams can get there in weeks.

The question is not “can you build it.” It is “can you carry it.”

How AI lets you build complexity you cannot carry

“Vibe coding” is now an official concept, even named Word of the Year for 2025, and it usually means leaning heavily on an AI assistant to write the code while you steer by prompts and tests instead of manual typing. Teams using vibe coding in production describe exactly this pattern of describing intent while the AI does most of the typing.

Used carefully, that can be powerful. But there is a brutal flip side that recent analyses are starting to call out:

Large scale code studies, like one analysis of AI assisted changes in real codebases, report more duplicate code and short lived churn when AI assistants are widely used, which is a nice way of saying “mess accumulates faster.”
Security reviews of AI generated code in the wild keep finding that AI output often looks production ready while hiding a high rate of vulnerabilities if you accept it without review.

My time in this codebase, trying to make it safe enough to run, was basically the qualitative version of those graphs.

AI had enabled the team to spin up:

A multi repository, multi service architecture
Integration with remote sandboxes, orchestration, proxies, WebSockets
A fairly advanced UI and onboarding flow

All in a short time and with little engineering capacity.

The cost was hidden: the system looked senior level from a distance, but behaved like a rushed side project when you tried to reason about it.

The illusion of solidity: code and docs that look great and mean nothing

What hit me first was the surface quality.

The code looked clean:

TypeScript everywhere
Modern patterns
Long comments explaining intent
Markdown docs describing the architecture

But after a few hours, a pattern emerged:

Many comments were clearly written by Claude style AI suggestions. You can recognize the tone. They were generic, repetitive, and sometimes just wrong.
Some functions started with a clear idea, then changed direction halfway, then stopped. Classic “model lost the thread” behavior.
There were helpers that were never called anywhere, or partially wired strategies that did nothing in practice.
The docs were extremely verbose, but in places they described how the system “should” behave, not how it actually behaved now.

The result was a new kind of problem:

The codebase looked more trustworthy than it was.

If you are used to human written code, lots of comments and long docs are usually a positive signal that someone cared. With heavy AI usage, those signals can be fake. You can copy paste internal documentation that sounds convincing, yet nobody has actually validated it.

I found myself treating most comments as suspicious, even though the code was full of them. That is a strange feeling.

Shipping without understanding

The second shock was the shipping pattern.

Looking at the pull requests and commit history:

Huge diffs were common, with hundreds of lines changed per PR
PR descriptions were minimal
There was basically no code review
Deployments were frequent and aggressive

When I asked about specific pieces of logic, the common answer was some version of:

“Yeah, I think Claude wrote that, it was working at some point, I am not totally sure now.”

That matches the incentives:

The founder sees shipping velocity and feature count.
Developers see that the ones who “ship more” are implicitly rewarded.
The easiest way to ship more is to lean harder on AI and skip review and tests.

This is not unique to this startup. People have been warning for years that “ship now, fix later” cultures turn into technical debt time bombs, especially in startups that reward speed over maintainability.

AI just adds more fuel. When an assistant can generate entire features in a single prompt, it is tempting to accept large chunks of code that nobody fully understands, and to push them straight to production because they “seem to work.”

The debugging story that made it real

One bug captures the feeling pretty well.

Symptom: infinite loading

From the user side, the problem was simple:

You ask the AI to build an app.
The UI shows a nice loading state.
It never finishes.

I started where most frontend folks start: the Next.js app.

No obvious infinite loops.
Network requests looked fine until they did not.
The UI was waiting for something that never arrived.

So I attached a debugger to the first Node.js service, the one that:

Takes the AI’s code output
Writes files into a sandbox
Triggers builds
Saves some state into PostgreSQL via Prisma

Following the data, I found:

Some validation helpers existed but were not actually wired into the main flow.
Bad code generated by the AI could slip straight into the sandbox.
If the sandbox failed to build, the failure did not propagate correctly to the frontend.

I wired in the existing checks so that obviously broken output would be caught earlier. That fixed part of the issue, but then another failure showed up.

Now users would sometimes see a “proxy error” instead of an infinite spinner.

The mystery “proxy error”

The “proxy error” came from a separate Node.js service acting as a reverse proxy and WebSocket bridge between the main app and the remote sandbox.

To understand it, I had to:

Set up a debugger on that proxy server
Step through the WebSocket handshake
Compare behavior when the sandbox was healthy vs when it was not

The behavior was roughly:

If the proxy tries to open a WebSocket to a sandbox port where no app is listening, it surfaces a generic “proxy error” to the frontend.
It did not clearly distinguish between “infra down” and “user app never started.”

After more digging, I found that some sandboxes were missing critical template files and configs. Earlier AI generated changes had written partial file sets or deleted things they should not. From the outside, it looked like “the proxy is flaky.” In reality, the system expected an app to be running where nothing had ever successfully booted.

To fix this class of failures, I ended up:

Checking whether an existing sandbox was actually healthy before reusing it
Rebuilding sandboxes from the database source of truth when they were in a broken state
Being more explicit about error categories (sandbox unreachable vs app crashed vs build failed)

Even with those improvements, it was obvious that the system needed a lot more than “the last 3 percent” of polish.

Zero tests in a multi service AI system

One more detail that matters:

There were no automated tests at all.

No unit tests
No integration tests
No end to end tests
Not even test dependencies in package.json

In a classic CRUD app, this is already risky. In a multi repo, AI heavy system orchestrating remote sandboxes and proxies, it is basically flying blind.

Recent work on AI and technical debt has started to argue that AI enhanced development makes testing and technical debt management more important, not less, exactly because complexity shows up so early.

I felt that directly. Even basic guardrail tests around the critical path:

“User asks AI for app”
“Sandbox builds and serves a page”
“Proxy connects to a live app”

would have caught many of the regressions pre deployment.

Instead, every change was effectively a manual experiment in production.

What this taught me about AI era engineering

I walked away with a few conclusions.

1. AI does not remove the need for fundamentals. It amplifies the cost of ignoring them.

You can now:

Generate complex architectures
Produce mountains of code and documentation
Wire things together with impressive speed

But AI will not:

Decide which parts of the system must stay simple
Enforce test coverage on your critical paths
Refuse to generate code you do not understand

The less you invest in architecture and testing, the more AI behaves like an army of juniors that never sleeps. That matches how some security researchers describe AI generated code: highly functional, but systematically lacking architectural judgment.

2. Velocity metrics are lying if nobody understands the code

“Number of features shipped per week” tells you almost nothing if:

The code is mostly AI generated
There is no review
There are no tests

You can feel like a “high velocity team” while actually slowing your future self down dramatically. You are not just borrowing against the future. You are borrowing against your own ability to even reason about the system later.

3. Testing is more important in AI heavy startups, not less

The way AI compresses timelines means that the problems mature companies face after years of development now show up in month two:

Complex dependency graphs
Hidden coupling between services
Hard to reason about state flows

Big companies have spent years building testing and review cultures to survive this. Early stage AI startups often have none of that, but can still reach similar complexity.

That is a dangerous mismatch.

If you are leading or joining an AI heavy startup

I am not arguing for perfect purity. Startups will always move fast and accept some risk.

What that experience stabilizing this kind of AI heavy codebase convinced me of is more modest:

Admit that AI lets you create more complexity than you can carry.
Do not trust surface signals like comments and pretty docs without sampling the reality underneath.
Treat AI generated code as untrusted until it is reviewed and tested, especially around infra and security sensitive paths. a few minimal tests on the critical path so that you at least know when the house is cracking.

And if you ever find yourself thinking “we are 99 percent done, we just need to fix that last bug,” it may be worth opening the foundation first.