How to Actually Build with AI Agents Without Creating Chaos

In the first article, I looked at why AI coding feels chaotic in practice and why vibe coding starts to break down as systems grow. I also touched on the bigger shift underneath that, from writing code to designing the system around it.

In this article, I want to move from that idea into the practical side of it: what that system actually looks like when you’re using AI to build something real.

The core idea

We know that the major problems with developing with AI are drift and hallucinations, so we need to anchor it in some way so that it is working within the same boundaries each time it produces output.

The shift is to treat AI as a contributor inside a structured delivery system.

Instead of saying, “build me this feature”, you say, “implement this task within these constraints”.

That changes your role. You move from implementer to system designer. The AI is not the system. It operates within one. Your job is to define the boundaries and constraints it works within, and the points where drift should be caught. Most importantly, the human still retains control over the final decisions.

If the system is weak, the outputs will drift. If the system is strong, the outputs become far more consistent and reliable.

So what does that system actually look like?

Foundations before writing code

If article one was about anchoring the AI’s context, this is where that starts. Before writing any code, the AI needs to understand not just the task in front of it, but the broader context of the solution.

That is where the specification comes in.

The specification is the foundation of the system. It acts as the reference point for everything that follows. It is not there solely to exist, but to act as a constant reference point for the AI to reduce its decision-making freedom in the wrong places.

At a minimum, it should cover:

what you are building
constraints you need to operate within
key edge cases and scenarios

This does not need to be a single document. In practice, it is usually a set of documents covering different concerns:

a product requirements document (PRD)
technical decisions and architecture
coding standards and conventions

The goal is simple. Reduce ambiguity.

If the spec is vague, the output will be vague.

The spec will evolve over time, but it remains the foundation of the system. It is also one of the biggest side benefits of this approach. You end up with far better documentation than most projects ever produce.

Agent instructions

Most AI coding tools will create something like a CLAUDE.md file when you start a project, but the idea applies to any model.

I really like how the Matsen group article describes it:

Think of it as the onboarding document you wish you had when you first joined the project, but written for an agent that can actually use it to take action.

This file defines how the agent should behave.

It typically includes:

coding standards and style
non-negotiables
architectural rules
tech stack decisions

In practice, this is where you give the agent the things that would otherwise get lost between prompts. That might be file structure, naming conventions, rules around where business logic should live, commands it should run before finishing, or shortcuts it should avoid even if they seem quicker in the moment.

This is what stops the AI from drifting as the codebase grows and the context becomes harder to manage. It gives the agent a consistent way of working, even when the conversation history changes.

In effect, this becomes a stable reference point for your implementation agent as the project grows.

Review instructions

The review instructions (REVIEW.md) define how code should be checked.

Where the implementation agent focuses on building, the review agent focuses on validating.

This file defines:

what good looks like
what to check for
what should be rejected

It is also useful to include a structured output format so the review is consistent and easy to act on. That matters not just for consistency, but because it makes feeding the review back into the implementation loop much easier.

This creates a proper feedback loop. The implementation agent builds, the review agent checks it against the spec and conventions, and any issues are fed back before moving on. The review step does not just need to exist. A vague prompt like “review the implementation” is not enough. Just like the implementation agent needs clear instructions, the review agent needs explicit rules and standards to check against rather than relying on its own taste.

Without this step, drift compounds quickly.

Implementation plan

AI agents do not work well when asked to implement something large in one go. That is where most of the chaos comes from.

The solution is to break the work down.

The implementation plan is essentially a structured breakdown of the system into smaller, manageable pieces. It is the same idea as splitting a feature into sub-tasks in Jira.

Once you have your specification, you break it into small, deliverable slices that can be implemented in isolation.

The smaller the slice, the easier it is to:

keep the AI focused
review the output
catch issues early

A simple example of this is the difference between asking an agent to “build authentication” and asking it to “add validation to the registration endpoint, write tests for invalid input, and update the API contract” with specific reference files and lines of code for context. One is broad and open-ended. The other is small enough to implement, review, and verify without the AI wandering off course.

This is where the workflow starts to come together.

The implementation loop

Once you have the pieces in place, the workflow becomes a repeatable loop:

Task slice - Select the next small task from the implementation plan.
Implementation - Ask the agent to read the relevant documents (spec, implementation plan, CLAUDE.md) and implement the task within those constraints.
Verification - Do a quick human sanity check. Does it behave as expected? Does it roughly match the intended architecture?
Review - Run a separate review agent against the same task using REVIEW.md. This checks structure, standards, and alignment with the spec.
Fix loop - Feed the review back into the implementation agent and resolve any issues.
Commit - Commit the change as a small, atomic unit. This makes the history easier to understand and reuse later.
Repeat - Move on to the next slice and continue the cycle.

A key distinction here is between verification and review. Verification is a quick human check that the feature works. Review is a structured check that the implementation is correct.

Both are needed.

A simple example might be a task to add validation to a registration endpoint. The implementation agent reads the relevant spec, the implementation plan, and CLAUDE.md, then makes the change. I do a quick sanity check to make sure the endpoint behaves as expected. The review agent then checks the change against REVIEW.md and flags anything structural, such as business logic in the wrong layer or missing tests. That feedback goes back into the fix loop, the change is updated, and once it looks right it gets committed as a small unit of work. That is a much more reliable flow than one long conversation where the AI is being asked to remember everything at once.

Why this works

This works because it reintroduces engineering discipline in a way that AI can actually operate within.

The specification defines intent. The agent instructions define boundaries. The implementation plan controls scope. The review loop enforces quality.

You are anchoring the AI at every step.

Instead of relying on one long conversation where intent gradually degrades, you are giving it stable reference points it can keep returning to. That is what keeps the system coherent as it grows.

It also shifts your role.

You spend less time writing code and more time defining constraints, making architectural decisions, spotting drift, and reviewing outputs. The AI handles more of the implementation, but only because the system around it makes that implementation more reliable.

And this does not sit outside normal engineering practice. Small atomic commits still matter. Tests still matter. CI still matters. The difference is that the AI is now working inside that delivery system rather than improvising outside of it.

A real example - Estimation Intelligence Platform

This is the approach we took when planning the Estimation Intelligence Platform at Equator.

The problem was not just “replace a spreadsheet”. The deeper issue was that estimation knowledge walked out the door with people, developer estimates got compressed without evidence, there was no reliable link between ballparks and actual delivery cost, and the existing spreadsheet had become fragile and difficult to trust.

The core loop we wanted to support was:

scope -> challenge -> evidence -> decide -> record -> learn

From that, we landed on three core areas:

a real-time collaborative scoping tool
an evidence retrieval system using semantic search
a commercial visibility layer linking estimates to delivery outcomes

This is not a small app. It has multiple moving parts, deliberate architectural choices, and real business impact.

So before any code was written, we front-loaded the thinking. We produced:

a full PRD
backend technical decisions
frontend technical decisions
a data model
CLAUDE.md for implementation
a phased build plan with acceptance criteria
REVIEW.md for code review

That front-loading was critical.

By the time implementation started, the agent was not being asked to invent the architecture on the fly. It was working within decisions that had already been made. That reduced ambiguity, reduced rework, and made it much easier to keep the build aligned with the shape of the system we actually wanted.

Real examples of task slicing

One of the biggest factors in making this work was how we sliced the work.

We did not ask the AI to “build the app”. We broke it into phases that were small enough to implement, review, and verify without too much drift.

The first phase was pure scaffolding. No features, just project setup, infrastructure, and wiring. That was deliberate. It was the safest place to learn the workflow and establish the basic patterns we wanted the rest of the system to follow.

The next phase introduced Projects CRUD. Still contained, but enough to prove the architecture and start building confidence that the system held together the way we expected.

Later, when we added MVP features, the slicing became more granular again. Instead of broad asks like “build the scoping tool”, the work became smaller implementation steps with clear review points.

That might mean setting up the project shell and real-time presence first, then adding a focused CRUD flow, then layering more complex behaviour on top once the underlying patterns were in place.

The smaller the slice, the easier it was to keep the AI honest.

The review loop

The two-agent review loop was one of the most valuable parts of the process.

One agent implements. Another reviews.

That separation matters. The implementation agent builds momentum in a direction. The review agent comes in fresh and checks whether that direction is correct. The second agent’s purpose is not added intelligence, but separation of concerns so that one agent is not marking its own homework.

This is where most drift gets caught.

The most useful part of REVIEW.md was not just general standards, but specific drift patterns to watch for. Things like:

business logic creeping into the wrong layers
architectural boundaries being ignored
convenience shortcuts that break long-term structure

These are the kinds of issues that feel harmless in isolation but compound quickly.

Without this step, the process would be faster. It would also be significantly messier.

What worked well

A few things worked particularly well.

Front-loading the thinking paid off. The stronger the documents were up front, the less the agent had to guess later. That made the early output more consistent and cut down on rework.

The phased plan also worked well. Breaking the system into smaller slices made progress easier to manage and made drift much easier to spot before it had a chance to spread.

Having separate documents for separate concerns helped too. The PRD defined what we were building. The technical decisions defined how it should be built. CLAUDE.md defined how the implementation agent should behave. REVIEW.md defined how the work should be checked. Each document had a clear job, and that clarity helped the overall workflow hold together.

And the two-agent pattern worked. Having one agent implement and another review created a feedback loop that felt much closer to engineering than prompting. It slowed things down slightly in the short term, but produced much cleaner output over time.

What did not work, or needed adjustment

It was not perfect.

Continuity across sessions was still a challenge. But I found that once the documents and system were in place, with them acting as memory rather than the chat history, the process was a lot more efficient than I had experienced on other AI-first projects.

My key takeaway from this project was:

Don’t rely on the AI to remember. Put the memory in the repo.

Prompt size was another issue. Smaller prompts that referenced documents worked far better than trying to include everything inline. When you put the effort into building the system and making sure the detail lives in the documents, you can give the AI much smaller prompts like: “Read prd.md and implementation_plan.md. You are adhering to instructions in claude.md - implement task 1.1.”

I realised after the initial build that the documents had to evolve with the project. The initial set got me to an MVP feature set, but they started to become stale once I began looking toward the next set of milestones and features.

So the key learning is simple.

The documents themselves are not static. They evolve. As the project changes, the system needs to be updated to reflect that.

Above all else, the overarching lesson was this:

AI-first development does not remove the need for discipline. If anything, it increases it.

A natural next step - adding a QA agent

One thing this process highlights is that code review and behavioural testing are not the same thing.

The review agent answers, “does this follow the rules?”

It does not answer, “does this actually work?”

That gap was still being filled manually.

A natural extension is a three-agent loop:

implementation agent
review agent
QA agent

The implementation agent gives you output.

The review agent gives you structural discipline.

The QA agent gives you behavioural confidence.

That is an important distinction.

The QA agent focuses on behaviour. It tests the application against acceptance criteria and verifies that it works end to end.

Like the others, it needs its own instructions. A QA.md file can define test cases, environments, and what “done” actually means.

This creates a progression:

one agent is fast but fragile
two agents improve structural quality
three agents introduce behavioural confidence

Closing

What I like about this process is that it does not pretend the AI is magic.

It recognises that the AI is good at implementation, but only if the human does the work of defining the problem, setting boundaries, and reviewing outputs.

That is the real shift.

AI-first development is not about replacing software engineering. It is about changing where the effort goes.

Less time typing code. More time writing specs, making decisions, and building systems that make the implementation reliable.

That is the part most of the AI coding conversation still skips over.

In the next article, I’ll look at what happens after the initial build, because getting to v1 is only half the story. The harder part is maintaining and evolving a system like this as it grows. That means dealing with drift over time, keeping the codebase coherent, updating the memory layer, evolving the docs and patterns as the system changes, and making the whole approach work across a wider team.

Source for the CLAUDE.md framing quoted above: Matsen group