The Three Agent Model: How We Approach AI-Native Development

By Elliott Torres

There's a version of AI-assisted development that most people picture: a developer typing a prompt, reading the output, and deciding whether to keep it. One human, one model, back and forth. It's better than writing everything from scratch. It's not how we work.


At Foxbox we use what's become known in the engineering community as the three-agent model. It's a multi-agent architecture pattern — a way of structuring AI agents in a hierarchy so that complex software work can be decomposed, executed in parallel, and verified before it ships. Understanding it properly requires stepping back from individual AI tools and thinking about how a system of agents can work together.


This post explains the model, where it comes from, how we apply it in practice, and what it looks like on a real engagement.

Microsoft's Azure Architecture Center, Google's Agent Development Kit documentation, and Anthropic's guidance on multi-agent workflows all describe variations of the same core pattern. It goes by different names: orchestrator-subagent-evaluator, generator-critic loop, maker-checker. The underlying structure is consistent.

Table of Contents

The Three-Agent Model

The Three-Agent Model

The three-agent model didn't originate with any single company or tool. It emerged from the broader field of multi-agent systems research and has been formalized in the past year or so by engineering teams at Anthropic, Google, Microsoft, and OpenAI as AI coding agents became capable enough to use it in practice.

The core idea is old: in any well-designed system, generation and verification should be separate concerns. You don't have the same person write the code and sign off on its correctness. You don't have the same process that builds the artifact also audit it. This separation of concerns is foundational to software engineering — it's why code review exists, why QA is a distinct function, why tests are written independently of the code they test.

The three-agent model applies that principle to AI agents. Instead of one agent trying to do everything — plan, implement, check, refine — you give each role to a specialized agent with its own context, instructions, and scope.

The Three Roles

The Three Roles

The Orchestrator

The orchestrator is the agent at the top of the hierarchy. It receives the task — a feature, a bug fix, a module to build — and does the planning work: decomposing the problem, deciding what needs to happen in what order, assigning work to subagents, and synthesizing results.

The orchestrator does not write code directly. Its job is coordination and judgment. It maps the work, specifies what each subagent should produce, and determines when the output meets the standard to proceed.

This mirrors how a senior engineer works. When a complex feature lands in a sprint, the senior engineer doesn't immediately start coding. They break it down. They think about dependencies. They decide what the interface should look like before anyone touches an implementation. That planning work is what the orchestrator does.


Why Three Roles and Not One

Why Three Roles and Not One

The practical reason is context. A single AI agent working on a large feature has to hold the entire task in a single context window. As that window fills, quality degrades — the model loses track of earlier constraints, makes decisions that contradict earlier ones, and produces output that works in isolation but conflicts with the rest of the system.


Distributing the work across multiple agents, each with a focused context, prevents this. The orchestrator's context stays lean because it only manages plans and summaries, not implementation details. Each subagent's context is scoped to its specific task. The critic's context is structured around evaluation criteria, not generation.


There's also a quality argument independent of context. Generation and evaluation are different cognitive modes. An agent that wrote the code has already committed to a set of assumptions. A fresh agent reviewing that code — with different instructions and no prior context — will catch things the generator missed. This is why pair programming works. It's why code review finds bugs that the original author didn't. Separating the roles bakes that dynamic into the process.

How We Apply It at Foxbox

How We Apply It at Foxbox

We don't implement this mechanically for every task. A simple bug fix doesn't need an orchestrator. A single-endpoint API change doesn't need parallel subagents. The pattern earns its overhead when the work is complex enough that a single agent would hit its limits.

Where we use it consistently:

Feature builds across multiple layers. When a feature touches the database, the API, and the frontend — or requires integration with a third-party system alongside product logic — the orchestrator decomposes this into separate workstreams that can be specified and verified independently.


Codebase-wide changes. Refactoring a data model, migrating an API version, updating an authentication system — these are changes that affect many files and many layers. Parallel subagents, each responsible for a specific domain, move through these changes faster and with less risk of cross-contamination.


High-stakes domains. Any work where correctness is non-negotiable gets a critic configured with explicit acceptance criteria, not just a general quality review. In regulated industries — healthcare, fintech, anything with real compliance exposure — the critic's checklist includes the specific requirements that can't be compromised.

A Real Example: Building a Care Coordination Feature

A Real Example: Building a Care Coordination Feature

One of our current engagements is with a digital health company building a care coordination platform. The product manages care plans for patients across multiple providers. The feature we were building: a notification system that alerts care team members when a patient's status changes, routed correctly based on the member's role, their relationship to the patient, and the type of status change.

This is a non-trivial feature. It has a rules engine (who gets notified, under what conditions), a delivery layer (push, SMS, in-app), a preference system (notification settings per user), and a data model change (new notification event types with appropriate audit logging). It touches HIPAA-covered data throughout.

Here's how the orchestrator decomposed it:


  1. Define the event schema for status change notifications and the audit log format
  2. Build the rules engine that maps patient status changes to notification recipients
  3. Build the delivery service (separate subagent per channel: push, SMS, in-app)
  4. Build the user preferences API
  5. Wire the components together and validate the end-to-end flow


Tasks 1 through 4 were assigned to parallel subagents. Each had a scoped context: the event schema agent only saw the existing data model and the requirements doc. The rules engine agent saw the event schema output and the business logic spec. The delivery service agents each saw only their channel's integration requirements.

The critic for this engagement was configured with specific criteria: every function touching PHI must pass through the existing access control middleware, no patient identifiers in log output, notification content must not include clinical notes.

That last requirement caught something on the first pass. The rules engine subagent had included a brief status summary in the notification payload — the kind of thing a developer building quickly might include as helpful context, without realizing it constituted protected health information in transit. The critic flagged it. The subagent revised the payload to pass only identifiers and a status code, with the readable content fetched client-side after authentication.


That's a real catch. It's the kind of thing that gets through code review when reviewers are moving fast and aren't specialized in HIPAA requirements. Having an agent whose entire job is to evaluate the output against a defined compliance checklist — before it leaves the development workflow — is a different category of protection.


What This Means for How We Staff Engagements

What This Means for How We Staff Engagements

The three-agent model changes what senior engineers spend their time on, not whether you need them.


The engineers at Foxbox are not spending time on boilerplate. They're not typing out CRUD endpoints or writing the same authentication middleware they've written a dozen times.

What they're doing is what the model can't replace: designing the orchestrator's decomposition, defining the critic's acceptance criteria, reviewing the synthesized output before it merges, and making the architectural decisions that determine whether a system will hold up at scale.

This is, in our view, what senior engineering capacity should be used for. The judgment work. The design work. The work that requires understanding the whole system, not just the current task.

What AI has changed is the ratio. More of a senior engineer's day is judgment. Less of it is execution.


The Honest Caveats

The Honest Caveats

The model has real costs. Each agent invocation adds latency and token usage. Orchestrating five parallel subagents costs more than running one. For small tasks, the overhead isn't worth it.


Poorly specified acceptance criteria undermine the critic. If you give the critic a vague mandate ("make sure this is good"), it won't catch the specific things you care about. The value of the critic is proportional to the precision of its instructions.


And the model requires engineers who understand it. Setting up effective orchestration, writing good subagent specifications, configuring critics with meaningful criteria — these are skills. The tools are increasingly accessible. The judgment required to use them well is not.

The Short Version

The Short Version

The three-agent model is: an orchestrator that plans and coordinates, specialized subagents that execute in parallel isolated contexts, and a critic that evaluates output before it's accepted.

It's a pattern with real roots in software engineering and multi-agent systems research, now formalized by the major AI labs and increasingly standard in how serious engineering teams work with these tools.


We use it because it produces better software faster than any single-agent approach at the scale we operate. And because in the domains our clients work in — healthcare, fintech, enterprise software — "faster" without "better" isn't actually faster. It's just debt accumulated at speed.