Multi-Agent AI Coding Workflow: Step-by-Step (2026)
Build a 3-agent AI coding workflow with CrewAI and Python. One agent writes, one reviews, one writes tests. Full code included.
On this page
Throw a single prompt at an AI coding assistant and you get a single perspective: one model, one pass, one set of blind spots. It writes the function, maybe adds a docstring, and calls it done. Nobody checks the edge cases. Nobody writes a test. Nobody asks whether the approach was right in the first place.
A multi-agent workflow fixes this by splitting the work the way a real engineering team would: one agent writes the code, a second reviews it with fresh eyes, and a third writes tests against what actually shipped. Each agent has a narrow job, a clear handoff, and no memory of how hard the work was, so it isn't tempted to rubber-stamp its own output.
This guide builds that pipeline from scratch using CrewAI and Python. By the end you'll have a working 3-agent code review crew you can run on any function, plus the patterns to extend it with a security agent, a CI hook, or your own task definitions.
If you're already living inside an AI coding assistant day to day, it's worth comparing this orchestrated approach to how Claude Code's own subagent system handles parallel work. The mental model is the same: narrow roles, clear handoffs, no single agent doing everything.
Single-Prompt AI vs a Multi-Agent Workflow
When you ask a single model to "write a function and make sure it's good," you're asking it to be the author and the critic in the same breath. Models, like people, are bad at grading their own homework. The same context that produced the code also produced the justification for why the code is fine.
Splitting the work into separate agents with separate prompts (and often separate model calls) breaks that loop. The Reviewer agent never sees the Coder's reasoning, only its output, so it evaluates the code on its own merits. The Test Writer never sees the review either; it just sees the final function and writes tests that would catch real regressions.
This isn't a novelty. It mirrors how software teams already work: someone writes the PR, someone else reviews it, and (ideally) someone writes or extends the test suite. Multi-agent frameworks just give that structure to an LLM pipeline so it happens automatically, every time, without anyone forgetting to ask for a second opinion.
| Approach | What you get | What you miss |
|---|---|---|
| Single prompt | Fast, one round trip, good for small snippets | No independent review, no tests, blind spots go unchecked |
| Single prompt + "review your code" | Slightly better self-correction on obvious issues | Same context bias; the model defends its own choices |
| Multi-agent (Coder โ Reviewer โ Tester) | Independent critique, generated tests, an audit trail of each stage | More tokens, more latency, more moving parts to debug |
Why CrewAI Maps to How You Already Think About Teams
There are several multi-agent frameworks (LangGraph, AutoGen, OpenAI's Swarm-style patterns), but CrewAI is the easiest on-ramp if you've ever managed or worked on a small team. Its core abstractions map almost one-to-one onto roles, responsibilities, and handoffs.
An Agent has a role, a goal, and a backstory. These aren't decoration: they materially change how the underlying model behaves, because they become part of every prompt that agent sees. A Task is a unit of work assigned to an agent, with an expected_output that tells the model what "done" looks like. A Crew wires agents and tasks together and runs them in order (or in parallel, if you configure it that way).
The part that matters most for a code review pipeline is context: a task can declare that it depends on the output of a previous task, and CrewAI automatically feeds that output into the new task's prompt. That's the handoff mechanism, the digital equivalent of "here's the PR, go review it."
Build a 3-Agent Code Review Pipeline
We're going to build a crew of three agents: a Coder who writes a function from a spec, a Reviewer who critiques that function for bugs, style, and edge cases, and a Test Writer who produces a pytest suite based on the final code and the review notes. Each step below adds one piece, and by the end you'll have a complete, runnable script.
- 1
Install CrewAI and Set Up Your Project
CrewAI ships as a regular pip package along with an optional tools package for things like web search and file I/O. We won't need the tools package for this pipeline, but it's worth installing now if you plan to extend the crew later.
The commands below use a standard
venvsetup. If you want a faster install and lockfile workflow, switching your Python setup to uv drops straight into this same project structure.bashmkdir agent-review-crew && cd agent-review-crew python -m venv .venv source .venv/bin/activate pip install crewai crewai-tools python-dotenvCreate a
.envfile for your API key. CrewAI reads provider credentials the same way the underlying SDKs do, so if you're using OpenAI or Anthropic, the standard environment variable names work out of the box.bash โ .envOPENAI_API_KEY=sk-your-key-here # or, if you're using Anthropic models: # ANTHROPIC_API_KEY=sk-ant-your-key-here - 2
Define the Coder Agent
Start a file called
crew.py. The first agent's job is narrow: take a plain-language spec and produce working Python. Notice that its backstory explicitly tells it not to explain or apologize, just to write code. That constraint matters: without it, models tend to wrap code in paragraphs of hedging that pollute the next agent's input.python โ crew.pyimport os from dotenv import load_dotenv from crewai import Agent, Task, Crew, Process, LLM load_dotenv() llm = LLM(model="gpt-4o-mini", temperature=0.2) coder = Agent( role="Senior Python Developer", goal="Write clean, correct, well-typed Python functions from a spec", backstory=( "You are a senior backend engineer who values readability and " "correctness over cleverness. You write idiomatic Python with type " "hints and docstrings. You output only code, no explanations, no " "markdown fences, no commentary about how the code works." ), llm=llm, verbose=True, )Lower temperature (around 0.2) keeps the Coder's output consistent and reduces the chance of syntactically inventive but wrong code. You can raise it later if you want more varied first drafts to review.
- 3
Define the Reviewer and Test Writer Agents
The Reviewer's backstory is doing the heavy lifting here. It's deliberately adversarial: its job is to find problems, not to be polite about the Coder's work. The Test Writer, in turn, is told to write tests that would actually fail if the code were subtly wrong, not tests that just exercise the happy path.
python โ crew.py (continued)reviewer = Agent( role="Staff Engineer Code Reviewer", goal="Find real bugs, missing edge cases, and style issues in Python code", backstory=( "You are a staff engineer doing a thorough code review. You are not " "here to be nice, you are here to catch what the author missed: " "off-by-one errors, unhandled exceptions, missing input validation, " "type mismatches, and unclear naming. List concrete issues with line " "references where possible. If the code is genuinely solid, say so " "briefly and move on, don't invent problems." ), llm=llm, verbose=True, ) test_writer = Agent( role="QA Engineer", goal="Write a pytest suite that would catch real regressions in the given function", backstory=( "You write pytest test suites that go beyond the happy path. You " "cover edge cases, invalid inputs, boundary values, and any issues " "raised in a code review. You output only valid Python test code " "using pytest conventions, no explanations." ), llm=llm, verbose=True, ) - 4
Chain the Tasks With Context
This is where the pipeline actually becomes a pipeline. Each
Taskdeclares anagentto run it and, critically, acontextlist of prior tasks whose output should be fed in as additional input. CrewAI handles the plumbing: the Reviewer automatically receives the Coder's function, and the Test Writer automatically receives both the function and the review.python โ crew.py (continued)write_task = Task( description=( "Write a Python function that takes a list of order dictionaries " "(each with 'id', 'amount', and 'status' keys) and returns the total " "amount of all orders with status 'completed'. Handle missing keys " "and non-numeric amounts gracefully. Include type hints and a docstring." ), expected_output="A complete, runnable Python function with type hints and a docstring.", agent=coder, ) review_task = Task( description=( "Review the Python function produced in the previous task. Identify " "concrete bugs, missing edge cases, and style issues. Be specific " "about what could go wrong and why." ), expected_output="A bullet list of specific issues, or a brief note that the code is solid.", agent=reviewer, context=[write_task], ) test_task = Task( description=( "Write a pytest test suite for the function from the first task, " "taking into account the issues raised in the review. Cover the " "happy path, missing keys, non-numeric amounts, empty lists, and " "any edge cases the reviewer flagged." ), expected_output="A complete pytest file with multiple test functions, ready to run with pytest.", agent=test_writer, context=[write_task, review_task], )Note that
test_tasklists bothwrite_taskandreview_taskin its context. That's the whole trick: the final agent sees the original code and the critique of that code, so its tests target the things most likely to be wrong, not just the things that are obviously right. - 5
Assemble the Crew and Run It
With the agents and tasks defined, wiring up the crew is a few lines.
Process.sequentialruns tasks in the order you listed them, which is exactly what we want for a write-review-test pipeline where each stage depends on the last.python โ crew.py (continued)crew = Crew( agents=[coder, reviewer, test_writer], tasks=[write_task, review_task, test_task], process=Process.sequential, verbose=True, ) if __name__ == "__main__": result = crew.kickoff() print("\n\n=== FINAL OUTPUT ===\n") print(result)bashpython crew.pyRun it and you'll see verbose logs for each agent as it works through its task, followed by the final output (the test suite, since that's the last task in the chain). If you want to inspect every stage rather than just the last one, each
Taskobject exposes its own.outputafterkickoff()completes.pythonprint("--- CODE ---") print(write_task.output.raw) print("\n--- REVIEW ---") print(review_task.output.raw) print("\n--- TESTS ---") print(test_task.output.raw)
Common Failure Modes (And How to Recover)
Multi-agent pipelines fail differently than single-prompt calls. The good news is that the failure modes are predictable once you've seen them, and most have a cheap fix. Here are the four you'll hit most often.
- The Reviewer rubber-stamps everything. If your Reviewer keeps saying "looks good, no issues," its backstory probably isn't adversarial enough, or its temperature is too low for it to think critically. Make the goal explicitly about finding problems, and consider raising its temperature slightly above the Coder's.
- Output bleeds between agents. You'll sometimes see the Test Writer include explanatory prose, markdown fences, or even copy the Coder's docstring verbatim into a test file. Tighten the
expected_outputdescription ("a complete pytest file, no markdown fences, no explanations") and, if it persists, add an output parser step that strips fences before the result reaches the next task. - The chain drifts off-spec. Each handoff is a chance for the original requirement to get diluted. If task three is solving a slightly different problem than task one described, add the original spec back into later tasks'
descriptionfields rather than relying solely oncontextto carry it forward. - Costs and latency add up fast. Three agents means at minimum three model calls per run, often more with CrewAI's internal reasoning steps. For iteration, use a cheaper or smaller model (or a local model via Ollama) and reserve your strongest model for the final production run.
- One bad stage poisons the rest. Because each task's input includes the previous task's raw output, a malformed or truncated response from the Coder can cascade into a useless review and useless tests. Add lightweight validation between stages (does the code parse? does it define the expected function name?) and fail fast rather than letting garbage propagate.
When Multi-Agent Beats a Single Agent (And When It Doesn't)
Multi-agent pipelines aren't free. Every additional agent is another model call, another chance for drift, and another thing to debug when the output looks wrong. They're worth the overhead when the task genuinely benefits from independent perspectives or a structured handoff, and they're overkill when a single well-prompted call would do.
| Use multi-agent | Stick with a single prompt | |
|---|---|---|
| Task complexity | Multi-step work where each step has a genuinely different skill set (write, critique, test, document) | Self-contained tasks that fit comfortably in one response, like generating a regex or a SQL query |
| Need for independent review | You want a check that isn't biased by the same context that produced the original output | You trust the model's first pass and just need it fast |
| Auditability | You need a record of what was written, what was flagged, and what was tested, for compliance or learning | You don't need an audit trail, just the final artifact |
| Latency and cost tolerance | You're running this in CI or batch jobs where a few extra seconds and tokens are acceptable | You need a near-instant response, like inline editor completions |
| Team or enterprise | Onboarding non-experts to a workflow that mirrors how your engineering team already reviews code | A solo developer prototyping quickly who can review their own output just as fast as an agent could |
What to Build Next: Add a Security Agent and CI
Once the 3-agent crew is running reliably, the natural next step is to add a fourth agent focused purely on security: SQL injection, unsanitized input, secrets in code, unsafe deserialization, that kind of thing. Give it a narrow, security-specific backstory rather than folding it into the general Reviewer's job description; specialized agents consistently outperform generalists asked to wear multiple hats in one pass.
From there, wiring the crew into CI is mostly plumbing you already know. Trigger crew.py from a GitHub Actions workflow on pull request open, capture each task's .output.raw, and post the review and generated tests as a PR comment. The crew becomes a first-pass reviewer that runs before a human ever opens the diff, catching the obvious issues so your team's review time goes toward the things that actually need human judgment.
A smaller but high-leverage addition: persist every run's stage outputs (code, review, tests, and eventually security notes) to a database. Over time that log becomes a dataset you can use to spot which agent role tends to miss things, and to fine-tune your prompts (or a smaller model) against your team's actual code patterns rather than generic examples.
Frequently Asked Questions
What is CrewAI and how is it different from LangChain?
CrewAI is a Python framework for orchestrating multiple LLM agents that collaborate on a task, each with a defined role, goal, and set of responsibilities. It's built on top of the same kinds of LLM integrations LangChain provides, but where LangChain is a general-purpose toolkit for building LLM applications (chains, retrieval, memory, agents, and more), CrewAI is purpose-built around the specific pattern of a 'crew' of role-based agents handing work off to one another in sequence or in parallel.
In practice, that narrower focus makes CrewAI faster to get started with for multi-agent pipelines specifically. You define agents and tasks, wire them into a crew, and run it. LangChain (and its graph-based sibling LangGraph) gives you more low-level control over the flow, at the cost of more boilerplate to get the same result.
Does a multi-agent crew cost more than a single prompt?
Yes, generally. A 3-agent crew makes at least three model calls per run (often more, since CrewAI's internal reasoning can add extra round trips), versus one call for a single prompt. For a small function, that might mean 3 to 6 times the token usage and a noticeably longer wall-clock time.
If your team is already tracking how token-based billing affects AI coding costs, treat a multi-agent crew the same way: profile it on a few real tasks before rolling it out broadly, since the per-run cost compounds quickly across a team.
| Setup | Typical model calls | Relative cost |
|---|---|---|
| Single prompt | 1 | 1x (baseline) |
| Single prompt + self-review | 2 | ~2x |
| 3-agent crew (Coder, Reviewer, Tester) | 3-5 | ~3-6x |
Whether that's worth it depends on what you're getting back: independent review, generated tests, and an audit trail are often worth far more than the marginal token cost, especially compared to the cost of a bug that ships to production.
Can each agent in a crew use a different LLM?
Yes. CrewAI's Agent constructor accepts its own llm parameter, so you can mix and match providers and model sizes. A common pattern is to give the Coder your strongest available model (since it's doing the creative, generative work), and give the Reviewer and Test Writer smaller, cheaper models, since their jobs are narrower and more mechanical.
from crewai import Agent, LLM
strong_llm = LLM(model="gpt-4o", temperature=0.2)
fast_llm = LLM(model="gpt-4o-mini", temperature=0.3)
coder = Agent(role="Senior Python Developer", llm=strong_llm, ...)
reviewer = Agent(role="Staff Engineer Code Reviewer", llm=fast_llm, ...)This mix-and-match approach is one of the easiest ways to bring the cost of a multi-agent pipeline down without sacrificing much quality, since the most demanding work (writing correct code from a spec) gets the best model, and the more mechanical work (listing issues, writing tests against known code) can run on something cheaper.
How do I debug a crew when the final output looks wrong?
Start by inspecting each task's output individually rather than just the final result. Every Task exposes its output after crew.kickoff() completes, so you can print or log the Coder's code, the Reviewer's notes, and the Test Writer's suite separately to see exactly where things went off the rails.
result = crew.kickoff()
for task in [write_task, review_task, test_task]:
print(f"--- {task.agent.role} ---")
print(task.output.raw)
print()Almost always, a bad final result traces back to a bad intermediate one: truncated code, a Reviewer that hallucinated issues, or a Test Writer that misunderstood the spec because it was diluted somewhere along the chain. Fixing the pipeline means fixing the first stage where things went wrong, not patching the last one.
Can I run a multi-agent crew entirely with local models?
Yes. CrewAI supports local models through Ollama and other OpenAI-compatible local servers, so you can run an entire crew without sending code to a third-party API. This is worth considering if you're working with proprietary code that can't leave your machine, or if you simply want to avoid per-token costs while iterating on agent prompts.
from crewai import LLM
local_llm = LLM(
model="ollama/qwen2.5-coder:14b",
base_url="http://localhost:11434",
)The trade-off is quality: smaller local models tend to need more explicit, more constrained prompts to produce output as reliable as a frontier hosted model would on the first try. It's a reasonable setup for prototyping your agent definitions cheaply, then switching to a stronger hosted model for production runs.
Related Articles
Claude Code Cheatsheet: Commands, Hooks & Subagents
The complete Claude Code reference: every slash command, keyboard shortcut, hook, subagent, and CLAUDE.md tip, with real examples for developers.
How to Switch to uv: Replace pip, virtualenv, and Poetry in Your Python Project
uv replaces pip, virtualenv, and Poetry with a single fast binary. Step-by-step guide to migrating your existing Python project and setting up GitHub Actions CI.
GitHub Copilot Token Billing: Real Cost by Workflow (2026)
GitHub Copilot switched to AI Credits billing June 1. Code completions stay free. Here's what chat, agent mode, and code review cost per workflow, and when to switch.