ai tools10 min read

Multi-Agent AI Coding Workflow: Step-by-Step (2026)

Q: Can each agent in a crew use a different LLM?

Yes. CrewAI's `Agent` constructor accepts its own `llm` parameter, so you can mix and match providers and model sizes. A common pattern is to give the Coder your strongest available model (since it's doing the creative, generative work), and give the Reviewer and Test Writer smaller, cheaper models, since their jobs are narrower and more mechanical. from crewai import Agent, LLM strong_llm = LLM(model="gpt-4o", temperature=0.2) fast_llm = LLM(model="gpt-4o-mini", temperature=0.3) coder = Agent(role="Senior Python Developer", llm=strong_llm, ...) reviewer = Agent(role="Staff Engineer Code Reviewer", llm=fast_llm, ...) This mix-and-match approach is one of the easiest ways to bring the cost of a multi-agent pipeline down without sacrificing much quality, since the most demanding work (writing correct code from a spec) gets the best model, and the more mechanical work (listing issues, writing tests against known code) can run on something cheaper.

Build a 3-agent AI coding workflow with CrewAI and Python. One agent writes, one reviews, one writes tests. Full code included.

Zeeshan Tofiq

June 7, 2026

On this page

Throw a single prompt at an AI coding assistant and you get a single perspective: one model, one pass, one set of blind spots. It writes the function, maybe adds a docstring, and calls it done. Nobody checks the edge cases. Nobody writes a test. Nobody asks whether the approach was right in the first place.

A multi-agent workflow fixes this by splitting the work the way a real engineering team would: one agent writes the code, a second reviews it with fresh eyes, and a third writes tests against what actually shipped. Each agent has a narrow job, a clear handoff, and no memory of how hard the work was, so it isn't tempted to rubber-stamp its own output.

This guide builds that pipeline from scratch using CrewAI and Python. By the end you'll have a working 3-agent code review crew you can run on any function, plus the patterns to extend it with a security agent, a CI hook, or your own task definitions.

If you're already living inside an AI coding assistant day to day, it's worth comparing this orchestrated approach to how Claude Code's own subagent system handles parallel work. The mental model is the same: narrow roles, clear handoffs, no single agent doing everything.

Single-Prompt AI vs a Multi-Agent Workflow

When you ask a single model to "write a function and make sure it's good," you're asking it to be the author and the critic in the same breath. Models, like people, are bad at grading their own homework. The same context that produced the code also produced the justification for why the code is fine.

Splitting the work into separate agents with separate prompts (and often separate model calls) breaks that loop. The Reviewer agent never sees the Coder's reasoning, only its output, so it evaluates the code on its own merits. The Test Writer never sees the review either; it just sees the final function and writes tests that would catch real regressions.

This isn't a novelty. It mirrors how software teams already work: someone writes the PR, someone else reviews it, and (ideally) someone writes or extends the test suite. Multi-agent frameworks just give that structure to an LLM pipeline so it happens automatically, every time, without anyone forgetting to ask for a second opinion.

Approach	What you get	What you miss
Single prompt	Fast, one round trip, good for small snippets	No independent review, no tests, blind spots go unchecked
Single prompt + "review your code"	Slightly better self-correction on obvious issues	Same context bias; the model defends its own choices
Multi-agent (Coder → Reviewer → Tester)	Independent critique, generated tests, an audit trail of each stage	More tokens, more latency, more moving parts to debug

Why CrewAI Maps to How You Already Think About Teams

There are several multi-agent frameworks (LangGraph, AutoGen, OpenAI's Swarm-style patterns), but CrewAI is the easiest on-ramp if you've ever managed or worked on a small team. Its core abstractions map almost one-to-one onto roles, responsibilities, and handoffs.

An Agent has a role, a goal, and a backstory. These aren't decoration: they materially change how the underlying model behaves, because they become part of every prompt that agent sees. A Task is a unit of work assigned to an agent, with an expected_output that tells the model what "done" looks like. A Crew wires agents and tasks together and runs them in order (or in parallel, if you configure it that way).

The part that matters most for a code review pipeline is context: a task can declare that it depends on the output of a previous task, and CrewAI automatically feeds that output into the new task's prompt. That's the handoff mechanism, the digital equivalent of "here's the PR, go review it."

Build a 3-Agent Code Review Pipeline

We're going to build a crew of three agents: a Coder who writes a function from a spec, a Reviewer who critiques that function for bugs, style, and edge cases, and a Test Writer who produces a pytest suite based on the final code and the review notes. Each step below adds one piece, and by the end you'll have a complete, runnable script.

1
Install CrewAI and Set Up Your Project
CrewAI ships as a regular pip package along with an optional tools package for things like web search and file I/O. We won't need the tools package for this pipeline, but it's worth installing now if you plan to extend the crew later.
The commands below use a standard venv setup. If you want a faster install and lockfile workflow, switching your Python setup to uv drops straight into this same project structure.
bash
```
mkdir agent-review-crew && cd agent-review-crew
python -m venv .venv
source .venv/bin/activate

pip install crewai crewai-tools python-dotenv
```
Create a .env file for your API key. CrewAI reads provider credentials the same way the underlying SDKs do, so if you're using OpenAI or Anthropic, the standard environment variable names work out of the box.
bash — .env
```
OPENAI_API_KEY=sk-your-key-here
# or, if you're using Anthropic models:
# ANTHROPIC_API_KEY=sk-ant-your-key-here
```
Warning
Add .env to your .gitignore before your first commit. API keys committed to a repo, even a private one, have a way of ending up in places you didn't intend.

Define the Coder Agent

Start a file called crew.py. The first agent's job is narrow: take a plain-language spec and produce working Python. Notice that its backstory explicitly tells it not to explain or apologize, just to write code. That constraint matters: without it, models tend to wrap code in paragraphs of hedging that pollute the next agent's input.

python — crew.py

import os
from dotenv import load_dotenv
from crewai import Agent, Task, Crew, Process, LLM

load_dotenv()

llm = LLM(model="gpt-4o-mini", temperature=0.2)

coder = Agent(
    role="Senior Python Developer",
    goal="Write clean, correct, well-typed Python functions from a spec",
    backstory=(
        "You are a senior backend engineer who values readability and "
        "correctness over cleverness. You write idiomatic Python with type "
        "hints and docstrings. You output only code, no explanations, no "
        "markdown fences, no commentary about how the code works."
    ),
    llm=llm,
    verbose=True,
)

Lower temperature (around 0.2) keeps the Coder's output consistent and reduces the chance of syntactically inventive but wrong code. You can raise it later if you want more varied first drafts to review.

Define the Reviewer and Test Writer Agents

The Reviewer's backstory is doing the heavy lifting here. It's deliberately adversarial: its job is to find problems, not to be polite about the Coder's work. The Test Writer, in turn, is told to write tests that would actually fail if the code were subtly wrong, not tests that just exercise the happy path.

python — crew.py (continued)

reviewer = Agent(
    role="Staff Engineer Code Reviewer",
    goal="Find real bugs, missing edge cases, and style issues in Python code",
    backstory=(
        "You are a staff engineer doing a thorough code review. You are not "
        "here to be nice, you are here to catch what the author missed: "
        "off-by-one errors, unhandled exceptions, missing input validation, "
        "type mismatches, and unclear naming. List concrete issues with line "
        "references where possible. If the code is genuinely solid, say so "
        "briefly and move on, don't invent problems."
    ),
    llm=llm,
    verbose=True,
)

test_writer = Agent(
    role="QA Engineer",
    goal="Write a pytest suite that would catch real regressions in the given function",
    backstory=(
        "You write pytest test suites that go beyond the happy path. You "
        "cover edge cases, invalid inputs, boundary values, and any issues "
        "raised in a code review. You output only valid Python test code "
        "using pytest conventions, no explanations."
    ),
    llm=llm,
    verbose=True,
)

Chain the Tasks With Context

This is where the pipeline actually becomes a pipeline. Each Task declares an agent to run it and, critically, a context list of prior tasks whose output should be fed in as additional input. CrewAI handles the plumbing: the Reviewer automatically receives the Coder's function, and the Test Writer automatically receives both the function and the review.

python — crew.py (continued)

write_task = Task(
    description=(
        "Write a Python function that takes a list of order dictionaries "
        "(each with 'id', 'amount', and 'status' keys) and returns the total "
        "amount of all orders with status 'completed'. Handle missing keys "
        "and non-numeric amounts gracefully. Include type hints and a docstring."
    ),
    expected_output="A complete, runnable Python function with type hints and a docstring.",
    agent=coder,
)

review_task = Task(
    description=(
        "Review the Python function produced in the previous task. Identify "
        "concrete bugs, missing edge cases, and style issues. Be specific "
        "about what could go wrong and why."
    ),
    expected_output="A bullet list of specific issues, or a brief note that the code is solid.",
    agent=reviewer,
    context=[write_task],
)

test_task = Task(
    description=(
        "Write a pytest test suite for the function from the first task, "
        "taking into account the issues raised in the review. Cover the "
        "happy path, missing keys, non-numeric amounts, empty lists, and "
        "any edge cases the reviewer flagged."
    ),
    expected_output="A complete pytest file with multiple test functions, ready to run with pytest.",
    agent=test_writer,
    context=[write_task, review_task],
)

Note that test_task lists both write_task and review_task in its context. That's the whole trick: the final agent sees the original code and the critique of that code, so its tests target the things most likely to be wrong, not just the things that are obviously right.

5
Assemble the Crew and Run It
With the agents and tasks defined, wiring up the crew is a few lines. Process.sequential runs tasks in the order you listed them, which is exactly what we want for a write-review-test pipeline where each stage depends on the last.
python — crew.py (continued)
```
crew = Crew(
    agents=[coder, reviewer, test_writer],
    tasks=[write_task, review_task, test_task],
    process=Process.sequential,
    verbose=True,
)

if __name__ == "__main__":
    result = crew.kickoff()
    print("\n\n=== FINAL OUTPUT ===\n")
    print(result)
```
bash
```
python crew.py
```
Run it and you'll see verbose logs for each agent as it works through its task, followed by the final output (the test suite, since that's the last task in the chain). If you want to inspect every stage rather than just the last one, each Task object exposes its own .output after kickoff() completes.
python
```
print("--- CODE ---")
print(write_task.output.raw)

print("\n--- REVIEW ---")
print(review_task.output.raw)

print("\n--- TESTS ---")
print(test_task.output.raw)
```
Info
Logging every stage's output to a file (or a database) is worth doing from day one. When the pipeline produces something odd, having the intermediate review and the original code side by side is the fastest way to figure out which agent introduced the problem.

Common Failure Modes (And How to Recover)

Multi-agent pipelines fail differently than single-prompt calls. The good news is that the failure modes are predictable once you've seen them, and most have a cheap fix. Here are the four you'll hit most often.

The Reviewer rubber-stamps everything. If your Reviewer keeps saying "looks good, no issues," its backstory probably isn't adversarial enough, or its temperature is too low for it to think critically. Make the goal explicitly about finding problems, and consider raising its temperature slightly above the Coder's.
Output bleeds between agents. You'll sometimes see the Test Writer include explanatory prose, markdown fences, or even copy the Coder's docstring verbatim into a test file. Tighten the expected_output description ("a complete pytest file, no markdown fences, no explanations") and, if it persists, add an output parser step that strips fences before the result reaches the next task.
The chain drifts off-spec. Each handoff is a chance for the original requirement to get diluted. If task three is solving a slightly different problem than task one described, add the original spec back into later tasks' description fields rather than relying solely on context to carry it forward.
Costs and latency add up fast. Three agents means at minimum three model calls per run, often more with CrewAI's internal reasoning steps. For iteration, use a cheaper or smaller model (or a local model via Ollama) and reserve your strongest model for the final production run.
One bad stage poisons the rest. Because each task's input includes the previous task's raw output, a malformed or truncated response from the Coder can cascade into a useless review and useless tests. Add lightweight validation between stages (does the code parse? does it define the expected function name?) and fail fast rather than letting garbage propagate.

When Multi-Agent Beats a Single Agent (And When It Doesn't)

Multi-agent pipelines aren't free. Every additional agent is another model call, another chance for drift, and another thing to debug when the output looks wrong. They're worth the overhead when the task genuinely benefits from independent perspectives or a structured handoff, and they're overkill when a single well-prompted call would do.

	Use multi-agent	Stick with a single prompt
Task complexity	Multi-step work where each step has a genuinely different skill set (write, critique, test, document)	Self-contained tasks that fit comfortably in one response, like generating a regex or a SQL query
Need for independent review	You want a check that isn't biased by the same context that produced the original output	You trust the model's first pass and just need it fast
Auditability	You need a record of what was written, what was flagged, and what was tested, for compliance or learning	You don't need an audit trail, just the final artifact
Latency and cost tolerance	You're running this in CI or batch jobs where a few extra seconds and tokens are acceptable	You need a near-instant response, like inline editor completions
Team or enterprise	Onboarding non-experts to a workflow that mirrors how your engineering team already reviews code	A solo developer prototyping quickly who can review their own output just as fast as an agent could

What to Build Next: Add a Security Agent and CI

Once the 3-agent crew is running reliably, the natural next step is to add a fourth agent focused purely on security: SQL injection, unsanitized input, secrets in code, unsafe deserialization, that kind of thing. Give it a narrow, security-specific backstory rather than folding it into the general Reviewer's job description; specialized agents consistently outperform generalists asked to wear multiple hats in one pass.

From there, wiring the crew into CI is mostly plumbing you already know. Trigger crew.py from a GitHub Actions workflow on pull request open, capture each task's .output.raw, and post the review and generated tests as a PR comment. The crew becomes a first-pass reviewer that runs before a human ever opens the diff, catching the obvious issues so your team's review time goes toward the things that actually need human judgment.

A smaller but high-leverage addition: persist every run's stage outputs (code, review, tests, and eventually security notes) to a database. Over time that log becomes a dataset you can use to spot which agent role tends to miss things, and to fine-tune your prompts (or a smaller model) against your team's actual code patterns rather than generic examples.

Frequently Asked Questions

What is CrewAI and how is it different from LangChain?

CrewAI is a Python framework for orchestrating multiple LLM agents that collaborate on a task, each with a defined role, goal, and set of responsibilities. It's built on top of the same kinds of LLM integrations LangChain provides, but where LangChain is a general-purpose toolkit for building LLM applications (chains, retrieval, memory, agents, and more), CrewAI is purpose-built around the specific pattern of a 'crew' of role-based agents handing work off to one another in sequence or in parallel.

In practice, that narrower focus makes CrewAI faster to get started with for multi-agent pipelines specifically. You define agents and tasks, wire them into a crew, and run it. LangChain (and its graph-based sibling LangGraph) gives you more low-level control over the flow, at the cost of more boilerplate to get the same result.

Does a multi-agent crew cost more than a single prompt?

Yes, generally. A 3-agent crew makes at least three model calls per run (often more, since CrewAI's internal reasoning can add extra round trips), versus one call for a single prompt. For a small function, that might mean 3 to 6 times the token usage and a noticeably longer wall-clock time.

If your team is already tracking how token-based billing affects AI coding costs, treat a multi-agent crew the same way: profile it on a few real tasks before rolling it out broadly, since the per-run cost compounds quickly across a team.

Setup	Typical model calls	Relative cost
Single prompt	1	1x (baseline)
Single prompt + self-review	2	~2x
3-agent crew (Coder, Reviewer, Tester)	3-5	~3-6x

Whether that's worth it depends on what you're getting back: independent review, generated tests, and an audit trail are often worth far more than the marginal token cost, especially compared to the cost of a bug that ships to production.

Can each agent in a crew use a different LLM?

Yes. CrewAI's Agent constructor accepts its own llm parameter, so you can mix and match providers and model sizes. A common pattern is to give the Coder your strongest available model (since it's doing the creative, generative work), and give the Reviewer and Test Writer smaller, cheaper models, since their jobs are narrower and more mechanical.

python

from crewai import Agent, LLM

strong_llm = LLM(model="gpt-4o", temperature=0.2)
fast_llm = LLM(model="gpt-4o-mini", temperature=0.3)

coder = Agent(role="Senior Python Developer", llm=strong_llm, ...)
reviewer = Agent(role="Staff Engineer Code Reviewer", llm=fast_llm, ...)

This mix-and-match approach is one of the easiest ways to bring the cost of a multi-agent pipeline down without sacrificing much quality, since the most demanding work (writing correct code from a spec) gets the best model, and the more mechanical work (listing issues, writing tests against known code) can run on something cheaper.

How do I debug a crew when the final output looks wrong?

Start by inspecting each task's output individually rather than just the final result. Every Task exposes its output after crew.kickoff() completes, so you can print or log the Coder's code, the Reviewer's notes, and the Test Writer's suite separately to see exactly where things went off the rails.

python

result = crew.kickoff()

for task in [write_task, review_task, test_task]:
    print(f"--- {task.agent.role} ---")
    print(task.output.raw)
    print()

Almost always, a bad final result traces back to a bad intermediate one: truncated code, a Reviewer that hallucinated issues, or a Test Writer that misunderstood the spec because it was diluted somewhere along the chain. Fixing the pipeline means fixing the first stage where things went wrong, not patching the last one.

What if an agent's output leaks into the wrong stage of the pipeline?

This usually shows up as the Test Writer including explanatory prose meant for a human, or the Reviewer copying chunks of the original code into its critique verbatim instead of referencing it. The fix starts with tightening each task's expected_output to be explicit about format: 'output only valid Python, no markdown fences, no explanations' is a stronger constraint than 'write tests for this function.'

If tightening the prompt doesn't fully solve it, add a small parsing step between stages that strips known noise (markdown code fences, leading 'Here is the code:' phrases, and so on) before the output is handed to the next task. CrewAI lets you intercept and transform task outputs, so this can live entirely in your pipeline code without touching the agent definitions.

Can I run a multi-agent crew entirely with local models?

Yes. CrewAI supports local models through Ollama and other OpenAI-compatible local servers, so you can run an entire crew without sending code to a third-party API. This is worth considering if you're working with proprietary code that can't leave your machine, or if you simply want to avoid per-token costs while iterating on agent prompts.

python

from crewai import LLM

local_llm = LLM(
    model="ollama/qwen2.5-coder:14b",
    base_url="http://localhost:11434",
)

The trade-off is quality: smaller local models tend to need more explicit, more constrained prompts to produce output as reliable as a frontier hosted model would on the first try. It's a reasonable setup for prototyping your agent definitions cheaply, then switching to a stronger hosted model for production runs.

ai tools

Claude Code Cheatsheet: Commands, Hooks & Subagents

The complete Claude Code reference: every slash command, keyboard shortcut, hook, subagent, and CLAUDE.md tip, with real examples for developers.

Jun 6, 202610 min read

python

How to Switch to uv: Replace pip, virtualenv, and Poetry in Your Python Project

uv replaces pip, virtualenv, and Poetry with a single fast binary. Step-by-step guide to migrating your existing Python project and setting up GitHub Actions CI.

Jun 4, 202610 min read

tools

GitHub Copilot Token Billing: Real Cost by Workflow (2026)

GitHub Copilot switched to AI Credits billing June 1. Code completions stay free. Here's what chat, agent mode, and code review cost per workflow, and when to switch.

Jun 5, 202611 min read

ai tools10 min read

Multi-Agent AI Coding Workflow: Step-by-Step (2026)

Build a 3-agent AI coding workflow with CrewAI and Python. One agent writes, one reviews, one writes tests. Full code included.

Zeeshan Tofiq

June 7, 2026

On this page

Single-Prompt AI vs a Multi-Agent Workflow

Approach	What you get	What you miss
Single prompt	Fast, one round trip, good for small snippets	No independent review, no tests, blind spots go unchecked
Single prompt + "review your code"	Slightly better self-correction on obvious issues	Same context bias; the model defends its own choices
Multi-agent (Coder → Reviewer → Tester)	Independent critique, generated tests, an audit trail of each stage	More tokens, more latency, more moving parts to debug

Why CrewAI Maps to How You Already Think About Teams

Build a 3-Agent Code Review Pipeline

1
Install CrewAI and Set Up Your Project
CrewAI ships as a regular pip package along with an optional tools package for things like web search and file I/O. We won't need the tools package for this pipeline, but it's worth installing now if you plan to extend the crew later.
The commands below use a standard venv setup. If you want a faster install and lockfile workflow, switching your Python setup to uv drops straight into this same project structure.
bash
```
mkdir agent-review-crew && cd agent-review-crew
python -m venv .venv
source .venv/bin/activate

pip install crewai crewai-tools python-dotenv
```
Create a .env file for your API key. CrewAI reads provider credentials the same way the underlying SDKs do, so if you're using OpenAI or Anthropic, the standard environment variable names work out of the box.
bash — .env
```
OPENAI_API_KEY=sk-your-key-here
# or, if you're using Anthropic models:
# ANTHROPIC_API_KEY=sk-ant-your-key-here
```
Warning
Add .env to your .gitignore before your first commit. API keys committed to a repo, even a private one, have a way of ending up in places you didn't intend.

Define the Coder Agent

python — crew.py

import os
from dotenv import load_dotenv
from crewai import Agent, Task, Crew, Process, LLM

load_dotenv()

llm = LLM(model="gpt-4o-mini", temperature=0.2)

coder = Agent(
    role="Senior Python Developer",
    goal="Write clean, correct, well-typed Python functions from a spec",
    backstory=(
        "You are a senior backend engineer who values readability and "
        "correctness over cleverness. You write idiomatic Python with type "
        "hints and docstrings. You output only code, no explanations, no "
        "markdown fences, no commentary about how the code works."
    ),
    llm=llm,
    verbose=True,
)

Define the Reviewer and Test Writer Agents

python — crew.py (continued)

reviewer = Agent(
    role="Staff Engineer Code Reviewer",
    goal="Find real bugs, missing edge cases, and style issues in Python code",
    backstory=(
        "You are a staff engineer doing a thorough code review. You are not "
        "here to be nice, you are here to catch what the author missed: "
        "off-by-one errors, unhandled exceptions, missing input validation, "
        "type mismatches, and unclear naming. List concrete issues with line "
        "references where possible. If the code is genuinely solid, say so "
        "briefly and move on, don't invent problems."
    ),
    llm=llm,
    verbose=True,
)

test_writer = Agent(
    role="QA Engineer",
    goal="Write a pytest suite that would catch real regressions in the given function",
    backstory=(
        "You write pytest test suites that go beyond the happy path. You "
        "cover edge cases, invalid inputs, boundary values, and any issues "
        "raised in a code review. You output only valid Python test code "
        "using pytest conventions, no explanations."
    ),
    llm=llm,
    verbose=True,
)

Chain the Tasks With Context

python — crew.py (continued)

write_task = Task(
    description=(
        "Write a Python function that takes a list of order dictionaries "
        "(each with 'id', 'amount', and 'status' keys) and returns the total "
        "amount of all orders with status 'completed'. Handle missing keys "
        "and non-numeric amounts gracefully. Include type hints and a docstring."
    ),
    expected_output="A complete, runnable Python function with type hints and a docstring.",
    agent=coder,
)

review_task = Task(
    description=(
        "Review the Python function produced in the previous task. Identify "
        "concrete bugs, missing edge cases, and style issues. Be specific "
        "about what could go wrong and why."
    ),
    expected_output="A bullet list of specific issues, or a brief note that the code is solid.",
    agent=reviewer,
    context=[write_task],
)

test_task = Task(
    description=(
        "Write a pytest test suite for the function from the first task, "
        "taking into account the issues raised in the review. Cover the "
        "happy path, missing keys, non-numeric amounts, empty lists, and "
        "any edge cases the reviewer flagged."
    ),
    expected_output="A complete pytest file with multiple test functions, ready to run with pytest.",
    agent=test_writer,
    context=[write_task, review_task],
)

5
Assemble the Crew and Run It
With the agents and tasks defined, wiring up the crew is a few lines. Process.sequential runs tasks in the order you listed them, which is exactly what we want for a write-review-test pipeline where each stage depends on the last.
python — crew.py (continued)
```
crew = Crew(
    agents=[coder, reviewer, test_writer],
    tasks=[write_task, review_task, test_task],
    process=Process.sequential,
    verbose=True,
)

if __name__ == "__main__":
    result = crew.kickoff()
    print("\n\n=== FINAL OUTPUT ===\n")
    print(result)
```
bash
```
python crew.py
```
Run it and you'll see verbose logs for each agent as it works through its task, followed by the final output (the test suite, since that's the last task in the chain). If you want to inspect every stage rather than just the last one, each Task object exposes its own .output after kickoff() completes.
python
```
print("--- CODE ---")
print(write_task.output.raw)

print("\n--- REVIEW ---")
print(review_task.output.raw)

print("\n--- TESTS ---")
print(test_task.output.raw)
```
Info
Logging every stage's output to a file (or a database) is worth doing from day one. When the pipeline produces something odd, having the intermediate review and the original code side by side is the fastest way to figure out which agent introduced the problem.

Common Failure Modes (And How to Recover)

The Reviewer rubber-stamps everything. If your Reviewer keeps saying "looks good, no issues," its backstory probably isn't adversarial enough, or its temperature is too low for it to think critically. Make the goal explicitly about finding problems, and consider raising its temperature slightly above the Coder's.
Output bleeds between agents. You'll sometimes see the Test Writer include explanatory prose, markdown fences, or even copy the Coder's docstring verbatim into a test file. Tighten the expected_output description ("a complete pytest file, no markdown fences, no explanations") and, if it persists, add an output parser step that strips fences before the result reaches the next task.
The chain drifts off-spec. Each handoff is a chance for the original requirement to get diluted. If task three is solving a slightly different problem than task one described, add the original spec back into later tasks' description fields rather than relying solely on context to carry it forward.
Costs and latency add up fast. Three agents means at minimum three model calls per run, often more with CrewAI's internal reasoning steps. For iteration, use a cheaper or smaller model (or a local model via Ollama) and reserve your strongest model for the final production run.
One bad stage poisons the rest. Because each task's input includes the previous task's raw output, a malformed or truncated response from the Coder can cascade into a useless review and useless tests. Add lightweight validation between stages (does the code parse? does it define the expected function name?) and fail fast rather than letting garbage propagate.

When Multi-Agent Beats a Single Agent (And When It Doesn't)

	Use multi-agent	Stick with a single prompt
Task complexity	Multi-step work where each step has a genuinely different skill set (write, critique, test, document)	Self-contained tasks that fit comfortably in one response, like generating a regex or a SQL query
Need for independent review	You want a check that isn't biased by the same context that produced the original output	You trust the model's first pass and just need it fast
Auditability	You need a record of what was written, what was flagged, and what was tested, for compliance or learning	You don't need an audit trail, just the final artifact
Latency and cost tolerance	You're running this in CI or batch jobs where a few extra seconds and tokens are acceptable	You need a near-instant response, like inline editor completions
Team or enterprise	Onboarding non-experts to a workflow that mirrors how your engineering team already reviews code	A solo developer prototyping quickly who can review their own output just as fast as an agent could

What to Build Next: Add a Security Agent and CI

Frequently Asked Questions

What is CrewAI and how is it different from LangChain?

Does a multi-agent crew cost more than a single prompt?

Setup	Typical model calls	Relative cost
Single prompt	1	1x (baseline)
Single prompt + self-review	2	~2x
3-agent crew (Coder, Reviewer, Tester)	3-5	~3-6x

Can each agent in a crew use a different LLM?

python

from crewai import Agent, LLM

strong_llm = LLM(model="gpt-4o", temperature=0.2)
fast_llm = LLM(model="gpt-4o-mini", temperature=0.3)

coder = Agent(role="Senior Python Developer", llm=strong_llm, ...)
reviewer = Agent(role="Staff Engineer Code Reviewer", llm=fast_llm, ...)

How do I debug a crew when the final output looks wrong?

python

result = crew.kickoff()

for task in [write_task, review_task, test_task]:
    print(f"--- {task.agent.role} ---")
    print(task.output.raw)
    print()

What if an agent's output leaks into the wrong stage of the pipeline?

Can I run a multi-agent crew entirely with local models?

python

from crewai import LLM

local_llm = LLM(
    model="ollama/qwen2.5-coder:14b",
    base_url="http://localhost:11434",
)

ai tools

Claude Code Cheatsheet: Commands, Hooks & Subagents

The complete Claude Code reference: every slash command, keyboard shortcut, hook, subagent, and CLAUDE.md tip, with real examples for developers.

Jun 6, 202610 min read

python

How to Switch to uv: Replace pip, virtualenv, and Poetry in Your Python Project

uv replaces pip, virtualenv, and Poetry with a single fast binary. Step-by-step guide to migrating your existing Python project and setting up GitHub Actions CI.

Jun 4, 202610 min read

tools

GitHub Copilot Token Billing: Real Cost by Workflow (2026)

GitHub Copilot switched to AI Credits billing June 1. Code completions stay free. Here's what chat, agent mode, and code review cost per workflow, and when to switch.

Jun 5, 202611 min read

Single-Prompt AI vs a Multi-Agent Workflow

Why CrewAI Maps to How You Already Think About Teams

Build a 3-Agent Code Review Pipeline

Install CrewAI and Set Up Your Project

Define the Coder Agent

Define the Reviewer and Test Writer Agents

Chain the Tasks With Context

Assemble the Crew and Run It

Common Failure Modes (And How to Recover)

When Multi-Agent Beats a Single Agent (And When It Doesn't)

What to Build Next: Add a Security Agent and CI

Frequently Asked Questions

Related Articles

Claude Code Cheatsheet: Commands, Hooks & Subagents

How to Switch to uv: Replace pip, virtualenv, and Poetry in Your Python Project

GitHub Copilot Token Billing: Real Cost by Workflow (2026)

Single-Prompt AI vs a Multi-Agent Workflow

Why CrewAI Maps to How You Already Think About Teams

Build a 3-Agent Code Review Pipeline

Install CrewAI and Set Up Your Project

Define the Coder Agent

Define the Reviewer and Test Writer Agents

Chain the Tasks With Context

Assemble the Crew and Run It

Common Failure Modes (And How to Recover)

When Multi-Agent Beats a Single Agent (And When It Doesn't)

What to Build Next: Add a Security Agent and CI

Frequently Asked Questions

Related Articles

Claude Code Cheatsheet: Commands, Hooks & Subagents

How to Switch to uv: Replace pip, virtualenv, and Poetry in Your Python Project

GitHub Copilot Token Billing: Real Cost by Workflow (2026)