AI Orchestration Patterns

Supervisor, Orchestrator, and the MCP Revolution

Introduction: The Modern Architects Problem

Hey, here we are again. Look, if you're reading this it's because you've already realized something important: integrating AI into applications isn't just calling an API and calling it a day. It's like when a building architect thinks they can build a skyscraper just because they know how to lay bricks. No way, you need to understand the structure, the foundations, how different systems communicate.

In this chapter we're going deep into architectural patterns for orchestrating AI agents. We'll see how they work with and without MCP (Model Context Protocol), and most importantly: you'll understand when to use each one. Because there's no magic solution, there's context.

"The difference between a junior and senior developer isn't that the senior knows more patterns, it's that they know which one to use in each situation."

The N×M Problem: Why We Need Orchestration

Before we dive into patterns, let me explain the problem we're solving. Imagine you have an application that needs to connect to different tools: databases, external APIs, file systems, third-party services. Now imagine you want to use different AI models: Claude, GPT-4, Gemini, local models.

Without a standard, you have an N×M problem. If you have 10 tools and 5 models, you need to write 50 different integrations. Each model has its function calling format, each tool needs its specific adapter. It's a mess.

Orchestration comes to solve this. Instead of point-to-point connections, we establish a central pattern that coordinates everything. It's like the foreman on a construction site: they don't lay bricks, but they know who has to lay them, when, and in what order.

The Four Fundamental Orchestration Patterns

Look, before talking about frameworks and protocols, you need to understand the base patterns. They're like the foundation of a building: no matter what technology you use on top, these patterns will be present.

1. The Supervisor Pattern (Hub-and-Spoke)

This is the most intuitive pattern and probably the one you'll use most. It works exactly as it sounds: you have a central agent (the Supervisor) that coordinates multiple specialized agents (the Workers).

Construction analogy: Think of the chief architect on a construction site. They don't run the electrical cables, don't weld the pipes, don't install the windows. But they know exactly what each specialist needs to do, in what order, and how to combine everything so the building works. The electrician reports to them, the plumber reports to them, and they make the coordination decisions.

      ┌─────────────────────────┐
      │       SUPERVISOR        │
      │ (Receives,deleg,synth.) │
      └───┬────────┬────────┬───┘
          │        │        │
     ┌────┴───┐┌───┴───┐┌───┴────┐
     │ Worker ││Worker ││ Worker │
     │Research││Writer ││Analyst │
     └────────┘└───────┘└────────┘

When to use it:

Tasks with clear decomposition ("research", "write", "review")
When you need a centralized audit trail
Sequential flows or with simple conditions

Main limitation: The supervisor becomes a bottleneck if you have many concurrent agents. It's the classic "single point of failure" problem.

2. The Hierarchical Pattern (Multi-level)

This is basically the Supervisor pattern but at multiple levels. You have a main supervisor who coordinates team supervisors, and each team supervisor coordinates their own workers.

Construction analogy: It's like a large construction company. The CEO doesn't talk directly to the electrician. The CEO talks to the project manager, the project manager talks to the site foreman, and the foreman coordinates the different teams. Each level has its scope of responsibility.

              ┌─────────────────┐
              │  ORCHESTRATOR   │
              └────────┬────────┘
         ┌─────────────┼─────────────┐
   ┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
   │ Team Lead │ │ Team Lead │ │ Team Lead │
   │ Research  │ │  Writing  │ │  Review   │
   └─────┬─────┘ └─────┬─────┘ └─────┬─────┘
      ┌──┴──┐       ┌──┴──┐       ┌──┴──┐
      │W│ │W│       │W│ │W│       │W│ │W│
      └─┘ └─┘       └─┘ └─┘       └─┘ └─┘

Golden rule: Never more than 2-3 levels. Each extra level adds latency and complexity without proportional benefit. If you need more than 3 levels, you're probably solving the wrong problem.

3. The ReAct Pattern (Reasoning + Acting)

This pattern is different because it's iterative by nature. The agent alternates between thinking (reasoning) and acting in a continuous loop until completing the task.

The cycle is simple:

Thought: "I need to search for information about X"
Action: Executes the search tool
Observation: Receives the search result
Repeat: Returns to Thought with the new information

Construction analogy: It's like an architect designing in real-time while visiting the site. They look at the ground, think "I need to know if it's clay", run a test, see the result, think "then I need deeper foundations", do the calculations, and so on.

Critical trade-off: ReAct is very flexible and handles emergent cases well, but each iteration accumulates context. In a long task, you can end up with thousands of tokens just from history. This impacts both cost and quality (the model can "forget" things from the beginning).

4. The Plan-and-Execute Pattern

This is my favorite for complex tasks. The idea is simple but powerful: you completely separate planning from execution.

First, a Planner (large, capable model) generates a complete plan with all steps. Then, Executors (can be smaller, cheaper models) execute each step without needing to reason about the complete plan.

┌───────────────────────────────┐
│           PLANNER             │
│   (Claude Opus / GPT-4)       │
│   Input: Complex task         │
│   Output: [Step1...Step5]     │
└──────────────┬────────────────┘
               │
               ▼
┌───────────────────────────────┐
│          EXECUTORS            │
│  (Haiku/GPT-3.5 - economic)   │
│  Step1 ──► Result1            │
│  Step2 ──► Result2            │
│  Step3 ──► Result3            │
└───────────────────────────────┘

Construction analogy: It's exactly how real construction works. The architect (Planner) designs the complete blueprints. Then, different teams (Executors) execute each part of the blueprint without needing to understand the complete design. The electrician doesn't need to know why the kitchen is where it is, they just need to know where the outlets go.

Security benefit: Once the plan is created, tool data cannot modify it. This protects you against prompt injection: if a tool returns malicious content, that content cannot inject new actions into the plan.

MCP: The Paradigm Shift

Now let's talk about Model Context Protocol (MCP). This is what Anthropic launched and is changing how we think about tool integration with AI.

The problem MCP solves is exactly the N×M I mentioned before. Instead of each application having to implement custom integrations for each tool, MCP establishes a standard protocol. It's like USB for AI: before you needed a different cable for each device, now everything uses the same connector.

Architecture: Host, Client, Server

MCP has three main components you need to understand:

Host: It's the application that orchestrates everything. Claude Desktop, your IDE with AI extension, or your custom application. The Host handles security and lifecycle of clients.
Client: Establishes 1:1 connections with servers. Handles protocol mechanics. Think of it as the "driver" that translates between Host and Server.
Server: Exposes capabilities through standard primitives. A GitHub server exposes tools for creating PRs, a PostgreSQL one exposes queries, etc.

The Four MCP Primitives

MCP defines four types of capabilities a server can expose:

Primitive	Control	Use
Tools	Model	Actions requiring consent: API calls, modify data, execute commands
Resources	Application	Data for context (like GET endpoints). Files, documents, static data
Prompts	User	Predefined templates for guided workflows. Like slash commands
Sampling	Server → Client	Allows server to request completions from model. Recursive reasoning

Traditional Integration vs MCP: The Practical Example

Let's see real code so you understand the difference.

Without MCP (Traditional integration with function calling):

// You have to manually define the schema for each tool
const tools = [
	{
		name: 'get_weather',
		description: 'Get weather for a location',
		input_schema: {
			type: 'object',
			properties: {
				location: { type: 'string', description: 'City name' },
			},
			required: ['location'],
		},
	},
];

// And then you have to handle execution manually
if (toolCall.name === 'get_weather') {
	const result = await weatherAPI.get(toolCall.input.location);
	// ... handle result
}

With MCP (using FastMCP):

# The server defines the tool, MCP generates the schema automatically
from fastmcp import FastMCP

mcp = FastMCP('WeatherServer')

@mcp.tool()
def get_weather(location: str) -> dict:
    """Get weather for a location"""
    return {'temperature': 22, 'conditions': 'sunny'}

# Any MCP-compatible host can discover and use this tool
# JSON schema is automatically generated from type hints

See the difference? With MCP, the client can call tools/list at runtime and discover what tools are available. You don't need to compile anything, you don't need to update your application when you add a new tool to the server.

MCP-Specific Orchestration Patterns

Now comes the juicy part. MCP not only simplifies tool integration, it also enables orchestration patterns that were previously very difficult to implement.

The Handoff Pattern: Specialists with Persistence

This pattern is great for chatbots and support systems. The idea is that you have specialized agents and the system "hands off" the user to the right specialist based on intent.

But here's the important part: once the user is talking to the billing specialist, they stay with billing until the topic changes. You're not re-classifying each message.

User: "I have a problem with my invoice"
  │
  ▼
┌───────────────────┐
│ Intent Classifier │ ──► "billing"
└─────────┬─────────┘
          │
          ▼  HANDOFF
┌───────────────────┐
│ Billing Agent     │ ◄── PERSISTENT
│ (MCP: billing-    │
│  server tools)    │
└───────────────────┘

The Magentic Pattern: Manager with Task Ledger

This is a more sophisticated pattern that combines the best of Supervisor and Plan-and-Execute. You have a Manager that maintains a "Task Ledger" - basically a record of all tasks, their status, dependencies, and results.

The Manager doesn't just delegate, it also does dynamic re-planning. If a task fails or returns unexpected results, it can adjust the plan without having to start from scratch.

┌──────────────────────────────┐
│        MANAGER AGENT         │
│  ┌────────────────────────┐  │
│  │      TASK LEDGER       │  │
│  │ T1: Research [✓] Res:{} │  │
│  │ T2: Analyze  [⟳] Dep:T1 │  │
│  │ T3: Write    [○] Dep:T2 │  │
│  │ T4: Review   [○] Dep:T3 │  │
│  └────────────────────────┘  │
└──────────────────────────────┘

The Code Execution Pattern: Anthropic's Recommendation

This pattern blew my mind when I saw it. Instead of the agent making many direct tool calls, the agent writes code that interacts with MCP servers.

Anthropic reported a 98.7% reduction in tokens in some cases (from 150,000 to 2,000 tokens). Why? Because instead of making 50 tool calls with all the context overhead, the agent writes a script that does the 50 operations and returns only the final result.

Concrete example:

# ❌ Traditional approach: many tool calls
tool_call: list_files('/docs')          # +500 tokens context
tool_call: read_file('file1.md')        # +500 tokens context
tool_call: read_file('file2.md')        # +500 tokens context
# ... 47 more calls ...
# Total: ~25,000 tokens overhead

# ✅ Code Execution approach: one script
execute_code('''
  import mcp_client
  files = mcp_client.list_files('/docs')
  results = []
  for f in files:
    content = mcp_client.read_file(f)
    if 'keyword' in content:
      results.append(summarize(content))
  return results  # Only the final result
''')
# Total: ~500 tokens

The Decision Framework: MCP or Direct Integration

This is the million dollar question, and the answer is: it depends. But I'm not going to leave you with that, I'll give you a concrete framework.

Choose MCP when...

You're building tools that will be used by multiple applications. If your GitHub tools server will be used by Claude Desktop, your IDE, and your custom app, MCP saves you maintaining three integrations.
You need dynamic capability discovery. If available tools can change at runtime (for example, a plugin marketplace), MCP gives you that for free.
You want provider-agnostic interfaces. Today you use Claude, tomorrow maybe you want to try GPT-4 or a local model. With MCP, your servers work with any compatible host.
You'll leverage the existing ecosystem. There are already over 5,000 MCP servers for Salesforce, Jira, PostgreSQL, AWS, GitHub, Docker... Why reinvent the wheel?

Choose Direct Integration when...

It's a single application with specific tools. If your tool only exists for your app and will never be reused, MCP overhead isn't justified.
Performance is critical. MCP adds protocol latency. For cases where every millisecond counts, direct integration is more efficient.
You need total control over execution. If you have very specific requirements for error handling, retry logic, or circuit breakers, sometimes it's easier to implement them directly.
It's an MVP or proof of concept. To validate an idea quickly, direct function calling is simpler to implement.

In practice, most serious applications end up using a hybrid approach:

MCP for external integrations: Databases, third-party APIs, enterprise services. You leverage the ecosystem and standardization.
Direct integration for core logic: Tools that are specific to your domain and tightly coupled to your business logic.

Think of it this way: MCP is for your building's standard electrical connections (outlets, lights, air conditioning). Direct integration is for the custom automated system you designed specifically for your smart building.

Clean Architecture for AI Systems

Now comes the part I like most. How does all this fit into Clean Architecture? Because if you know me, you know I won't let AI contaminate my layers.

The fundamental principle is simple: LLMs are external infrastructure, not domain logic. Treat Claude or GPT-4 the same way you'd treat a database or third-party API.

The Layer Structure

┌─────────────────────────────┐
│      INFRASTRUCTURE         │
│ ┌───────┐┌───────┐┌───────┐ │
│ │  LLM  ││VectorD││  MCP  │ │
│ │Adapter││Pinecne││Clients│ │
│ └───┬───┘└───┬───┘└───┬───┘ │
└─────┼────────┼────────┼─────┘
      │        │        │
      ▼        ▼        ▼
┌─────────────────────────────┐
│       APPLICATION           │
│ ┌─────────────────────────┐ │
│ │  ORCHESTRATION LAYER    │ │
│ │ Agents, RAG, Workflows  │ │
│ └─────────────────────────┘ │
│ ┌─────────────────────────┐ │
│ │      USE CASES          │ │
│ │ ProcessDoc,Analyze,Rept │ │
│ └─────────────────────────┘ │
└─────────────────────────────┘
              │
              ▼
┌─────────────────────────────┐
│         DOMAIN              │
│ Entities,ValueObjects,Rules │
│      ⚠️ NO AI DEPS ⚠️      │
└─────────────────────────────┘

The Sacred Principle

LLMs interpret and orchestrate. The domain validates and executes.

This is critical and I'll repeat it: never, ever, put critical business logic in an LLM. Models are stochastic - sometimes they're wrong, sometimes they hallucinate. Your domain layer must validate EVERYTHING that comes from the agent before executing it.

Bad example:

// ❌ The agent decides if the transaction is valid
const result = await agent.process('Transfer $10000 to account X');
await bankingService.execute(result.transaction);

Good example:

// ✅ The agent interprets, the domain validates
const intent = await agent.parseIntent('Transfer $10000 to account X');
const transaction = transactionFactory.create(intent);
const validation = transactionValidator.validate(transaction);
if (validation.isValid) {
	await bankingService.execute(transaction);
}

Recommended Folder Structure

src/
├── domain/                    # Pure, no AI deps
│   ├── entities/
│   ├── value-objects/
│   ├── services/
│   └── ports/                 # Interfaces
│
├── application/               # Orchestration + Use Cases
│   ├── agents/                # AI Agents
│   │   ├── supervisor.ts
│   │   ├── researcher.ts
│   │   └── writer.ts
│   ├── orchestration/         # Coordination
│   │   ├── pipelines/
│   │   └── workflows/
│   └── use-cases/
│
└── infrastructure/            # External adapters
    ├── llm/                   # LLM Providers
    │   ├── anthropic-adapter.ts
    │   └── openai-adapter.ts
    ├── mcp/                   # MCP Clients
    └── vector-stores/         # Pinecone, Weaviate, etc.

Taking It to Production: What Nobody Tells You

Alright, now comes the part where I tell you the things you learn the hard way. Because building an agent that works on your machine is one thing, having it running 24/7 in production is another story.

Error Handling: It's Not Optional

In AI systems you have three types of errors you need to handle:

Execution errors: The tool failed, timeout, API down. These are the "easy" ones - retry with exponential backoff + jitter.
Semantic errors: The model called an API with technically valid but incorrect parameters. Syntax OK, semantics wrong. These are treacherous.
Planning errors: The agent's plan has circular dependencies, impossible steps, or simply doesn't make sense. Requires complete re-planning.

Concrete rules:

Retry policy: 3-5 attempts for external APIs with 5 second delays + jitter. 2-3 attempts for LLMs with longer delays respecting rate limits.
Circuit breaker: If a tool fails X consecutive times, stop calling it for a period. You won't fix anything by saturating a downed service.
Fallback LLMs: If Claude is down, route to GPT-4. If GPT-4 is down, route to a local model. Always have a plan B.

Observability: If You Don't Measure It, It Doesn't Exist

Four categories of metrics you need to track:

Category	Key Metrics
Performance	Total response time, latency per stage (retrieval, synthesis, writing), time to first token
Reliability	Error rate by type, completion rate, retry rate, circuit breaker trips
Quality	Accuracy (LLM-as-judge), hallucination rate, user feedback, task success rate
Cost	Token usage per request, API costs, cost per completed task, cost trends

Recommended tools:

Langfuse: Open source, excellent for LLM call tracing
Datadog LLM Observability: If you already use Datadog, integration is natural
Arize Phoenix: Very good for detecting drift and quality degradation

Testing: The Elephant in the Room

Testing AI systems is difficult because they're non-deterministic. The same input can give different outputs. But that doesn't mean you can't test.

Layered testing strategy:

Domain unit tests: 100% deterministic, no AI involved. Your validators, factories, and business rules must be covered.
Agent component tests: Fixed inputs, evaluate output structure rather than exact content. Did it call the right tools? Does the plan make sense?
LLM-as-judge evaluation: Use a model to evaluate outputs from another. Frameworks like DeepEval and Giskard help.
Regression datasets: Each production bug becomes a test case. Build your regression dataset organically.

Conclusion: The Path Forward

Well, we've reached the end. Let's recap the important stuff:

Orchestration patterns are MCP-agnostic. Supervisor, Hierarchical, ReAct, Plan-and-Execute - all work with or without MCP. MCP changes how you expose and integrate tools, not how agents coordinate with each other.

MCP solves the N×M problem. If you have tools that will be reused, if you want to leverage the ecosystem, if you need dynamic discovery - MCP is your friend. If it's a specific app with custom tools, direct integration may be sufficient.

Clean Architecture still applies. LLMs are external infrastructure. Your domain must be pure, your application layer orchestrates, and AI adapters live in infrastructure. Never let the model make business decisions without domain validation.

In production, observability is not optional. Multi-layer error handling, performance/reliability/quality/cost metrics, and a testing strategy that accepts the non-deterministic nature of these systems.

"We're not building applications that use AI. We're building software systems where AI is just another component - powerful, yes, but a component that must respect the same architecture rules as any other external dependency."

Let's go. Now go build. 🚀

References and Resources

Official Documentation:

Model Context Protocol Specification - modelcontextprotocol.io
LangGraph Documentation - langchain-ai.github.io/langgraph
CrewAI Documentation - docs.crewai.com
Microsoft AutoGen - microsoft.github.io/autogen

Fundamental Papers:

ReAct: Synergizing Reasoning and Acting in Language Models (arXiv:2210.03629)
Plan-and-Solve Prompting (arXiv:2305.04091)

Observability Tools:

Langfuse - langfuse.com (Open Source)
Arize Phoenix - arize.com/phoenix
DeepEval - docs.confident-ai.com

MCP Servers Ecosystem:

MCP Server Registry - github.com/modelcontextprotocol/servers
FastMCP - github.com/jlowin/fastmcp