Agentic AI is the most significant shift in how we build LLM applications since RAG. Instead of a single LLM call answering a question, agents decompose complex tasks into sub-tasks, call tools, reason about intermediate results, and iteratively refine their output. The challenge isn't building one agent — it's orchestrating multiple agents that collaborate reliably in production.
At EPAM, I've been designing multi-agent workflows for enterprise automation — from document processing pipelines to multi-cloud infrastructure management. This post covers the orchestration patterns that actually work, the frameworks I've evaluated, and the hard lessons from putting agents into production.
Why Multi-Agent? The Limitations of Single-Agent Systems
A single agent with a long system prompt and 20 tools sounds elegant. In practice, it fails in predictable ways:
- Tool selection confusion: With 15+ tools, LLMs increasingly pick the wrong tool or hallucinate tool arguments. Accuracy degrades as the tool set grows.
- Context window saturation: A single agent handling a complex task accumulates tool outputs, intermediate reasoning, and error messages. By step 8, the context is polluted with irrelevant information from steps 1-3.
- No separation of concerns: A single agent handling research, analysis, writing, and review produces mediocre results at everything instead of excellent results at one thing.
Multi-agent systems solve this by giving each agent a narrow scope, a small tool set, and a focused system prompt. A "researcher" agent only searches and retrieves. An "analyst" agent only reasons about data. A "writer" agent only produces output. Each agent excels at its role.
LangGraph: State Machines for Agents
After evaluating LangGraph, CrewAI, AutoGen, and OpenAI's Agents SDK, I've settled on LangGraph for production multi-agent systems. The reason: it models agent workflows as explicit state machines with typed state, conditional routing, and human-in-the-loop checkpoints.
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
class AgentState(TypedDict):
task: str
research_output: str
analysis_output: str
final_output: str
needs_human_review: bool
graph = StateGraph(AgentState)
# Add agent nodes
graph.add_node("researcher", researcher_agent)
graph.add_node("analyst", analyst_agent)
graph.add_node("writer", writer_agent)
graph.add_node("human_review", human_review_checkpoint)
# Define edges (workflow)
graph.set_entry_point("researcher")
graph.add_edge("researcher", "analyst")
graph.add_conditional_edges("analyst", route_after_analysis, {
"needs_review": "human_review",
"ready": "writer"
})
graph.add_edge("human_review", "writer")
graph.add_edge("writer", END)
app = graph.compile(checkpointer=memory_checkpointer)
The key advantage over frameworks like CrewAI: the workflow is explicit, debuggable, and deterministic. You can see exactly which agent runs when, what state gets passed, and where decisions are made. In production, this observability is non-negotiable.
Orchestration Patterns
Pattern 1: Sequential Pipeline
The simplest pattern — agents run in sequence, each transforming the state. Use when: tasks have clear sequential dependencies (research → analyze → write → review).
Research Agent → Analysis Agent → Writing Agent → Review Agent
↓ ↓ ↓ ↓
retrieves reasons about generates validates
documents findings output quality
Pattern 2: Router (Supervisor)
A supervisor agent receives the task and routes it to the appropriate specialist agent based on task classification. Use when: incoming requests vary in type and each type needs different handling.
User Request → Supervisor Agent → classify task type
|
┌─────────────────┼─────────────────┐
↓ ↓ ↓
Code Agent Research Agent Data Agent
(writes code) (searches docs) (queries DB)
At EPAM, our IAM policy generator uses this pattern. The supervisor classifies the request (single-service vs multi-cloud vs compliance audit) and routes to specialized policy agents, each with their own RAG context and tool set.
Pattern 3: Parallel Fan-Out / Fan-In
Multiple agents work on sub-tasks simultaneously, and a synthesizer agent combines their outputs. Use when: the task can be decomposed into independent sub-problems.
Task Decomposer → [Agent A, Agent B, Agent C] → Synthesizer Agent
(parallel execution) (combines results)
Pattern 4: Iterative Refinement (Reflection)
An executor agent produces output, a critic agent evaluates it, and the executor refines based on feedback — looping until quality criteria are met. Use when: output quality matters more than latency.
Executor Agent ←──── feedback ────── Critic Agent
↓ ↑
generates output ──── evaluate ──────────┘
↓ (when approved)
END
"The reflection pattern is the closest thing we have to 'thinking hard' in LLM systems. Two rounds of self-critique improve output quality by 30-40% on our internal benchmarks — but each round adds latency and cost. We cap at 3 iterations maximum."
Tool-Use Routing: Getting It Right
Tool selection is where most agent failures happen. Strategies that worked:
- Small tool sets per agent: Max 5 tools per agent. If you need more, split into multiple agents. Our research agent has 3 tools: vector search, web search, and document reader.
- Typed tool signatures: Pydantic models for tool inputs/outputs. This gives the LLM structured schemas and catches malformed arguments before execution.
- Tool descriptions as prompts: The tool description is part of the prompt. A vague description like "searches documents" leads to incorrect invocations. "Searches the IAM policy documentation index and returns the top-5 most relevant policy snippets for the given AWS service name" works dramatically better.
- Fallback chains: If the primary tool fails (API timeout, empty result), define a fallback. Our web search falls back to cached results, then to a summary of previously retrieved context.
Memory Management
Agents need memory at three levels:
- Working memory (within a task): The state object in LangGraph. Holds intermediate results, tool outputs, and agent decisions for the current task. Cleared when the task completes.
- Short-term memory (across turns): Conversation history within a session. We use a sliding window of the last 10 messages plus a summary of older messages — full history blows up the context window.
- Long-term memory (across sessions): Persistent facts learned from previous interactions. Stored in a vector database, retrieved when relevant. For our policy generator, this includes previously approved policy patterns and client-specific rules.
# LangGraph checkpointing for persistence
from langgraph.checkpoint.sqlite import SqliteSaver
checkpointer = SqliteSaver.from_conn_string("agent_memory.db")
app = graph.compile(checkpointer=checkpointer)
# Resume from checkpoint (enables human-in-the-loop)
config = {"configurable": {"thread_id": "user-123-task-456"}}
result = app.invoke(state, config=config)
Human-in-the-Loop: The Production Requirement
Fully autonomous agents are a demo feature. In production, especially for enterprise workflows, humans must be in the loop for high-stakes decisions. LangGraph makes this explicit with interrupt points:
- Approval gates: Before executing destructive actions (creating IAM policies, modifying infrastructure), pause and present the plan to a human for approval.
- Quality checkpoints: After the analysis agent but before the writer, let a human validate the reasoning. Cheaper to catch errors here than after a full report is generated.
- Escalation: When the agent's confidence is low (detected via output parsing or self-evaluation), route to a human instead of guessing.
Production Observability
Multi-agent systems are hard to debug without proper observability. Our stack:
- LangSmith traces: Every agent invocation, tool call, and state transition is traced. When a workflow produces bad output, we can replay the exact sequence of decisions.
- Structured logging: Each agent logs its input state, reasoning, tool calls, and output in structured JSON. This feeds into dashboards that show success rates per agent, average tool call counts, and common failure modes.
- Cost tracking: Token usage per agent per workflow. We discovered our "researcher" agent was consuming 60% of total tokens due to overly long retrieval contexts — truncating to top-3 chunks cut cost by 40%.
- Quality metrics: Automated evaluation on a representative test set after every deploy. Human evaluation weekly on a sample of production outputs.
Framework Comparison: What I've Learned
- LangGraph: Best for production. Explicit state machines, typed state, checkpointing, human-in-the-loop. Steeper learning curve but worth it for anything beyond prototypes.
- CrewAI: Great for quick prototypes. Role-based agents are intuitive. But implicit state management and limited control flow make production debugging painful.
- AutoGen: Good for conversational multi-agent patterns (agents talking to each other). Less suited for structured workflows with deterministic steps.
- OpenAI Agents SDK: Clean API, good tool integration. But vendor-locked to OpenAI models, which limits flexibility in production where you may need to switch providers.
Lessons from Production
- Start with a single agent. Add multi-agent complexity only when you've proven that a single agent can't handle the task well enough. Most tasks don't need 5 agents.
- Make every transition explicit. Implicit agent-to-agent communication (agents "chatting") is impossible to debug. Use typed state objects and deterministic routing.
- Budget tokens per agent. Without limits, agents will use unlimited context. Set max token budgets per agent per step.
- Test with adversarial inputs. Agents fail in creative ways. Test with malformed inputs, edge cases, and tasks that require saying "I can't do this."
- Version your workflows. Agent prompts, tool definitions, and graph topology should all be versioned. Rolling back a broken agent workflow should be a one-click operation.
"Agentic AI isn't about building smarter models — it's about building smarter systems. The orchestration layer matters more than the underlying LLM. A well-orchestrated system with GPT-4o-mini often outperforms a single GPT-4 call on complex tasks."