AI Agents โ Autonomous LLM Systems
Tool use, memory, reasoning patterns, multi-agent collaboration, and production deployment of LLM-powered autonomous systems.
A chatbot responds. An agent acts. The difference is not intelligence โ it is the ability to take actions with real-world consequences: browsing the web, writing files, executing code, calling APIs, and persisting across multiple steps toward a goal. When you give an LLM tools and a goal, you get an agent.
A standard LLM is a stateless input-output transformation: a message goes in, text comes out. Between turns there is no persistent state, no access to the outside world, and no capability to take actions beyond generating words. It is powerful but passive.
An AI agent wraps that LLM with tools, memory, and a loop. The LLM becomes the "brain" โ it decides what to do. Tools are the "hands" โ they actually do it. The agent pursues a goal across multiple steps, using the result of each action to inform the next. Real-world example: Devin reads a GitHub ticket, writes code across 15 files, runs the test suite, and iterates until all tests pass โ without a human in the loop for each file.
| Property | Standard LLM | AI Agent |
|---|---|---|
| Interaction style | Single-turn input โ output | Multi-step task execution |
| State | Stateless between turns | Maintains state across steps |
| Output type | Text only | Real-world actions (files, APIs, code) |
| External access | None | Tools, APIs, databases, browsers |
| Model calls per task | One | Many (until goal achieved) |
| Side effects | None | Creates files, sends emails, runs code |
Agency is not binary. It exists on a spectrum from a pure LLM that never touches the outside world, to a fully autonomous system that operates for hours without human input. Understanding where on this spectrum your system sits is the first design decision in building any agent โ it determines safety requirements, reliability challenges, and appropriate use cases.
Chat only. No tools, no memory, no actions. User asks โ LLM answers. Example: vanilla ChatGPT without plugins.
One-shot tool use per turn. ChatGPT with web search enabled. Tool execution is deterministic โ one search per turn.
Plans and calls tools, but one action per user turn. Code Interpreter: executes one cell, returns results.
Loops: think โ act โ observe โ repeat until goal reached. Claude computer use, AutoGPT, most LangChain agents. Most 2024 production agents are here.
Operates indefinitely without human input. Self-assigns subtasks, spawns sub-agents, recovers from failures. Devin, SWE-Agent on multi-day tasks.
Most reliable production deployments. Level 5 is emerging but requires careful safety design, sandboxing, and human-in-the-loop checkpoints.
Every agent โ regardless of framework or task domain โ is built from the same four primitives. These are not optional enhancements; they are the minimal set of components required for an LLM to pursue a multi-step goal in the world.
The reasoning engine. Decides what to do, interprets results, generates plans, selects tools, evaluates progress. All "intelligence" lives here โ everything else is infrastructure.
The interface to the world. Each tool has a name, description, input schema, and callable implementation. LLM selects tool + arguments. Examples: web search, code executor, file I/O, database query.
Short-term: context window โ all prior messages + tool results. Long-term: vector store or database for knowledge persisting across sessions. Full treatment in Ch 8.4.
How the agent decides what to do next. Simple: one LLM call per step. Complex: ReAct loops, tree-of-thought, plan-and-execute. Planning strategy determines reliability and capability. See Ch 8.3.
Designing an agent requires explicitly specifying what it can see (perception space) and what it can do (action space). These two boundaries define capability and determine risk. An unrestricted write action space with irreversible operations is dangerous; a read-only agent is safe but limited.
- Natural language messages and instructions
- Document and file contents (PDF, code, data)
- Web page HTML and rendered screenshots
- Tool call results and API responses
- Database and vector store query results
- Structured JSON / API response payloads
- Full conversation history (context window)
- Generate text responses, plans, and analysis
- Call external tools with structured arguments
- Write and execute code (Python, bash, SQL)
- Control a computer (mouse, keyboard, navigation)
- Read and write files and databases
- Send emails, post messages, create tickets
- Spawn sub-agents to handle subtasks
The agent loop is the fundamental runtime of any multi-step agent. It is deceptively simple: observe the current state, ask the LLM what to do, execute the chosen action, update the context, and repeat. Everything else โ ReAct, tool calling, memory retrieval, planning โ is a variation on this core loop.
One important consequence: every loop iteration adds tokens to the context window. Long tasks can exhaust the context limit. A production agent must decide what to keep verbatim, what to summarise, and what to offload to long-term memory. This is one of the primary engineering challenges in building reliable agents.
The agent loop is not deterministic. The same goal can take 3 steps or 30 depending on what the LLM decides, what tool results come back, and what errors occur along the way. Reliability engineering for agents means designing the loop to handle all three: success, recoverable failure, and unrecoverable failure โ with graceful exits for each.
Call external functions: search engines, calculators, APIs, databases. The most common agent type. LLM selects which tool and what arguments. Examples: Perplexity (search), Code Interpreter, Claude with tools.
Multi-turn dialogue with users or other agents. Track conversation history, maintain context, handle follow-up questions. Examples: customer service bots, tutoring agents, interview assistants.
Write and execute code as the primary action. Can install packages, run tests, debug errors, and iterate until code works. Examples: Devin, SWE-Agent, GitHub Copilot Workspace.
Control a computer via screenshot observation and mouse/keyboard actions. Can operate any GUI application as a human would โ no API needed. Examples: Claude computer use, OpenAI Operator.
The concept of an autonomous goal-seeking agent is decades old in AI research. What changed between 2022 and 2024 is the convergence of three enabling factors that finally made production agents practical.
GPT-4 and Claude 3 crossed the threshold needed to reliably plan, reason about tool results, and self-correct on errors. Earlier models failed too often for practical multi-step agent loops.
OpenAI function calling (June 2023) and Anthropic tool use gave models a reliable structured way to invoke tools. Before this, tool use required fragile prompt-parsing heuristics.
LangChain, LangGraph, AutoGen, CrewAI abstract the boilerplate. Developers build production agents in hours, not weeks. MCP standardises tool and context protocols across models.
โ Chapter 8.1 โ Key Takeaways
- Agent = LLM + tools + memory + goal โ takes actions with real-world consequences, not just generates text
- The agency spectrum: Level 1 (pure LLM) โ Level 5 (fully autonomous) โ most 2024 production systems sit at Level 3โ4
- Four components required: LLM Brain, Tools, Memory, Planning โ all four needed for complex multi-step tasks
- Agent loop: Observe โ Think โ Act โ Update โ repeats until goal achieved, max steps hit, or error occurs
- Enabled by: GPT-4-class reasoning + structured function calling APIs (June 2023) + mature frameworks (LangChain, LangGraph)
- Key risk: agents can take irreversible real-world actions โ safety design and human oversight are non-negotiable
An LLM without tools is a very smart autocomplete. Tools are what turn text generation into action. The function calling API โ released by OpenAI in June 2023 โ was the single most important infrastructure change that made production agents practical. Before it, tool use was fragile prompt engineering. After it, it was engineering.
LLMs have three fundamental limitations that tools directly address. Knowledge cutoff: training data has a fixed date โ models cannot tell you today's stock price or last night's sports result. Computation: LLMs are unreliable at precise arithmetic, code execution, and structured data queries. Side effects: a language model can describe writing an email but cannot actually send one. Tools bridge each of these gaps, turning a model that only speaks into one that acts.
Web search, Wikipedia, news feeds, stock data, weather APIs, knowledge bases, RAG retrieval. Overcome knowledge cutoff โ give the model access to current information.
Code interpreter (Python), calculator, SQL database, image generation, data analysis. Deliver precise, verifiable results the model itself cannot compute reliably.
Email sending, calendar events, file creation/editing, web browser control, API calls, form submission. Create real-world side effects โ the agent does things, not just says things.
Before structured function calling, using tools required prompting the model to output JSON
and then parsing it โ a fragile approach that broke on any formatting variation.
OpenAI's function calling API (June 2023) changed this: the model now outputs a
structured tool_call object guaranteed by the API, not a string that
needs parsing. Anthropic's tool use API follows the same pattern with minor schema differences.
The round-trip has exactly six steps: define tools โ model requests a tool call โ developer executes the function โ developer returns the result โ model reasons over the result โ model produces the final answer. The developer drives steps 3 and 4; the model does everything else.
# Anthropic Tool Use โ complete working example import anthropic, json client = anthropic.Anthropic() tools = [{ "name": "get_weather", "description": "Get current weather for a city. Returns temp and conditions.", "input_schema": { "type": "object", "properties": { "city": {"type": "string", "description": "City name e.g. 'Tokyo'"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} }, "required": ["city"] } }] def get_weather(city: str, unit: str = "celsius") -> dict: return {"city": city, "temperature": 22, "condition": "sunny", "unit": unit} def run_agent(user_message: str) -> str: messages = [{"role": "user", "content": user_message}] while True: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, tools=tools, messages=messages ) if response.stop_reason == "end_turn": return next(b.text for b in response.content if b.type == "text") if response.stop_reason == "tool_use": messages.append({"role": "assistant", "content": response.content}) tool_results = [] for block in response.content: if block.type == "tool_use": result = get_weather(**block.input) # execute the function tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": json.dumps(result) }) messages.append({"role": "user", "content": tool_results}) # loop continues โ model sees tool results and responds print(run_agent("What's the weather in Tokyo right now?")) # โ "Currently in Tokyo: 22ยฐC and sunny."
A tool schema is a prompt. The model reads your name, description, and parameter descriptions to decide whether to call the tool, when to call it, and what arguments to pass. A vague or misleading schema leads to incorrect tool calls or failed executions โ and these errors compound when tools call other tools.
name: "weather" description: "weather tool" parameters: location: string
Vague name, no description of output, no guidance on location format. Model calls with "Paris" โ ambiguous city โ tool error.
name: "get_current_weather" description: "Retrieve current conditions for a city. Returns temp, condition, humidity, wind. Use for current weather. NOT for historical data." parameters: city: "Full name e.g. 'Paris, France'" unit: celsius|fahrenheit (optional)
Verb_noun name, precise description, example value, explicit scope (what NOT to use for). Model calls with "Paris, France" โ success.
Best practices: verb_noun naming (get_weather, search_web, create_file); description states what it does AND when to use it; parameters include format examples ("e.g. 'Paris, France'"); required vs optional explicitly declared; enum constraints where possible. Think of it as writing documentation for an AI colleague who will read it exactly once before making a decision.
To understand how agents really work, trace a complete multi-step execution. Task: "Find the top 3 Python packages for data visualisation and compare their GitHub stars." The agent needs to: (a) discover which packages are popular, (b) fetch star counts for each, and (c) synthesise the results. Three separate tool calls, four LLM invocations.
Modern LLM APIs support returning multiple tool calls in a single model response. When the model determines that two tool calls are independent โ neither depends on the output of the other โ it can request them simultaneously. The developer then executes both in parallel and returns both results in a single follow-up message. This typically reduces latency by 30โ50% for tasks with multiple independent lookups.
| Tool Category | Examples | Latency | Risk Level |
|---|---|---|---|
| Web / Search | Brave Search, Bing, Serper, SerpAPI, Tavily | 200โ500ms | Low |
| Code Execution | Python REPL, JavaScript sandbox, Jupyter kernel | 100msโ30s | Medium |
| File System | read_file, write_file, list_dir, delete_file | <10ms | High (irreversible) |
| Database | SQL query, vector search, NoSQL get/set | 10โ100ms | MediumโHigh |
| External APIs | REST calls, GraphQL, gRPC services | 100msโ2s | Varies |
| Communication | send_email, post_slack, create_ticket | 200โ500ms | High (irreversible) |
| Browser / Computer | navigate, click, type, screenshot | 500msโ2s | High |
| Memory | vector_store_add, retrieve, entity_update | 10โ100ms | Low |
Before MCP, every agent framework defined tools differently: LangChain tools, AutoGen tools, and custom code were all incompatible. A tool built for one framework couldn't be used in another without rewriting the wrapper. Anthropic's Model Context Protocol (released 2024) is an open standard that solves this โ think of it as HTTP for tool use.
An MCP server exposes tools over a standardised JSON-RPC interface (via stdio or HTTP+SSE). Any MCP-compatible client can connect to any MCP server without modification. The ecosystem already includes servers for: filesystem, PostgreSQL, Slack, GitHub, Google Drive, Puppeteer (browser control), and dozens more.
โ Chapter 8.2 โ Key Takeaways
- Tools solve three LLM limits: knowledge cutoff, computation accuracy, world side effects โ they turn text generation into action
- Function calling: structured JSON
tool_calloutput โ reliable, parseable tool invocation โ the key enabler for production agents - Tool schemas are prompts โ precise descriptions and parameter constraints are critical for correct tool selection and argument generation
- Multi-step loop: tool results added to context โ model reasons over accumulating evidence across multiple LLM calls
- Parallel tool use: call independent tools simultaneously โ reduces latency ~40% with no code changes beyond handling multiple results
- MCP: universal standard for tool connectivity โ any client works with any server, eliminating framework lock-in
The core insight of ReAct is simple but profound: don't just think, then act. Think, act, observe, think again โ interleaving reasoning with real-world grounding. Each tool result updates the plan. Each thought commits to the next action. This is why modern agents are more reliable than either pure reasoning or pure acting alone.
Chain-of-thought prompting was originally a single-turn technique: "Let's think step by step" before answering dramatically improved multi-step reasoning on math and logic tasks. In agents, CoT becomes something more structural โ it is the backbone of every decision step. Before acting, the model writes out its reasoning. This reasoning serves as working memory and directly constrains the next action.
Verbalised reasoning helps agents in four concrete ways: it forces commitment to a plan before executing an irreversible action; it makes the agent's reasoning auditable โ you can inspect exactly why a choice was made; it enables error recovery โ bad reasoning is visible and can be interrupted; and it helps the model notice contradictions before they compound across multiple steps.
When an agent writes "Thought: I need to search for the current price first, then calculate the percentage change", it is not just narrating โ it is programming its own next action. The thought IS the plan. This is why verbalised reasoning improves agent reliability: the model checks its own logic before committing to an action.
Yao et al. (Princeton/Google, 2022) introduced ReAct in "ReAct: Synergising Reasoning and Acting in Language Models". The core insight: interleave reasoning traces (Thought) with tool-grounded actions (Action / Observation) step by step. Not "think then act" as two separate phases โ but thinking and acting interwoven at every step.
Pure reasoning (CoT) lets models hallucinate facts with no grounding in reality โ there is nothing to correct wrong assumptions. Pure acting wastes tool calls without strategy โ the model fires searches randomly without a plan. ReAct solves both: each Observation updates the model's plan; each Thought grounds the next Action in accumulated evidence. The original paper showed ReAct outperforms CoT-only and Act-only on HotpotQA, FEVER, and WebShop benchmarks.
Multi-hop questions require chaining multiple lookups where each result informs the next query. ReAct handles this naturally because each Observation is added to the context before the next Thought. Task: "What is the population of the capital city of the country that hosted the 2020 Olympics?" โ requires three information hops.
ReAct requires no framework โ it is just the standard tool-use loop with a system prompt that instructs the model to think before acting. The key is the system prompt structure and the loop that feeds observations back to the model. The implementation below is complete and runnable with the Anthropic API.
from anthropic import Anthropic client = Anthropic() tools = [{ "name": "search", "description": "Search the web for current information. Use for factual queries, current events, or when you need to look something up.", "input_schema": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"} }, "required": ["query"] } }] def mock_search(query: str) -> str: results = { "2020 Summer Olympics host": "Held in Tokyo, Japan in 2021", "Tokyo population": "City: ~13.96M ยท Metro: ~37.4M (2024)" } for key in results: if key.lower() in query.lower(): return results[key] return f"Search results for: {query}" def react_agent(task: str, max_steps: int = 10) -> str: system = """You are a helpful agent. Think step by step before each action. Always start with a Thought explaining your reasoning before calling a tool.""" messages = [{"role": "user", "content": task}] for step in range(max_steps): response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=system, tools=tools, messages=messages ) messages.append({"role": "assistant", "content": response.content}) if response.stop_reason == "end_turn": return next(b.text for b in response.content if b.type == "text") tool_results = [] for block in response.content: if block.type == "tool_use": result = mock_search(block.input["query"]) print(f" Action: {block.name}({block.input})") print(f" Observation: {result}\n") tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": result }) messages.append({"role": "user", "content": tool_results}) return "Max steps reached without final answer" answer = react_agent( "What is the population of the capital of the 2020 Olympics host country?" ) print(f"\nFinal Answer: {answer}")
Standard ReAct agents repeat mistakes across attempts โ there is no mechanism to learn from failure within a session. Shinn et al. (2023) introduced Reflexion ("Language Agents with Verbal Reinforcement Learning") to address this. After each failed attempt, the agent generates a verbal self-critique explaining what went wrong and what to try differently. This critique is stored in memory and prepended to the next attempt.
Reflexion adds three components to the standard loop: an Evaluator that judges whether an attempt succeeded, a Self-Reflection step that generates a verbal critique on failure, and a Memory store that accumulates critiques across attempts. The result is a form of in-session learning without any gradient updates โ pure verbal reinforcement.
Yao et al. (2023) introduced Tree of Thoughts (ToT): instead of a single linear reasoning chain, maintain multiple candidate reasoning branches simultaneously. At each step, generate multiple next-thoughts, evaluate the promise of each (another LLM call), and explore the most promising โ backtracking from dead ends using BFS or DFS.
Standard CoT picks one path and commits. If that path leads to a wrong conclusion there is no recovery. ToT is best for problems where early choices are high-stakes: puzzle solving, proof writing, strategic planning with multiple valid initial moves. The cost is significantly more LLM calls โ often 5โ20ร more than ReAct. Use it only when the added cost is justified by problem complexity.
| Approach | Paths | Backtracking | LLM Calls | Best For |
|---|---|---|---|---|
| CoT | 1 linear chain | None | 1โ3 | Simple reasoning, clear next step |
| ReAct | 1 path + tools | Implicit via observations | 3โ10 | Most agent tasks, multi-hop queries |
| Reflexion | Multiple attempts | Between attempts | 5โ30 | Tasks requiring iterative refinement |
| ToT | Multiple branches | Within attempt | 20โ100+ | Hard puzzles, proofs, high-stakes planning |
โ Chapter 8.3 โ Key Takeaways
- CoT gives agents verbalised reasoning โ makes plans auditable and helps agents self-correct before committing to actions
- ReAct (Yao et al. 2022): interleave Thought โ Action โ Observation โ grounded reasoning outperforms both pure CoT and act-only approaches
- Multi-hop reasoning: each Observation is added to context before the next Thought โ chains "2020 Olympics โ Japan โ Tokyo โ Population" naturally
- Reflexion: verbal self-critique stored in memory โ agents improve across successive failed attempts without gradient updates
- Tree of Thought: multiple reasoning branches explored and evaluated โ best for high-stakes complex problems; expensive (20โ100+ LLM calls)
- Production default: ReAct is the standard for most tasks; add Reflexion for iterative refinement; use ToT only when early mistakes are catastrophic
A stateless agent is amnesiac โ it forgets everything between turns. A memory-augmented agent can recall a user's preferences from last week, learn from its own past mistakes, and maintain a coherent project context across hundreds of conversations. Memory is what turns a chatbot into a collaborator.
Agent memory maps directly onto human cognitive memory systems. In-context memory is working memory โ everything visible to the model right now. Vector store memory is associative memory โ retrieve by similarity ("what do I know about X?"). Episodic memory is autobiographical โ specific events with timestamps ("what happened last session?"). Semantic/entity memory is world knowledge โ structured facts ("who is Alice, what does she prefer?").
The context window โ all messages, tool results, and plans from the current session. The LLM sees all of it without any retrieval. Fast but finite: 8Kโ200K tokens depending on model. Lost when the session ends.
Text embedded as dense vectors. Retrieve by semantic similarity โ not exact key match. "What do I know about the user's preferences?" returns all relevant stored facts. Persistent across sessions.
Timestamped logs of past interactions. "Last Monday we discussed the invoice API" โ specific events with when, what, and outcome. Enables cross-session continuity and learning from past attempts.
Structured facts about known entities. User profiles: name, role, preferences, communication style. Project facts: stack, status, blockers. Explicitly maintained โ not inferred from logs.
The context window is the agent's working memory. Every message, tool result, plan, and observation from the current session lives here. The LLM sees all of it simultaneously โ no retrieval needed, no similarity search, no latency penalty. It is the default memory for any agent and is sufficient for most short tasks.
The fundamental limitation is the context window is finite. Models support 8K to 200K tokens depending on provider. Long multi-step tasks accumulate tokens rapidly: every tool result, every thought, every observation adds to the total. When context approaches the limit, the agent must decide what to keep, compress, or offload.
Keep only the last N messages. Oldest messages are dropped when context fills. Simple, no retrieval cost. Loses history permanently. Use for: short task-focused assistants.
Compress old turns progressively: recent turns in full detail, older turns as a paragraph summary, oldest as a single sentence. Never completely loses information. Use for: long-running support bots.
Keep all critical tool results but summarise conversational turns. Identify which information is load-bearing (facts, decisions, errors) vs. noise (filler, redundant acknowledgements).
External memory lives outside the context window โ in a database, vector store, or file system. The agent interacts with it via explicit tool calls: write stores important information, search retrieves relevant information at query time. External memory enables persistence across sessions, scalability beyond context limits, and cross-session learning.
The write/search interface is the minimal design. Every agent memory system needs at minimum
three operations: memory_write(key, content, tags) to store,
memory_search(query, n) for semantic retrieval, and
memory_get(key) for exact lookup. The implementation below is a working
in-memory vector store using sentence-transformers.
import json from datetime import datetime from sentence_transformers import SentenceTransformer import numpy as np class AgentMemory: """Simple in-memory vector store for agent memories""" def __init__(self): self.model = SentenceTransformer('all-MiniLM-L6-v2') self.memories = [] # list of {key, content, embedding, timestamp, tags} def write(self, key: str, content: str, tags: list = None) -> str: """Store a memory with its embedding""" embedding = self.model.encode(content) self.memories.append({ "key": key, "content": content, "embedding": embedding, "timestamp": datetime.now().isoformat(), "tags": tags or [] }) return f"Stored memory: {key}" def search(self, query: str, n: int = 3) -> list: """Find n most relevant memories by cosine similarity""" if not self.memories: return [] q_emb = self.model.encode(query) scores = [ np.dot(q_emb, m["embedding"]) / (np.linalg.norm(q_emb) * np.linalg.norm(m["embedding"])) for m in self.memories ] top_n = sorted(zip(scores, self.memories), key=lambda x: -x[0])[:n] return [{"score": s, **m} for s, m in top_n] # Usage mem = AgentMemory() mem.write("user_pref_1", "User prefers Python over JavaScript", tags=["preference", "code"]) mem.write("task_note_1", "User's project is a FastAPI service for invoice processing", tags=["project"]) results = mem.search("What programming language does the user prefer?") print(results[0]["content"]) # "User prefers Python over JavaScript"
Vector store memory retrieves information by meaning, not by exact key. Each piece of stored text is converted into a dense numerical vector (embedding) by an embedding model. At retrieval time, the query is embedded and the vector database returns the stored items with the highest cosine similarity. This makes it possible to ask "What do I know about the user's technical background?" and retrieve all relevant stored facts, even if they use completely different words.
Episodic memory stores specific past events with timestamps โ not just what was learned, but when it happened, in what context, and with what outcome. For agents, episodic memory enables cross-session continuity: the agent on Friday remembers the conversation from Monday without the user needing to re-explain.
Semantic memory stores structured facts about known entities โ timeless information distinct from episodic "when it happened" logs. A user entity has: name, role, technical preferences, communication style, current projects. A project entity has: name, stack, status, key files, blockers. This structured profile grows as the agent learns more and eliminates the need to re-ask the same onboarding questions.
| Use Case | Best Memory Type | Implementation | Persistence |
|---|---|---|---|
| Current conversation state | In-context | Context window messages | Session only |
| Recent task results | In-context | Tool result messages | Session only |
| Long conversation (>100 turns) | External (vector) | Chunked + embedded history | Across sessions |
| User preferences & profile | Semantic/entity | JSON profile + vector search | Permanent |
| Past task attempts & failures | Episodic | Timestamped summary logs | Permanent |
| Domain knowledge base | External (vector) | RAG pipeline on documents | Permanent |
| Cross-session continuity | Episodic + semantic | Combined: summaries + profile | Permanent |
Three patterns cover the vast majority of production agent memory designs. Pattern selection depends on task length, session frequency, and personalisation requirements.
Keep the last N messages. Drop oldest when context fills. Simple, no retrieval cost, no infrastructure needed. Loses history permanently. Best for: short task-focused assistants with well-scoped goals.
Recent turns: full detail. Older turns: paragraph summary. Oldest: one sentence. Never completely loses info โ compresses to gist. Best for: long-running customer support, multi-day coding sessions.
All facts stored in vector DB. At each step: retrieve relevant memories + inject into context. Working memory = retrieved context, not full history. Infinite effective memory, small context footprint. Best for: personalised cross-session agents.
โ Chapter 8.4 โ Key Takeaways
- Four memory types: in-context (window), vector (semantic), episodic (logs), entity (facts) โ each serves a different recall need
- In-context is the default โ fast and zero-retrieval but limited by context window size and lost when session ends
- Vector store: embed โ store โ retrieve by semantic similarity โ "what do I know about X?" regardless of exact wording
- Episodic memory: timestamped session summaries โ enables cross-session continuity without re-explanation
- Semantic/entity memory: structured profiles of users and domains โ enables personalisation and avoids repetitive onboarding
- Most production agents need all four types working together: context for now, vector for knowledge, episodic for history, entity for identity
ReAct decides one step at a time. Planning decides the whole path before taking the first step. For short tasks, step-by-step is fine. For tasks with irreversible actions, dependencies, and ten or more steps โ a plan prevents early mistakes that cannot be undone. The art is knowing when to plan, how deeply, and when to abandon the plan and replan.
For simple tasks, ReAct's step-by-step approach is perfectly adequate โ decide, act, observe, repeat. Planning becomes necessary when tasks involve irreversible actions (sending emails, committing code, deleting files), long horizons where the agent may lose track of the original goal after ten or more steps, or parallel sub-tasks where independent work streams could be executed concurrently for efficiency.
Planning also enables human oversight: when a complete plan is generated before any action is taken, a human can review and approve the plan before irreversible operations start. This is one of the most practical safety mechanisms in production agents.
Decide step-by-step as observations accumulate. Works for: simple queries, short tasks, tasks where every step is reversible. Risk: no global strategy, can get lost on long tasks.
"Think step by step before acting" โ loose structure, single LLM call to outline approach before starting. Works for: medium complexity tasks requiring a rough roadmap but flexible execution.
Explicit numbered plan generated and tracked step by step. Enables human review before execution. Works for: irreversible actions, long tasks, tasks with clear sequential dependencies.
Goals decomposed into sub-goals recursively. Independent sub-tasks run in parallel. Works for: complex multi-domain tasks where specialised sub-agents handle different branches.
Wang et al. (2023) introduced Plan-and-Solve: a two-phase architecture where a Planner LLM call generates a complete numbered plan, and an Executor loop carries out each step using tools. This separation provides a critical advantage: the full plan is explicit before any action is taken, enabling human review, dependency analysis, and progress tracking.
from anthropic import Anthropic client = Anthropic() def create_plan(goal: str) -> list: """Phase 1: Planner โ generate a complete numbered plan""" response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": f"""Create a numbered step-by-step plan for: Goal: {goal} Output ONLY a numbered list. Each step should be concrete and actionable. Max 7 steps."""}] ) plan_text = response.content[0].text return [l.strip() for l in plan_text.split('\n') if l.strip() and l.strip()[0].isdigit()] def execute_step(step: str, completed: list) -> str: """Phase 2: Executor โ carry out one plan step""" context = "\n".join(f"- {s}" for s in completed) response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": f"""Completed steps: {context} Execute this step and provide its output: {step}"""}] ) return response.content[0].text def plan_and_execute(goal: str) -> str: # Phase 1: generate full plan plan = create_plan(goal) print(f"Plan ({len(plan)} steps):") for i, step in enumerate(plan, 1): print(f" {i}. {step}") # Phase 2: execute each step completed = [] for step in plan: result = execute_step(step, completed) completed.append(f"{step} โ {result[:80]}...") print(f"\nโ {step}\n {result[:200]}") return f"Completed {len(plan)} steps successfully" plan_and_execute("Research and write a brief comparison of Redis vs Memcached")
Complex goals naturally form a tree of sub-goals. A top-level goal like "Build a data analysis report" decomposes into "Collect data", "Analyse data", and "Write report" โ each of which decomposes further into concrete executable steps. Hierarchical planning makes this structure explicit. Independent sub-trees can run in parallel; sequential dependencies are enforced between tree levels.
Static plans assume the world matches their assumptions. In practice, steps fail: the database is down, the API changed, the file doesn't exist. An agent that cannot adapt to failure is brittle. Replanning is the mechanism for detecting when reality has diverged from the plan and generating a revised plan from the current state.
Three strategies, in order of cost: step retry โ try the same step differently; partial replan โ regenerate only the remaining steps given the new situation; full replan โ abandon the current plan entirely and regenerate from the current state. The key is that the agent must actively recognise a failure condition rather than blindly proceeding.
Zhou et al. (2023) introduced Least-to-Most Prompting, inspired by educational scaffolding. Rather than attacking a complex question directly, the method first decomposes it into simpler sub-problems, then solves each in order โ using prior solutions as context for the next. This approach excels at compositional problems where a complex answer genuinely depends on simpler intermediate results.
def least_to_most(question: str) -> str: """Stage 1: decompose. Stage 2: solve each subproblem in order.""" # Stage 1: Decompose into ordered subproblems decompose = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, messages=[{"role": "user", "content": f"""Break this question into simpler subproblems that must be solved first, ordered from simplest to most complex. Question: {question} Output: A numbered list of simpler subproblems."""}] ) subproblems = [l.strip() for l in decompose.content[0].text.split('\n') if l.strip() and l.strip()[0].isdigit()] # Stage 2: Solve each subproblem, using prior answers as context context = f"Original question: {question}\n\nSubproblem solutions:\n" for sub in subproblems: resp = client.messages.create( model="claude-sonnet-4-6", max_tokens=256, messages=[{"role": "user", "content": f"{context}\nNow solve: {sub}"}] ) context += f"\n{sub}: {resp.content[0].text}" # Final answer using accumulated subproblem solutions final = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, messages=[{"role": "user", "content": f"{context}\n\nNow answer the original question."}] ) return final.content[0].text # Example: compositional maths/reasoning answer = least_to_most( "If a train travels 120km in 1.5 hours and then 80km in 1 hour, what is its average speed?" ) print(answer)
Subgoal decomposition is the general skill underlying all planning methods: given a goal, identify the minimal set of concrete sub-tasks that, when completed, guarantee the goal is achieved. The quality of decomposition determines everything downstream โ bad decomposition leads to missing steps, incorrect dependencies, and wasted effort.
Steps must run in order โ each depends on the previous. Use when: output of step N is an input to step N+1. Example: Collect data โ Clean data โ Analyse โ Report.
Sub-tasks are independent โ all can run simultaneously. Use when: sub-tasks share the same input but produce separate outputs. Example: search 3 different sources in parallel.
Next sub-task depends on the result of a prior sub-task. Use when: branching logic exists. Example: "if search returns no results, try alternative query; else proceed to analyse."
Agent spends too many steps planning instead of acting โ "analysis paralysis". 10 planning steps for a 2-step task. Fix: limit planning depth, set a max plan length, use ReAct for simple tasks.
Agent follows the initial plan rigidly even when observations contradict it. Committed to the plan, ignores reality. Fix: explicit plan-evaluation step after each observation โ "does the plan still make sense?"
Plan assumes step 4 can proceed before step 2 finishes. Parallel execution races cause unpredictable failures. Fix: explicit dependency graph before execution โ identify all step inputs and outputs.
In long plans, agent slowly drifts from the original goal โ sub-goals become ends in themselves. Fix: re-state the original goal at regular checkpoints in the context window.
โ Chapter 8.5 โ Key Takeaways
- Planning is essential for irreversible actions, long tasks, and parallel sub-tasks โ ReAct alone is insufficient
- Plan-and-Execute: generate full plan first, then execute step by step โ enables human review before any action runs
- Hierarchical: decompose goal tree recursively โ independent sub-trees run in parallel, sequential arrows enforce ordering
- Dynamic replanning: detect failure, decide retry vs partial/full replan โ critical for robustness in real environments
- Least-to-Most: simpler subproblems solved first โ each answer scaffolds the next for compositional tasks
- Key pitfalls: over-planning, stale plans, missing dependencies, goal drift over long tasks โ all require explicit mitigation
A single agent is a generalist. A team of agents is a specialised organisation. When tasks decompose into distinct roles โ researcher, coder, critic, writer โ routing each to a specialist consistently outperforms one agent doing everything. The Supervisor pattern has become the default architecture for production systems that need reliability, auditability, and scale.
A single agent with all tools solves many problems โ but hits four fundamental limits. Context overload: fifty tool calls in one context window approaches or exceeds any model's limit. Specialisation: a code-writing specialist with a focused system prompt and code-specific tools outperforms a generalist at the same task. Parallelism: three independent research threads can run simultaneously in three agents for 3ร throughput. Verification: a separate critic agent reviewing the author agent's output catches errors that self-review misses.
Independent sub-tasks run simultaneously. A research agent + a code agent + a writing agent can all work at once โ 3ร speedup for parallel workstreams compared to sequential single-agent execution.
Each agent optimised for one role: researcher, coder, critic, planner. Focused system prompt + role-specific tool set โ better per-role performance than a generalist attempting all roles.
Separate critic/reviewer agent checks the main agent's work independently. Two-agent "author + reviewer" consistently produces fewer errors than one agent doing both roles.
Route different task types to different specialised agents. Handle more users by running parallel agent instances. Different domains (legal, medical, code) get domain-specific agents.
Four structural patterns cover the vast majority of multi-agent system designs. Each makes different trade-offs between simplicity, flexibility, and parallelism.
The Supervisor/Worker pattern is the dominant architecture in production multi-agent systems. The Supervisor is an LLM whose sole job is orchestration โ it understands the overall goal, decides which specialist to invoke and with what task, and synthesises results. It does not itself call tools. Worker agents are specialised: each has a focused system prompt, a specific tool set, and a narrow domain of responsibility.
from langgraph.graph import StateGraph, END from langchain_anthropic import ChatAnthropic from typing import TypedDict, Annotated import operator class AgentState(TypedDict): goal: str messages: Annotated[list, operator.add] next_agent: str final_output: str llm = ChatAnthropic(model="claude-sonnet-4-6") def supervisor(state: AgentState) -> AgentState: """Decides which agent to call next""" prompt = f"""You are an orchestrator managing a team of agents. Goal: {state['goal']} Progress: {state['messages'][-3:] if state['messages'] else 'None'} Available agents: RESEARCHER, CODER, WRITER, FINISH Which agent should act next? Respond with ONLY the agent name.""" response = llm.invoke(prompt) return {**state, "next_agent": response.content.strip()} def researcher(state: AgentState) -> AgentState: result = f"[Research results for: {state['goal']}]" return {**state, "messages": [f"Researcher: {result}"]} def coder(state: AgentState) -> AgentState: result = f"[Code for: {state['goal']}]" return {**state, "messages": [f"Coder: {result}"]} def writer(state: AgentState) -> AgentState: result = f"[Final document for: {state['goal']}]" return {**state, "messages": [f"Writer: {result}"], "final_output": result} # Build the graph graph = StateGraph(AgentState) for name, fn in [("supervisor", supervisor), ("researcher", researcher), ("coder", coder), ("writer", writer)]: graph.add_node(name, fn) graph.set_entry_point("supervisor") graph.add_conditional_edges("supervisor", lambda s: s["next_agent"], { "RESEARCHER": "researcher", "CODER": "coder", "WRITER": "writer", "FINISH": END }) for node in ["researcher", "coder", "writer"]: graph.add_edge(node, "supervisor") # workers always return to supervisor app = graph.compile() result = app.invoke({"goal": "Create a Python web scraper for HN jobs", "messages": []})
A handoff is the transfer of execution from one agent to another with relevant context. What gets transferred determines whether the receiving agent can proceed effectively: too little context and the agent re-does already-completed work; too much context and the receiving agent's context window overflows.
The minimal handoff payload should include: the current task description, relevant completed-step outputs, constraints and requirements, and optionally a summary of conversation history. The handoff type determines urgency and routing: specialisation (this requires coding โ transfer to coder), escalation (this requires human approval โ pause and notify), error (I failed โ transfer to error-recovery agent), and completion (sub-task done โ return to supervisor).
Current agent identifies a task outside its specialty. Passes: task description + context needed. Example: "This requires code generation โ handing to Coder agent."
Task requires approval or capabilities beyond any agent. Routes to human-in-the-loop. Example: "This action is irreversible โ notifying human for approval before proceeding."
Agent has failed and cannot self-recover. Passes: failure description + attempted approaches. Routes to error-recovery agent or supervisor for replanning.
Sub-task successfully completed. Returns result to supervisor with: output summary, status, any side effects created. Supervisor decides next step.
LangGraph (LangChain, 2024) is a graph-based framework for stateful multi-agent systems. Its core abstraction is a StateGraph: a directed graph where nodes are functions (agents or tools), edges are transitions, and a shared typed State dictionary persists across all nodes. Conditional edges route execution based on the current state โ the LLM's output determines which node runs next.
LangGraph's key advantages over plain Python loops are built-in persistence (checkpoint state to a database โ resume after failure, enable human-in-the-loop approvals), streaming (intermediate steps stream to the UI in real time), and parallel execution (fork-join for simultaneous node execution).
| Framework | Paradigm | Strengths | Best For | Abstraction |
|---|---|---|---|---|
| LangGraph | Graph-based stateful | Persistence, streaming, fine-grained control | Production agents, complex flows | Low (explicit graph) |
| AutoGen (Microsoft) | Conversational agents | Easy multi-agent chat, human-in-loop | Research, prototyping, group chat | Medium |
| CrewAI | Role-based crews | Simple role/goal definition, no graph | Rapid prototyping, simple multi-role | High (declarative) |
| LangChain Agents | Tool-using agents | Rich tool ecosystem, many integrations | Single agent with many tools | Medium |
| OpenAI Assistants | Thread-based managed | Built-in memory, code interpreter | Simple production assistants | High (managed) |
| Semantic Kernel | Plugin-based | .NET/Python, enterprise planning | Enterprise applications | Medium |
How agents communicate determines system reliability, debuggability, and scalability. Three communication patterns cover most production systems: shared state (agents read and write a common typed dictionary โ LangGraph's model), message passing (agents send structured messages to each other โ AutoGen's model), and function calling (supervisor calls worker as a tool with structured input/output). Shared state is easiest to debug; message passing is most flexible; function calling is the most familiar interface for developers.
All agents read/write a common typed state dictionary. Every step is visible to every agent. Easy to inspect and debug. Used by LangGraph. Risk: state conflicts if agents write the same field.
Agents send structured messages to each other. Each agent maintains its own conversation thread. Flexible, decoupled. Used by AutoGen. Risk: message format mismatches between agents.
Supervisor calls workers as tools with structured JSON input/output. Clean interface boundary. Workers are stateless from the supervisor's perspective. Risk: loses intermediate worker context.
โ Chapter 8.6 โ Key Takeaways
- Multiple agents enable: parallelism, specialisation, verification, and scale โ justified when a single agent hits context or quality limits
- Four topologies: Sequential, Fan-out, Supervisor/Worker, Peer Debate โ Supervisor is the dominant production pattern
- Supervisor pattern: orchestrator routes tasks to specialised workers โ LangGraph provides the graph infrastructure for this
- Handoffs pass context between agents โ too little = re-work; too much = context overflow โ right-size the handoff payload
- LangGraph: explicit stateful graph with persistence and streaming โ best for production requiring reliability and human-in-the-loop
- CrewAI/AutoGen for rapid prototyping; LangGraph for production systems requiring checkpointing and auditability
You cannot improve what you cannot measure. And you cannot deploy what you cannot trust. Agent evaluation is harder than model evaluation because agents act โ and actions can be irreversible. The same agent solving the same task may take a completely different path each time. Building reliable agents means knowing where they fail, how often, and why โ and designing safety mechanisms before those failures have real-world consequences.
Standard LLM evaluation is straightforward: fixed input, expected output, compute a score (BLEU, accuracy, human preference). Agent evaluation is fundamentally harder across five dimensions: multiple valid paths (many different tool call sequences can reach the correct answer โ which counts?); process vs outcome (did the agent succeed by doing the right thing, or by luck?); stochasticity (same task, different execution every run); long horizons (50 steps โ where exactly did it go wrong?); and irreversibility (some failures cannot be undone, limiting how many evaluation runs you can afford).
Production agent evaluation requires a controlled simulation environment: tools return deterministic results from a test fixture, enabling repeatable evaluation of the agent's decision-making independently of external API variability.
A task may be solvable by 20 different tool call sequences. Exact-match evaluation is meaningless. Must evaluate the outcome, not the path โ or evaluate both separately.
Same agent, same task โ different execution each time. A single-run evaluation is unreliable. Need Nโฅ10 runs per task to get a stable success rate estimate.
"Right answer, wrong process" is a real failure mode. An agent that guesses correctly without using tools may be brittle on harder variants. Both outcome and trajectory quality matter.
No single metric captures agent quality. Production agent monitoring requires tracking at least six dimensions simultaneously โ success, efficiency, recovery, safety, cost, and latency. An agent that succeeds 90% of the time but costs 10ร more than necessary is not production-ready.
| Metric | Formula | What It Measures | Target |
|---|---|---|---|
| Task Success Rate | correct / total tasks | End-to-end task completion | >80% |
| Step Efficiency | optimal_steps / actual_steps | Tool call efficiency, no wasted steps | >0.7 |
| Error Recovery Rate | recoveries / total errors | Robustness to tool failures | >70% |
| Safety Rate | 1 โ violations / actions | Avoidance of unsafe actions | >99% |
| Cost per Task | $ tokens + $ tool calls | Economic efficiency | Benchmark-dependent |
| P90 Latency | 90th percentile wall-clock | Real-world responsiveness | <30s typical |
| Benchmark | Domain | Measure | Top Score (2024) | Notes |
|---|---|---|---|---|
| SWE-bench | Software engineering | % GitHub issues resolved | ~50% (best systems) | Hard โ real codebase understanding |
| WebArena | Web navigation | Task success rate | ~35โ50% | Browse, fill forms, extract info |
| AgentBench | 8 domains (OS, DB, web, game) | Avg task success | ~50โ60% | Diverse agent task suite |
| HotpotQA | Multi-hop QA | EM + F1 score | ~70โ80% | ReAct baseline well-established |
| ALFWorld | Household navigation (text) | Task success rate | ~90%+ | Simulated environment |
| GAIA | General AI (diverse) | % correct (requires tools) | ~50% frontier models | Real-world tools required |
Agent generates arguments that don't match tool schema โ e.g. search(url='...') when schema requires search(query='...'). Fix: strict JSON schema validation before execution, input sanitisation.
Agent calls the same tool repeatedly with the same arguments, making no progress. Caused by unhelpful tool results that don't resolve the impasse. Fix: max_steps limit, same-tool+args loop detection.
After many steps, agent forgets the original goal โ pursues sub-goals as ends in themselves. Fix: re-state original goal in system prompt, periodic goal-check steps in the agent loop.
Tool results + messages exceed the context window. Model truncation corrupts task state โ agent loses track of what it was doing. Fix: summarise old messages, external memory, limit tool result size.
Agent takes actions outside intended scope โ sends emails not requested, deletes files to "clean up". Fix: explicit scope constraints in system prompt, tool allow-list, human approval gates for irreversible actions.
One failed tool call causes misinterpretation of state โ all subsequent steps are wrong because built on a bad premise. Fix: explicit error detection, tool result validation, partial-plan recovery on failure.
Not all agent actions should be fully autonomous. Actions exist on a risk spectrum: read-only actions (search, read files) carry no irreversible risk and can auto-execute; reversible writes (draft email, temp file) are low risk; irreversible actions (send email, delete file, submit form) require explicit approval; critical actions (financial transactions, public communications, security changes) always require human review.
Prompt injection is a class of attack where malicious content in the environment hijacks the agent's behaviour. Unlike a chatbot where injection only generates text, an agent can execute code, send emails, and access files โ making prompt injection potentially catastrophic rather than merely embarrassing.
Indirect injection is the most dangerous variant for agents: malicious instructions embedded in a web page, retrieved document, or tool result โ content the agent is supposed to process, not follow. The agent doesn't distinguish "content to read" from "instructions to follow" without explicit architectural protections.
Give the agent ONLY the tools it needs for its specific task. A research agent doesn't need email tools. A customer service agent doesn't need file deletion. Minimum tool surface = minimum blast radius.
Design reversible tool variants: draft_email not send_email ยท move_to_trash not delete ยท stage_changes not commit. When irreversible is unavoidable, require explicit confirmation.
Define clear triggers: any action affecting >N users ยท any financial action >$X ยท any irreversible modification ยท any external communication. Automate the classification, not the override.
Log all tool calls with: timestamp, agent ID, tool name, args, result, latency. Immutable audit trail for debugging, compliance, and rollback. Essential for any agent with real-world consequences.
Run code execution in isolated containers: network restrictions ยท filesystem limits ยท CPU/memory caps ยท timeout enforcement. Never execute agent-generated code in the host environment directly.
State both permissions and prohibitions: "You MAY: read files in /workspace, search the web, run Python. You may NEVER: send emails, modify system files, access credentials." Explicit prohibition reduces accidental violations.
โ Chapter 8.7 โ Key Takeaways
- Agent eval is hard: multiple valid paths, stochastic execution, long-horizon, irreversible actions โ requires simulation environments
- Six key metrics: task success, step efficiency, error recovery, safety violations, cost, latency โ no single number suffices
- Most common production failures: context overflow (35%) and goal drift (28%) โ address these first
- Human-in-the-loop: irreversible and high-risk actions require approval gates โ classify risk, then route automatically
- Prompt injection: malicious content in tool results can hijack agent instructions โ separate instruction/content contexts, apply allow-listing
- Safe design pillars: least privilege, reversible actions, explicit scope constraints, sandbox execution, immutable audit trail
The most important question about AI agents is not how they work in research papers โ it is how they fail in production. Every frontier lab in 2024 is shipping agents. The gap between demo and production is where fortunes are made and lost. This chapter is about closing that gap.
By 2024, AI agents have moved from research demonstrations to production products used by millions of developers and consumers. The systems below represent the state of the art across different application domains โ each offers a case study in a different architectural approach to the core challenges of reliability, autonomy, and safety.
First "AI software engineer" โ autonomously completes software engineering tasks end-to-end. Tools: shell, code editor, browser, test runner. Reads issue โ plans โ writes code โ runs tests โ debugs โ submits PR. Architecture: long-horizon planning + parallel exploration.
Open-source code agent for research. Key insight: purpose-built ACI (Agent-Computer Interface) tools โ search_code, edit_file, find_function โ outperform generic bash tools. SWE-bench: ~18% with improved tools.
IDE-integrated code agent. Reads issue/PR โ generates plan โ implements across multiple files. Human reviews the plan before execution starts. Deep repo context: file tree, PR history, test results. Available to millions of GitHub users.
Perceives a computer screen via screenshots, acts via mouse/keyboard. Can operate any GUI application as a human would โ no API needed. Architecture: multimodal LLM (vision) + computer action tools. Use cases: forms, legacy software, UI testing.
Web automation agent integrated into ChatGPT. Completes multi-step web tasks: book flights, fill forms, complete purchases. Safety-first: only proceeds for clearly benign tasks, pauses for human confirmation on sensitive or irreversible actions.
Research agent: plans search queries, executes multiple searches, synthesises into cited answers. Lower autonomy (Level 2โ3) but highest reliability in its domain. Best research product available to consumers in 2024.
| System | Domain | Architecture | Autonomy | Notable |
|---|---|---|---|---|
| Devin | Software engineering | ReAct + planning + specialised tools | High | First "AI software engineer" |
| SWE-Agent | Code + GitHub | ACI + specialised tools | Medium-High | Open source, research |
| Copilot Workspace | IDE + code | Plan-then-execute | Medium (human reviews plan) | Mass market, GitHub |
| Claude Computer Use | Any GUI | Screenshot โ action loop | High | Any app, no API needed |
| OpenAI Operator | Web automation | Web browsing + actions | Medium | Integrated in ChatGPT |
| Perplexity | Research | Search + synthesis | Low-Medium | Best research product 2024 |
Code agents are the most mature agent category. The reasons are structural: code is verifiable (it either passes the tests or it doesn't), safe to iterate (test โ error โ fix is a tight feedback loop with no irreversible side effects), and the tools are well-defined (shell, editor, test runner โ standard interfaces that haven't changed in decades).
The key lesson from SWE-Agent is that tool design matters as much as the model. Generic tools (bash, read_file) force the agent to navigate file systems and parse raw output manually. Purpose-built ACI tools (search_code, edit_function, apply_patch) return structured results that map directly to how a developer thinks about code โ dramatically reducing the reasoning burden per step.
Computer use agents are the most general form of agent: instead of purpose-built APIs, the agent perceives a screenshot of any GUI application and takes actions via simulated mouse clicks and keyboard input. No integration work is required โ if a human can see and click it, the agent can too.
Three patterns cover the architecture of most production agent deployments. Choosing the right pattern depends on task duration, response time requirements, and whether tasks can run independently in the background.
Production agents can cost $0.05โ$5.00 per task depending on complexity. The dominant cost driver is LLM tokens โ particularly in the planning and synthesis steps. The dominant latency driver is the number of serial LLM calls. Both can be dramatically reduced through model routing (match model size to step complexity), parallel tool execution, and result caching.
| Agent Step | Complexity | Recommended Model | Est. Cost / 1K tokens | Latency |
|---|---|---|---|---|
| Initial goal analysis | High | Claude Opus / GPT-4o | $0.015 | 2โ4s |
| Planning generation | High | Claude Sonnet / GPT-4o | $0.003 | 1โ3s |
| Tool call routing | Low | Claude Haiku / GPT-4o-mini | $0.00025 | 0.3โ0.8s |
| Tool result parsing | Low | Claude Haiku | $0.00025 | 0.3โ0.5s |
| Error recovery | Medium | Claude Sonnet | $0.003 | 1โ2s |
| Final synthesis | High | Claude Sonnet / Opus | $0.003โ0.015 | 1โ4s |
The open problems in agentic AI (2024โ2026) are: long-horizon reliability (tasks spanning hours or days โ compounding errors over hundreds of steps), cross-agent trust (how does agent A know agent B is trustworthy, not compromised or hallucinating), persistent identity (memory that degrades gracefully over months, not sessions), and self-improving agents (agents that improve their own tools and strategies through experience rather than requiring manual retraining).
Specialist vertical agents (legal, medical, finance). Enterprise deployment platforms. Agent-to-agent marketplaces. MCP as universal tool standard. RL-trained agents from real task outcomes. Computer use at production reliability.
Multi-month task horizons. Agents that manage other agents at scale. Self-improving tool use. Persistent agent identities across years. Deep integration with physical systems (robotics + agents). Standardised agent-to-agent protocols.
๐ Domain 8 Complete โ Agentic AI
- Ch 8.1: Agent = LLM + tools + memory + goal. Agency spectrum: Level 1 (chatbot) โ Level 5 (autonomous). Most 2024 systems are Level 3โ4.
- Ch 8.2: Function calling = structured tool invocation. Tool schemas are prompts โ precise descriptions prevent failures. MCP standardises tool connectivity.
- Ch 8.3: ReAct: Thought โ Action โ Observation loop interleaves reasoning with grounding. Reflexion adds verbal self-critique for iterative improvement.
- Ch 8.4: Four memory types: in-context, vector store, episodic, semantic. Context management is critical for long tasks.
- Ch 8.5: Plan-and-Execute for complex tasks; dynamic replanning for failures. Planning prevents irreversible early mistakes.
- Ch 8.6: Supervisor/Worker is the dominant production multi-agent pattern. LangGraph for production; CrewAI/AutoGen for prototyping.
- Ch 8.7: Agent failures: context overflow and goal drift are most common. Human-in-the-loop gates required for high-risk irreversible actions.
- Ch 8.8: Production agents (Devin, Copilot Workspace, Claude computer use) are here now. Cost and latency optimisation via model routing and parallel tool execution are essential.
This Foundation chapter introduced the concepts. But production agents are far more complex:
- Tool reliability issues โ what happens when APIs fail or return unexpected results
- Latency constraints โ making agents feel fast while managing multiple LLM calls
- Hallucinated actions โ agents confidently executing the wrong tool or wrong arguments
- Orchestration challenges โ coordinating multi-step agents with human checkpoints
โ Covered in depth: AI Agents in Production (Advanced)
Agentic AI is where all previous domains converge. Domain 2 (maths) โ the reasoning the LLM uses. Domain 4 (deep learning) โ the model powering the agent brain. Domain 5 (NLP) โ the language understanding and generation. Domain 6 (CV) โ computer use agents seeing the screen. Domain 7 (RL) โ the RLHF that aligned the agent to be helpful. An agent is the sum of everything we've built.
The question that remains โ and the one Domain 9 addresses โ is: as these agents become more capable and more autonomous, how do we ensure they remain aligned with human values?