AI Foundation · Domain 08

AI Agents — Autonomous LLM Systems

Tool use, memory, reasoning patterns, multi-agent collaboration, and production deployment of LLM-powered autonomous systems.

8.1

Chapter 8.1

What Are AI Agents? — From Chatbots to Autonomous Systems

A chatbot responds. An agent acts. The difference is not intelligence — it is the ability to take actions with real-world consequences: browsing the web, writing files, executing code, calling APIs, and persisting across multiple steps toward a goal. When you give an LLM tools and a goal, you get an agent.

LLM vs Agent Core

A standard LLM is a stateless input-output transformation: a message goes in, text comes out. Between turns there is no persistent state, no access to the outside world, and no capability to take actions beyond generating words. It is powerful but passive.

An AI agent wraps that LLM with tools, memory, and a loop. The LLM becomes the "brain" — it decides what to do. Tools are the "hands" — they actually do it. The agent pursues a goal across multiple steps, using the result of each action to inform the next. Real-world example: Devin reads a GitHub ticket, writes code across 15 files, runs the test suite, and iterates until all tests pass — without a human in the loop for each file.

Property	Standard LLM	AI Agent
Interaction style	Single-turn input → output	Multi-step task execution
State	Stateless between turns	Maintains state across steps
Output type	Text only	Real-world actions (files, APIs, code)
External access	None	Tools, APIs, databases, browsers
Model calls per task	One	Many (until goal achieved)
Side effects	None	Creates files, sends emails, runs code

LLM vs Agent — text response vs goal-directed multi-step action

The Agency Spectrum In-depth

Agency is not binary. It exists on a spectrum from a pure LLM that never touches the outside world, to a fully autonomous system that operates for hours without human input. Understanding where on this spectrum your system sits is the first design decision in building any agent — it determines safety requirements, reliability challenges, and appropriate use cases.

💬

Level 1 — Pure LLM

Chat only. No tools, no memory, no actions. User asks → LLM answers. Example: vanilla ChatGPT without plugins.

🔧

Level 2 — LLM + Fixed Tools

One-shot tool use per turn. ChatGPT with web search enabled. Tool execution is deterministic — one search per turn.

⚡

Level 3 — Single-step Agent

Plans and calls tools, but one action per user turn. Code Interpreter: executes one cell, returns results.

🔄

Level 4 — Multi-step Agent

Loops: think → act → observe → repeat until goal reached. Claude computer use, AutoGPT, most LangChain agents. Most 2024 production agents are here.

🤖

Level 5 — Autonomous Agent

Operates indefinitely without human input. Self-assigns subtasks, spawns sub-agents, recovers from failures. Devin, SWE-Agent on multi-day tasks.

📍

2024 Sweet Spot: Level 3–4

Most reliable production deployments. Level 5 is emerging but requires careful safety design, sandboxing, and human-in-the-loop checkpoints.

The Agency Spectrum — from pure LLM response to fully autonomous action

Agent Anatomy In-depth

Every agent — regardless of framework or task domain — is built from the same four primitives. These are not optional enhancements; they are the minimal set of components required for an LLM to pursue a multi-step goal in the world.

🧠

1 — LLM Brain

The reasoning engine. Decides what to do, interprets results, generates plans, selects tools, evaluates progress. All "intelligence" lives here — everything else is infrastructure.

🔧

2 — Tools

The interface to the world. Each tool has a name, description, input schema, and callable implementation. LLM selects tool + arguments. Examples: web search, code executor, file I/O, database query.

🗂️

3 — Memory

Short-term: context window — all prior messages + tool results. Long-term: vector store or database for knowledge persisting across sessions. Full treatment in Ch 8.4.

📋

4 — Planning System

How the agent decides what to do next. Simple: one LLM call per step. Complex: ReAct loops, tree-of-thought, plan-and-execute. Planning strategy determines reliability and capability. See Ch 8.3.

Agent Anatomy — LLM Brain + Tools + Memory + Planning

Perception & Action Space Core

Designing an agent requires explicitly specifying what it can see (perception space) and what it can do (action space). These two boundaries define capability and determine risk. An unrestricted write action space with irreversible operations is dangerous; a read-only agent is safe but limited.

👁️

Agent Perception (What It Sees)

Natural language messages and instructions
Document and file contents (PDF, code, data)
Web page HTML and rendered screenshots
Tool call results and API responses
Database and vector store query results
Structured JSON / API response payloads
Full conversation history (context window)

⚡

Agent Actions (What It Does)

Generate text responses, plans, and analysis
Call external tools with structured arguments
Write and execute code (Python, bash, SQL)
Control a computer (mouse, keyboard, navigation)
Read and write files and databases
Send emails, post messages, create tickets
Spawn sub-agents to handle subtasks

The Agent Loop In-depth

The agent loop is the fundamental runtime of any multi-step agent. It is deceptively simple: observe the current state, ask the LLM what to do, execute the chosen action, update the context, and repeat. Everything else — ReAct, tool calling, memory retrieval, planning — is a variation on this core loop.

One important consequence: every loop iteration adds tokens to the context window. Long tasks can exhaust the context limit. A production agent must decide what to keep verbatim, what to summarise, and what to offload to long-term memory. This is one of the primary engineering challenges in building reliable agents.

The agent loop is not deterministic. The same goal can take 3 steps or 30 depending on what the LLM decides, what tool results come back, and what errors occur along the way. Reliability engineering for agents means designing the loop to handle all three: success, recoverable failure, and unrecoverable failure — with graceful exits for each.

Agent Loop — Observe → Think → Act → Update → repeat until done

Agent Types Core

🔧

Tool-Using Agents

Call external functions: search engines, calculators, APIs, databases. The most common agent type. LLM selects which tool and what arguments. Examples: Perplexity (search), Code Interpreter, Claude with tools.

🤝

Conversational Agents

Multi-turn dialogue with users or other agents. Track conversation history, maintain context, handle follow-up questions. Examples: customer service bots, tutoring agents, interview assistants.

💻

Code Agents

Write and execute code as the primary action. Can install packages, run tests, debug errors, and iterate until code works. Examples: Devin, SWE-Agent, GitHub Copilot Workspace.

🖥️

Computer Use Agents

Control a computer via screenshot observation and mouse/keyboard actions. Can operate any GUI application as a human would — no API needed. Examples: Claude computer use, OpenAI Operator.

Why Agents Now? Core

The concept of an autonomous goal-seeking agent is decades old in AI research. What changed between 2022 and 2024 is the convergence of three enabling factors that finally made production agents practical.

🎯

1 — LLM Capability Threshold

GPT-4 and Claude 3 crossed the threshold needed to reliably plan, reason about tool results, and self-correct on errors. Earlier models failed too often for practical multi-step agent loops.

⚙️

2 — Structured Tool Calling APIs

OpenAI function calling (June 2023) and Anthropic tool use gave models a reliable structured way to invoke tools. Before this, tool use required fragile prompt-parsing heuristics.

🏗️

3 — Ecosystem Maturity

LangChain, LangGraph, AutoGen, CrewAI abstract the boilerplate. Developers build production agents in hours, not weeks. MCP standardises tool and context protocols across models.

AI Agent Timeline — from research prototype to production infrastructure

∑ Chapter 8.1 — Key Takeaways

Agent = LLM + tools + memory + goal — takes actions with real-world consequences, not just generates text
The agency spectrum: Level 1 (pure LLM) → Level 5 (fully autonomous) — most 2024 production systems sit at Level 3–4
Four components required: LLM Brain, Tools, Memory, Planning — all four needed for complex multi-step tasks
Agent loop: Observe → Think → Act → Update — repeats until goal achieved, max steps hit, or error occurs
Enabled by: GPT-4-class reasoning + structured function calling APIs (June 2023) + mature frameworks (LangChain, LangGraph)
Key risk: agents can take irreversible real-world actions — safety design and human oversight are non-negotiable

8.2

Chapter 8.2

Tool Use & Function Calling

An LLM without tools is a very smart autocomplete. Tools are what turn text generation into action. The function calling API — released by OpenAI in June 2023 — was the single most important infrastructure change that made production agents practical. Before it, tool use was fragile prompt engineering. After it, it was engineering.

Why Tools? Core

LLMs have three fundamental limitations that tools directly address. Knowledge cutoff: training data has a fixed date — models cannot tell you today's stock price or last night's sports result. Computation: LLMs are unreliable at precise arithmetic, code execution, and structured data queries. Side effects: a language model can describe writing an email but cannot actually send one. Tools bridge each of these gaps, turning a model that only speaks into one that acts.

📅

Knowledge Tools

Web search, Wikipedia, news feeds, stock data, weather APIs, knowledge bases, RAG retrieval. Overcome knowledge cutoff — give the model access to current information.

🔢

Computation Tools

Code interpreter (Python), calculator, SQL database, image generation, data analysis. Deliver precise, verifiable results the model itself cannot compute reliably.

⚡

Action Tools

Email sending, calendar events, file creation/editing, web browser control, API calls, form submission. Create real-world side effects — the agent does things, not just says things.

Function Calling API In-depth

Before structured function calling, using tools required prompting the model to output JSON and then parsing it — a fragile approach that broke on any formatting variation. OpenAI's function calling API (June 2023) changed this: the model now outputs a structured tool_call object guaranteed by the API, not a string that needs parsing. Anthropic's tool use API follows the same pattern with minor schema differences.

The round-trip has exactly six steps: define tools → model requests a tool call → developer executes the function → developer returns the result → model reasons over the result → model produces the final answer. The developer drives steps 3 and 4; the model does everything else.

Function Calling Round-Trip — 6 steps from tool definition to final answer

# Anthropic Tool Use — complete working example
import anthropic, json

client = anthropic.Anthropic()

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a city. Returns temp and conditions.",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name e.g. 'Tokyo'"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["city"]
    }
}]

def get_weather(city: str, unit: str = "celsius") -> dict:
    return {"city": city, "temperature": 22, "condition": "sunny", "unit": unit}

def run_agent(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=1024,
            tools=tools, messages=messages
        )
        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if b.type == "text")

        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = get_weather(**block.input)         # execute the function
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result)
                    })
            messages.append({"role": "user", "content": tool_results})
            # loop continues — model sees tool results and responds

print(run_agent("What's the weather in Tokyo right now?"))
# → "Currently in Tokyo: 22°C and sunny."

Tool Schema Design In-depth

A tool schema is a prompt. The model reads your name, description, and parameter descriptions to decide whether to call the tool, when to call it, and what arguments to pass. A vague or misleading schema leads to incorrect tool calls or failed executions — and these errors compound when tools call other tools.

❌

Bad Schema

name: "weather"
description: "weather tool"
parameters:
  location: string

Vague name, no description of output, no guidance on location format. Model calls with "Paris" → ambiguous city → tool error.

✅

Good Schema

name: "get_current_weather"
description: "Retrieve current conditions
  for a city. Returns temp, condition,
  humidity, wind. Use for current weather.
  NOT for historical data."
parameters:
  city: "Full name e.g. 'Paris, France'"
  unit: celsius|fahrenheit (optional)

Verb_noun name, precise description, example value, explicit scope (what NOT to use for). Model calls with "Paris, France" → success.

Best practices: verb_noun naming (get_weather, search_web, create_file); description states what it does AND when to use it; parameters include format examples ("e.g. 'Paris, France'"); required vs optional explicitly declared; enum constraints where possible. Think of it as writing documentation for an AI colleague who will read it exactly once before making a decision.

Schema Quality Impact — precise descriptions prevent tool call failures

Tracing a Full Tool Call In-depth

To understand how agents really work, trace a complete multi-step execution. Task: "Find the top 3 Python packages for data visualisation and compare their GitHub stars." The agent needs to: (a) discover which packages are popular, (b) fetch star counts for each, and (c) synthesise the results. Three separate tool calls, four LLM invocations.

Full Tool Call Trace — step-by-step multi-tool agent execution

Parallel Tool Use Core

Modern LLM APIs support returning multiple tool calls in a single model response. When the model determines that two tool calls are independent — neither depends on the output of the other — it can request them simultaneously. The developer then executes both in parallel and returns both results in a single follow-up message. This typically reduces latency by 30–50% for tasks with multiple independent lookups.

Sequential vs Parallel Tool Calls — reduce latency for independent operations

Tool Taxonomy Core

Tool Category	Examples	Latency	Risk Level
Web / Search	Brave Search, Bing, Serper, SerpAPI, Tavily	200–500ms	Low
Code Execution	Python REPL, JavaScript sandbox, Jupyter kernel	100ms–30s	Medium
File System	read_file, write_file, list_dir, delete_file	<10ms	High (irreversible)
Database	SQL query, vector search, NoSQL get/set	10–100ms	Medium–High
External APIs	REST calls, GraphQL, gRPC services	100ms–2s	Varies
Communication	send_email, post_slack, create_ticket	200–500ms	High (irreversible)
Browser / Computer	navigate, click, type, screenshot	500ms–2s	High
Memory	vector_store_add, retrieve, entity_update	10–100ms	Low

Model Context Protocol (MCP) Core

Before MCP, every agent framework defined tools differently: LangChain tools, AutoGen tools, and custom code were all incompatible. A tool built for one framework couldn't be used in another without rewriting the wrapper. Anthropic's Model Context Protocol (released 2024) is an open standard that solves this — think of it as HTTP for tool use.

An MCP server exposes tools over a standardised JSON-RPC interface (via stdio or HTTP+SSE). Any MCP-compatible client can connect to any MCP server without modification. The ecosystem already includes servers for: filesystem, PostgreSQL, Slack, GitHub, Google Drive, Puppeteer (browser control), and dozens more.

MCP — Universal Protocol for Agent Tool Connectivity

∑ Chapter 8.2 — Key Takeaways

Tools solve three LLM limits: knowledge cutoff, computation accuracy, world side effects — they turn text generation into action
Function calling: structured JSON tool_call output → reliable, parseable tool invocation — the key enabler for production agents
Tool schemas are prompts — precise descriptions and parameter constraints are critical for correct tool selection and argument generation
Multi-step loop: tool results added to context → model reasons over accumulating evidence across multiple LLM calls
Parallel tool use: call independent tools simultaneously — reduces latency ~40% with no code changes beyond handling multiple results
MCP: universal standard for tool connectivity — any client works with any server, eliminating framework lock-in

8.3

Chapter 8.3

ReAct & Reasoning Loops

The core insight of ReAct is simple but profound: don't just think, then act. Think, act, observe, think again — interleaving reasoning with real-world grounding. Each tool result updates the plan. Each thought commits to the next action. This is why modern agents are more reliable than either pure reasoning or pure acting alone.

Chain-of-Thought for Agents Core

Chain-of-thought prompting was originally a single-turn technique: "Let's think step by step" before answering dramatically improved multi-step reasoning on math and logic tasks. In agents, CoT becomes something more structural — it is the backbone of every decision step. Before acting, the model writes out its reasoning. This reasoning serves as working memory and directly constrains the next action.

Verbalised reasoning helps agents in four concrete ways: it forces commitment to a plan before executing an irreversible action; it makes the agent's reasoning auditable — you can inspect exactly why a choice was made; it enables error recovery — bad reasoning is visible and can be interrupted; and it helps the model notice contradictions before they compound across multiple steps.

When an agent writes "Thought: I need to search for the current price first, then calculate the percentage change", it is not just narrating — it is programming its own next action. The thought IS the plan. This is why verbalised reasoning improves agent reliability: the model checks its own logic before committing to an action.

ReAct: Reason + Act In-depth

Yao et al. (Princeton/Google, 2022) introduced ReAct in "ReAct: Synergising Reasoning and Acting in Language Models". The core insight: interleave reasoning traces (Thought) with tool-grounded actions (Action / Observation) step by step. Not "think then act" as two separate phases — but thinking and acting interwoven at every step.

Pure reasoning (CoT) lets models hallucinate facts with no grounding in reality — there is nothing to correct wrong assumptions. Pure acting wastes tool calls without strategy — the model fires searches randomly without a plan. ReAct solves both: each Observation updates the model's plan; each Thought grounds the next Action in accumulated evidence. The original paper showed ReAct outperforms CoT-only and Act-only on HotpotQA, FEVER, and WebShop benchmarks.

ReAct Loop Structure: Thought_t → Action_t: tool_name(args) → Observation_t: tool result → repeat until Action_t = "Final Answer: [answer]" Each Thought is the model's reasoning about current state · Each Observation grounds the next Thought · Loop terminates on Final Answer or max steps

ReAct vs CoT vs Act-Only — interleaving reasoning and acting beats both alone

ReAct Full Worked Trace In-depth

Multi-hop questions require chaining multiple lookups where each result informs the next query. ReAct handles this naturally because each Observation is added to the context before the next Thought. Task: "What is the population of the capital city of the country that hosted the 2020 Olympics?" — requires three information hops.

ReAct Multi-Hop Trace — chaining observations to answer complex questions

Implementing ReAct In-depth

ReAct requires no framework — it is just the standard tool-use loop with a system prompt that instructs the model to think before acting. The key is the system prompt structure and the loop that feeds observations back to the model. The implementation below is complete and runnable with the Anthropic API.

from anthropic import Anthropic

client = Anthropic()

tools = [{
    "name": "search",
    "description": "Search the web for current information. Use for factual queries, current events, or when you need to look something up.",
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search query"}
        },
        "required": ["query"]
    }
}]

def mock_search(query: str) -> str:
    results = {
        "2020 Summer Olympics host": "Held in Tokyo, Japan in 2021",
        "Tokyo population": "City: ~13.96M · Metro: ~37.4M (2024)"
    }
    for key in results:
        if key.lower() in query.lower():
            return results[key]
    return f"Search results for: {query}"

def react_agent(task: str, max_steps: int = 10) -> str:
    system = """You are a helpful agent. Think step by step before each action.
Always start with a Thought explaining your reasoning before calling a tool."""

    messages = [{"role": "user", "content": task}]

    for step in range(max_steps):
        response = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=1024,
            system=system, tools=tools, messages=messages
        )
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if b.type == "text")

        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = mock_search(block.input["query"])
                print(f"  Action: {block.name}({block.input})")
                print(f"  Observation: {result}\n")
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })
        messages.append({"role": "user", "content": tool_results})

    return "Max steps reached without final answer"

answer = react_agent(
    "What is the population of the capital of the 2020 Olympics host country?"
)
print(f"\nFinal Answer: {answer}")

Self-Reflection & Reflexion In-depth

Standard ReAct agents repeat mistakes across attempts — there is no mechanism to learn from failure within a session. Shinn et al. (2023) introduced Reflexion ("Language Agents with Verbal Reinforcement Learning") to address this. After each failed attempt, the agent generates a verbal self-critique explaining what went wrong and what to try differently. This critique is stored in memory and prepended to the next attempt.

Reflexion adds three components to the standard loop: an Evaluator that judges whether an attempt succeeded, a Self-Reflection step that generates a verbal critique on failure, and a Memory store that accumulates critiques across attempts. The result is a form of in-session learning without any gradient updates — pure verbal reinforcement.

Reflexion — Verbal self-critique stored in memory guides successive attempts

Tree of Thought Core

Yao et al. (2023) introduced Tree of Thoughts (ToT): instead of a single linear reasoning chain, maintain multiple candidate reasoning branches simultaneously. At each step, generate multiple next-thoughts, evaluate the promise of each (another LLM call), and explore the most promising — backtracking from dead ends using BFS or DFS.

Standard CoT picks one path and commits. If that path leads to a wrong conclusion there is no recovery. ToT is best for problems where early choices are high-stakes: puzzle solving, proof writing, strategic planning with multiple valid initial moves. The cost is significantly more LLM calls — often 5–20× more than ReAct. Use it only when the added cost is justified by problem complexity.

Approach	Paths	Backtracking	LLM Calls	Best For
CoT	1 linear chain	None	1–3	Simple reasoning, clear next step
ReAct	1 path + tools	Implicit via observations	3–10	Most agent tasks, multi-hop queries
Reflexion	Multiple attempts	Between attempts	5–30	Tasks requiring iterative refinement
ToT	Multiple branches	Within attempt	20–100+	Hard puzzles, proofs, high-stakes planning

Chain of Thought vs Tree of Thought — linear vs deliberate multi-path reasoning

∑ Chapter 8.3 — Key Takeaways

CoT gives agents verbalised reasoning — makes plans auditable and helps agents self-correct before committing to actions
ReAct (Yao et al. 2022): interleave Thought → Action → Observation — grounded reasoning outperforms both pure CoT and act-only approaches
Multi-hop reasoning: each Observation is added to context before the next Thought — chains "2020 Olympics → Japan → Tokyo → Population" naturally
Reflexion: verbal self-critique stored in memory — agents improve across successive failed attempts without gradient updates
Tree of Thought: multiple reasoning branches explored and evaluated — best for high-stakes complex problems; expensive (20–100+ LLM calls)
Production default: ReAct is the standard for most tasks; add Reflexion for iterative refinement; use ToT only when early mistakes are catastrophic

8.4

Chapter 8.4

Memory Systems — How Agents Remember

A stateless agent is amnesiac — it forgets everything between turns. A memory-augmented agent can recall a user's preferences from last week, learn from its own past mistakes, and maintain a coherent project context across hundreds of conversations. Memory is what turns a chatbot into a collaborator.

Memory Taxonomy Core

Agent memory maps directly onto human cognitive memory systems. In-context memory is working memory — everything visible to the model right now. Vector store memory is associative memory — retrieve by similarity ("what do I know about X?"). Episodic memory is autobiographical — specific events with timestamps ("what happened last session?"). Semantic/entity memory is world knowledge — structured facts ("who is Alice, what does she prefer?").

💭

In-Context (Working Memory)

The context window — all messages, tool results, and plans from the current session. The LLM sees all of it without any retrieval. Fast but finite: 8K–200K tokens depending on model. Lost when the session ends.

🔍

Vector Store (Associative Memory)

Text embedded as dense vectors. Retrieve by semantic similarity — not exact key match. "What do I know about the user's preferences?" returns all relevant stored facts. Persistent across sessions.

📅

Episodic Memory

Timestamped logs of past interactions. "Last Monday we discussed the invoice API" — specific events with when, what, and outcome. Enables cross-session continuity and learning from past attempts.

👤

Semantic / Entity Memory

Structured facts about known entities. User profiles: name, role, preferences, communication style. Project facts: stack, status, blockers. Explicitly maintained — not inferred from logs.

Four Agent Memory Types — in-context, vector, episodic, semantic

In-Context Memory In-depth

The context window is the agent's working memory. Every message, tool result, plan, and observation from the current session lives here. The LLM sees all of it simultaneously — no retrieval needed, no similarity search, no latency penalty. It is the default memory for any agent and is sufficient for most short tasks.

The fundamental limitation is the context window is finite. Models support 8K to 200K tokens depending on provider. Long multi-step tasks accumulate tokens rapidly: every tool result, every thought, every observation adds to the total. When context approaches the limit, the agent must decide what to keep, compress, or offload.

🪟

Sliding Window

Keep only the last N messages. Oldest messages are dropped when context fills. Simple, no retrieval cost. Loses history permanently. Use for: short task-focused assistants.

📄

Hierarchical Summarisation

Compress old turns progressively: recent turns in full detail, older turns as a paragraph summary, oldest as a single sentence. Never completely loses information. Use for: long-running support bots.

💾

Selective Retention

Keep all critical tool results but summarise conversational turns. Identify which information is load-bearing (facts, decisions, errors) vs. noise (filler, redundant acknowledgements).

Context Window Filling — token accumulation over a long multi-step task

External Memory & Retrieval In-depth

External memory lives outside the context window — in a database, vector store, or file system. The agent interacts with it via explicit tool calls: write stores important information, search retrieves relevant information at query time. External memory enables persistence across sessions, scalability beyond context limits, and cross-session learning.

The write/search interface is the minimal design. Every agent memory system needs at minimum three operations: memory_write(key, content, tags) to store, memory_search(query, n) for semantic retrieval, and memory_get(key) for exact lookup. The implementation below is a working in-memory vector store using sentence-transformers.

import json
from datetime import datetime
from sentence_transformers import SentenceTransformer
import numpy as np

class AgentMemory:
    """Simple in-memory vector store for agent memories"""

    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.memories = []  # list of {key, content, embedding, timestamp, tags}

    def write(self, key: str, content: str, tags: list = None) -> str:
        """Store a memory with its embedding"""
        embedding = self.model.encode(content)
        self.memories.append({
            "key": key,
            "content": content,
            "embedding": embedding,
            "timestamp": datetime.now().isoformat(),
            "tags": tags or []
        })
        return f"Stored memory: {key}"

    def search(self, query: str, n: int = 3) -> list:
        """Find n most relevant memories by cosine similarity"""
        if not self.memories: return []
        q_emb = self.model.encode(query)
        scores = [
            np.dot(q_emb, m["embedding"]) /
            (np.linalg.norm(q_emb) * np.linalg.norm(m["embedding"]))
            for m in self.memories
        ]
        top_n = sorted(zip(scores, self.memories), key=lambda x: -x[0])[:n]
        return [{"score": s, **m} for s, m in top_n]

# Usage
mem = AgentMemory()
mem.write("user_pref_1", "User prefers Python over JavaScript", tags=["preference", "code"])
mem.write("task_note_1", "User's project is a FastAPI service for invoice processing", tags=["project"])

results = mem.search("What programming language does the user prefer?")
print(results[0]["content"])  # "User prefers Python over JavaScript"

Vector Store Memory Core

Vector store memory retrieves information by meaning, not by exact key. Each piece of stored text is converted into a dense numerical vector (embedding) by an embedding model. At retrieval time, the query is embedded and the vector database returns the stored items with the highest cosine similarity. This makes it possible to ask "What do I know about the user's technical background?" and retrieve all relevant stored facts, even if they use completely different words.

Vector Store Memory — semantic write and retrieve by meaning

Episodic Memory Core

Episodic memory stores specific past events with timestamps — not just what was learned, but when it happened, in what context, and with what outcome. For agents, episodic memory enables cross-session continuity: the agent on Friday remembers the conversation from Monday without the user needing to re-explain.

Episodic Memory — building context across multiple sessions

Entity & Semantic Memory Core

Semantic memory stores structured facts about known entities — timeless information distinct from episodic "when it happened" logs. A user entity has: name, role, technical preferences, communication style, current projects. A project entity has: name, stack, status, key files, blockers. This structured profile grows as the agent learns more and eliminates the need to re-ask the same onboarding questions.

Use Case	Best Memory Type	Implementation	Persistence
Current conversation state	In-context	Context window messages	Session only
Recent task results	In-context	Tool result messages	Session only
Long conversation (>100 turns)	External (vector)	Chunked + embedded history	Across sessions
User preferences & profile	Semantic/entity	JSON profile + vector search	Permanent
Past task attempts & failures	Episodic	Timestamped summary logs	Permanent
Domain knowledge base	External (vector)	RAG pipeline on documents	Permanent
Cross-session continuity	Episodic + semantic	Combined: summaries + profile	Permanent

Memory Architecture Patterns In-depth

Three patterns cover the vast majority of production agent memory designs. Pattern selection depends on task length, session frequency, and personalisation requirements.

🪟

Pattern 1 — Sliding Window

Keep the last N messages. Drop oldest when context fills. Simple, no retrieval cost, no infrastructure needed. Loses history permanently. Best for: short task-focused assistants with well-scoped goals.

📑

Pattern 2 — Hierarchical Summarisation

Recent turns: full detail. Older turns: paragraph summary. Oldest: one sentence. Never completely loses info — compresses to gist. Best for: long-running customer support, multi-day coding sessions.

🗃️

Pattern 3 — RAG Memory

All facts stored in vector DB. At each step: retrieve relevant memories + inject into context. Working memory = retrieved context, not full history. Infinite effective memory, small context footprint. Best for: personalised cross-session agents.

Memory Architecture Patterns — sliding window, hierarchical, and RAG-memory

∑ Chapter 8.4 — Key Takeaways

Four memory types: in-context (window), vector (semantic), episodic (logs), entity (facts) — each serves a different recall need
In-context is the default — fast and zero-retrieval but limited by context window size and lost when session ends
Vector store: embed → store → retrieve by semantic similarity — "what do I know about X?" regardless of exact wording
Episodic memory: timestamped session summaries — enables cross-session continuity without re-explanation
Semantic/entity memory: structured profiles of users and domains — enables personalisation and avoids repetitive onboarding
Most production agents need all four types working together: context for now, vector for knowledge, episodic for history, entity for identity

8.5

Chapter 8.5

Planning & Task Decomposition

ReAct decides one step at a time. Planning decides the whole path before taking the first step. For short tasks, step-by-step is fine. For tasks with irreversible actions, dependencies, and ten or more steps — a plan prevents early mistakes that cannot be undone. The art is knowing when to plan, how deeply, and when to abandon the plan and replan.

Why Planning Matters Core

For simple tasks, ReAct's step-by-step approach is perfectly adequate — decide, act, observe, repeat. Planning becomes necessary when tasks involve irreversible actions (sending emails, committing code, deleting files), long horizons where the agent may lose track of the original goal after ten or more steps, or parallel sub-tasks where independent work streams could be executed concurrently for efficiency.

Planning also enables human oversight: when a complete plan is generated before any action is taken, a human can review and approve the plan before irreversible operations start. This is one of the most practical safety mechanisms in production agents.

⚡

No Planning (ReAct)

Decide step-by-step as observations accumulate. Works for: simple queries, short tasks, tasks where every step is reversible. Risk: no global strategy, can get lost on long tasks.

📋

Soft Planning

"Think step by step before acting" — loose structure, single LLM call to outline approach before starting. Works for: medium complexity tasks requiring a rough roadmap but flexible execution.

📐

Hard Planning

Explicit numbered plan generated and tracked step by step. Enables human review before execution. Works for: irreversible actions, long tasks, tasks with clear sequential dependencies.

🌳

Hierarchical Planning

Goals decomposed into sub-goals recursively. Independent sub-tasks run in parallel. Works for: complex multi-domain tasks where specialised sub-agents handle different branches.

Plan-and-Execute In-depth

Wang et al. (2023) introduced Plan-and-Solve: a two-phase architecture where a Planner LLM call generates a complete numbered plan, and an Executor loop carries out each step using tools. This separation provides a critical advantage: the full plan is explicit before any action is taken, enabling human review, dependency analysis, and progress tracking.

ReAct vs Plan-and-Execute — incremental vs planned execution

from anthropic import Anthropic

client = Anthropic()

def create_plan(goal: str) -> list:
    """Phase 1: Planner — generate a complete numbered plan"""
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=1024,
        messages=[{"role": "user", "content": f"""Create a numbered step-by-step plan for:

Goal: {goal}

Output ONLY a numbered list. Each step should be concrete and actionable. Max 7 steps."""}]
    )
    plan_text = response.content[0].text
    return [l.strip() for l in plan_text.split('\n')
            if l.strip() and l.strip()[0].isdigit()]

def execute_step(step: str, completed: list) -> str:
    """Phase 2: Executor — carry out one plan step"""
    context = "\n".join(f"- {s}" for s in completed)
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=1024,
        messages=[{"role": "user", "content": f"""Completed steps:
{context}

Execute this step and provide its output: {step}"""}]
    )
    return response.content[0].text

def plan_and_execute(goal: str) -> str:
    # Phase 1: generate full plan
    plan = create_plan(goal)
    print(f"Plan ({len(plan)} steps):")
    for i, step in enumerate(plan, 1):
        print(f"  {i}. {step}")

    # Phase 2: execute each step
    completed = []
    for step in plan:
        result = execute_step(step, completed)
        completed.append(f"{step} → {result[:80]}...")
        print(f"\n✓ {step}\n  {result[:200]}")

    return f"Completed {len(plan)} steps successfully"

plan_and_execute("Research and write a brief comparison of Redis vs Memcached")

Hierarchical Planning In-depth

Complex goals naturally form a tree of sub-goals. A top-level goal like "Build a data analysis report" decomposes into "Collect data", "Analyse data", and "Write report" — each of which decomposes further into concrete executable steps. Hierarchical planning makes this structure explicit. Independent sub-trees can run in parallel; sequential dependencies are enforced between tree levels.

Hierarchical Task Decomposition — root goal → sub-goals → executable steps

Dynamic Replanning Core

Static plans assume the world matches their assumptions. In practice, steps fail: the database is down, the API changed, the file doesn't exist. An agent that cannot adapt to failure is brittle. Replanning is the mechanism for detecting when reality has diverged from the plan and generating a revised plan from the current state.

Three strategies, in order of cost: step retry — try the same step differently; partial replan — regenerate only the remaining steps given the new situation; full replan — abandon the current plan entirely and regenerate from the current state. The key is that the agent must actively recognise a failure condition rather than blindly proceeding.

Dynamic Replanning — detect failure, decide retry vs replan, continue

Least-to-Most Prompting Core

Zhou et al. (2023) introduced Least-to-Most Prompting, inspired by educational scaffolding. Rather than attacking a complex question directly, the method first decomposes it into simpler sub-problems, then solves each in order — using prior solutions as context for the next. This approach excels at compositional problems where a complex answer genuinely depends on simpler intermediate results.

def least_to_most(question: str) -> str:
    """Stage 1: decompose. Stage 2: solve each subproblem in order."""

    # Stage 1: Decompose into ordered subproblems
    decompose = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=512,
        messages=[{"role": "user", "content": f"""Break this question into simpler subproblems
that must be solved first, ordered from simplest to most complex.

Question: {question}

Output: A numbered list of simpler subproblems."""}]
    )
    subproblems = [l.strip() for l in decompose.content[0].text.split('\n')
                   if l.strip() and l.strip()[0].isdigit()]

    # Stage 2: Solve each subproblem, using prior answers as context
    context = f"Original question: {question}\n\nSubproblem solutions:\n"
    for sub in subproblems:
        resp = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=256,
            messages=[{"role": "user", "content": f"{context}\nNow solve: {sub}"}]
        )
        context += f"\n{sub}: {resp.content[0].text}"

    # Final answer using accumulated subproblem solutions
    final = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=512,
        messages=[{"role": "user", "content": f"{context}\n\nNow answer the original question."}]
    )
    return final.content[0].text

# Example: compositional maths/reasoning
answer = least_to_most(
    "If a train travels 120km in 1.5 hours and then 80km in 1 hour, what is its average speed?"
)
print(answer)

Subgoal Decomposition Core

Subgoal decomposition is the general skill underlying all planning methods: given a goal, identify the minimal set of concrete sub-tasks that, when completed, guarantee the goal is achieved. The quality of decomposition determines everything downstream — bad decomposition leads to missing steps, incorrect dependencies, and wasted effort.

✂️

Sequential Decomposition

Steps must run in order — each depends on the previous. Use when: output of step N is an input to step N+1. Example: Collect data → Clean data → Analyse → Report.

⚡

Parallel Decomposition

Sub-tasks are independent — all can run simultaneously. Use when: sub-tasks share the same input but produce separate outputs. Example: search 3 different sources in parallel.

🌳

Conditional Decomposition

Next sub-task depends on the result of a prior sub-task. Use when: branching logic exists. Example: "if search returns no results, try alternative query; else proceed to analyse."

Planning Failure Modes Core

🔄

Over-Planning

Agent spends too many steps planning instead of acting — "analysis paralysis". 10 planning steps for a 2-step task. Fix: limit planning depth, set a max plan length, use ReAct for simple tasks.

📋

Stale Plans

Agent follows the initial plan rigidly even when observations contradict it. Committed to the plan, ignores reality. Fix: explicit plan-evaluation step after each observation — "does the plan still make sense?"

⛓️

Missing Dependencies

Plan assumes step 4 can proceed before step 2 finishes. Parallel execution races cause unpredictable failures. Fix: explicit dependency graph before execution — identify all step inputs and outputs.

🎯

Goal Drift

In long plans, agent slowly drifts from the original goal — sub-goals become ends in themselves. Fix: re-state the original goal at regular checkpoints in the context window.

∑ Chapter 8.5 — Key Takeaways

Planning is essential for irreversible actions, long tasks, and parallel sub-tasks — ReAct alone is insufficient
Plan-and-Execute: generate full plan first, then execute step by step — enables human review before any action runs
Hierarchical: decompose goal tree recursively — independent sub-trees run in parallel, sequential arrows enforce ordering
Dynamic replanning: detect failure, decide retry vs partial/full replan — critical for robustness in real environments
Least-to-Most: simpler subproblems solved first — each answer scaffolds the next for compositional tasks
Key pitfalls: over-planning, stale plans, missing dependencies, goal drift over long tasks — all require explicit mitigation

8.6

Chapter 8.6

Multi-Agent Systems

A single agent is a generalist. A team of agents is a specialised organisation. When tasks decompose into distinct roles — researcher, coder, critic, writer — routing each to a specialist consistently outperforms one agent doing everything. The Supervisor pattern has become the default architecture for production systems that need reliability, auditability, and scale.

Why Multiple Agents? Core

A single agent with all tools solves many problems — but hits four fundamental limits. Context overload: fifty tool calls in one context window approaches or exceeds any model's limit. Specialisation: a code-writing specialist with a focused system prompt and code-specific tools outperforms a generalist at the same task. Parallelism: three independent research threads can run simultaneously in three agents for 3× throughput. Verification: a separate critic agent reviewing the author agent's output catches errors that self-review misses.

⚡

Parallelism

Independent sub-tasks run simultaneously. A research agent + a code agent + a writing agent can all work at once — 3× speedup for parallel workstreams compared to sequential single-agent execution.

🎯

Specialisation

Each agent optimised for one role: researcher, coder, critic, planner. Focused system prompt + role-specific tool set → better per-role performance than a generalist attempting all roles.

✅

Verification

Separate critic/reviewer agent checks the main agent's work independently. Two-agent "author + reviewer" consistently produces fewer errors than one agent doing both roles.

📏

Scale & Routing

Route different task types to different specialised agents. Handle more users by running parallel agent instances. Different domains (legal, medical, code) get domain-specific agents.

Multi-Agent Topologies In-depth

Four structural patterns cover the vast majority of multi-agent system designs. Each makes different trade-offs between simplicity, flexibility, and parallelism.

Four Multi-Agent Topologies — Sequential, Fan-out, Supervisor, Debate

The Supervisor Pattern In-depth

The Supervisor/Worker pattern is the dominant architecture in production multi-agent systems. The Supervisor is an LLM whose sole job is orchestration — it understands the overall goal, decides which specialist to invoke and with what task, and synthesises results. It does not itself call tools. Worker agents are specialised: each has a focused system prompt, a specific tool set, and a narrow domain of responsibility.

Supervisor-Worker Pattern — orchestrator routes tasks to specialised agents

from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    goal: str
    messages: Annotated[list, operator.add]
    next_agent: str
    final_output: str

llm = ChatAnthropic(model="claude-sonnet-4-6")

def supervisor(state: AgentState) -> AgentState:
    """Decides which agent to call next"""
    prompt = f"""You are an orchestrator managing a team of agents.

Goal: {state['goal']}
Progress: {state['messages'][-3:] if state['messages'] else 'None'}

Available agents: RESEARCHER, CODER, WRITER, FINISH
Which agent should act next? Respond with ONLY the agent name."""

    response = llm.invoke(prompt)
    return {**state, "next_agent": response.content.strip()}

def researcher(state: AgentState) -> AgentState:
    result = f"[Research results for: {state['goal']}]"
    return {**state, "messages": [f"Researcher: {result}"]}

def coder(state: AgentState) -> AgentState:
    result = f"[Code for: {state['goal']}]"
    return {**state, "messages": [f"Coder: {result}"]}

def writer(state: AgentState) -> AgentState:
    result = f"[Final document for: {state['goal']}]"
    return {**state, "messages": [f"Writer: {result}"], "final_output": result}

# Build the graph
graph = StateGraph(AgentState)
for name, fn in [("supervisor", supervisor), ("researcher", researcher),
                   ("coder", coder), ("writer", writer)]:
    graph.add_node(name, fn)

graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", lambda s: s["next_agent"], {
    "RESEARCHER": "researcher",
    "CODER": "coder",
    "WRITER": "writer",
    "FINISH": END
})
for node in ["researcher", "coder", "writer"]:
    graph.add_edge(node, "supervisor")  # workers always return to supervisor

app = graph.compile()
result = app.invoke({"goal": "Create a Python web scraper for HN jobs", "messages": []})

Agent Handoffs Core

A handoff is the transfer of execution from one agent to another with relevant context. What gets transferred determines whether the receiving agent can proceed effectively: too little context and the agent re-does already-completed work; too much context and the receiving agent's context window overflows.

The minimal handoff payload should include: the current task description, relevant completed-step outputs, constraints and requirements, and optionally a summary of conversation history. The handoff type determines urgency and routing: specialisation (this requires coding — transfer to coder), escalation (this requires human approval — pause and notify), error (I failed — transfer to error-recovery agent), and completion (sub-task done — return to supervisor).

🎯

Specialisation Handoff

Current agent identifies a task outside its specialty. Passes: task description + context needed. Example: "This requires code generation — handing to Coder agent."

⬆️

Escalation Handoff

Task requires approval or capabilities beyond any agent. Routes to human-in-the-loop. Example: "This action is irreversible — notifying human for approval before proceeding."

🔴

Error Handoff

Agent has failed and cannot self-recover. Passes: failure description + attempted approaches. Routes to error-recovery agent or supervisor for replanning.

✅

Completion Handoff

Sub-task successfully completed. Returns result to supervisor with: output summary, status, any side effects created. Supervisor decides next step.

LangGraph In-depth

LangGraph (LangChain, 2024) is a graph-based framework for stateful multi-agent systems. Its core abstraction is a StateGraph: a directed graph where nodes are functions (agents or tools), edges are transitions, and a shared typed State dictionary persists across all nodes. Conditional edges route execution based on the current state — the LLM's output determines which node runs next.

LangGraph's key advantages over plain Python loops are built-in persistence (checkpoint state to a database — resume after failure, enable human-in-the-loop approvals), streaming (intermediate steps stream to the UI in real time), and parallel execution (fork-join for simultaneous node execution).

LangGraph Execution Flow — stateful graph with conditional routing

AutoGen & CrewAI — Framework Comparison Core

Framework	Paradigm	Strengths	Best For	Abstraction
LangGraph	Graph-based stateful	Persistence, streaming, fine-grained control	Production agents, complex flows	Low (explicit graph)
AutoGen (Microsoft)	Conversational agents	Easy multi-agent chat, human-in-loop	Research, prototyping, group chat	Medium
CrewAI	Role-based crews	Simple role/goal definition, no graph	Rapid prototyping, simple multi-role	High (declarative)
LangChain Agents	Tool-using agents	Rich tool ecosystem, many integrations	Single agent with many tools	Medium
OpenAI Assistants	Thread-based managed	Built-in memory, code interpreter	Simple production assistants	High (managed)
Semantic Kernel	Plugin-based	.NET/Python, enterprise planning	Enterprise applications	Medium

Inter-Agent Communication Core

How agents communicate determines system reliability, debuggability, and scalability. Three communication patterns cover most production systems: shared state (agents read and write a common typed dictionary — LangGraph's model), message passing (agents send structured messages to each other — AutoGen's model), and function calling (supervisor calls worker as a tool with structured input/output). Shared state is easiest to debug; message passing is most flexible; function calling is the most familiar interface for developers.

🗂️

Shared State

All agents read/write a common typed state dictionary. Every step is visible to every agent. Easy to inspect and debug. Used by LangGraph. Risk: state conflicts if agents write the same field.

✉️

Message Passing

Agents send structured messages to each other. Each agent maintains its own conversation thread. Flexible, decoupled. Used by AutoGen. Risk: message format mismatches between agents.

🔧

Function Calling

Supervisor calls workers as tools with structured JSON input/output. Clean interface boundary. Workers are stateless from the supervisor's perspective. Risk: loses intermediate worker context.

∑ Chapter 8.6 — Key Takeaways

Multiple agents enable: parallelism, specialisation, verification, and scale — justified when a single agent hits context or quality limits
Four topologies: Sequential, Fan-out, Supervisor/Worker, Peer Debate — Supervisor is the dominant production pattern
Supervisor pattern: orchestrator routes tasks to specialised workers — LangGraph provides the graph infrastructure for this
Handoffs pass context between agents — too little = re-work; too much = context overflow — right-size the handoff payload
LangGraph: explicit stateful graph with persistence and streaming — best for production requiring reliability and human-in-the-loop
CrewAI/AutoGen for rapid prototyping; LangGraph for production systems requiring checkpointing and auditability

8.7

Chapter 8.7

Agent Evaluation, Safety & Reliability

You cannot improve what you cannot measure. And you cannot deploy what you cannot trust. Agent evaluation is harder than model evaluation because agents act — and actions can be irreversible. The same agent solving the same task may take a completely different path each time. Building reliable agents means knowing where they fail, how often, and why — and designing safety mechanisms before those failures have real-world consequences.

Why Agent Evaluation Is Hard Core

Standard LLM evaluation is straightforward: fixed input, expected output, compute a score (BLEU, accuracy, human preference). Agent evaluation is fundamentally harder across five dimensions: multiple valid paths (many different tool call sequences can reach the correct answer — which counts?); process vs outcome (did the agent succeed by doing the right thing, or by luck?); stochasticity (same task, different execution every run); long horizons (50 steps — where exactly did it go wrong?); and irreversibility (some failures cannot be undone, limiting how many evaluation runs you can afford).

Production agent evaluation requires a controlled simulation environment: tools return deterministic results from a test fixture, enabling repeatable evaluation of the agent's decision-making independently of external API variability.

🔀

Multiple Valid Paths

A task may be solvable by 20 different tool call sequences. Exact-match evaluation is meaningless. Must evaluate the outcome, not the path — or evaluate both separately.

🎲

Stochasticity

Same agent, same task → different execution each time. A single-run evaluation is unreliable. Need N≥10 runs per task to get a stable success rate estimate.

🔍

Process vs Outcome

"Right answer, wrong process" is a real failure mode. An agent that guesses correctly without using tools may be brittle on harder variants. Both outcome and trajectory quality matter.

Agent Metrics In-depth

No single metric captures agent quality. Production agent monitoring requires tracking at least six dimensions simultaneously — success, efficiency, recovery, safety, cost, and latency. An agent that succeeds 90% of the time but costs 10× more than necessary is not production-ready.

Agent Evaluation Radar — multi-dimensional performance comparison

Metric	Formula	What It Measures	Target
Task Success Rate	correct / total tasks	End-to-end task completion	>80%
Step Efficiency	optimal_steps / actual_steps	Tool call efficiency, no wasted steps	>0.7
Error Recovery Rate	recoveries / total errors	Robustness to tool failures	>70%
Safety Rate	1 − violations / actions	Avoidance of unsafe actions	>99%
Cost per Task	$ tokens + $ tool calls	Economic efficiency	Benchmark-dependent
P90 Latency	90th percentile wall-clock	Real-world responsiveness	<30s typical

Agent Benchmarks Core

Benchmark	Domain	Measure	Top Score (2024)	Notes
SWE-bench	Software engineering	% GitHub issues resolved	~50% (best systems)	Hard — real codebase understanding
WebArena	Web navigation	Task success rate	~35–50%	Browse, fill forms, extract info
AgentBench	8 domains (OS, DB, web, game)	Avg task success	~50–60%	Diverse agent task suite
HotpotQA	Multi-hop QA	EM + F1 score	~70–80%	ReAct baseline well-established
ALFWorld	Household navigation (text)	Task success rate	~90%+	Simulated environment
GAIA	General AI (diverse)	% correct (requires tools)	~50% frontier models	Real-world tools required

Agent Failure Modes In-depth

🔁

Hallucinated Tool Calls

Agent generates arguments that don't match tool schema — e.g. search(url='...') when schema requires search(query='...'). Fix: strict JSON schema validation before execution, input sanitisation.

🌀

Infinite Loops

Agent calls the same tool repeatedly with the same arguments, making no progress. Caused by unhelpful tool results that don't resolve the impasse. Fix: max_steps limit, same-tool+args loop detection.

🎯

Goal Abandonment

After many steps, agent forgets the original goal — pursues sub-goals as ends in themselves. Fix: re-state original goal in system prompt, periodic goal-check steps in the agent loop.

📤

Context Overflow

Tool results + messages exceed the context window. Model truncation corrupts task state — agent loses track of what it was doing. Fix: summarise old messages, external memory, limit tool result size.

🔐

Unauthorised Actions

Agent takes actions outside intended scope — sends emails not requested, deletes files to "clean up". Fix: explicit scope constraints in system prompt, tool allow-list, human approval gates for irreversible actions.

⛓️

Cascading Failures

One failed tool call causes misinterpretation of state — all subsequent steps are wrong because built on a bad premise. Fix: explicit error detection, tool result validation, partial-plan recovery on failure.

Agent Failure Mode Frequency — production observations

Human-in-the-Loop In-depth

Not all agent actions should be fully autonomous. Actions exist on a risk spectrum: read-only actions (search, read files) carry no irreversible risk and can auto-execute; reversible writes (draft email, temp file) are low risk; irreversible actions (send email, delete file, submit form) require explicit approval; critical actions (financial transactions, public communications, security changes) always require human review.

Human-in-the-Loop Gates — risk-based approval routing

Prompt Injection Attacks In-depth

Prompt injection is a class of attack where malicious content in the environment hijacks the agent's behaviour. Unlike a chatbot where injection only generates text, an agent can execute code, send emails, and access files — making prompt injection potentially catastrophic rather than merely embarrassing.

Indirect injection is the most dangerous variant for agents: malicious instructions embedded in a web page, retrieved document, or tool result — content the agent is supposed to process, not follow. The agent doesn't distinguish "content to read" from "instructions to follow" without explicit architectural protections.

Indirect Prompt Injection — malicious content in tool results hijacks agent

Safe Agent Design Principles Core

🔒

Principle of Least Privilege

Give the agent ONLY the tools it needs for its specific task. A research agent doesn't need email tools. A customer service agent doesn't need file deletion. Minimum tool surface = minimum blast radius.

🔄

Prefer Reversible Actions

Design reversible tool variants: draft_email not send_email · move_to_trash not delete · stage_changes not commit. When irreversible is unavoidable, require explicit confirmation.

👤

Human Approval Gates

Define clear triggers: any action affecting >N users · any financial action >$X · any irreversible modification · any external communication. Automate the classification, not the override.

📋

Audit Everything

Log all tool calls with: timestamp, agent ID, tool name, args, result, latency. Immutable audit trail for debugging, compliance, and rollback. Essential for any agent with real-world consequences.

🧱

Sandbox Execution

Run code execution in isolated containers: network restrictions · filesystem limits · CPU/memory caps · timeout enforcement. Never execute agent-generated code in the host environment directly.

🎯

Explicit Scope in System Prompt

State both permissions and prohibitions: "You MAY: read files in /workspace, search the web, run Python. You may NEVER: send emails, modify system files, access credentials." Explicit prohibition reduces accidental violations.

∑ Chapter 8.7 — Key Takeaways

Agent eval is hard: multiple valid paths, stochastic execution, long-horizon, irreversible actions — requires simulation environments
Six key metrics: task success, step efficiency, error recovery, safety violations, cost, latency — no single number suffices
Most common production failures: context overflow (35%) and goal drift (28%) — address these first
Human-in-the-loop: irreversible and high-risk actions require approval gates — classify risk, then route automatically
Prompt injection: malicious content in tool results can hijack agent instructions — separate instruction/content contexts, apply allow-listing
Safe design pillars: least privilege, reversible actions, explicit scope constraints, sandbox execution, immutable audit trail

8.8

Chapter 8.8

Production Agents — Real-World Systems & Frameworks

The most important question about AI agents is not how they work in research papers — it is how they fail in production. Every frontier lab in 2024 is shipping agents. The gap between demo and production is where fortunes are made and lost. This chapter is about closing that gap.

Real Production Systems In-depth

By 2024, AI agents have moved from research demonstrations to production products used by millions of developers and consumers. The systems below represent the state of the art across different application domains — each offers a case study in a different architectural approach to the core challenges of reliability, autonomy, and safety.

💻

Devin (Cognition AI, 2024)

First "AI software engineer" — autonomously completes software engineering tasks end-to-end. Tools: shell, code editor, browser, test runner. Reads issue → plans → writes code → runs tests → debugs → submits PR. Architecture: long-horizon planning + parallel exploration.

🔬

SWE-Agent (Princeton, 2024)

Open-source code agent for research. Key insight: purpose-built ACI (Agent-Computer Interface) tools — search_code, edit_file, find_function — outperform generic bash tools. SWE-bench: ~18% with improved tools.

🔧

GitHub Copilot Workspace (2024)

IDE-integrated code agent. Reads issue/PR → generates plan → implements across multiple files. Human reviews the plan before execution starts. Deep repo context: file tree, PR history, test results. Available to millions of GitHub users.

🖥️

Claude Computer Use (Anthropic, 2024)

Perceives a computer screen via screenshots, acts via mouse/keyboard. Can operate any GUI application as a human would — no API needed. Architecture: multimodal LLM (vision) + computer action tools. Use cases: forms, legacy software, UI testing.

🌐

OpenAI Operator (2025)

Web automation agent integrated into ChatGPT. Completes multi-step web tasks: book flights, fill forms, complete purchases. Safety-first: only proceeds for clearly benign tasks, pauses for human confirmation on sensitive or irreversible actions.

🔎

Perplexity (2023–2024)

Research agent: plans search queries, executes multiple searches, synthesises into cited answers. Lower autonomy (Level 2–3) but highest reliability in its domain. Best research product available to consumers in 2024.

System	Domain	Architecture	Autonomy	Notable
Devin	Software engineering	ReAct + planning + specialised tools	High	First "AI software engineer"
SWE-Agent	Code + GitHub	ACI + specialised tools	Medium-High	Open source, research
Copilot Workspace	IDE + code	Plan-then-execute	Medium (human reviews plan)	Mass market, GitHub
Claude Computer Use	Any GUI	Screenshot → action loop	High	Any app, no API needed
OpenAI Operator	Web automation	Web browsing + actions	Medium	Integrated in ChatGPT
Perplexity	Research	Search + synthesis	Low-Medium	Best research product 2024

Code Agents Deep Dive In-depth

Code agents are the most mature agent category. The reasons are structural: code is verifiable (it either passes the tests or it doesn't), safe to iterate (test → error → fix is a tight feedback loop with no irreversible side effects), and the tools are well-defined (shell, editor, test runner — standard interfaces that haven't changed in decades).

The key lesson from SWE-Agent is that tool design matters as much as the model. Generic tools (bash, read_file) force the agent to navigate file systems and parse raw output manually. Purpose-built ACI tools (search_code, edit_function, apply_patch) return structured results that map directly to how a developer thinks about code — dramatically reducing the reasoning burden per step.

Code Agent Tool Stack — ACI layer bridges agent to execution environment

Computer Use Agents Core

Computer use agents are the most general form of agent: instead of purpose-built APIs, the agent perceives a screenshot of any GUI application and takes actions via simulated mouse clicks and keyboard input. No integration work is required — if a human can see and click it, the agent can too.

Computer Use Loop — Screenshot → Perceive → Act → Verify → repeat

Production Architecture Patterns In-depth

Three patterns cover the architecture of most production agent deployments. Choosing the right pattern depends on task duration, response time requirements, and whether tasks can run independently in the background.

Three Production Agent Architectures — sync, async, event-driven

Cost & Latency Optimisation In-depth

Production agents can cost $0.05–$5.00 per task depending on complexity. The dominant cost driver is LLM tokens — particularly in the planning and synthesis steps. The dominant latency driver is the number of serial LLM calls. Both can be dramatically reduced through model routing (match model size to step complexity), parallel tool execution, and result caching.

Agent Cost Breakdown — planning calls dominate, caching and routing reduce costs

Agent Step	Complexity	Recommended Model	Est. Cost / 1K tokens	Latency
Initial goal analysis	High	Claude Opus / GPT-4o	$0.015	2–4s
Planning generation	High	Claude Sonnet / GPT-4o	$0.003	1–3s
Tool call routing	Low	Claude Haiku / GPT-4o-mini	$0.00025	0.3–0.8s
Tool result parsing	Low	Claude Haiku	$0.00025	0.3–0.5s
Error recovery	Medium	Claude Sonnet	$0.003	1–2s
Final synthesis	High	Claude Sonnet / Opus	$0.003–0.015	1–4s

The Future of Agents Core

The open problems in agentic AI (2024–2026) are: long-horizon reliability (tasks spanning hours or days — compounding errors over hundreds of steps), cross-agent trust (how does agent A know agent B is trustworthy, not compromised or hallucinating), persistent identity (memory that degrades gracefully over months, not sessions), and self-improving agents (agents that improve their own tools and strategies through experience rather than requiring manual retraining).

🔮

Near-Term (2024–2026)

Specialist vertical agents (legal, medical, finance). Enterprise deployment platforms. Agent-to-agent marketplaces. MCP as universal tool standard. RL-trained agents from real task outcomes. Computer use at production reliability.

🌐

Medium-Term (2026–2030)

Multi-month task horizons. Agents that manage other agents at scale. Self-improving tool use. Persistent agent identities across years. Deep integration with physical systems (robotics + agents). Standardised agent-to-agent protocols.

🎓 Domain 8 Complete — Agentic AI

Ch 8.1: Agent = LLM + tools + memory + goal. Agency spectrum: Level 1 (chatbot) → Level 5 (autonomous). Most 2024 systems are Level 3–4.
Ch 8.2: Function calling = structured tool invocation. Tool schemas are prompts — precise descriptions prevent failures. MCP standardises tool connectivity.
Ch 8.3: ReAct: Thought → Action → Observation loop interleaves reasoning with grounding. Reflexion adds verbal self-critique for iterative improvement.
Ch 8.4: Four memory types: in-context, vector store, episodic, semantic. Context management is critical for long tasks.
Ch 8.5: Plan-and-Execute for complex tasks; dynamic replanning for failures. Planning prevents irreversible early mistakes.
Ch 8.6: Supervisor/Worker is the dominant production multi-agent pattern. LangGraph for production; CrewAI/AutoGen for prototyping.
Ch 8.7: Agent failures: context overflow and goal drift are most common. Human-in-the-loop gates required for high-risk irreversible actions.
Ch 8.8: Production agents (Devin, Copilot Workspace, Claude computer use) are here now. Cost and latency optimisation via model routing and parallel tool execution are essential.

🚀 Go Deeper — Production Agents

This Foundation chapter introduced the concepts. But production agents are far more complex:

Tool reliability issues — what happens when APIs fail or return unexpected results
Latency constraints — making agents feel fast while managing multiple LLM calls
Hallucinated actions — agents confidently executing the wrong tool or wrong arguments
Orchestration challenges — coordinating multi-step agents with human checkpoints

→ Covered in depth: AI Agents in Production (Advanced)

Agentic AI is where all previous domains converge. Domain 2 (maths) → the reasoning the LLM uses. Domain 4 (deep learning) → the model powering the agent brain. Domain 5 (NLP) → the language understanding and generation. Domain 6 (CV) → computer use agents seeing the screen. Domain 7 (RL) → the RLHF that aligned the agent to be helpful. An agent is the sum of everything we've built.

The question that remains — and the one Domain 9 addresses — is: as these agents become more capable and more autonomous, how do we ensure they remain aligned with human values?

← Domain 07: Reinforcement Learning Domain 09: MLOps →