Agent Suicide by Context

Your agent just killed itself.

It didn’t crash. It didn’t throw an exception. It made a perfectly reasonable decision that happened to generate 250,000 tokens of output, exceeding its 200K context window. The request failed, the task died, and the agent never understood why.

This is context suicide: agents taking actions that destroy their own ability to continue.

Why AI Agents Are Bad at Managing Context Limits

Context limits aren’t new, but the shape of the problem has changed. Early discussions focused on RAG: how much to pre-load before the conversation starts. Agentic workflows changed everything. Context now builds dynamically through tool calls, accumulating unpredictably as agents take actions.

Modern agents can see their token budgets. Claude models with extended thinking can track usage during agentic loops. But knowing your budget doesn’t mean you’ll manage it well. Both Anthropic’s context engineering guide and Factory.ai’s research show that even million-token windows fail against real enterprise workloads.

The fundamental issue is architectural. LLM interaction is single-threaded: request in, response out, tool call, response back. There’s no native way to fork, parallelize, or isolate risky operations. Without orchestration, everything happens in one context with no recovery path. The harness layer is where this gets fixed: sub-agents add isolation, turning every tool call from a gamble into a recoverable operation.

And agents gamble constantly. When an agent calls a list tool or reads a file, it could check the size first, but most don’t. They do what they’re told (finding, searching, trawling for more context) even when it hurts them. The tool call looks reasonable. The 150K response is fatal.

We see this regularly at StackOne when building tools for Notion and GitHub. An agent retrieves a page, the page contains embedded databases, the response balloons to 200K tokens. Dead before it processes anything useful. We can help by offering options to retrieve less data. But we’re tool providers, not the agent harness. The architecture has to be designed for survival.

How AI Agents Die from Context Overflow

Here are the common ways agents exceed their context limits:

The Killing Blow: Fatal Context Overflows

Some agents start so heavy that the first tool call finishes them off.

RAG Pre-Loading. An agent configured to retrieve “relevant context” at startup pulls 50 documents averaging 2,500 tokens each. That’s 125K tokens before the user says anything. The agent isn’t dead yet, but it’s wounded. The user asks a question, the agent makes one tool call to fetch more context, and that 80K response pushes it over.

// Agent starts with 125K of pre-loaded context
{
  "relevant_context": [
    { "doc_id": "auth-001", "content": "## Authentication Flow...", "tokens": 2847 },
    { "doc_id": "auth-002", "content": "## Token Refresh...", "tokens": 1923 },
    // ... 48 more documents
  ],
  "total_tokens": 125000  // Already 62% of 200K limit before user speaks
}
// User: "How does the payment flow work?" → +85K tokens → Dead

Tool Definition Catch-22. MCP servers expose tools with JSON schemas. Detailed descriptions and input/output schemas improve accuracy, but the more detail you add, the more context you burn. A server with 500+ tools can use 100K+ tokens on definitions alone.

The irony: you need good tool definitions for accuracy, but verbose definitions eat the context you need for actual work. Sparse definitions save tokens but increase hallucination. There’s no winning without dynamic tool loading.

// Just 3 of 500+ tool definitions in the system prompt
{
  "tools": [
    {
      "name": "salesforce_create_lead",
      "description": "Creates a new lead in Salesforce CRM...",
      "parameters": {
        "type": "object",
        "properties": {
          "first_name": { "type": "string", "description": "Lead's first name" },
          "last_name": { "type": "string", "description": "Lead's last name" },
          "company": { "type": "string", "description": "Company name" },
          "email": { "type": "string", "format": "email" },
          // ... 20 more fields per tool
        },
        "required": ["last_name", "company"]
      }
    },
    // ... 499 more tools with similar verbosity
  ]
}

Greedy File Reads. “Read the codebase to understand the structure” sounds reasonable. The agent reads three files, each 15K tokens. Then it decides it needs more context. Ten files later, it’s dead.

Context Death by a Thousand Cuts

Other deaths are slower. Drew Breunig’s analysis of context rot found that model correctness began to fall around 32k tokens, with agents showing a tendency toward favoring repeating actions from their vast history as context grows. Research on the “lost in the middle” effect shows models ignore information in the middle of long contexts.

Conversation Bloat. Every turn adds tokens. Without compaction, a 20-turn conversation easily exceeds 200K tokens. The agent “remembers” everything but understands less and less.

Context Poisoning. When playing Pokemon, researchers found that Gemini agents would occasionally hallucinate, poisoning their own context with false information. Once the “goals” section was corrupted, the agent developed nonsensical strategies and repeated behaviors in pursuit of impossible objectives.

Intermediate Result Accumulation. Each tool call returns results. Each result stays in context. An agent making 30 API calls accumulates 30 responses. Even if each response is only 3K tokens, that’s 90K tokens of intermediate state.

// Context after 30 sequential API calls
{
  "conversation": [
    { "role": "user", "content": "Sync all employee data from our HRIS systems" },
    { "role": "assistant", "content": "I'll fetch employees from each system." },
    { "role": "tool", "name": "workday_list_employees", "tokens": 3200 },
    { "role": "tool", "name": "bamboohr_list_employees", "tokens": 2800 },
    { "role": "tool", "name": "adp_list_employees", "tokens": 3100 },
    // ... 27 more tool responses
  ],
  "accumulated_tool_tokens": 94500,
  "total_context": 98200  // Half the context is intermediate results
}

Why AI Agents Don’t Protect Themselves

Agents can be taught to monitor their own context. Some frameworks expose token counts. You can prompt agents to check response sizes before committing. The problem: context management is a job in itself.

An agent juggling “solve the user’s problem” and “don’t kill yourself” is doing two jobs at once. Tracking resource constraints competes with solving the actual problem. Every token spent reasoning about context is a token not spent on the problem.

This is partly why sub-agent architectures work: you can dedicate the orchestrator to planning and context management while sub-agents focus purely on execution. The orchestrator doesn’t need to predict response sizes if it delegates risky operations to disposable workers.

AI Agent Survival Architectures for Context Management

You can’t predict your way out of this. You need architectures that survive when prediction fails.

Sub-Agent Isolation for Context Management

The strongest pattern: delegate risky operations to sub-agents with their own context windows.

Anthropic’s Claude Code does this. The main agent maintains a high-level plan with clean context. Sub-agents handle focused tasks (file search, code analysis, test execution). Each sub-agent might use 50K tokens exploring deeply, but returns only a 2K token summary.

If a sub-agent dies from context overflow, the orchestrator notices the failure and adapts. It can spawn a new sub-agent with different instructions: “search only the src/ directory” instead of “search the entire codebase.”

Sub-Agent Benefits

Isolation: Sub-agent death doesn’t kill the orchestrator
Recovery: Failed tasks can be retried with adjusted parameters
Compression: 50K tokens of exploration becomes 2K of summary
Specialization: Each agent optimized for its specific task

Code Mode: Programmatic Filtering

Sub-agents isolate risk. But there’s another approach: never let large results enter context at all.

Code mode lets agents write code that fetches data, chains tool calls, and filters output before it enters context. Cloudflare pioneered this approach, and Anthropic’s Claude Code implements a similar pattern.

Traditional approach: call a tool, receive the response directly into context, process the response. If the response is 150K tokens, it’s now in your context whether you need it or not.

Code mode approach: write code that fetches, processes, and extracts only what you need. The difference is dramatic. When agents filter data in a code execution environment instead of receiving it directly, token usage drops by over 90% (e.g., 150K down to a few thousand).

// Data filtering: agent sees 5 rows, not 10,000
const allRows = await gdrive.getSheet({ sheetId: 'abc123' });
const pendingOrders = allRows.filter(row => row["Status"] === 'pending');
console.log(pendingOrders.slice(0, 5));  // ~200 tokens instead of 40K

// Tool chaining: data flows between tools without entering context
const transcript = (await gdrive.getDocument({ documentId: 'notes' })).content;
await salesforce.updateRecord({
  objectType: 'SalesMeeting',
  recordId: '00Q5f000001abcXYZ',
  data: { Notes: transcript }  // 15K tokens never seen by the model
});

The agent is programming its own data pipeline, not passively receiving tool outputs. It can filter 10,000 rows to the 5 that matter, chain tool calls where data flows directly between tools, and run analysis in the execution environment instead of loading everything into context.

File-Based Iteration

A related pattern uses the filesystem as a buffer. Instead of loading everything into context, dump results to files and use bash to iterate through them.

# Dump search results to a file
search_logs "authentication" > /tmp/results.json

# Agent can now selectively read what it needs
head -100 /tmp/results.json        # First 100 lines
jq '.items[0:5]' /tmp/results.json  # First 5 items
grep "error" /tmp/results.json | head -20  # Filtered subset

The full results live on disk. The agent iteratively explores with bash commands, each returning small slices. If it needs more, it reads more. If it doesn’t, those 50K tokens never touched context.

Memory Pointers for Token Reduction

Code mode writes to files. But what about data that needs to persist across many operations?

Recent research shows a memory pointer approach that reduces token usage by 87%. Instead of passing raw data through context, the agent references external storage by ID.

The agent sees: [memory:doc_12345] instead of the 10K token document. When it needs the content, it retrieves it. When it doesn’t, the pointer costs 20 tokens instead of 10,000.

// Without memory pointers: 50 documents x 10K tokens = 500K tokens
{
  "documents": [
    { "id": "doc_001", "content": "Full 10,000 token document..." },
    { "id": "doc_002", "content": "Another 10,000 token document..." },
    // ... 48 more full documents
  ],
  "total_tokens": 500000  // Impossible to fit in any context
}

// With memory pointers: 50 pointers x 20 tokens = 1K tokens
{
  "document_refs": [
    "[memory:doc_001]", "[memory:doc_002]", "[memory:doc_003]",
    // ... 47 more pointers
  ],
  "current_document": { "id": "doc_001", "content": "Full content..." },
  "total_tokens": 11000  // Only the active document + pointers
}

This works particularly well for repetitive operations. An agent processing 50 similar documents doesn’t need all 50 in context simultaneously. It needs the current document and references to the others.

Progressive Context Compaction

The techniques above handle tool response overflow. But conversations also accumulate tokens turn by turn, even without large tool responses. What about gradual bloat from conversation history itself?

Compaction summarizes conversation history to reclaim context space. Anthropic describes this as distilling context contents in a high-fidelity manner while preserving architectural decisions and implementation details.

The limitation: compaction only helps with gradual degradation. If an agent takes a single action that returns 250K tokens, compaction doesn’t save it. The death is instant.

Compaction works best when combined with other techniques. Use sub-agents to prevent one-shot kills. Use compaction to handle conversation accumulation over time.

Compaction Tradeoffs

Prevents: Gradual context bloat from long conversations
Doesn’t prevent: One-shot kills from massive tool responses
Risk: Overly aggressive summarization loses subtle context
Best for: Multi-turn tasks with clear milestones

Recursive Language Model (RLM)

Sub-agents work. But what if you need to process more data than any single agent can handle?

RLM (Recursive Language Model) takes a different approach: store context in a Python REPL environment and let the model write code to interact with it.

Instead of loading data into the context window, RLM loads data into REPL variables. The model then writes Python code to peek at subsets, run regex queries, partition into chunks, and summarize results. Data lives in memory, not in context.

# Context stored in REPL variable, not model context
context = load_document("large_codebase.tar.gz")  # 500K tokens

# Model writes code to explore without loading everything
preview = context.peek(lines=100)  # See structure first
matches = context.search(r"def authenticate")  # Regex query
chunks = context.partition(chunk_size=5000)  # Split for processing

# Recursive call: spawn sub-instance with smaller context
for chunk in chunks:
    summary = rlm.call(query="Find security issues", context=chunk)
    results.append(summary)  # Only summaries enter main context

The REPL environment manages control flow. The model decides decomposition strategies at inference time: how to chunk, what to search for, when to spawn recursive calls. Each recursive call gets its own isolated context, processes a piece, and returns a summary.

This works for tasks that would otherwise be impossible: analyzing entire codebases, processing thousands of documents, running comprehensive security audits. The main agent’s context grows slowly because it only sees summaries, not raw data.

RLM Pattern

Storage: Data in REPL variables, not context window
Interaction: Model writes Python to peek, search, partition
Recursion: Spawns isolated sub-instances for chunks
Scale: Can process inputs far beyond any context limit

MCP Tool Metadata for Context Management

The architectures above are defensive. What about giving agents the information they need to protect themselves?

MCP is already moving in this direction. The spec includes a size field for resources that hosts can use to estimate context window usage. But this only covers resources, not tool responses.

Good tool design would extend this pattern:

// Tool definition with size hints
{
  name: "search_logs",
  description: "Search application logs",
  parameters: { ... },
  responseMetadata: {
    estimatedTokens: "variable",
    averageItemSize: 500,      // tokens per log entry
    defaultLimit: 100,         // items returned by default
    pagination: {
      supported: true,
      recommendedPageSize: 50
    }
  }
}

With this metadata, an agent can reason: “I have 60K tokens remaining. This tool returns ~500 tokens per item with a default limit of 100. That’s 50K tokens. I should request a smaller page or use a sub-agent.”

Even better: a dry run mode where the agent can ask “how big would this response be?” before committing:

// Dry run returns size estimate without fetching data
const estimate = await search.dryRun({ query: "auth errors", days: 30 });
// { estimatedTokens: 180000, itemCount: 3600 }

// Agent can now make an informed decision
if (estimate.estimatedTokens > remainingContext * 0.5) {
  // Delegate to sub-agent or adjust parameters
}

If you’re building tools and you give agents this information, their survival becomes their responsibility. They have the budget awareness, they have the size estimates, they can decide whether to proceed or delegate. Tools that surface their resource implications let agents make informed choices instead of blind ones.

Building an AI Agent Context Strategy

The techniques above aren’t mutually exclusive. Most production agent systems combine several:

Sub-agents handle anything with unpredictable output (file reads, searches, API calls)
Code mode and file-based iteration keep large results out of context entirely
Compaction manages gradual bloat in long conversations
RLM patterns tackle tasks that exceed any single context window
Tool metadata lets agents make informed decisions about risky calls
Built-in filters let tools return exactly what’s needed without harness changes

The common thread: assume your agent will encounter context-killing situations, and design systems that recover gracefully when it happens.

What MCP Tool Providers Can Do About Context Overflow

Most survival architectures live in the agent harness. But tool providers aren’t helpless. There’s a spectrum of how much we can help.

Discovery at the server level. Instead of exposing 500 tools upfront, build discovery into the MCP server itself. At StackOne, we built meta-execute and search tools that let agents discover available actions without loading every schema. A good harness can also solve this (Anthropic’s Claude SDK does dynamic tool loading), but embedding discovery in the server means it works regardless of how sophisticated the harness is.

Code mode is powerful but heavy. Letting agents generate code to filter and transform data works well. But it requires infrastructure: a secure sandbox to execute arbitrary code, proper isolation, timeout handling. That’s a lot to ask from every tool provider. Most won’t build it.

The middle ground: built-in filters. What if tools offered filter parameters using syntax LLMs already know? Think JSONPath or SQLite-style queries:

// Instead of returning 50K tokens of Notion pages...
notion.search({
  query: "Q4 planning",
  filter: "$.results[?(@.properties.status == 'In Progress')]",  // JSONPath
  select: ["id", "title", "last_edited"]  // Only these fields
})

// Or SQL-like for structured data
logs.query({
  where: "level = 'error' AND timestamp > '2026-01-01'",
  limit: 50,
  select: ["message", "stack_trace"]
})

This isn’t full code mode. No sandbox needed. The syntax is well-known to LLMs (they’ve seen millions of JSONPath and SQL examples). It doesn’t require generating and executing scripts, just a filtering parameter on existing tools. But it achieves the same goal: agents request exactly what they need instead of receiving everything and drowning.

The filtering happens server-side. The agent gets 2K tokens instead of 50K. No harness changes required.

The Future of AI Agent Context Management

The architectures above work. Sub-agents, code mode, RLM, memory pointers, tool metadata, built-in filters. They’re all effective. But they’re also fragmented. Every harness implements its own version. Every tool provider makes different choices.

SDKs and agent frameworks still lack most of this functionality, and that makes sense. The space moves fast. Commit to one approach today, and it might be obsolete in three months. Teams are cautious about building infrastructure that could become technical debt.

For now, we build survival architectures. Assume agents will kill themselves. Design systems that recover anyway. The harnesses that get this right become the platforms, not because of features, but because their agents actually finish the task.