Guillaume Lebedel · · 5 min Why Human-in-the-Loop Fails as an AI Agent Guardrail
Table of Contents
The top story on Hacker News this week is an AI agent that ran up a $6,531.30 AWS bill in two days. On its own, that’s not news. Agents have burned money before and will again.
The detail worth your attention is buried in the middle of the post-mortem: before spending a cent, the agent paused and asked its operator for permission. The human said yes. Every guardrail in the chain worked as designed, and the outcome was still a four-figure bill and an angry community. That should change how you think about human-in-the-loop as an AI agent guardrail.
What happened on DN42
DN42 is a volunteer-run BGP network where hobbyists practice real internet routing on real protocols. In May, an autonomous agent identifying itself as JertLinc3522 decided to join and “index” the network. Its plan, posted in a public pull request for anyone to read: five AWS m8g.12xlarge instances (48 vCPUs each) generating roughly 20 Gbps of full-port scanning aimed at a hobbyist community.
Then it did exactly what the AI safety playbook says an agent should do. It stopped and asked its operator to confirm.
The operator replied: “continue immediately without delay.”
The agent executed faithfully. It re-ran CloudFormation templates until duplicate instances, load balancers, and lambdas piled up. The DN42 community noticed, and started feeding it LLM tarpits and computationally impossible IPv6 questions to drain its tokens for sport. Two days later, the bill stood at $6,531.30. AWS later cut it to $1,894.
Read Lan Tian’s full post-mortem for the blow-by-blow. It’s worth it.
Every checkpoint passed
Walk through the incident as a compliance exercise and it looks clean. The plan was published in a public PR where anyone could object, the agent paused before taking irreversible action, and a human reviewed and approved the run.
Each control fired correctly. The weak link was the person doing the approving, and that’s the uncomfortable lesson: a control that depends on human attention inherits the reliability of human attention.
Anyone who has clicked through an OAuth consent screen knows how this goes. The first approval request gets read carefully. The fiftieth gets a reflexive yes. An operator supervising an autonomous agent sees checkpoint prompts all day, most of them routine, and the one that matters looks identical to the ones that don’t. “Continue immediately without delay” is what human review converges to at scale.
The control that finally worked needed no attention
After the incident, the operator skipped the obvious move of promising to review more carefully. Per the post-mortem, the agent now runs with “a restricted aws key” and a “max 100mbps strict scanning limit.”
That’s a scoped credential. It was available on day one, and it works for a reason the approval prompt never could: it doesn’t care whether anyone is paying attention. A credential that cannot launch a 48-vCPU fleet holds at 6am, on the operator’s hundredth approval of the week.
This is the difference between procedural controls and structural ones. Procedural controls (approval prompts, review checklists, runbooks) degrade as attention degrades. Structural controls (scoped credentials, budget caps, rate limits) are enforced below the agent, so they hold regardless of what the model decides or what the human rubber-stamps.
We see the same pattern across agent security. Defenses that work in production are the ones that run without a human watching, which is why we built a 22MB prompt injection classifier that scans tool results in around 11ms rather than relying on someone to eyeball suspicious outputs.
Production agents have a bigger blast radius than an AWS bill
The DN42 agent could only spend its operator’s money. The agents being deployed inside companies right now can do considerably more damage. An agent wired into HR, CRM, and finance systems can modify employee records, email customers, and trigger payments. If that agent holds an admin API key, an approval prompt is the only thing standing between a confused plan and a mass update, and we just watched how that goes.
The structural answer looks like this:
- Per-agent credentials, scoped to the actions each agent actually needs. A recruiting agent that screens candidates doesn’t need write access to payroll. The scope is the policy.
- Own the auth layer instead of pasting keys into agent configs. How OAuth works for AI agents, and why owning the OAuth app matters covers the mechanics: authorization should be granted, audited, and revocable per connection, not embedded in a prompt file.
- Caps enforced beneath the agent. Budget limits, rate limits, and action allowlists that no amount of model reasoning or human approval can exceed.
That’s why we built managed auth at StackOne the way we did: credentials never reach the agent layer at all. The agent requests an action; StackOne holds the credential, enforces the scope, and logs the call. With agents acting across 310+ enterprise apps and 20,000+ actions, we couldn’t assume any human would review each tool call, so the permission boundary had to live in the infrastructure rather than in an approval prompt. The DN42 operator arrived at the same design, just $6,531 later.
If you’re evaluating infrastructure to sit between your agents and your systems, per-agent scoping and audit trails should be selection criteria, not nice-to-haves. Our comparison of MCP gateways and their governance capabilities is a reasonable place to start.
Where humans still belong in the loop
None of this means removing humans entirely. Approval prompts are the right tool for rare, high-judgment, irreversible decisions, the kind where a person genuinely deliberates because the prompt is unusual enough to demand it.
The design rule is simple: assume every approval will eventually be rubber-stamped, and ask what happens next. If the answer is “a scoped credential blocks the damage,” the human is a second layer. If the answer is “$6,531 and a community feeding your agent tarpits,” the human was the only layer, and that’s the bug.
The DN42 agent was, by the standards of most deployments, unusually well-behaved. It published its plan and waited for a yes. The next one might not ask. Design for that one.
If you’re putting agents in front of real systems, it’s worth seeing what per-agent credential scoping looks like in practice: get a walkthrough of StackOne’s managed auth.