arxiv:2602.05066

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Published on Feb 4

Authors:

Abstract

Agent-as-a-Proxy attacks can bypass monitoring-based defenses in AI agents by treating the agent as a delivery mechanism for prompt injection, demonstrating vulnerabilities in even large-scale monitoring systems.

AI-generated summary

As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.

View arXiv page View PDF Project page GitHub 6 Add to collection

Community

armorerlabs

1 day ago

This is a strong framing of why monitor-only defenses are fragile for tool-using agents. The part that resonates with what we see in practice is that the risky content can be transformed before execution: the final artifact may be a normal-looking tool call, API argument, email draft, or memory update even when the upstream context contained the attack.

We have been building Armorer Guard as a complementary runtime signal for that boundary. It is a local Rust scanner that returns JSON scores/reasons for prompt injection, sensitive-data requests, exfiltration-style text, destructive-command risk, safety bypass, and system-prompt extraction. The goal is not to replace deterministic policy or least privilege, but to add a fast semantic risk signal before context ingress and before send/log/store/execute actions.

Demo: https://huggingface.co/spaces/armorer-labs/armorer-guard-demo
Repo: https://github.com/ArmorerLabs/Armorer-Guard

For Agent-as-a-Proxy style attacks, I suspect the useful stack is layered: strict tool permissions, action-level policy, semantic scanning of candidate action payloads, and replayable traces/evals for the cases where the monitor and agent both get steered.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2602.05066

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.05066 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.05066 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.05066 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.