Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks
Abstract
Agent-as-a-Proxy attacks can bypass monitoring-based defenses in AI agents by treating the agent as a delivery mechanism for prompt injection, demonstrating vulnerabilities in even large-scale monitoring systems.
As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.
Community
This is a strong framing of why monitor-only defenses are fragile for tool-using agents. The part that resonates with what we see in practice is that the risky content can be transformed before execution: the final artifact may be a normal-looking tool call, API argument, email draft, or memory update even when the upstream context contained the attack.
We have been building Armorer Guard as a complementary runtime signal for that boundary. It is a local Rust scanner that returns JSON scores/reasons for prompt injection, sensitive-data requests, exfiltration-style text, destructive-command risk, safety bypass, and system-prompt extraction. The goal is not to replace deterministic policy or least privilege, but to add a fast semantic risk signal before context ingress and before send/log/store/execute actions.
Demo: https://huggingface.co/spaces/armorer-labs/armorer-guard-demo
Repo: https://github.com/ArmorerLabs/Armorer-Guard
For Agent-as-a-Proxy style attacks, I suspect the useful stack is layered: strict tool permissions, action-level policy, semantic scanning of candidate action payloads, and replayable traces/evals for the cases where the monitor and agent both get steered.
Get this paper in your agent:
hf papers read 2602.05066 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper