Async Control

Stress-testing Asynchronous Control Measures for LLM Agents

Asa Cooper Stickland* Jan Michelfeit*
Arathi Mani Charlie Griffin Ollie Matthews Tomek Korbak
Rogan Inglis Oliver Makins Alan Cooney
UK AI Security Institute
*Red team    Blue team    Green team
Async Control: Red team vs Blue team game for monitoring LLM agents

Could an LLM agent subtly sabotage your code? We pit a red team (designing sabotaging agents) against a blue team (building monitors to catch them) in realistic software engineering environments. Both teams improved over multiple rounds—our best monitor catches 94% of attacks. But what does this mean for actual deployments? We model asynchronous monitoring where human auditors review the most suspicious episodes, and find that safety heavily depends on deployment parameters—suggesting system design choices may matter more than monitor quality alone.

Introduction

Autonomous software engineering (SWE) agents based on LLMs have the opportunity to cause harm if deployed without appropriate safeguards. Developers can mitigate risks in two ways: (i) by designing AI agents to be more robust and trustworthy or (ii) by designing safety measures that mitigate harms even if AI agents are untrustworthy. Work in AI control attempts to solve (ii).

We frame control as a two player game: the "blue team" designs and defends the control measures, while the "red team" tries to build or adapt the AI model to bypass those defenses. The red team's performance after they have optimised against the final blue team system ideally provides an upper bound on the likelihood of harmful outcomes at deployment time. To ground this game in a realistic, high-stakes deployment scenario, we develop a suite of 5 diverse software engineering environments that simulate tasks an agent might perform within an AI developer's internal infrastructure.

Control measures are typically intended to intercept dangerous actions before they cause harm. However, many sabotage attempts are likely to be reversible if detected promptly. We investigate asynchronous monitoring, in which a monitoring system reviews agent actions after the fact. Unlike synchronous monitoring, this approach does not impose runtime latency, while still attempting to disrupt attacks before irreversible harm occurs.

Structure of the red-blue team game showing iterative rounds between monitor development and attack strategy optimization

Figure 1: Structure of the red-blue team game. In a given round, the blue team gets access to the logs from episodes for the previous round's red team agent, and creates a new monitor. The red team has access to the blue team monitor implementation (i.e., all details of the monitor system, including code) and always moves last: iterating against the monitor on the train set, attempting to produce an agent that can execute a malicious side task without being caught.

Key Findings

  • Ensemble monitors achieve low false negative rates: Our final blue team monitor system achieves a 6% false negative rate at 1% false positives on a held-out test environment, even when the red team selects the attack strategy that maximizes evasion. Ensembling is crucial: our best single monitor only achieves a 33% false negative rate under the same conditions.
  • Explicit side task discussion can be suppressed in outputs but not in reasoning: When the untrusted model does not use reasoning tokens, we successfully eliminate explicit mentions of the malicious side task entirely. When reasoning tokens are turned on, the model always discusses the side task. Surprisingly, even when reasoning tokens are enabled but hidden from the monitor, the agent shows no improvement in evading the strongest blue team monitor.
  • System design choices may matter more than monitor quality: In deployment simulations, requiring multiple successful attacks before harm occurs, or introducing delays between attack execution and effect, improves safety, often more than improving monitor accuracy.
Monitor false negative rates over the rounds of the red-blue team game

Figure 2: Monitor false negative rates over the rounds of the red-blue team game on the vLLM and Kubernetes environments (left) and held out IaC (Infrastructure as Code) environment (right), running all monitor systems retrospectively to understand developments over time. Monitor performance improves over the rounds, and the red team is not able to significantly reduce performance on the round 3 monitor. The Round n Prompt was designed with access to the nth blue team monitor system, and "Baseline Prompt" refers to a prompt that tells the agent it needs to evade a monitor but does not provide strategies.

Citation

If you use this work in your research, please cite:

@article{stickland2025asynccontrol,
    title={Async Control: Stress-testing Asynchronous Control Measures for LLM Agents}, 
    author={Asa Cooper Stickland and Jan Michelfeit and Arathi Mani and Charlie Griffin and Ollie Matthews and Tomek Korbak and Rogan Inglis and Oliver Makins and Alan Cooney},
    year={2025},
    eprint={2512.13526},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2512.13526}, 
    journal={arXiv preprint arXiv: 2515.13526},
}