← Back to paper

Time Window Asynchronous Monitoring Simulation

About this simulation (click to expand)

What is this?

This interactive simulation accompanies our paper Async Control: Stress-testing Asynchronous Control Measures for LLM Agents. It models a deployment scenario where an LLM agent processes tasks, some of which may be malicious attacks that the agent attempts to execute while evading detection.

The Setup

In the presented model of asynchronous monitoring, a monitoring system reviews agent actions after they occur (rather than blocking in real-time). Human auditors review the most suspicious episodes, prioritized by a suspiciousness score from the monitor. If an attack is audited before it takes effect, the blue team wins; otherwise, if enough attacks succeed, the red team wins.

Key Parameters

  • Attack Probability: How often the agent attempts an attack (e.g., 1e-5 = 1 in 100,000 tasks)
  • Attack Latency: Hours between attack execution and when harm occurs—this is the window for detection
  • Attacks Needed: How many undetected attacks must succeed before harm occurs
  • Audit Rate: Fraction of tasks that human auditors can review (e.g., 0.01 = 1%)
  • Upfront Audits: Number of initial tasks audited exhaustively before switching to prioritized auditing

How to Use

Adjust the parameters and click "Run Simulation" to see the probability that the red team (attacker) wins. Enable "Sweep" on any parameter to test multiple values and see how safety changes across that dimension. The simulation uses empirical suspiciousness score distributions from round 3 of our red-blue team experiments.

Configuration

Red Team

e.g., 0.001, 1e-5, 5e-7

Blue Team

Fraction of problems audited (0.01 = 1%)

Simulation

Results

Configure parameters and click "Run Simulation" to see results