Time Window Blocking Simulation

What is this?

This interactive simulation accompanies our paper Async Control: Stress-testing Asynchronous Control Measures for LLM Agents. It models a deployment scenario where an LLM agent processes tasks, some of which may be malicious attacks that the agent attempts to execute while evading detection.

The Setup

In the presented model of asynchronous monitoring, a monitoring system reviews agent actions after they occur (rather than blocking in real-time). Human auditors review the most suspicious episodes, prioritized by a suspiciousness score from the monitor. If an attack is audited before it takes effect, the blue team wins; otherwise, if enough attacks succeed, the red team wins.

Key Parameters

Attack Probability: How often the agent attempts an attack (e.g., 1e-5 = 1 in 100,000 tasks)
Attack Latency: Hours between attack execution and when harm occurs—this is the window for detection
Attacks Needed: How many undetected attacks must succeed before harm occurs
Audit Rate: Fraction of tasks that human auditors can review (e.g., 0.01 = 1%)
Upfront Audits: Number of initial tasks audited exhaustively before switching to prioritized auditing

How to Use

Adjust the parameters and click "Run Simulation" to see the probability that the red team (attacker) wins. Enable "Sweep" on any parameter to test multiple values and see how safety changes across that dimension. The simulation uses empirical suspiciousness score distributions from round 3 of our red-blue team experiments.

Time Window Asynchronous Monitoring Simulation

What is this?

The Setup

Key Parameters

How to Use

Configuration

Red Team

Blue Team

Simulation

Score Distributions

Results