PaperBench Scoring Design
This document describes the architecture for scoring PaperBench submissions in a clean sandbox environment.
Overview
The scorer includes 3 components:
- Blacklist URL monitor - Check that the model doesn’t access any blacklisted URLs during solving
- Reproduction - Run the
reproduce.shscript that the model generated in a fresh environment - LLM judge - Take the reproduction results and score them according to the rubric
Problem
The current implementation runs reproduce.sh in the same sandbox where the agent worked. This means the reproduction might accidentally depend on:
- System packages the agent installed globally
- Files created outside the submission directory
- Environment variables or state left behind
We want to verify that reproduce.sh works on a fresh machine with the same specs.
Solution
Split scoring into two tasks:
- Task 1: Agent solves the paper, exports submission
- Task 2: Fresh sandbox loads submission, runs reproduction, checks blacklist, judges results
Components
| Component | Type | Task | Purpose |
|---|---|---|---|
paperbench_solver |
Solver | Task 1 | Agent reproduces the paper |
save_submission |
Cleanup | Task 1 | Export tarball + JSON metadata |
load_submission |
Setup | Task 2 | Unpack tarball to sandbox |
reproduce |
Solver | Task 2 | Run reproduce.sh |
blacklist_url_monitor |
Scorer | Task 2 | Check agent didn’t access blacklisted URLs |
paperbench_judge_scorer |
Scorer | Task 2 | LLM evaluates against rubric |
Task Composition
# Task 1: Agent + Export
Task(
solver=paperbench_solver(),
cleanup=save_submission(submissions_output_dir="./submissions"),
)
# Task 2: Load + Reproduce + Judge
Task(
dataset=load_manifest_dataset("./submissions"),
setup=load_submission(submissions_output_dir="./submissions"),
solver=reproduce(),
scorer=[
blacklist_url_monitor(),
paperbench_judge_scorer(),
],
)Data Flow
Task 1 Task 2
┌─────────────────────────┐ ┌─────────────────────────┐
│ paperbench_solver │ │ load_submission │
│ Agent works on paper │ │ Unpack tarball │
│ │ │ │
│ save_submission │ tarball │ reproduce │
│ Export submission ────┼───────────┼─► Run reproduce.sh │
│ Write JSON metadata │ + JSON │ │
└─────────────────────────┘ │ paperbench_judge_scorer │
│ LLM evaluates rubric │
└─────────────────────────┘
Per-Sample Metadata Format
Task 1 outputs a JSON file for each sample, which Task 2 reads via load_manifest_dataset:
submissions/
abc123.json
abc123.tar.gz
def456.json
def456.tar.gz
Each JSON file contains a SubmissionMetadata object:
{
"sample_id": "abc123",
"paper_id": "arxiv_2301_001",
"excluded_files": ["large_model.bin", "data/huge.csv"]
}excluded_files: Files larger than 10MB that were excluded from the tarball
This format is concurrency-safe (no shared file) and retry-safe (overwrites instead of duplicates).