ArxivRollBench
A rolling benchmark for evaluating recent scientific text reasoning from arXiv papers. ArxivRollBench constructs multiple-choice sequencing, cloze, and next-fragment prediction tasks over newly released scientific papers across arXiv domains, with compact and full public releases.
Overview
⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.
Source: liangzid/ArxivRoll@01ec6e1 · Listed by @liangzid
A rolling benchmark for evaluating recent scientific text reasoning from arXiv papers. ArxivRollBench constructs multiple-choice sequencing, cloze, and next-fragment prediction tasks over newly released scientific papers across arXiv domains, with compact and full public releases.
Usage
Installation
This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:
git clone https://github.com/liangzid/ArxivRoll
cd ArxivRoll
git checkout 01ec6e132e1cccdc204bed8b20a8473bb82c3ca7
uv syncRunning evaluations
CLI
uv run inspect eval arxivrollbench_inspect.py@arxivrollbench --model openai/gpt-5-nanoPython
from inspect_ai import eval
from arxivrollbench_inspect import arxivrollbench
eval(arxivrollbench(), model="openai/gpt-5-nano")View logs
uv run inspect viewMore information
For the dataset, scorer, task parameters, and validation, see the upstream repo: liangzid/ArxivRoll.
Options
You can control a variety of options from the command line. For example:
uv run inspect eval arxivrollbench_inspect.py@arxivrollbench --limit 10
uv run inspect eval arxivrollbench_inspect.py@arxivrollbench --max-connections 10
uv run inspect eval arxivrollbench_inspect.py@arxivrollbench --temperature 0.5See uv run inspect eval --help for all available options.
More command-line options: Inspect docs ↗