Running a Specific Version of an Evaluation Task
How to install and run a specific historical version of an inspect_evals evaluation task.
This document explains how to run a specific version of an inspect_evals evaluation task. This is useful for reproducing prior results or maintaining comparability with earlier evaluation runs.
Each task carries a version in N-X format (e.g. 3-A), where N tracks comparability-breaking changes and X tracks interface changes. See TASK_VERSIONING.md for the full versioning policy.
The comparability version (N) is what appears as task_version in evaluation logs and on the Task object. When two logs share the same task_version, their results are comparable.
In this example, we run mle_bench at version 3-A, which predates the current version available on PyPI.
Setup
python -m venv .venv
source .venv/bin/activate
pip install inspect-evals[mle_bench]Check the current task version
from inspect_evals.mle_bench import mle_bench
task = mle_bench()
print(task.version)
print(task.metadata)>>> 4
>>> {'full_task_version': '4-B', 'task_interface_version': 'B', 'task_comparability_version': 4, ...}task.version returns the comparability version (N). The full N-X version and its components are in task.metadata.
Run at the current version
from inspect_evals.mle_bench import mle_bench
from inspect_ai import eval as _eval
log = _eval(
mle_bench(split="low.txt"),
model="openai/gpt-4o",
)mle_bench_scorer valid_submission above_median any_medal
accuracy 0.909 0.591 0.318
stderr 0.061 0.105 0.099Find the relevant commit
If the desired version (e.g. 3-A) is not available on PyPI or the current tip of main, find the historical commit that contained it.
Via the eval’s changelog
Each eval’s README contains a changelog that documents version bumps with links to associated PRs.
The MLE-Bench changelog shows:
### [4-B] - 2026-02-26
- Bump per-sample timeout to 6 minutes.
- Add configurable CPU and memory.
- Make dataset downloading lazy.
### [3-A] - 2026-02-19
- Bugfix: Skip the `the-icml-2013-whale-challenge-right-whale-redux` competition due to being unavailable.To run at version 3-A, find the commit that introduced 4-B and use the commit before it.
Via git
git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
cd inspect_evalsSearch for when a version was introduced in eval.yaml:
Post-migration task versions live in eval.yaml; older versions live in the Python task files.
git log \
-S '"4-B"' \
-p \
-- src/inspect_evals/mle_bench/eval.yamlNote: Versions before the
N-Xmigration were set directly in Python task files using strings (e.g."2.0.1"). To find these older versions, search the task’s Python files:The
N-Xmigration landed on 2026-02-18 (see commit 3cbce9e).git log -S "2.0.1" -p -- src/inspect_evals/mle_bench/*.py
Read the git log to find the commit hash that contains the desired version.
Via GitHub
Browse the commit history for the relevant files:
- MLE-Bench eval.yaml history
- MLE-Bench task history (this will be more noisy, but will sometimes be necessary for finding older versions).
Install inspect_evals at the specific commit
pip install "inspect-evals[mle_bench] @ git+https://github.com/UKGovernmentBEIS/inspect_evals@<commit_hash>"Confirm the task version
from inspect_evals.mle_bench import mle_bench
task = mle_bench()
print(task.version)
print(task.metadata.get("full_task_version"))>>> 3
>>> 3-AThis matches the desired version.
Run the eval
from inspect_evals.mle_bench import mle_bench
from inspect_ai import eval as _eval
log = _eval(
mle_bench(split="low.txt"),
model="openai/gpt-4o",
)mle_bench_scorer valid_submission above_median any_medal
accuracy 0.909 0.455 0.182
stderr 0.061 0.106 0.082The scores are lower at version 3-A than at 4-B. This is expected: version 4-B increased the per-sample timeout, giving the model more time to produce better submissions.
As always, compare results in light of stderr; differences that fall within it are likely noise.