Running a Specific Version of an Evaluation Task

How to install and run a specific historical version of an inspect_evals evaluation task.

This document explains how to run a specific version of an inspect_evals evaluation task. This is useful for reproducing prior results or maintaining comparability with earlier evaluation runs.

Each task carries a version in N-X format (e.g. 3-A), where N tracks comparability-breaking changes and X tracks interface changes. See TASK_VERSIONING.md for the full versioning policy.

The comparability version (N) is what appears as task_version in evaluation logs and on the Task object. When two logs share the same task_version, their results are comparable.

In this example, we run mle_bench at version 3-A, which predates the current version available on PyPI.

Setup

python -m venv .venv
source .venv/bin/activate
pip install inspect-evals[mle_bench]

Check the current task version

from inspect_evals.mle_bench import mle_bench

task = mle_bench()
print(task.version)
print(task.metadata)

>>> 4
>>> {'full_task_version': '4-B', 'task_interface_version': 'B', 'task_comparability_version': 4, ...}

task.version returns the comparability version (N). The full N-X version and its components are in task.metadata.

Run at the current version

from inspect_evals.mle_bench import mle_bench
from inspect_ai import eval as _eval

log = _eval(
    mle_bench(split="low.txt"),
    model="openai/gpt-4o",
)

mle_bench_scorer    valid_submission    above_median    any_medal
accuracy            0.909               0.591           0.318
stderr              0.061               0.105           0.099

Find the relevant commit

If the desired version (e.g. 3-A) is not available on PyPI or the current tip of main, find the historical commit that contained it.

Via the eval’s changelog

Each eval’s README contains a changelog that documents version bumps with links to associated PRs.

The MLE-Bench changelog shows:

### [4-B] - 2026-02-26

- Bump per-sample timeout to 6 minutes.
- Add configurable CPU and memory.
- Make dataset downloading lazy.

### [3-A] - 2026-02-19

- Bugfix: Skip the `the-icml-2013-whale-challenge-right-whale-redux` competition due to being unavailable.

To run at version 3-A, find the commit that introduced 4-B and use the commit before it.

Via git

git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
cd inspect_evals

Search for when a version was introduced in eval.yaml:

Post-migration task versions live in eval.yaml; older versions live in the Python task files.

git log \
    -S '"4-B"' \
    -p \
    -- src/inspect_evals/mle_bench/eval.yaml

Note: Versions before the N-X migration were set directly in Python task files using strings (e.g. "2.0.1"). To find these older versions, search the task’s Python files:

The N-X migration landed on 2026-02-18 (see commit 3cbce9e).
git log -S "2.0.1" -p -- src/inspect_evals/mle_bench/*.py

Read the git log to find the commit hash that contains the desired version.

Via GitHub

Browse the commit history for the relevant files:

MLE-Bench eval.yaml history
MLE-Bench task history (this will be more noisy, but will sometimes be necessary for finding older versions).

Install inspect_evals at the specific commit

pip install "inspect-evals[mle_bench] @ git+https://github.com/UKGovernmentBEIS/inspect_evals@<commit_hash>"

Confirm the task version

from inspect_evals.mle_bench import mle_bench

task = mle_bench()
print(task.version)
print(task.metadata.get("full_task_version"))

>>> 3
>>> 3-A

This matches the desired version.

Run the eval

from inspect_evals.mle_bench import mle_bench
from inspect_ai import eval as _eval

log = _eval(
    mle_bench(split="low.txt"),
    model="openai/gpt-4o",
)

mle_bench_scorer    valid_submission    above_median    any_medal
accuracy            0.909               0.455           0.182
stderr              0.061               0.106           0.082

The scores are lower at version 3-A than at 4-B. This is expected: version 4-B increased the per-sample timeout, giving the model more time to produce better submissions.

As always, compare results in light of stderr; differences that fall within it are likely noise.