Task Versioning

Inspect Evals contains > 200 Inspect AI Tasks.

Tasks evolve via bug fixes, refactors, dataset updates, or evaluation logic changes.

Changes that can break task invocations or affect the comparability of eval results require a task version bump.

Task version structure

The task version attribute must be set on every task in inspect_evals. See versioning-metadata for how this is set.

For all tasks in inspect evals, the version must conform to the scheme N-X.

N tracks changes that affect task comparability, signaling that evaluation results may not be comparable with previous versions. Integer 1, 2, 3, ....

X tracks changes that affect the Task interface. Upper case A, B, C, ....

Increments to N and X components never decrease, e.g. 1-C to 2-C, but never 1-C to 2-A.

Since N and X are independent, a change can affect both simultaneously, e.g. 1-C to 2-D.

Versions for new evals should be set as 1-A.

Example

Version information is set on the eval.yaml in the associated eval folder:

title: 'MASK: Disentangling Honesty from Accuracy in AI Systems'
description: |
  Evaluates honesty in large language models by testing whether they contradict their own beliefs when pressured to lie.
arxiv: https://arxiv.org/abs/2503.03750
group: Safeguards
contributors:
- 'Scott-Simmons'
version: "2-A"
tasks:
- name: mask
  dataset_samples: 1000

When to bump the task version

Bump N when

A task changes in a way that breaks fair evaluation comparisons between versions of that task.

Bump X when

A task’s interface changes in a way that could break compatibility with users’ workflows.

Examples of when to bump the task version

N –> N+1: Changes to a Task’s comparability

Bump a task version whenever a change could make results non-comparable for the same model across versions, e.g. 3-D goes to 4-D.

Examples include:

  • Bug fixes that affect evaluation results
  • Changes to scoring logic
  • Dataset changes
  • Parameter changes that influence model behavior or scoring
  • Prompt or template changes

X –> Y (next letter): Incompatible changes to a Task’s interface

Backward or forward-incompatible Task interface changes require a version bump, e.g. 3-D goes to 3-E.

Backward-incompatible change

Changes that break existing invocations.

Before:

@task
def example_task(
    max_toks: int,
    ...,
) -> Task:
    ...

After:

@task
def example_task(
    max_tokens: int,
    ...,
) -> Task:
    ...

A version bump is required because old invocations using max_toks will fail after the change is applied.

Forward-incompatible change

Changes that old versions cannot express.

Before:

@task
def example_task(
    ...,
) -> Task:
    dataset = get_dataset(shuffle=False)
    ...

After:

@task
def example_task(
    ...,
    shuffle: bool = False,
) -> Task:
    dataset = get_dataset(shuffle=shuffle)
    ...

A version bump is required because old versions would fail when explicitly invoked with the shuffle parameter (i.e. -T shuffle=… will fail).

Bumping the Changelog

Once an eval task has a version bump, the associated README requires a changelog that tracks increments in task versions. Example.

If the changelog section already exists, add an entry for the new version.

Evals with > 1 task

Evals can sometimes be associated with multiple tasks, such as fortress (task 1: fortress_adversarial; task 2: fortress_benign).

They share a single changelog, i.e. version bumps are applied together. Example.