Versioning and Changelog Policy

This document describes the versioning scheme and changelog management process for Inspect Evals.

Overview

Inspect Evals uses Semantic Versioning (SemVer) for releases and maintains a changelog following the Keep a Changelog convention.

We use scriv to manage the changelog.

For Contributors

When submitting a PR that includes a user-facing change:

uv run scriv create

Fill out the generated file with details of the change and commit it.

Fragment Format

Each bullet in a changelog fragment must follow this format:

- Name: Description of the change.
- Name (vN-X): Description of the change.

Eval name: Always use the eval’s print name (the human-readable title from eval.yaml, e.g. “MLE-Bench”, “NIAH”, “GDM Intercode CTF”), not the Python module name. The changelog is a user-facing document, so gdm_intercode_ctf should be written as “GDM Intercode CTF”.

Task version: Include the new version in parentheses after the name only when the change bumps the task version. Omit it for changes that don’t bump the version (e.g. doc-only or infra changes).

New Evals vs Existing Evals: Use the same format for both categories — the only difference is the heading. New evals don’t need a version annotation (the initial version is always 1-A).

Examples

### New Evals

- SWE-Lancer: New eval for freelance software engineering tasks.

### Existing Evals

- StrongREJECT (v2-B): Fix judge model resolution for grader role fallback.
- MLE-Bench: Freeze upstream dependency.

### Other

- Fix broken reference to AGENTS.md in CONTRIBUTING.md.

For Maintainers

When preparing a release:

uv run scriv collect --version X.Y.Z

Commit the updated *md files.

Tag the commit with the new version.

Create a github release from the tag.

Versioning Scheme

Use the MAJOR.MINOR.PATCH versioning format as per overview.

Version Component When to Bump Examples
Major Breaking changes Removing an eval, API changes, scorer output format changes
Minor New features Adding new evals, new task parameters, new utilities
Patch Bug fixes Eval fixes, scorer fixes, dataset loading fixes