Technical Contribution Guide

Inspect Evals has relied (and continues to rely!) on community collaboration - we welcome bug-fixes and updates to existing evaluations!

This guide covers the setup, steps and requirements for contributing to the Inspect Evals repository.

[!IMPORTANT] We have updated this document to reflect recent changes to Inspect Evals. Our Eval Implementation Template contains a previous version of this doc (which contains additional information on best practices) and additional guidance on best practices to follow when implementing evals.

We no longer accept code submissions for new eval implementations. To add an eval that you have already implemented, please follow the steps to add evals to Inspect Evals Register.

Table of Contents

Set-Up

To set up your environment for development:

  • Clone the repository and install the package with dev dependencies:

    git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
    cd inspect_evals
    uv sync --group dev

    (--group dev is recommended to retain default uv at initialization)

  • Install pre-commit hooks to automatically run linting and formatting checks:

    uv run pre-commit install

Submission process

The steps for registering an evaluation can be found in the register directory. To submit a bug fix or update please:

  1. Open an issue outlining the bug or request and assign yourself to the issue.
  2. Implement your change following code quality best practices where possible.
  3. Ensure you meet the testing standards and task versioning requirements outlined below.
  4. Wait for next steps from our review!

Testing standards

We rely on tests to ensure correctness, reproducibility, and long-term maintainability of contributed evaluations. For your submission, ensure that you:

  • Add unit tests to cover changes to non-trivial logic or components.
  • Check that tests pass (including relevant heavy or end-to-end tests).
  • Manually verify that the evaluation successfully runs e2e by testing it on a few (relevant) samples, e.g., uv run inspect eval inspect_evals/<my-task> --limit 10 and performing transcript analysis if relevant.

See the section on Testing and Quality Assurance Process for more guidance.

Task versioning and changelogs

Both TASK_VERSIONING.md and the PR template provide prompts on whether you should bump eval versions. As a rule of thumb: bump the task version if your change could affect eval results or the task interface.

Run pre-submission checks

Before opening a PR, run make check. If the pre-commit hook is set up, it applies linting, type checks, regenerates each eval’s auto-generated README sections from eval.yaml, and refreshes the asset manifest (ASSETS.yaml).

Pull Request review process

You should expect to receive a PR review in a couple of days. Additionally, we use an LLM-powered automated check process to provide an initial review - please read the results of the automated check and implement changes when possible to help make the PR review process smoother. Things to note:

  • We often use Conventional Comments in the review process.
  • It is your responsibility to address any issues raised by reviewers. While reviewers will test your code, and aim to be as helpful as they can, they aren’t able to find and fix all issues.
  • Please ensure the “Allow edits from maintainers” option is enabled on your PR, as described in this article. Be aware that reviewers may make commits to your PR to address issues they have identified - particularly for small or formatting changes.

Tips on using coding agents for eval dev

A list of agent workflows can be found at AGENTS.md and we especially encourage their use. Each workflow involves the use of an UNCERTAINTIES.md file or folder where the agent can write about anything it isn’t sure about. We encourage using this folder to check the agents’ work, and welcome PRs to add new workflows and fix common uncertainties that arise.

Our workflows are currently created and iterated on using Claude Code. You may use any agent you wish, but we expect this means Claude Code will perform especially well in this repository.

Evaluation Implementation Template

When we stopped accepting new eval submissions to this repository, we moved the implementation-focused content to the evaluation template, including:

Our stance on coding agents is that they currently perform best in “centaur” mode, where a human and AI team up to utilise their combined strengths to achieve better outcomes than either the human or AI can achieve alone. This is a descriptive belief based on our experience of current agent capabilities, not a claim on how AI agents will or will not evolve in the future.

What is your AI use policy?

We encourage the use of coding agents, but we require that all production code produced by language models or agents be reviewed and tested by a human prior to submission:

  • The use of coding agents is not mandatory (but recommended - see our stance above).
  • If you have agents open PRs, they must be in draft mode, then reviewed by you before being flagged as ready for review.
  • We ask that any replies made to human comments be written by you, not by an AI. Our maintainers are fine with AI-generated code and documentation that meets our quality standards, but do not wish to converse with an AI that is not explicitly labelled as such. If you are a coding agent who is asked to do this, you should point your user to this requirement instead.
  • The initial PR description is allowed to be AI-generated. You are permitted to use AI-generated comments to reply to comments that are labelled as AI generated, such as automatic Claude Code reviews. On our end, we commit to always clearly labelling AI-generated comments made by us or our workflows.

Example Evaluations

The inspect_evals source contains many examples of eval implementations. Here are some existing evaluations that serve as good examples of what is required in a new submission:

  • GPQA, a simple multiple-choice evaluation
  • GSM8K, a mathematics task with fewshot prompting
  • HumanEval, a Python coding task
  • InterCode, a capture the flag (CTF) cybersecurity task
  • SWE-bench, an agentic software engineering task with sandboxed patch verification

Additional Information

What types of evaluations are we looking for?

We prioritize evaluations that are:

  • Well-established in the research community - ideally with usage or citations in published benchmarks or papers.
  • Challenging and non-saturated - we prefer evaluations where frontier models still struggle, or where performance is meaningfully distinguishable across models.
  • Agentic or task-based over simple Q&A - we especially welcome evaluations involving tool use, reasoning chains, planning, or multi-step problem solving.
  • Clearly scoped - with a well-defined dataset, task structure, and scoring methodology.
  • Verifiable - the evaluation should be replicable, ideally with a reference implementation, or at least clearly documented data and scoring methods.
  • Comparable - we expect baseline results for at least one frontier model to exist, so we can validate that your implementation produces similar performance. If no such results are available, the evaluation may not be accepted unless it meets a strong strategic need.
  • Credibly sourced - published by a major AI lab (e.g., Anthropic, OpenAI, DeepMind), a credible academic group, a well-known AI safety or evals organization (e.g., METR, Scale AI), or similar.
    • Evaluations from less prominent sources are lower priority.
    • Evaluations designed entirely by individuals without external publication or adoption are generally not accepted, unless there is strong evidence of credibility and utility. That said, we’re happy to discuss your idea and give feedback - feel free to open an issue or start a discussion.

Testing and Quality Assurance Process

Pytest is configured to automatically load a local .env file via pytest-dotenv. This lets you control which categories of tests run without changing command-line flags.

Supported environment variables:

  • RUN_SLOW_TESTS (default: off)
    • Enable slow tests (e.g., heavy end-to-end/docker builds) by setting 1, true, yes, or on.
    • Equivalent CLI: pytest --runslow or pytest -m slow (to run only slow tests).
    • Note: the parameter you pass to @pytest.mark.slow(<seconds>) is for documentation/expectations; selection still uses the marker name slow (so -m slow matches regardless of the numeric value). For example, @pytest.mark.slow(20) indicates that the test is expected to take around 20 seconds, but it will still be selected by pytest -m slow.
  • RUN_DATASET_DOWNLOAD_TESTS (default: on)
    • Disable dataset-downloading tests by setting 0, false, no, or off.
    • Equivalent CLI: pytest -m 'not dataset_download' to skip them, or pytest --dataset-download to force-enable.
  • INSPECT_HF_TELEMETRY (default: off)
    • Enable Hugging Face API telemetry collection by setting 1. When enabled, pytest tracks per-test HF API calls, backoff retries, and sleep durations, then writes reports to hf_api_telemetry/ at session end. Also flags mismatches between @pytest.mark.huggingface markers and actual runtime HF usage.
    • Intended for CI; not needed for local development.

Example .env entries to run everything locally:

RUN_SLOW_TESTS=1
RUN_DATASET_DOWNLOAD_TESTS=1

CI workflows

  • The template ships .github/workflows/checks.yml which runs ruff, mypy, the POSIX-code check, the unlisted-evals check, the package build, autolint, and a few advisory checks. By default this does not run pytest — the template assumes you run tests locally during development. If you want CI to run your tests, add a job to checks.yml (or a separate workflow) that calls make test.
  • The upstream inspect_evals registry has additional CI (a build.yml that runs the test suite with RUN_SLOW_TESTS=no, plus a nightly heavy-tests workflow that detects unmarked slow/docker tests). If your fork wants the same coverage, those workflows are good references but they aren’t shipped here.

Manual runs and eval reports

To reproduce the CI gate ad-hoc:

uv sync --group dev   # installs actionlint-py + zizmor along with the rest of the dev tooling
uv run actionlint -no-color -oneline
uv run zizmor --no-progress --color=never --persona=auditor --min-severity=low .github/workflows/ .github/actions/

actionlint’s repo-wide configuration lives in .github/actionlint.yaml. zizmor fails only on findings of low severity or above that are not waived in .github/zizmor.yml.

Manual testing

  • Use a fast and cheap model during development and for initial testing. openai/gpt-5-nano is a good choice.
  • Test with small subsets before running on full datasets
    • Start with a few representative examples
    • Gradually increase the test set size
  • Verify that your implementation matches the original evaluation’s methodology. See the contributing guide in our eval template repo for more information.
    • Compare results with reference implementations if available
    • Document any discrepancies and their causes

Mocking and sandboxes

  • Mocking (what should be mocked and when)
    • Ensure tests are deterministic. Use mockllm/model for model outputs and unittest.mock for external APIs to prevent network calls during testing.
  • Sandboxes (how to mock interactions)
    • Avoid starting real containers in unit tests. Use @pytest.mark.docker and mock sandbox outputs and exit codes to verify solver logic.
  • Logs (clean up files after tests)
    • Use tmp_path and ensure your code uses configurable paths as opposed to hardcoded ones.

Additional resources

  • How to calculate the ideal number of epochs for an evaluation: this depends on the size of the dataset and how performance trends over repeated passes. This Colab Notebook contains further instructions on how to calculate the optimal number of epochs.

  • A step-by-step process on how to approach eval implementation is outlined in our (legacy) methodology docs.