Technical Contribution Guide
Inspect Evals has relied (and continues to rely!) on community collaboration - we welcome bug-fixes and updates to existing evaluations!
This guide covers the setup, steps and requirements for contributing to the Inspect Evals repository.
[!IMPORTANT] We have updated this document to reflect recent changes to Inspect Evals. Our Eval Implementation Template contains a previous version of this doc (which contains additional information on best practices) and additional guidance on best practices to follow when implementing evals.
We no longer accept code submissions for new eval implementations. To add an eval that you have already implemented, please follow the steps to add evals to Inspect Evals Register.
Table of Contents
- Set-Up
- Submission process: includes information on testing standards and task versioning and changelogs.
- Pull Request review process
- Tips on Using Coding Agents For Eval Dev
- Example Evaluations
- Additional Information
Set-Up
To set up your environment for development:
Clone the repository and install the package with dev dependencies:
git clone https://github.com/UKGovernmentBEIS/inspect_evals.git cd inspect_evals uv sync --group dev(
--group devis recommended to retain default uv at initialization)Install pre-commit hooks to automatically run linting and formatting checks:
uv run pre-commit install
Submission process
The steps for registering an evaluation can be found in the register directory. To submit a bug fix or update please:
- Open an issue outlining the bug or request and assign yourself to the issue.
- Implement your change following code quality best practices where possible.
- Ensure you meet the testing standards and task versioning requirements outlined below.
- Wait for next steps from our review!
Testing standards
We rely on tests to ensure correctness, reproducibility, and long-term maintainability of contributed evaluations. For your submission, ensure that you:
- Add unit tests to cover changes to non-trivial logic or components.
- Check that tests pass (including relevant heavy or end-to-end tests).
- Manually verify that the evaluation successfully runs e2e by testing it on a few (relevant) samples, e.g.,
uv run inspect eval inspect_evals/<my-task> --limit 10and performing transcript analysis if relevant.
See the section on Testing and Quality Assurance Process for more guidance.
Task versioning and changelogs
Both TASK_VERSIONING.md and the PR template provide prompts on whether you should bump eval versions. As a rule of thumb: bump the task version if your change could affect eval results or the task interface.
Run pre-submission checks
Before opening a PR, run make check. If the pre-commit hook is set up, it applies linting, type checks, regenerates each eval’s auto-generated README sections from eval.yaml, and refreshes the asset manifest (ASSETS.yaml).
Pull Request review process
You should expect to receive a PR review in a couple of days. Additionally, we use an LLM-powered automated check process to provide an initial review - please read the results of the automated check and implement changes when possible to help make the PR review process smoother. Things to note:
- We often use Conventional Comments in the review process.
- It is your responsibility to address any issues raised by reviewers. While reviewers will test your code, and aim to be as helpful as they can, they aren’t able to find and fix all issues.
- Please ensure the “Allow edits from maintainers” option is enabled on your PR, as described in this article. Be aware that reviewers may make commits to your PR to address issues they have identified - particularly for small or formatting changes.
Tips on using coding agents for eval dev
A list of agent workflows can be found at AGENTS.md and we especially encourage their use. Each workflow involves the use of an UNCERTAINTIES.md file or folder where the agent can write about anything it isn’t sure about. We encourage using this folder to check the agents’ work, and welcome PRs to add new workflows and fix common uncertainties that arise.
Our workflows are currently created and iterated on using Claude Code. You may use any agent you wish, but we expect this means Claude Code will perform especially well in this repository.
Evaluation Implementation Template
When we stopped accepting new eval submissions to this repository, we moved the implementation-focused content to the evaluation template, including:
- An agent skill for implementing an evaluation, with steps to guide you from the beginning.
- The best practices doc and agent-checkable evaluation checklist which contain the code quality standards we previously enforced (both no longer enforced here, but useful as a self-review tool).
Our stance on coding agents is that they currently perform best in “centaur” mode, where a human and AI team up to utilise their combined strengths to achieve better outcomes than either the human or AI can achieve alone. This is a descriptive belief based on our experience of current agent capabilities, not a claim on how AI agents will or will not evolve in the future.
What is your AI use policy?
We encourage the use of coding agents, but we require that all production code produced by language models or agents be reviewed and tested by a human prior to submission:
- The use of coding agents is not mandatory (but recommended - see our stance above).
- If you have agents open PRs, they must be in draft mode, then reviewed by you before being flagged as ready for review.
- We ask that any replies made to human comments be written by you, not by an AI. Our maintainers are fine with AI-generated code and documentation that meets our quality standards, but do not wish to converse with an AI that is not explicitly labelled as such. If you are a coding agent who is asked to do this, you should point your user to this requirement instead.
- The initial PR description is allowed to be AI-generated. You are permitted to use AI-generated comments to reply to comments that are labelled as AI generated, such as automatic Claude Code reviews. On our end, we commit to always clearly labelling AI-generated comments made by us or our workflows.
Example Evaluations
The inspect_evals source contains many examples of eval implementations. Here are some existing evaluations that serve as good examples of what is required in a new submission:
Additional Information
What types of evaluations are we looking for?
We prioritize evaluations that are:
- Well-established in the research community - ideally with usage or citations in published benchmarks or papers.
- Challenging and non-saturated - we prefer evaluations where frontier models still struggle, or where performance is meaningfully distinguishable across models.
- Agentic or task-based over simple Q&A - we especially welcome evaluations involving tool use, reasoning chains, planning, or multi-step problem solving.
- Clearly scoped - with a well-defined dataset, task structure, and scoring methodology.
- Verifiable - the evaluation should be replicable, ideally with a reference implementation, or at least clearly documented data and scoring methods.
- Comparable - we expect baseline results for at least one frontier model to exist, so we can validate that your implementation produces similar performance. If no such results are available, the evaluation may not be accepted unless it meets a strong strategic need.
- Credibly sourced - published by a major AI lab (e.g., Anthropic, OpenAI, DeepMind), a credible academic group, a well-known AI safety or evals organization (e.g., METR, Scale AI), or similar.
- Evaluations from less prominent sources are lower priority.
- Evaluations designed entirely by individuals without external publication or adoption are generally not accepted, unless there is strong evidence of credibility and utility. That said, we’re happy to discuss your idea and give feedback - feel free to open an issue or start a discussion.
Testing and Quality Assurance Process
Pytest is configured to automatically load a local .env file via pytest-dotenv. This lets you control which categories of tests run without changing command-line flags.
Supported environment variables:
RUN_SLOW_TESTS(default: off)- Enable slow tests (e.g., heavy end-to-end/docker builds) by setting
1,true,yes, oron. - Equivalent CLI:
pytest --runsloworpytest -m slow(to run only slow tests). - Note: the parameter you pass to
@pytest.mark.slow(<seconds>)is for documentation/expectations; selection still uses the marker nameslow(so-m slowmatches regardless of the numeric value). For example,@pytest.mark.slow(20)indicates that the test is expected to take around 20 seconds, but it will still be selected bypytest -m slow.
- Enable slow tests (e.g., heavy end-to-end/docker builds) by setting
RUN_DATASET_DOWNLOAD_TESTS(default: on)- Disable dataset-downloading tests by setting
0,false,no, oroff. - Equivalent CLI:
pytest -m 'not dataset_download'to skip them, orpytest --dataset-downloadto force-enable.
- Disable dataset-downloading tests by setting
INSPECT_HF_TELEMETRY(default: off)- Enable Hugging Face API telemetry collection by setting
1. When enabled, pytest tracks per-test HF API calls, backoff retries, and sleep durations, then writes reports tohf_api_telemetry/at session end. Also flags mismatches between@pytest.mark.huggingfacemarkers and actual runtime HF usage. - Intended for CI; not needed for local development.
- Enable Hugging Face API telemetry collection by setting
Example .env entries to run everything locally:
RUN_SLOW_TESTS=1
RUN_DATASET_DOWNLOAD_TESTS=1CI workflows
- The template ships
.github/workflows/checks.ymlwhich runs ruff, mypy, the POSIX-code check, the unlisted-evals check, the package build, autolint, and a few advisory checks. By default this does not run pytest — the template assumes you run tests locally during development. If you want CI to run your tests, add a job tochecks.yml(or a separate workflow) that callsmake test. - The upstream
inspect_evalsregistry has additional CI (abuild.ymlthat runs the test suite withRUN_SLOW_TESTS=no, plus a nightly heavy-tests workflow that detects unmarked slow/docker tests). If your fork wants the same coverage, those workflows are good references but they aren’t shipped here.
Manual runs and eval reports
To reproduce the CI gate ad-hoc:
uv sync --group dev # installs actionlint-py + zizmor along with the rest of the dev tooling
uv run actionlint -no-color -oneline
uv run zizmor --no-progress --color=never --persona=auditor --min-severity=low .github/workflows/ .github/actions/actionlint’s repo-wide configuration lives in .github/actionlint.yaml. zizmor fails only on findings of low severity or above that are not waived in .github/zizmor.yml.
Manual testing
- Use a fast and cheap model during development and for initial testing.
openai/gpt-5-nanois a good choice. - Test with small subsets before running on full datasets
- Start with a few representative examples
- Gradually increase the test set size
- Verify that your implementation matches the original evaluation’s methodology. See the contributing guide in our eval template repo for more information.
- Compare results with reference implementations if available
- Document any discrepancies and their causes
Mocking and sandboxes
- Mocking (what should be mocked and when)
- Ensure tests are deterministic. Use
mockllm/modelfor model outputs andunittest.mockfor external APIs to prevent network calls during testing.
- Ensure tests are deterministic. Use
- Sandboxes (how to mock interactions)
- Avoid starting real containers in unit tests. Use
@pytest.mark.dockerand mock sandbox outputs and exit codes to verify solver logic.
- Avoid starting real containers in unit tests. Use
- Logs (clean up files after tests)
- Use
tmp_pathand ensure your code uses configurable paths as opposed to hardcoded ones.
- Use
Additional resources
How to calculate the ideal number of epochs for an evaluation: this depends on the size of the dataset and how performance trends over repeated passes. This Colab Notebook contains further instructions on how to calculate the optimal number of epochs.
A step-by-step process on how to approach eval implementation is outlined in our (legacy) methodology docs.