Technical Contribution Guide

This guide covers the technical requirements, standards, and processes for contributing evaluations to the Inspect Evals repository. For strategic guidance on benchmark implementation methodology, see docs/methodology.md.

We welcome new evaluations and improvements to existing evaluations. You can contribute by following the technical setup and submission requirements outlined below.

What types of evaluations are we looking for?

We prioritize evaluations that are:

Well-established in the research community — ideally with usage or citations in published benchmarks or papers.
Credibly sourced — published by a major AI lab (e.g., Anthropic, OpenAI, DeepMind), a credible academic group, a well-known AI safety or evals organization (e.g., METR, Scale AI), or similar.
- Evaluations from less prominent sources are lower priority.
- Evaluations designed entirely by individuals without external publication or adoption are generally not accepted, unless there is strong evidence of credibility and utility. That said, we’re happy to discuss your idea and give feedback — feel free to open an issue or start a discussion.
Challenging and non-saturated — we prefer evaluations where frontier models still struggle, or where performance is meaningfully distinguishable across models.
Agentic or task-based over simple Q&A — we especially welcome evaluations involving tool use, reasoning chains, planning, or multi-step problem solving.
Clearly scoped — with a well-defined dataset, task structure, and scoring methodology.
Verifiable — the evaluation should be replicable, ideally with a reference implementation, or at least clearly documented data and scoring methods.
Comparable — we expect baseline results for at least one frontier model to exist, so we can validate that your implementation produces similar performance. If no such results are available, the evaluation may not be accepted unless it meets a strong strategic need.

If you’re unsure whether your evaluation idea meets these criteria, feel free to open a GitHub issue for discussion. We’re happy to help!

If you’ve come across an evaluation that fits our priorities but don’t have time to implement it yourself, please still raise an issue! Highlighting useful evaluations for others to pick up is a valuable way to contribute.

Once there’s an issue raised for the evaluation, make sure nobody has commented saying they are implementing it, and leave a comment saying you intend to implement it yourself. If you submit a PR for an evaluation that someone else is already working on, your work could be wasted!

Development

To develop a new evaluation:

Clone the repository and install the package with dev dependencies:

git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
cd inspect_evals
uv sync --group dev

Install pre-commit hooks to automatically run linting and formatting checks:
```
uv run pre-commit install
```
Create a sub-directory in src/inspect_evals for your evaluation and add your evaluation task and related code. Notes:
- Do not pass a name parameter to the task - this is only used for dynamically created tasks (i.e. tasks that are not addressable on the filesystem or in a package).
- Avoid uploading a new copy of the dataset to Hugging Face. Use the official source whenever possible to ensure consistency, traceability, and reduce duplication.
Create an __init__.py file in the directory and use it to export your task and other useful functions.
Register your task(s) for execution by the inspect eval CLI command by importing into src/inspect_evals/_registry.py.
Confirm that your evaluation is properly registered by running it:
```
uv run inspect eval inspect_evals/<my-task>
```
Take note of the total number of samples in the dataset.
Calculate the ideal number of epochs. You’ll need to determine how many epochs the evaluation should run for, based on dataset size and how performance trends over repeated passes. Your README should explain how you arrived at your chosen value.
- There’re further instructions in this Colab Notebook to calculate the optimal number of epochs.

Submission

To prepare an evaluation for submission as a pull request:

Create a README.md file in your sub-directory following the template.
Add custom dependencies in the pyproject.toml file, if required. Don’t forget to add your evaluation to the [test] dependencies if you’ve added any custom dependencies.
```
[project.optional-dependencies]
worldsense = ["pandas"]
your_evaluation = ["example-python-package"]

test = [
   "inspect_evals[dev]",
   "inspect_evals[your_evaluation]"
]
```
Pin the package version if your evaluation might be sensitive to version updates.

Make sure your module does not import your custom dependencies when the module is imported. This breaks the way inspect_ai imports tasks from the registry.

To check for this, pip uninstall your optional dependences and run another eval with --limit=0. If it fails due to one of your imports, you need to change how you are importing that dependency
Add pytest tests covering the custom functions included in your evaluation. See the Testing Standards section for more details.
Run pytest to ensure all new and existing tests pass.

To install test dependencies:
```
uv sync --extra test
```
Then run the tests
```
uv run pytest
```
Run linting, formatting, and type checking and address any issues:
```
make check
```
Note that we automatically run the make check commands on all incoming PRs, so be sure your contribution passes those before opening a PR.
If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation. See the Evaluation Report section for more details.

Comparison results between existing implementations and your implementation should be included in the README.md file and in the pull request description.
Add your evaluation to src/inspect_evals/listing.yaml to include it in the documentation (which is automatically generated and deployed by Github Actions after merge). See the Listing.yaml Reference section below for a complete example and field documentation.

Don’t forget to add any custom dependencies to pyproject.toml if applicable.
Run tools/listing.py to format your evaluations’ README file according to our standards.
```
uv run python tools/listing.py
```
Draft your PR, and name it something like <eval_name> implementation (putting the name of the evaluation first makes it easier to find your PR in a browser tab!).

Pull Request review process

You should expect to receive a PR review in a couple of days.

We often use Conventional Comments in the review process.

Please note that while reviewers will test your code, it is not their responsibility to find and fix all issues. It is your responsibility to address any issues raised by reviewers.

Be aware that reviewers may make commits to your PR to address issues they have identified. Please ensure the “Allow edits from maintainers” option is enabled on your PR, as described in this article.

Code Quality Standards

Write the code to be read by other developers with no prior knowledge of the eval - it should be possible to understand the code without reading the paper
Follow existing project structure and naming conventions
Include type hints for all functions
Document complex logic with comments
Add docstrings following Google style guide
Ensure all tests pass before submission
Use absolute imports instead of relative imports
Individual Ruff rules may be suppressed with a comment, but this should be done sparingly and with care

Inclusion of third-party code

If there is an official implementation of the eval, include a link to it in the README. It is permissable and encouraged to utilise code from the official implementation, where possible. This both reduces the amount of code you need to write and maintains consistency with the original implementation.

If you are able to use a significant amount of code from the official implementation, it can be added as a dependency.

If the official implementation has been released as a package on PyPI, you should add the package as a dependency in the pyproject.toml file, and import functions and classes from it (e.g. swe_bench in pyproject.toml).
If the official implementation has not been released on PyPI but is structured as a package, the GitHub repo can be added to the pyproject.toml file as a dependency (e.g. ifeval in pyproject.toml).
If the official implementation is not structured as a package, you should copy the code into your evaluation, and include a comment with a link to the original implementation (e.g. the scorer code in sciknoweval). Consider opening a PR to the original implementation to structure it as a package.
If there is only a small amount of code that you are able to use from the official implementation, it is permissable to copy the code into your evaluation, and include a comment with a link to the original implementation (e.g. the annotate solver in air_bench).

Code from `inspect_ai`

Code from inspect_ai should be utilised where possible, concentrating on public components, such as the built-in dataset functions, solvers, scorers, and metrics.

Internally, inspect_ai uses non-public functions and modules as building blocks for its public components. These are indicated by leading underscores in the package and function names such as inspect_ai.scorer._reducer and _compute_dict_stat. These aren’t guaranteed to be stable and do not go through a deprecation process if they change.

There’s no hard-and-fast rule against using them, but it would mean that the results of the benchmark could silently change in response to an internal change in the framework. If making use of non-public components, it is recommended to have tests in place that exhaustively confirm the functionality of the components you build with them, as this would give an early signal if the components changed in a way that affected your evaluation.

Alternatively, copy the implementations of the non-public components you need into your evaluation, and include a comment with a link to the original implementation (e.g. the helper functions in the zerobench reducer).

Solvers, Scorers and Metrics

[This section to be expanded with guidelines and examples]

Testing Standards

⚠️ If your evaluation is not adequately tested, it will not be accepted. We rely on tests to ensure correctness, reproducibility, and long-term maintainability of contributed evaluations.

Look in the tests directory for examples.

Unit Tests

Include unit tests that cover all non-trivial custom functions. This will often include (but is not limited to):

Solver, scorer and dataset functions
Custom tools
Custom utils or functions

Include edge cases in your test coverage. Test error conditions and invalid inputs. Unit tests should also support your end to end tests by providing coverage for any code you mock the results of. Create a test that demonstrates your record_to_sample function with an actual example from the dataset. This serves as both a test and documentation of the expected behavior.

Example:

def test_record_to_sample():
   record = {"question": "What is 2+2?", "answer": "4"}  # Example from actual dataset, showing all fields
   expected = Sample(input="What is 2+2?", target="4")
   assert record_to_sample(record) == expected

Hugging Face Datasets

If the dataset in your eval is from Hugging Face, you can use the assert_huggingface_dataset_is_valid and assert_huggingface_dataset_structure functions to validate it. assert_huggingface_dataset_structure in particular provides useful documentation of the expected structure of the dataset. See tests/utils/huggingface.py for more details, and tests/README.md for examples.

Additionally, please define the dataset path as a constant at the top of the file. This improves readability and makes it easier to locate and modify dataset references in the future.

End-to-end Tests

It is also required that you implement at least one end-to-end test for your evaluation. This should be a test that demonstrates the full evaluation pipeline, including the solver and scorer. It should use mockllm/model as the model, which will return the a default output for each sample. See mockllm for more details, including how to provide custom outputs.

If your eval has multiple tasks or a task with multiple variants, you should implement one success end-to-end test and one error handling end-to-end test for each task/variant.

If the test triggers the download of a dataset, mark it with @pytest.mark.dataset_download. If it uses a docker sandbox or otherwise triggers a docker build or pull, mark it with @pytest.mark.docker. If any test takes more than ~10 seconds, mark it with @pytest.mark.slow(<observed_seconds>) using the duration you saw locally (rounded, e.g. @pytest.mark.slow(20)). This allows CI to skip slow tests by default, and run them only when explicitly requested. See CI workflows below for more details.

Examples:

from inspect_ai import eval
from inspect_evals.your_eval import your_eval_task
from inspect_ai.model import get_model, ModelOutput

def test_end_to_end_your_eval_with_default_mock_responses():
    """
    This test confirms that the evaluation pipeline works end-to-end with the default mock responses.
    Note that all responses will presumably be incorrect, but the test should surface any issues with the evaluation pipeline.
    """
    [log] = eval(
        tasks=your_eval_task(
            task_parameter="task_parameter_value",
        ),
        sample_id="sample_id_1",
        model="mockllm/model",
    )
    assert log.status == "success"
    assert "accuracy" in log.results.scores[0].metrics
    # more asserts here


def test_end_to_end_your_eval_with_custom_mock_responses():
    """
    This test confirms that the evaluation pipeline works end-to-end with custom mock responses.
    This allows you to check the solver and metrics are working as expected.
    """
    [log] = eval(
        tasks=your_eval_task(
            task_parameter="task_parameter_value",
        ),
        sample_id="sample_id_1",
        model=get_model(
            "mockllm/model",
            custom_outputs=[
                ModelOutput.from_content(
                    model="mockllm/model",
                    content="ANSWER: A",  # correct answer
                ),
            ],
        ),
    )
    assert log.status == "success"
    assert "accuracy" in log.results.scores[0].metrics
    assert log.results.scores[0].metrics["accuracy"].value == 1.0  # all correct

Running tests and environment toggles

Pytest is configured to automatically load a local .env file via pytest-dotenv. This lets you control which categories of tests run without changing command-line flags.

Supported environment variables:

RUN_SLOW_TESTS (default: off)
- Enable slow tests (e.g., heavy end-to-end/docker builds) by setting 1, true, yes, or on.
- Equivalent CLI: pytest --runslow or pytest -m slow (to run only slow tests).
- Note: the parameter you pass to @pytest.mark.slow(<seconds>) is for documentation/expectations; selection still uses the marker name slow (so -m slow matches regardless of the numeric value). For example, @pytest.mark.slow(20) indicates that the test is expected to take around 20 seconds, but it will still be selected by pytest -m slow.
RUN_DATASET_DOWNLOAD_TESTS (default: on)
- Disable dataset-downloading tests by setting 0, false, no, or off.
- Equivalent CLI: pytest -m 'not dataset_download' to skip them, or pytest --dataset-download to force-enable.

Example .env entries to run everything locally:

RUN_SLOW_TESTS=1
RUN_DATASET_DOWNLOAD_TESTS=1

CI workflows

Main CI workflow (.github/workflows/build.yml) sets:
- RUN_SLOW_TESTS=no to keep CI fast by default
- RUN_DATASET_DOWNLOAD_TESTS=yes
- The targeted-slow-tests job runs for pull requests, and detects relevant tests based on the code changeset. It runs all tests relevant to the PR including slow tests, but not for the rest of the repository.
Heavy tests workflow (.github/workflows/heavy-tests.yml):
- Runs nightly at 02:00 UTC and can be triggered manually via the Actions tab
- Sets RUN_SLOW_TESTS=yes, RUN_DATASET_DOWNLOAD_TESTS=yes, and TRACE_DOCKER_BUILDS=yes
- Runs the full test suite with coverage and generates a JSONL report
- Detects and fails on:
  - Tests taking ≥10 seconds that are missing @pytest.mark.slow
  - Tests that use Docker (compose/build) that are missing @pytest.mark.docker
- Uploads artifacts: timing report and Docker tracing outputs

Manual testing

Use a fast and cheap model during development and for initial testing. openai/gpt-4o-mini is a good choice.
Test with small subsets before running on full datasets
- Start with a few representative examples
- Gradually increase the test set size
Verify that your implementation matches the original evaluation’s methodology
- Compare results with reference implementations if available
- Document any discrepancies and their causes

Guidelines for specific test scenarios

[This section to be expanded]

Mocking (what should be mocked and when)
Sandboxes (how to mock interactions)
Logs (cleanup files after tests)

Data Quality

If you discover a broken record in the dataset:
1. Raise an issue in the upstream source (e.g., Hugging Face dataset or original repository)
2. If necessary, filter the dataset to exclude the broken record
3. Document the exclusion in your code and PR description
- Example: drop:_sample_is_not_known_duplicate

Evaluation Report Guidelines

Overview

The evaluation report is a brief summary of results for your evaluation implementation compared against a standard set of existing results. We use your evaluation report to help validate that your implementation has accurately replicated the design of your eval into the Inspect framework.

Before completing your evaluation report

We expect that you will have:

Completed small runs of samples with different models. You should also verify these results and their evaluation logs to be confident that:
- The samples can run end to end
- The metrics are calculated correctly and reporting as expected
- The evaluation logs have an error free flow and a valid set of responses
- The evaluation logs do not contain unexpected failure cases i.e. incorrect responses due to strict mismatches, failing runs, hitting limits when not expected
- All subsets of the dataset pass these checks if they have notable differences in the dataset, or in their solver or scorer implementation

Comparing your results

Your implementation needs to be compared against the original paper or an existing reputable leaderboard result. If your benchmark does not have published results to compare against, you should raise this in your pull request.

We recommend producing results using at least two models, ideally with the full dataset. Depending on the evaluation implementation and choice of model, a full run comparison might be quite expensive (~$100USD or more). It is important that you calculate the costs for your runs first. If the cost is high we recommend using a representative subset (20-30% of the total dataset).

If you have completed an implementation but are unable to generate your report due to any type of cost barrier (i.e. unable to afford, cost still too high on subset etc), you can clearly state this in your PR and submit without.

Reporting your results

You should include your evaluation report in the README.md in a section titled Evaluation Report.

Your report needs to include the following:

Implementation Details
- Any deviations from reference implementations
- Known limitations or edge cases
Results
- The results of your eval report runs and the comparison results, ideally in a table format
- Performance comparison with original paper or existing reputable leaderboard results
- Analysis of findings and any notable observations
Reproducibility Information
- The total samples run out of how many total samples for the dataset
- A timestamp of when the results were produced (timestamp, or month and year is enough)
- Clear instructions for replication
- Environment details (model versions, key dependencies)
- Random seed information where applicable
- Justification for choice of model temperature

The livebench evaluation report is a good example.

Listing.yaml Reference

The listing.yaml file serves as a central registry of all evaluations available in the repository. It’s used to generate documentation and organize evaluations into categories.

Required Fields for Each Evaluation

title: The display name of the evaluation (e.g., “HumanEval: Python Function Generation from Instructions”)
description: A brief description of what the evaluation measures, usually adapted from the abstract of the paper
path: The filesystem path to the evaluation directory (e.g., “src/inspect_evals/humaneval”)
arxiv: Link to the paper or documentation (preferably arXiv link when available)
group: The category this evaluation belongs to (e.g., “Coding”, “Cybersecurity”, “Mathematics”. See top of “listing.yaml” for existing categories)
contributors: List of GitHub usernames who contributed to this evaluation
tasks: List of task configurations with:
- name: The task identifier used in the CLI
- dataset_samples: Number of samples in the dataset

Optional Fields

dependency: Name of the optional dependency from pyproject.toml if the evaluation requires additional packages
tags: List of documentation-related tags used for categorization in the documentation. These should only be used for documentation purposes, not for system configuration. Valid tags include:
- "Agent": For evaluations involving agents that can take multiple actions or use tools
- "Multimodal": For evaluations involving multiple input modalities (e.g., text and images)
Omit this field if no documentation tags apply. Do not use for system/configuration information.
metadata: (Optional) Object containing system/configuration information. All fields are optional. May include:
- sandbox: List of components that use a sandboxed environment. Can include "solver" and/or "scorer". Omit if no sandbox is used.
- requires_internet: Set to true only if the evaluation requires internet access. Omit if false.
- environment: Special environment requirements (e.g., "Kali-Linux"). Only include if a non-standard environment is needed.
Example:
```
metadata:
  sandbox: ["solver", "scorer"]  # Both solver and scorer use sandbox
  requires_internet: true
  environment: "Kali-Linux"  # Only include if special environment is needed
```
Omit the entire metadata field if none of its subfields are needed.

human_baseline: Optional field for evaluations with known human performance metrics

human_baseline:
  metric: accuracy  # The metric used (accuracy, F1, etc.)
  score: 0.875     # The human performance score
  source: https://arxiv.org/abs/0000.00000  # Source of the baseline

Example Entry

- title: "Example-Bench: Your Amazing AI Benchmark"
  description: |
    A detailed description of what this benchmark evaluates and why it's important.
    This can span multiple lines using YAML's pipe (|) syntax.
  path: src/inspect_evals/your_evaluation
  arxiv: https://arxiv.org/abs/1234.12345
  group: Coding
  contributors: ["your-github-handle"]
  tasks:
    - name: task-name
      dataset_samples: 100
  dependency: "your_evaluation"  # Optional
  tags: ["Agent"]  # Optional
  metadata:  # optional metadata documenting eval information. All fields optional
   sandbox: []  # list of eval aspects that use a sandbox, can include "solver" and/or "scorer"
   requires_internet: true  # boolean indicating whether the eval requires internet access
   environment: "Kali-Linux"  # optional environment information

Best Practices

Keep descriptions concise but informative
Use the pipe (|) syntax for multi-line descriptions
Always include the arXiv link when available
List all contributors using their GitHub usernames
Keep the group consistent with existing categories (see top of “listing.yaml” for existing categories)
For agent-based evaluations, include the “Agent” tag in the tags field
Use the metadata field for all system/configuration information
Omit empty tags and metadata fields
Update the listing when adding new tasks or making significant changes to existing ones

Examples

Here are some existing evaluations that serve as good examples of what is required in a new submission:

GPQA, a simple multiple-choice evaluation
GSM8K, a mathematics task with fewshot prompting
HumanEval, a Python coding task
InterCode, a capture the flag (CTF) cybersecurity task