Contribution Guide

We welcome new evaluations and improvements to existing evaluations. You can contribute by installing the development version of inspect_evals, adding your evaluation, and opening a pull request in this repo.

What types of evaluations are we looking for?

We prioritize evaluations that are: * Well-established in the research community — ideally with usage or citations in published benchmarks or papers. * Credibly sourced — published by a major AI lab (e.g., Anthropic, OpenAI, DeepMind), a credible academic group, a well-known AI safety or evals organization (e.g., METR, Scale AI), or similar. * Evaluations from less prominent sources are lower priority. * Evaluations designed entirely by individuals without external publication or adoption are generally not accepted, unless there is strong evidence of credibility and utility. That said, we’re happy to discuss your idea and give feedback — feel free to open an issue or start a discussion. * Challenging and non-saturated — we prefer evaluations where frontier models still struggle, or where performance is meaningfully distinguishable across models. * Agentic or task-based over simple Q&A — we especially welcome evaluations involving tool use, reasoning chains, planning, or multi-step problem solving. * Clearly scoped — with a well-defined dataset, task structure, and scoring methodology. * Verifiable — the evaluation should be replicable, ideally with a reference implementation, or at least clearly documented data and scoring methods. * Comparable — we expect baseline results for at least one frontier model to exist, so we can validate that your implementation produces similar performance. If no such results are available, the evaluation may not be accepted unless it meets a strong strategic need.

If you’re unsure whether your evaluation idea meets these criteria, feel free to open a GitHub issue for discussion. We’re happy to help!

If you’ve come across an evaluation that fits our priorities but don’t have time to implement it yourself, please still raise an issue! Highlighting useful evaluations for others to pick up is a valuable way to contribute.

Development

To develop a new evaluation:

  1. Clone the repository and install the package with the -e flag and [dev] optional dependencies:

    git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
    cd inspect_evals
    pip install -e ".[dev]"
  2. Create a sub-directory in src/inspect_evals for your evaluation and add your evaluation task and related code. Notes:

  • Do not pass a name parameter to the task - this is only used for dynamically created tasks (i.e. tasks that are not addressable on the filesystem or in a package).
  • Avoid uploading a new copy of the dataset to Hugging Face. Use the official source whenever possible to ensure consistency, traceability, and reduce duplication.
  1. Create an __init__.py file in the directory and use it to export your task and other useful functions.

  2. Register your task(s) for execution by the inspect eval CLI command by importing into src/inspect_evals/_registry.py.

  3. Confirm that your evaluation is properly registered by running it:

    inspect eval inspect_evals/<my-task>

    Take note of the total number of samples in the dataset.

  4. Calculate the ideal number of epochs. You’ll need to determine how many epochs the evaluation should run for, based on dataset size and how performance trends over repeated passes. Your README should explain how you arrived at your chosen value.

Submission

To prepare an evaluation for submission as a pull request:

  1. Create a README.md file in your sub-directory following the template used by other evaluations.

  2. Add custom dependencies in the pyproject.toml file. Don’t forget to add your evaluation to the [test] dependencies.

    [project.optional-dependencies]
    worldsense = ["pandas"]
    your_evaluation = ["example-python-package"]
    
    test = [
       "inspect_evals[dev]",
       "inspect_evals[your_evaluation]"
    ]

    Pin the package version if your evaluation might be sensitive to version updates.

    Make sure your module does not import your custom dependencies when the module is imported. This breaks the way inspect_ai imports tasks from the registry.

  3. Add pytest tests covering the custom functions included in your evaluation. Demonstrating an example input and output of your record_to_sample function is fantastic documentation. Look in the tests directory for examples. This will often include (but not limited to):

    • Solver, scorer and dataset functions
    • Custom tools
    • Custom utils or functions

    It is also required that you implement an end-to-end test for your evaluation.

    ⚠️ If your evaluation is not adequately tested, it will not be accepted. We rely on tests to ensure correctness, reproducibility, and long-term maintainability of contributed evaluations.

  4. Run pytest to ensure all new and existing tests pass.

    To install test dependencies:

    pip install -e ".[test]"
  5. Run linting, formatting, and type checking and address any issues:

    make check

    Note that we automatically run the make check commands on all incoming PRs, so be sure your contribution passes those before opening a PR.

  6. If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation. See the Evaluation Report section for more details.

    Comparison results between existing implementations and your implementation should be included in the README.md file and in the pull request description.

  7. Add your evaluation to src/inspect_evals/listing.yaml to include it in the documentation (which is automatically generated and deployed by Github Actions after merge).

     - title: "Example-Bench: Your Amazing AI Benchmark"
       description: Describe what the benchmark is about here.
       path: src/inspect_evals/your_evaluation
       arxiv: https://arxiv.org/abs/1234.12345
       group: Coding
       contributors: ["your-github-handle"]
       tasks:
         - name: task-name
           dataset_samples: 365  # number of samples in the dataset used for this task
           human_baseline:  # optional field for when a human baseline is available for the task
             metric: accuracy
             score: 0.875
             source: https://arxiv.org/abs/0000.00000
       dependency: "your_evaluation"  # optional field for custom dependency from pyproject.toml
       tags: ["Agent"]  # optional tag for agentic evaluations

    Don’t forget to add the custom dependency from pyproject.toml if applicable.

  8. Run tools/listing.py to format your evaluations’ README file according to our standards.

You’re expected to receive a PR review in a couple of days.

Evaluation Report Guidelines

Overview

The evaluation report is a brief summary of results for your evaluation implementation compared against a standard set of existing results. We use your evaluation report to help validate that your implementation has accurately replicated the design of your eval into the Inspect framework.

Before completing your evaluation report

We expect that you will have:

  • Completed small runs of samples with different models. You should also verify these results and their evaluation logs to be confident that:
    • The samples can run end to end
    • The metrics are calculated correctly and reporting as expected
    • The evaluation logs have an error free flow and a valid set of responses
    • The evaluation logs do not contain unexpected failure cases i.e. incorrect responses due to strict mismatches, failing runs, hitting limits when not expected
    • All subsets of the dataset pass these checks if they have notable differences in the dataset, or in their solver or scorer implementation

Comparing your results

Your implementation needs to be compared against the original paper or an existing reputable leaderboard result. If your benchmark does not have published results to compare against, you should raise this in your pull request.

We recommend producing results using at least two models, ideally with the full dataset. Depending on the evaluation implementation and choice of model, a full run comparison might be quite expensive (~$100USD or more). It is important that you calculate the costs for your runs first. If the cost is high we recommend using a representative subset (20-30% of the total dataset).

If you have completed an implementation but are unable to generate your report due to any type of cost barrier (i.e. unable to afford, cost still too high on subset etc), you can clearly state this in your PR and submit without.

Reporting your results

You should include your evaluation report in the README.md in a section titled Evaluation Report.

Your report needs to include the following: * The results of your eval report runs and the comparison results, ideally in a table format * The total samples run out of how many total samples for the dataset * A timestamp of when the results were produced (timestamp, or month and year is enough)

An evaluation report example can be found here

Listing.yaml Reference

The listing.yaml file serves as a central registry of all evaluations available in the repository. It’s used to generate documentation and organize evaluations into categories.

Required Fields for Each Evaluation

  • title: The display name of the evaluation (e.g., “HumanEval: Python Function Generation from Instructions”)
  • description: A brief description of what the evaluation measures, usually adapted from the abstract of the paper
  • path: The filesystem path to the evaluation directory (e.g., “src/inspect_evals/humaneval”)
  • arxiv: Link to the paper or documentation (preferably arXiv link when available)
  • group: The category this evaluation belongs to (e.g., “Coding”, “Cybersecurity”, “Mathematics”)
  • contributors: List of GitHub usernames who contributed to this evaluation
  • tasks: List of task configurations with:
    • name: The task identifier used in the CLI
    • dataset_samples: Number of samples in the dataset

Optional Fields

  • dependency: Name of the optional dependency from pyproject.toml if the evaluation requires additional packages

  • tags: List of tags that describe special characteristics or requirements of the evaluation:

    • "Agent": Indicates the evaluation involves an agent that can take multiple actions or use tools
    • "Multimodal": For evaluations that involve multiple input modalities (e.g., text and images)
    • "requires_internet": The evaluation requires internet access to function
    • "Kali-Linux": The evaluation uses a Kali Linux environment, potentially making it higher risk from a security perspective
    • "sandbox:solver": The evaluation uses a sandboxed environment for the solver component
    • "sandbox:scorer": The evaluation uses a sandboxed environment for the scoring component

    Multiple tags can be combined as needed, e.g., ["Agent", "requires_internet"]

  • human_baseline: Optional field for evaluations with known human performance metrics

    human_baseline:
      metric: accuracy  # The metric used (accuracy, F1, etc.)
      score: 0.875     # The human performance score
      source: https://arxiv.org/abs/0000.00000  # Source of the baseline

Example Entry

- title: "Example-Bench: Your Amazing AI Benchmark"
  description: |
    A detailed description of what this benchmark evaluates and why it's important.
    This can span multiple lines using YAML's pipe (|) syntax.
  path: src/inspect_evals/your_evaluation
  arxiv: https://arxiv.org/abs/1234.12345
  group: Coding
  contributors: ["your-github-handle"]
  tasks:
    - name: task-name
      dataset_samples: 100
  dependency: "your_evaluation"  # Optional
  tags: ["Agent"]  # Optional

Best Practices

  1. Keep descriptions concise but informative
  2. Use the pipe (|) syntax for multi-line descriptions
  3. Always include the arXiv link when available
  4. List all contributors using their GitHub usernames
  5. Keep the group consistent with existing categories
  6. For agent-based evaluations, include the “Agent” tag
  7. Update the listing when adding new tasks or making significant changes to existing ones

Examples

Here are some existing evaluations that serve as good examples of what is required in a new submission:

  • GPQA, a simple multiple-choice evaluation
  • GSM8K, a mathematics task with fewshot prompting
  • HumanEval, a Python coding task
  • InterCode, a capture the flag (CTF) cybersecurity task