Contribution Guide
We welcome new evaluations and improvements to existing evaluations. You can contribute by installing the development version of inspect_evals
, adding your evaluation, and opening a pull request in this repo.
What types of evaluations are we looking for?
We prioritize evaluations that are: * Well-established in the research community — ideally with usage or citations in published benchmarks or papers. * Credibly sourced — published by a major AI lab (e.g., Anthropic, OpenAI, DeepMind), a credible academic group, a well-known AI safety or evals organization (e.g., METR, Scale AI), or similar. * Evaluations from less prominent sources are lower priority. * Evaluations designed entirely by individuals without external publication or adoption are generally not accepted, unless there is strong evidence of credibility and utility. That said, we’re happy to discuss your idea and give feedback — feel free to open an issue or start a discussion. * Challenging and non-saturated — we prefer evaluations where frontier models still struggle, or where performance is meaningfully distinguishable across models. * Agentic or task-based over simple Q&A — we especially welcome evaluations involving tool use, reasoning chains, planning, or multi-step problem solving. * Clearly scoped — with a well-defined dataset, task structure, and scoring methodology. * Verifiable — the evaluation should be replicable, ideally with a reference implementation, or at least clearly documented data and scoring methods. * Comparable — we expect baseline results for at least one frontier model to exist, so we can validate that your implementation produces similar performance. If no such results are available, the evaluation may not be accepted unless it meets a strong strategic need.
If you’re unsure whether your evaluation idea meets these criteria, feel free to open a GitHub issue for discussion. We’re happy to help!
If you’ve come across an evaluation that fits our priorities but don’t have time to implement it yourself, please still raise an issue! Highlighting useful evaluations for others to pick up is a valuable way to contribute.
Development
To develop a new evaluation:
Clone the repository and install the package with the
-e
flag and[dev]
optional dependencies:git clone https://github.com/UKGovernmentBEIS/inspect_evals.git cd inspect_evals pip install -e ".[dev]"
Create a sub-directory in
src/inspect_evals
for your evaluation and add your evaluation task and related code. Notes:
- Do not pass a
name
parameter to the task - this is only used for dynamically created tasks (i.e. tasks that are not addressable on the filesystem or in a package). - Avoid uploading a new copy of the dataset to Hugging Face. Use the official source whenever possible to ensure consistency, traceability, and reduce duplication.
Create an
__init__.py
file in the directory and use it to export your task and other useful functions.Register your task(s) for execution by the
inspect eval
CLI command by importing into src/inspect_evals/_registry.py.Confirm that your evaluation is properly registered by running it:
inspect eval inspect_evals/<my-task>
Take note of the total number of samples in the dataset.
Calculate the ideal number of epochs. You’ll need to determine how many epochs the evaluation should run for, based on dataset size and how performance trends over repeated passes. Your README should explain how you arrived at your chosen value.
Submission
To prepare an evaluation for submission as a pull request:
Create a
README.md
file in your sub-directory following the template used by other evaluations.Add custom dependencies in the pyproject.toml file. Don’t forget to add your evaluation to the
[test]
dependencies.[project.optional-dependencies] worldsense = ["pandas"] your_evaluation = ["example-python-package"] test = [ "inspect_evals[dev]", "inspect_evals[your_evaluation]" ]
Pin the package version if your evaluation might be sensitive to version updates.
Make sure your module does not import your custom dependencies when the module is imported. This breaks the way
inspect_ai
imports tasks from the registry.Add pytest tests covering the custom functions included in your evaluation. Demonstrating an example input and output of your
record_to_sample
function is fantastic documentation. Look in thetests
directory for examples. This will often include (but not limited to):- Solver, scorer and dataset functions
- Custom tools
- Custom utils or functions
It is also required that you implement an end-to-end test for your evaluation.
⚠️ If your evaluation is not adequately tested, it will not be accepted. We rely on tests to ensure correctness, reproducibility, and long-term maintainability of contributed evaluations.
Run
pytest
to ensure all new and existing tests pass.To install test dependencies:
pip install -e ".[test]"
Run linting, formatting, and type checking and address any issues:
make check
Note that we automatically run the
make check
commands on all incoming PRs, so be sure your contribution passes those before opening a PR.If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation. See the Evaluation Report section for more details.
Comparison results between existing implementations and your implementation should be included in the README.md file and in the pull request description.
Add your evaluation to src/inspect_evals/listing.yaml to include it in the documentation (which is automatically generated and deployed by Github Actions after merge).
- title: "Example-Bench: Your Amazing AI Benchmark" description: Describe what the benchmark is about here. path: src/inspect_evals/your_evaluation arxiv: https://arxiv.org/abs/1234.12345 group: Coding contributors: ["your-github-handle"] tasks: - name: task-name dataset_samples: 365 # number of samples in the dataset used for this task human_baseline: # optional field for when a human baseline is available for the task metric: accuracy score: 0.875 source: https://arxiv.org/abs/0000.00000 dependency: "your_evaluation" # optional field for custom dependency from pyproject.toml tags: ["Agent"] # optional tag for agentic evaluations
Don’t forget to add the custom dependency from pyproject.toml if applicable.
Run tools/listing.py to format your evaluations’ README file according to our standards.
You’re expected to receive a PR review in a couple of days.
Evaluation Report Guidelines
Overview
The evaluation report is a brief summary of results for your evaluation implementation compared against a standard set of existing results. We use your evaluation report to help validate that your implementation has accurately replicated the design of your eval into the Inspect framework.
Before completing your evaluation report
We expect that you will have:
- Completed small runs of samples with different models. You should also verify these results and their evaluation logs to be confident that:
- The samples can run end to end
- The metrics are calculated correctly and reporting as expected
- The evaluation logs have an error free flow and a valid set of responses
- The evaluation logs do not contain unexpected failure cases i.e. incorrect responses due to strict mismatches, failing runs, hitting limits when not expected
- All subsets of the dataset pass these checks if they have notable differences in the dataset, or in their solver or scorer implementation
Comparing your results
Your implementation needs to be compared against the original paper or an existing reputable leaderboard result. If your benchmark does not have published results to compare against, you should raise this in your pull request.
We recommend producing results using at least two models, ideally with the full dataset. Depending on the evaluation implementation and choice of model, a full run comparison might be quite expensive (~$100USD or more). It is important that you calculate the costs for your runs first. If the cost is high we recommend using a representative subset (20-30% of the total dataset).
If you have completed an implementation but are unable to generate your report due to any type of cost barrier (i.e. unable to afford, cost still too high on subset etc), you can clearly state this in your PR and submit without.
Reporting your results
You should include your evaluation report in the README.md in a section titled Evaluation Report.
Your report needs to include the following: * The results of your eval report runs and the comparison results, ideally in a table format * The total samples run out of how many total samples for the dataset * A timestamp of when the results were produced (timestamp, or month and year is enough)
An evaluation report example can be found here
Listing.yaml Reference
The listing.yaml
file serves as a central registry of all evaluations available in the repository. It’s used to generate documentation and organize evaluations into categories.
Required Fields for Each Evaluation
title
: The display name of the evaluation (e.g., “HumanEval: Python Function Generation from Instructions”)description
: A brief description of what the evaluation measures, usually adapted from the abstract of the paperpath
: The filesystem path to the evaluation directory (e.g., “src/inspect_evals/humaneval”)arxiv
: Link to the paper or documentation (preferably arXiv link when available)group
: The category this evaluation belongs to (e.g., “Coding”, “Cybersecurity”, “Mathematics”)contributors
: List of GitHub usernames who contributed to this evaluationtasks
: List of task configurations with:name
: The task identifier used in the CLIdataset_samples
: Number of samples in the dataset
Optional Fields
dependency
: Name of the optional dependency from pyproject.toml if the evaluation requires additional packagestags
: List of tags that describe special characteristics or requirements of the evaluation:"Agent"
: Indicates the evaluation involves an agent that can take multiple actions or use tools"Multimodal"
: For evaluations that involve multiple input modalities (e.g., text and images)"requires_internet"
: The evaluation requires internet access to function"Kali-Linux"
: The evaluation uses a Kali Linux environment, potentially making it higher risk from a security perspective"sandbox:solver"
: The evaluation uses a sandboxed environment for the solver component"sandbox:scorer"
: The evaluation uses a sandboxed environment for the scoring component
Multiple tags can be combined as needed, e.g.,
["Agent", "requires_internet"]
human_baseline
: Optional field for evaluations with known human performance metricshuman_baseline: metric: accuracy # The metric used (accuracy, F1, etc.) score: 0.875 # The human performance score source: https://arxiv.org/abs/0000.00000 # Source of the baseline
Example Entry
- title: "Example-Bench: Your Amazing AI Benchmark"
description: |
A detailed description of what this benchmark evaluates and why it's important.
This can span multiple lines using YAML's pipe (|) syntax. path: src/inspect_evals/your_evaluation
arxiv: https://arxiv.org/abs/1234.12345
group: Coding
contributors: ["your-github-handle"]
tasks:
- name: task-name
dataset_samples: 100
dependency: "your_evaluation" # Optional
tags: ["Agent"] # Optional
Best Practices
- Keep descriptions concise but informative
- Use the pipe (|) syntax for multi-line descriptions
- Always include the arXiv link when available
- List all contributors using their GitHub usernames
- Keep the
group
consistent with existing categories - For agent-based evaluations, include the “Agent” tag
- Update the listing when adding new tasks or making significant changes to existing ones
Examples
Here are some existing evaluations that serve as good examples of what is required in a new submission: