Technical Contribution Guide
This guide covers the technical requirements, standards, and processes for contributing evaluations to the Inspect Evals repository. For strategic guidance on benchmark implementation methodology, see docs/methodology.md.
We welcome new evaluations and improvements to existing evaluations. You can contribute by following the technical setup and submission requirements outlined below.
What types of evaluations are we looking for?
We prioritize evaluations that are:
- Well-established in the research community — ideally with usage or citations in published benchmarks or papers.
- Credibly sourced — published by a major AI lab (e.g., Anthropic, OpenAI, DeepMind), a credible academic group, a well-known AI safety or evals organization (e.g., METR, Scale AI), or similar.
- Evaluations from less prominent sources are lower priority.
- Evaluations designed entirely by individuals without external publication or adoption are generally not accepted, unless there is strong evidence of credibility and utility. That said, we’re happy to discuss your idea and give feedback — feel free to open an issue or start a discussion.
- Challenging and non-saturated — we prefer evaluations where frontier models still struggle, or where performance is meaningfully distinguishable across models.
- Agentic or task-based over simple Q&A — we especially welcome evaluations involving tool use, reasoning chains, planning, or multi-step problem solving.
- Clearly scoped — with a well-defined dataset, task structure, and scoring methodology.
- Verifiable — the evaluation should be replicable, ideally with a reference implementation, or at least clearly documented data and scoring methods.
- Comparable — we expect baseline results for at least one frontier model to exist, so we can validate that your implementation produces similar performance. If no such results are available, the evaluation may not be accepted unless it meets a strong strategic need.
If you’re unsure whether your evaluation idea meets these criteria, feel free to open a GitHub issue for discussion. We’re happy to help!
If you’ve come across an evaluation that fits our priorities but don’t have time to implement it yourself, please still raise an issue! Highlighting useful evaluations for others to pick up is a valuable way to contribute.
Once there’s an issue raised for the evaluation, make sure nobody has commented saying they are implementing it, and leave a comment saying you intend to implement it yourself. If you submit a PR for an evaluation that someone else is already working on, your work could be wasted!
Development
To develop a new evaluation:
Clone the repository and install the package with the
-e
flag and[dev]
optional dependencies:git clone https://github.com/UKGovernmentBEIS/inspect_evals.git cd inspect_evals pip install -e ".[dev]"
Create a sub-directory in
src/inspect_evals
for your evaluation and add your evaluation task and related code. Notes:- Do not pass a
name
parameter to the task - this is only used for dynamically created tasks (i.e. tasks that are not addressable on the filesystem or in a package). - Avoid uploading a new copy of the dataset to Hugging Face. Use the official source whenever possible to ensure consistency, traceability, and reduce duplication.
- Do not pass a
Create an
__init__.py
file in the directory and use it to export your task and other useful functions.Register your task(s) for execution by the
inspect eval
CLI command by importing into src/inspect_evals/_registry.py.Confirm that your evaluation is properly registered by running it:
inspect eval inspect_evals/<my-task>
Take note of the total number of samples in the dataset.
Calculate the ideal number of epochs. You’ll need to determine how many epochs the evaluation should run for, based on dataset size and how performance trends over repeated passes. Your README should explain how you arrived at your chosen value.
- There’re further instructions in this Colab Notebook to calculate the optimal number of epochs.
Submission
To prepare an evaluation for submission as a pull request:
Create a
README.md
file in your sub-directory following the template used by other evaluations.Add custom dependencies in the pyproject.toml file, if required. Don’t forget to add your evaluation to the
[test]
dependencies if you’ve added any custom dependencies.[project.optional-dependencies] worldsense = ["pandas"] your_evaluation = ["example-python-package"] test = [ "inspect_evals[dev]", "inspect_evals[your_evaluation]" ]
Pin the package version if your evaluation might be sensitive to version updates.
Make sure your module does not import your custom dependencies when the module is imported. This breaks the way
inspect_ai
imports tasks from the registry.To check for this,
pip uninstall
your optional dependences and run another eval with--limit=0
. If it fails due to one of your imports, you need to change how you are importing that dependencyAdd pytest tests covering the custom functions included in your evaluation. See the Testing Standards section for more details.
Run
pytest
to ensure all new and existing tests pass.To install test dependencies:
pip install -e ".[test]"
Run linting, formatting, and type checking and address any issues:
make check
Note that we automatically run the
make check
commands on all incoming PRs, so be sure your contribution passes those before opening a PR.If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation. See the Evaluation Report section for more details.
Comparison results between existing implementations and your implementation should be included in the README.md file and in the pull request description.
Add your evaluation to src/inspect_evals/listing.yaml to include it in the documentation (which is automatically generated and deployed by Github Actions after merge). See the Listing.yaml Reference section below for a complete example and field documentation.
Don’t forget to add any custom dependencies to pyproject.toml if applicable.
Run tools/listing.py to format your evaluations’ README file according to our standards.
You’re expected to receive a PR review in a couple of days.
Code Quality Standards
- Write the code to be read by other developers with no prior knowledge of the eval - it should be possible to understand the code without reading the paper
- Follow existing project structure and naming conventions
- Include type hints for all functions
- Document complex logic with comments
- Add docstrings following Google style guide
- Ensure all tests pass before submission
Inclusion of third-party code
If there is an official implementation of the eval, include a link to it in the README. It is permissable and encouraged to utilise code from the official implementation, where possible. This both reduces the amount of code you need to write and maintains consistency with the original implementation.
If you are able to use a significant amount of code from the official implementation, it can be added as a dependency.
- If the official implementation has been released as a package on PyPI, you should add the package as a dependency in the
pyproject.toml
file, and import functions and classes from it (e.g.swe_bench
inpyproject.toml
). - If the official implementation has not been released on PyPI but is structured as a package, the GitHub repo can be added to the
pyproject.toml
file as a dependency (e.g.ifeval
inpyproject.toml
). - If the official implementation is not structured as a package, you should copy the code into your evaluation, and include a comment with a link to the original implementation (e.g. the scorer code in
sciknoweval
). Consider opening a PR to the original implementation to structure it as a package. - If there is only a small amount of code that you are able to use from the official implementation, it is permissable to copy the code into your evaluation, and include a comment with a link to the original implementation (e.g. the annotate solver in
air_bench
).
Code from inspect_ai
Code from inspect_ai
should be utilised where possible, concentrating on public components, such as the built-in dataset functions, solvers, scorers, and metrics.
Internally, inspect_ai
uses non-public functions and modules as building blocks for its public components. These are indicated by leading underscores in the package and function names such as inspect_ai.scorer._reducer
and _compute_dict_stat
. These aren’t guaranteed to be stable and do not go through a deprecation process if they change.
There’s no hard-and-fast rule against using them, but it would mean that the results of the benchmark could silently change in response to an internal change in the framework. If making use of non-public components, it is recommended to have tests in place that exhaustively confirm the functionality of the components you build with them, as this would give an early signal if the components changed in a way that affected your evaluation.
Alternatively, copy the implementations of the non-public components you need into your evaluation, and include a comment with a link to the original implementation (e.g. the helper functions in the zerobench
reducer).
Solvers, Scorers and Metrics
[This section to be expanded with guidelines and examples]
Testing Standards
⚠️ If your evaluation is not adequately tested, it will not be accepted. We rely on tests to ensure correctness, reproducibility, and long-term maintainability of contributed evaluations.
Look in the tests
directory for examples.
Unit Tests
Include unit tests that cover all non-trivial custom functions. This will often include (but is not limited to):
- Solver, scorer and dataset functions
- Custom tools
- Custom utils or functions
Include edge cases in your test coverage. Test error conditions and invalid inputs. Unit tests should also support your end to end tests by providing coverage for any code you mock the results of. Create a test that demonstrates your record_to_sample
function with an actual example from the dataset. This serves as both a test and documentation of the expected behavior.
Example:
def test_record_to_sample():
= {"question": "What is 2+2?", "answer": "4"} # Example from actual dataset, showing all fields
record = Sample(input="What is 2+2?", target="4")
expected assert record_to_sample(record) == expected
End-to-end Tests
It is also required that you implement at least one end-to-end test for your evaluation. This should be a test that demonstrates the full evaluation pipeline, including the solver and scorer. It should use mockllm/model
as the model, which will return the a default output for each sample. See mockllm for more details, including how to provide custom outputs.
If your eval has multiple tasks or a task with multiple variants, you should implement one success end-to-end test and one error handling end-to-end test for each task/variant.
If the test triggers the download of a dataset, mark it with @pytest.mark.dataset_download
.
Examples:
from inspect_ai import eval
from inspect_evals.your_eval import your_eval_task
from inspect_ai.model import get_model, ModelOutput
def test_end_to_end_your_eval_with_default_mock_responses():
"""
This test confirms that the evaluation pipeline works end-to-end with the default mock responses.
Note that all responses will presumably be incorrect, but the test should surface any issues with the evaluation pipeline.
"""
= eval(
[log] =your_eval_task(
tasks="task_parameter_value",
task_parameter
),="sample_id_1",
sample_id="mockllm/model",
model
)assert log.status == "success"
assert "accuracy" in log.results.scores[0].metrics
# more asserts here
def test_end_to_end_your_eval_with_custom_mock_responses():
"""
This test confirms that the evaluation pipeline works end-to-end with custom mock responses.
This allows you to check the solver and metrics are working as expected.
"""
= eval(
[log] =your_eval_task(
tasks="task_parameter_value",
task_parameter
),="sample_id_1",
sample_id=get_model(
model"mockllm/model",
=[
custom_outputs
ModelOutput.from_content(="mockllm/model",
model="ANSWER: A", # correct answer
content
),
],
),
)assert log.status == "success"
assert "accuracy" in log.results.scores[0].metrics
assert log.results.scores[0].metrics["accuracy"] == 1.0 # all correct
Manual testing
- Use a fast and cheap model during development and for initial testing.
openai/gpt-4o-mini
is a good choice. - Test with small subsets before running on full datasets
- Start with a few representative examples
- Gradually increase the test set size
- Verify that your implementation matches the original evaluation’s methodology
- Compare results with reference implementations if available
- Document any discrepancies and their causes
Guidelines for specific test scenarios
[This section to be expanded]
- Mocking (what should be mocked and when)
- Sandboxes (how to mock interactions)
- Logs (cleanup files after tests)
Data Quality
- If you discover a broken record in the dataset:
- Raise an issue in the upstream source (e.g., Hugging Face dataset or original repository)
- If necessary, filter the dataset to exclude the broken record
- Document the exclusion in your code and PR description
- Example: drop:_sample_is_not_known_duplicate
Evaluation Report Guidelines
Overview
The evaluation report is a brief summary of results for your evaluation implementation compared against a standard set of existing results. We use your evaluation report to help validate that your implementation has accurately replicated the design of your eval into the Inspect framework.
Before completing your evaluation report
We expect that you will have:
- Completed small runs of samples with different models. You should also verify these results and their evaluation logs to be confident that:
- The samples can run end to end
- The metrics are calculated correctly and reporting as expected
- The evaluation logs have an error free flow and a valid set of responses
- The evaluation logs do not contain unexpected failure cases i.e. incorrect responses due to strict mismatches, failing runs, hitting limits when not expected
- All subsets of the dataset pass these checks if they have notable differences in the dataset, or in their solver or scorer implementation
Comparing your results
Your implementation needs to be compared against the original paper or an existing reputable leaderboard result. If your benchmark does not have published results to compare against, you should raise this in your pull request.
We recommend producing results using at least two models, ideally with the full dataset. Depending on the evaluation implementation and choice of model, a full run comparison might be quite expensive (~$100USD or more). It is important that you calculate the costs for your runs first. If the cost is high we recommend using a representative subset (20-30% of the total dataset).
If you have completed an implementation but are unable to generate your report due to any type of cost barrier (i.e. unable to afford, cost still too high on subset etc), you can clearly state this in your PR and submit without.
Reporting your results
You should include your evaluation report in the README.md in a section titled Evaluation Report.
Your report needs to include the following:
- Implementation Details
- Any deviations from reference implementations
- Known limitations or edge cases
- Results
- The results of your eval report runs and the comparison results, ideally in a table format
- Performance comparison with original paper or existing reputable leaderboard results
- Analysis of findings and any notable observations
- Reproducibility Information
- The total samples run out of how many total samples for the dataset
- A timestamp of when the results were produced (timestamp, or month and year is enough)
- Clear instructions for replication
- Environment details (model versions, key dependencies)
- Random seed information where applicable
- Justification for choice of model temperature
The livebench evaluation report is a good example.
Listing.yaml Reference
The listing.yaml
file serves as a central registry of all evaluations available in the repository. It’s used to generate documentation and organize evaluations into categories.
Required Fields for Each Evaluation
title
: The display name of the evaluation (e.g., “HumanEval: Python Function Generation from Instructions”)description
: A brief description of what the evaluation measures, usually adapted from the abstract of the paperpath
: The filesystem path to the evaluation directory (e.g., “src/inspect_evals/humaneval”)arxiv
: Link to the paper or documentation (preferably arXiv link when available)group
: The category this evaluation belongs to (e.g., “Coding”, “Cybersecurity”, “Mathematics”)contributors
: List of GitHub usernames who contributed to this evaluationtasks
: List of task configurations with:name
: The task identifier used in the CLIdataset_samples
: Number of samples in the dataset
Optional Fields
dependency
: Name of the optional dependency from pyproject.toml if the evaluation requires additional packagestags
: List of documentation-related tags used for categorization in the documentation. These should only be used for documentation purposes, not for system configuration. Valid tags include:"Agent"
: For evaluations involving agents that can take multiple actions or use tools"Multimodal"
: For evaluations involving multiple input modalities (e.g., text and images)
Omit this field if no documentation tags apply. Do not use for system/configuration information.
metadata
: (Optional) Object containing system/configuration information. All fields are optional. May include:sandbox
: List of components that use a sandboxed environment. Can include"solver"
and/or"scorer"
. Omit if no sandbox is used.requires_internet
: Set totrue
only if the evaluation requires internet access. Omit if false.environment
: Special environment requirements (e.g.,"Kali-Linux"
). Only include if a non-standard environment is needed.
Example:
metadata: sandbox: ["solver", "scorer"] # Both solver and scorer use sandbox requires_internet: true environment: "Kali-Linux" # Only include if special environment is needed
Omit the entire
metadata
field if none of its subfields are needed.human_baseline
: Optional field for evaluations with known human performance metricshuman_baseline: metric: accuracy # The metric used (accuracy, F1, etc.) score: 0.875 # The human performance score source: https://arxiv.org/abs/0000.00000 # Source of the baseline
Example Entry
- title: "Example-Bench: Your Amazing AI Benchmark"
description: |
A detailed description of what this benchmark evaluates and why it's important.
This can span multiple lines using YAML's pipe (|) syntax. path: src/inspect_evals/your_evaluation
arxiv: https://arxiv.org/abs/1234.12345
group: Coding
contributors: ["your-github-handle"]
tasks:
- name: task-name
dataset_samples: 100
dependency: "your_evaluation" # Optional
tags: ["Agent"] # Optional
metadata: # optional metadata documenting eval information. All fields optional
sandbox: [] # list of eval aspects that use a sandbox, can include "solver" and/or "scorer"
requires_internet: true # boolean indicating whether the eval requires internet access
environment: "Kali-Linux" # optional environment information
Best Practices
- Keep descriptions concise but informative
- Use the pipe (|) syntax for multi-line descriptions
- Always include the arXiv link when available
- List all contributors using their GitHub usernames
- Keep the
group
consistent with existing categories - For agent-based evaluations, include the “Agent” tag in the
tags
field - Use the
metadata
field for all system/configuration information - Omit empty
tags
andmetadata
fields - Update the listing when adding new tasks or making significant changes to existing ones
Examples
Here are some existing evaluations that serve as good examples of what is required in a new submission: