Contribution Guide
We welcome new evaluations and improvements to existing evaluations. You can contribute by installing the development version of inspect_evals
, adding your evaluation, and opening a pull request in this repo.
What types of evaluations are we looking for?
We are currently prioritizing evaluations that are: * Well-established in the research community, with usage or citations in papers or benchmarks. * Published by credible sources, such as academic papers, industry benchmarks, or evaluation platforms. * Clearly scoped, with a defined dataset, task structure, and scoring methodology. * Verifiable, meaning they include reference implementations or clearly reported baseline results that can be replicated.
Development
To develop a new evaluation:
Clone the repository and install the package with the
-e
flag and[dev]
optional dependencies:git clone https://github.com/UKGovernmentBEIS/inspect_evals.git cd inspect_evals pip install -e ".[dev]"
Create a sub-directory in
src/inspect_evals
for your evaluation and add your evaluation task and related code. Note: do not pass aname
parameter to the task - this is only used for dynamically created tasks (i.e. tasks that are not addressable on the filesystem or in a package).Create an
__init__.py
file in the directory and use it to export your task and other useful functions.Register your task(s) for execution by the
inspect eval
CLI command by importing into src/inspect_evals/_registry.py.Confirm that your evaluation is properly registered by running it:
inspect eval inspect_evals/<my-task>
Take note of the total number of samples in the dataset.
Submission
To prepare an evaluation for submission as a pull request:
Create a
README.md
file in your sub-directory following the template used by other evaluations.Add custom dependencies in the pyproject.toml file. Don’t forget to add your evaluation to the
[test]
dependencies.[project.optional-dependencies] worldsense = ["pandas"] your_evaluation = ["example-python-package"] test = [ "inspect_evals[dev]", "inspect_evals[your_evaluation]" ]
Pin the package version if your evaluation might be sensitive to version updates.
Make sure your module does not import your custom dependencies when the module is imported. This breaks the way
inspect_ai
imports tasks from the registry.Add pytest tests covering the custom functions included in your evaluation. Demonstrating an example input and output of your
record_to_sample
function is fantastic documentation. Look in thetests
directory for examples. This will often include (but not limited to):- Solver, scorer and dataset functions
- Custom tools
- Custom utils or functions
It is also recommended that you implement an end-to-end test for your evaluation.
Run
pytest
to ensure all new and existing tests pass.To install test dependencies:
pip install -e ".[test]"
Run linting, formatting, and type checking and address any issues:
make check
Note that we automatically run the
make check
commands on all incoming PRs, so be sure your contribution passes those before opening a PR.If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation. See the Evaluation Report section for more details.
Comparison results between existing implementations and your implementation should be included in the README.md file and in the pull request description.
Add your evaluation to src/inspect_evals/listing.yaml to include it in the documentation (which is automatically generated and deployed by Github Actions after merge).
- title: "Example-Bench: Your Amazing AI Benchmark" description: Describe what the benchmark is about here. path: src/inspect_evals/your_evaluation arxiv: https://arxiv.org/abs/1234.12345 group: Coding contributors: ["your-github-handle"] tasks: - name: task-name dataset_samples: 365 # number of samples in the dataset used for this task human_baseline: # optional field for when a human baseline is available for the task metric: accuracy score: 0.875 source: https://arxiv.org/abs/0000.00000 dependency: "your_evaluation" # optional field for custom dependency from pyproject.toml tags: ["Agent"] # optional tag for agentic evaluations
Don’t forget to add the custom dependency from pyproject.toml if applicable.
Run tools/listing.py to format your evaluations’ README file according to our standards.
You’re expected to receive a PR review in a couple of days.
Evaluation Report Guidelines
Overview
The evaluation report is a brief summary of results for your evaluation implementation compared against a standard set of existing results. We use your evaluation report to help validate that your implementation has accurately replicated the design of your eval into the Inspect framework.
Before completing your evaluation report
We expect that you will have:
- Completed small runs of samples with different models. You should also verify these results and their evaluation logs to be confident that:
- The samples can run end to end
- The metrics are calculated correctly and reporting as expected
- The evaluation logs have an error free flow and a valid set of responses
- The evaluation logs do not contain unexpected failure cases i.e. incorrect responses due to strict mismatches, failing runs, hitting limits when not expected
- All subsets of the dataset pass these checks if they have notable differences in the dataset, or in their solver or scorer implementation
Comparing your results
Your implementation needs to be compared against the original paper or an existing reputable leaderboard result. If your benchmark does not have published results to compare against, you should raise this in your pull request.
We recommend producing results using at least two models, ideally with the full dataset. Depending on the evaluation implementation and choice of model, a full run comparison might be quite expensive (~$100USD or more). It is important that you calculate the costs for your runs first. If the cost is high we recommend using a representative subset (20-30% of the total dataset).
If you have completed an implementation but are unable to generate your report due to any type of cost barrier (i.e. unable to afford, cost still too high on subset etc), you can clearly state this in your PR and submit without.
Reporting your results
You should include your evaluation report in the README.md in a section titled Evaluation Report.
Your report needs to include the following: * The results of your eval report runs and the comparison results, ideally in a table format * The total samples run out of how many total samples for the dataset * A timestamp of when the results were produced (timestamp, or month and year is enough)
An evaluation report example can be found here
Examples
Here are some existing evaluations that serve as good examples of what is required in a new submission: