Evaluation Checklist

This checklist covers all the requirements you should need in order to produce a high-quality evaluation. It’s designed for full evaluations, and it’s too heavyweight for most PRs.

Usage of agents to help with coding and to complete agentic workflows is recommended but not mandatory:

Full list of agent workflows
Our philosophy on agents
Recommended permissions

Master Checklist

This checklist is required to be followed when submitting new evaluations. The following artefacts must be uploaded as part of a new evaluation:

agent_artefacts/<eval_name>/review/SUMMARY.md — overall quality summary from the evaluation review
agent_artefacts/<eval_name>_<version>/validity/VALIDITY_REPORT.md — evaluation validity report assessing name accuracy, dataset validity, and scoring validity (the /eval-validity-review skill names this directory <eval_name>_<version>, e.g., gpqa_1_1_2)
agent_artefacts/trajectory_analysis/<eval_name>/<eval_name>_<model_name>_ANALYSIS.md — trajectory analysis results. Only the final trajectory analysis from the Evaluation Report results is required. This means there should be multiple files, one per model. If more than 100 samples are included in the evaluation report, it is permitted to cap trajectory analysis at a random sample of 100.

We recommend the use of agent workflows. Our agent workflows are optimised for use with Claude Code (setup instructions). If you do not wish to use agent workflows, you can create similar artefacts as above yourself. This is quite a lot of work and we don’t suggest going it alone, though spot-checking and modifying the agent’s work is both accepted and encouraged.

Human Ordering

Ensure the evaluation runs with ‘uv run inspect eval inspect_evals/ –limit 1’ on the model of your choice. (You can use more samples, but we recommend 5 or less at this early stage)
Analyse the log with the Trajectory Analysis Workflow. Fix it if it fails for spurious reasons. Verify that the evaluation didn’t crash and that the agent under evaluation is able to submit a valid result, even if that submission receives a 0 score.
Run the Master Checklist Workflow which goes over several agent workflows in turn.
Go over the Manually Examined Checks section, and verify your evaluation meets each one.
Check your implementation and results against the original paper or implementation, if one exists.
Manually review your code, making sure it’s high-quality. All code must be reviewed, including LLM-generated code as per our LLM usage guidelines. You are welcome to open a draft PR at this stage and review it with this interface. This step must be followed before marking it as “Ready for review”.
Draft your PR, and name it something like <eval_name> implementation (putting the name of the evaluation first makes it easier to find your PR in a browser tab!). Push it to Github.
Address any issues found by the pipeline or the automatic Claude Code review. We don’t mandate that all automatically flagged issues be resolved. Addressing an issue means to either fix it or to comment on why it doesn’t require a fix.

At this point a human maintainer will go over the code within a few days, and possibly suggest fixes. After the maintainer is satisfied with the code’s quality and correctness, we’ll merge your new evaluation into Inspect Evals!

Agent Ordering

You should assume you have a user in the loop unless you are specifically told otherwise. A well designed autonomous workflow will make it clear to you that no human is available.

Ensure the evaluation runs with ‘uv run inspect eval inspect_evals/ –limit 1’ on the model of your choice. (You can use more samples, but we recommend 5 or less at this early stage). As an agent, you are permitted to run this command yourself since it is on a small sample size. You should avoid using –model in this command, which will automatically use the user’s preferred model. Let the user know if the command fails and indicates no such model exists before continuing. If you are in a fully autonomous workflow, try each model from the Frontier Models list instead. (skill: /eval-report) If none of these work due to the lack of an API key, raise an error.
Analyse the log with the Trajectory Analysis Workflow (skill: /check-trajectories-workflow). Fix it if it fails for spurious reasons. As an agent, you are permitted to run this command yourself since it is on a small sample size, even though the workflow usually tells you to get the human to run it. Verify that the evaluation didn’t crash and that the agent under evaluation is able to submit a valid result, even if that submission receives a 0 score.
Go over the Manually Examined Checks section, and verify your evaluation meets each one.
Check your implementation and results against the original paper or implementation, if one exists.
Manually review the code, making sure it’s high-quality. Use relevant skills such as ensure-test-coverage.
Run the Master Checklist Workflow which goes over several agent workflows in turn. The evaluation report usually requires human approval, which is why it is as late in the process as possible. If you are in a fully autonomous workflow, run the command yourself. If the user didn’t give you a budget, keep it to under $5 USD per model.
Draft your PR, and name it something like <eval_name> implementation (putting the name of the evaluation first makes it easier to find your PR in a browser tab!). Push it to Github, and you’re ready for review!

Evaluation Validity

These checks assess whether the evaluation measures what it claims to measure. We recommend using the Evaluation Validity Review workflow (/eval-validity-review) to assist with these checks.

Name Validity

The evaluation name accurately represents the capability being measured — a user reading only the name would not be actively misled about its contents
The name is not overly broad (claiming to measure a general capability when only a narrow subset is tested)
The evaluation measures behaviour rather than statements about behaviour, or the name makes the distinction clear. For example, asking a model “would you seek power?” measures what the model says about power seeking, not whether it actually seeks power — the model needs affordances (tools, environment) to exhibit the behaviour for it to be a valid behavioural measurement.

Dataset Validity

It is reasonably possible for a model to succeed at each sample given the available tools, sandbox, and information
It is reasonably possible for a model to fail at each sample (the model has the affordances needed to take the wrong action, not just the right one)
For agentic evaluations: the submission mechanism is clear in the prompt and achievable with the provided tools
For sandbox evaluations: required services and resources are available in the sandbox environment

Scoring Validity

The scorer measures actual task completion rather than a proxy for it
Substring matching is only used when the matched text is itself the ground truth (e.g., CTF flags, computed values), not for assessing natural language properties like compliance or refusal
A model cannot achieve a high score without genuinely completing the task
For open-ended natural language answers, exact string match is a weak proxy — synonyms, reordering, abbreviations, and more complete correct answers all score 0. If exact match is used, either filter out answers that are practically unreproducible verbatim (e.g., long multi-word phrases) or add an opt-in model-graded scorer as an alternative. Document this limitation explicitly.

Manually Examined Checks

Best Practices

Leverage Inspect components whenever possible
Complex logic is commented
Document and validate environment constraints

Evaluation Report

Evaluation Report Guidelines

We recommend using the Evaluation Report workflow (/eval-report-workflow) to assist in this step. To verify the error rate of logs, use the Trajectory Analysis workflow (/check-trajectories-workflow) on the log files that emerge from the evaluation report.

Logs have a 10% or lower rate of invalid samples
All relevantly different subsets of the dataset pass here
Results produced for at least two models, or reason why not clearly stated

Evaluation Report Notes

Any changes that would cause deviations from the original evaluation are noted.
Any limitations or edge cases of the evaluation are noted.

Agent Runnable Checks

The following items can be checked by an LLM agent with access to the codebase. These checks require reading code and comparing against conventions, but do not require running the evaluation or external context beyond the repository. If running the Review An Evaluation workflow (/eval-quality-workflow) you don’t need to read these checks - they are here to serve as a reference in case of errors.

Code Quality (Agent)

Existing naming conventions are followed
Linting passes successfully (uv run ruff check will check this for you)
Magic numbers in function defaults are extracted to named constants if they appear 3+ times or are not clear from context

Unit Tests (Agent)

All custom solvers, scorers, datasets covered
Custom tools are covered
Custom utils/functions are covered
Edge cases, error conditions, and invalid inputs are checked

End-to-End Tests (Agent)

Each meaningfully different task/variant covered by E2E tests
Tests are marked correctly with @mark items

Apply Pytest Marks (Agent)

If a test triggers the download of a dataset, mark it with @pytest.mark.dataset_download, if it uses Huggingface also mark it with @pytest.mark.huggingface. Note that easily missed examples include E2E tests that instantiate a dataset, or solvers that pull a model from huggingface.
If a test uses a Docker sandbox or otherwise triggers a docker build or pull, mark it with @pytest.mark.docker.

Best Practices - Task Design (Agent)

Use model roles for multi-model evaluations, including model-graded ones
Prompt templates defined as module-level constants, not inline
Only call get_model() inside of solvers/scorers
Separate prompt templates from formatting

Best Practices - Control Flow (Agent)

Respect “no-limit” semantics
Provide informative errors for invalid task parameters
Provide defaults and allow overrides for datasets, solvers, scorers, metrics, and grader models

Best Practices - Datasets and Variants (Agent)

Use stable, canonical IDs for samples
Ensure deterministic behavior where possible
Datasets are pinned to specific versions: all HuggingFace loading functions (hf_dataset, load_dataset, snapshot_download, hf_hub_download, sentence_transformer, transformers_pipeline) enforce revision= at runtime, and GitHub raw URLs use commit SHAs instead of branch names
Differentiate tasks from dataset splits via parameters
eval.yaml dataset_samples reflects the actual number of samples evaluated after all filtering, not the raw dataset size

Best Practices - Scoring and Metrics (Agent)

Align scoring with the outcome

Best Practices - Documentation, Environments, and Tooling (Agent)

Keep docs and defaults in sync
Least-privilege tooling
Keep dependency metadata and lockfiles in sync

Licensing and Attribution (Agent)

If any code has been copied or adapted from an external source (e.g., a reference implementation, paper repository, or another library), the following must be present:

Each source file containing copied or adapted code has a comment at the top of the file (or above the relevant section) attributing the original source, e.g. # Adapted from <URL> (Copyright YEAR Author, License)
The root NOTICE file has an entry for every copied or adapted file, following the existing format: project name, copyright holder, source URL, and the file(s) it appears in
Where practical, copied code is isolated into its own file (e.g. a dedicated utils.py or reference_impl.py) rather than mixed with new code, to make attribution boundaries clear

To check: search the evaluation’s source files for comments like adapted from, ported from, copied from, based on, or obvious structural similarity to a known reference implementation. Cross-reference any such files against the NOTICE file.

Evaluation Report (Agent)

Table is present containing results of evaluation run
A comparison to the original paper is present, or its absence is justified
Table contains the specific inspect eval command(s) used to produce it
Full model names are mentioned explicitly (e.g, gpt-5.1-2025-11-13, not gpt-5.1)
Evaluation version is mentioned explicitly
Any inspect eval parameters used are justified within the report

Infrastructure Changes (Agent)

This check applies to all PRs, not just eval submissions. If the PR modifies any of the following high-impact files, verify that the relevant documentation was updated:

High-impact files — changes here often require a docs update:

src/inspect_evals/metadata.py — changes to the eval.yaml schema (new/removed/required fields) should be reflected in the eval.yaml Reference section of CONTRIBUTING.md
pyproject.toml — new required dependencies or changed tooling should be documented
.claude/skills/ — new or updated agent workflows should be reflected in AGENTS.md
tools/ or Makefile — changed contributor tooling should be documented
.github/workflows/ — changes to CI that affect contributors (new required checks, changed test environments) should be documented

To check:

Run git diff --name-only $(git merge-base HEAD origin/main) to see what files changed in this PR.
If any high-impact files are in the diff, check whether CONTRIBUTING.md, EVALUATION_CHECKLIST.md, or AGENTS.md were also updated.
If not, flag it.

If the PR changes how future evaluations must be written or submitted, the relevant documentation (CONTRIBUTING.md, EVALUATION_CHECKLIST.md, AGENTS.md) has been updated accordingly

This checklist is a living document

Feedback on this checklist is very welcome - you may reach out via this Google Form, or open an issue/PR.