Evaluation Checklist

This checklist covers all the requirements you should need in order to produce a high-quality evaluation. It’s designed for full evaluations, and it’s too heavyweight for most PRs.

Usage of agents to help with coding and to complete agentic workflows is recommended but not mandatory:

  1. Full list of agent workflows

  2. Our philosophy on agents

  3. Recommended permissions

Master Checklist

This checklist is required to be followed when submitting new evaluations. The following artefacts must be uploaded as part of a new evaluation:

  • agent_artefacts/<eval_name>/review/SUMMARY.md — overall quality summary from the evaluation review
  • agent_artefacts/<eval_name>_<version>/validity/VALIDITY_REPORT.md — evaluation validity report assessing name accuracy, dataset validity, and scoring validity (the /eval-validity-review skill names this directory <eval_name>_<version>, e.g., gpqa_1_1_2)
  • agent_artefacts/trajectory_analysis/<eval_name>/<eval_name>_<model_name>_ANALYSIS.md — trajectory analysis results. Only the final trajectory analysis from the Evaluation Report results is required. This means there should be multiple files, one per model. If more than 100 samples are included in the evaluation report, it is permitted to cap trajectory analysis at a random sample of 100.

We recommend the use of agent workflows. Our agent workflows are optimised for use with Claude Code (setup instructions). If you do not wish to use agent workflows, you can create similar artefacts as above yourself. This is quite a lot of work and we don’t suggest going it alone, though spot-checking and modifying the agent’s work is both accepted and encouraged.

Human Ordering

At this point a human maintainer will go over the code within a few days, and possibly suggest fixes. After the maintainer is satisfied with the code’s quality and correctness, we’ll merge your new evaluation into Inspect Evals!

Agent Ordering

You should assume you have a user in the loop unless you are specifically told otherwise. A well designed autonomous workflow will make it clear to you that no human is available.

Evaluation Validity

These checks assess whether the evaluation measures what it claims to measure. We recommend using the Evaluation Validity Review workflow (/eval-validity-review) to assist with these checks.

Name Validity

Dataset Validity

Scoring Validity

Manually Examined Checks

Best Practices

Evaluation Report

Evaluation Report Guidelines

We recommend using the Evaluation Report workflow (/eval-report-workflow) to assist in this step. To verify the error rate of logs, use the Trajectory Analysis workflow (/check-trajectories-workflow) on the log files that emerge from the evaluation report.

Evaluation Report Notes


Agent Runnable Checks

The following items can be checked by an LLM agent with access to the codebase. These checks require reading code and comparing against conventions, but do not require running the evaluation or external context beyond the repository. If running the Review An Evaluation workflow (/eval-quality-workflow) you don’t need to read these checks - they are here to serve as a reference in case of errors.

Code Quality (Agent)

Unit Tests (Agent)

End-to-End Tests (Agent)

Apply Pytest Marks (Agent)

Best Practices - Task Design (Agent)

Best Practices - Control Flow (Agent)

Best Practices - Datasets and Variants (Agent)

Best Practices - Scoring and Metrics (Agent)

Best Practices - Documentation, Environments, and Tooling (Agent)

Licensing and Attribution (Agent)

If any code has been copied or adapted from an external source (e.g., a reference implementation, paper repository, or another library), the following must be present:

To check: search the evaluation’s source files for comments like adapted from, ported from, copied from, based on, or obvious structural similarity to a known reference implementation. Cross-reference any such files against the NOTICE file.

Evaluation Report (Agent)

Infrastructure Changes (Agent)

This check applies to all PRs, not just eval submissions. If the PR modifies any of the following high-impact files, verify that the relevant documentation was updated:

High-impact files — changes here often require a docs update:

  • src/inspect_evals/metadata.py — changes to the eval.yaml schema (new/removed/required fields) should be reflected in the eval.yaml Reference section of CONTRIBUTING.md
  • pyproject.toml — new required dependencies or changed tooling should be documented
  • .claude/skills/ — new or updated agent workflows should be reflected in AGENTS.md
  • tools/ or Makefile — changed contributor tooling should be documented
  • .github/workflows/ — changes to CI that affect contributors (new required checks, changed test environments) should be documented

To check:

  1. Run git diff --name-only $(git merge-base HEAD origin/main) to see what files changed in this PR.
  2. If any high-impact files are in the diff, check whether CONTRIBUTING.md, EVALUATION_CHECKLIST.md, or AGENTS.md were also updated.
  3. If not, flag it.

This checklist is a living document

Feedback on this checklist is very welcome - you may reach out via this Google Form, or open an issue/PR.