Evaluation Checklist
This checklist covers all the requirements you should need in order to produce a high-quality evaluation. It’s designed for full evaluations, and it’s too heavyweight for most PRs.
Usage of agents to help with coding and to complete agentic workflows is recommended but not mandatory:
Master Checklist
This checklist is required to be followed when submitting new evaluations. The following artefacts must be uploaded as part of a new evaluation:
agent_artefacts/<eval_name>/review/SUMMARY.md— overall quality summary from the evaluation reviewagent_artefacts/trajectory_analysis/<eval_name>/<eval_name>_<model_name>_ANALYSIS.md— trajectory analysis results. Only the final trajectory analysis from the Evaluation Report results is required. This means there should be multiple files, one per model. If more than 100 samples are included in the evaluation report, it is permitted to cap trajectory analysis at a random sample of 100.
We recommend the use of agent workflows. Our agent workflows are optimised for use with Claude Code (setup instructions). If you do not wish to use agent workflows, you can create similar artefacts as above yourself. This is quite a lot of work and we don’t suggest going it alone, though spot-checking and modifying the agent’s work is both accepted and encouraged.
Human Ordering
At this point a human maintainer will go over the code within a few days, and possibly suggest fixes. After the maintainer is satisfied with the code’s quality and correctness, we’ll merge your new evaluation into Inspect Evals!
Agent Ordering
You should assume you have a user in the loop unless you are specifically told otherwise. A well designed autonomous workflow will make it clear to you that no human is available.
Manually Examined Checks
Best Practices
Evaluation Report
Evaluation Report Guidelines
We recommend using the Evaluation Report workflow (/eval-report-workflow) to assist in this step. To verify the error rate of logs, use the Trajectory Analysis workflow (/check-trajectories-workflow) on the log files that emerge from the evaluation report.
Evaluation Report Notes
Agent Runnable Checks
The following items can be checked by an LLM agent with access to the codebase. These checks require reading code and comparing against conventions, but do not require running the evaluation or external context beyond the repository. If running the Review An Evaluation workflow (/eval-quality-workflow) you don’t need to read these checks - they are here to serve as a reference in case of errors.
Code Quality (Agent)
Unit Tests (Agent)
End-to-End Tests (Agent)
Apply Pytest Marks (Agent)
Best Practices - Task Design (Agent)
Best Practices - Control Flow (Agent)
Best Practices - Datasets and Variants (Agent)
Best Practices - Scoring and Metrics (Agent)
Best Practices - Documentation, Environments, and Tooling (Agent)
Licensing and Attribution (Agent)
If any code has been copied or adapted from an external source (e.g., a reference implementation, paper repository, or another library), the following must be present:
To check: search the evaluation’s source files for comments like adapted from, ported from, copied from, based on, or obvious structural similarity to a known reference implementation. Cross-reference any such files against the NOTICE file.
Evaluation Report (Agent)
This checklist is a living document
Feedback on this checklist is very welcome - you may reach out via this Google Form, or open an issue/PR.