Evaluation Checklist
This checklist covers all the requirements you should need in order to produce a high-quality evaluation. It’s designed for full evaluations, and it’s too heavyweight for most PRs.
Usage of agents to help with coding and to complete agentic workflows is recommended but not mandatory:
Master Checklist
Contributors: ensure each step is done. Reviewers: verify each step has been done.
We recommend the use of agent workflows. Our agent workflows are optimised for use with Claude Code (setup instructions).
Human-Required Checks
Best Practices
Evaluation Report
Evaluation Report Guidelines
We recommend using the Evaluation Report workflow to assist in this step. To verify the error rate of logs, use the Trajectory Analysis Workflow on the log files that emerge from the evaluation report.
Evaluation Report Notes
Agent Runnable Checks
The following items can be checked by an LLM agent with access to the codebase. These checks require reading code and comparing against conventions, but do not require running the evaluation or external context beyond the repository. If running the Review An Evaluation Workflow you don’t need to read these checks - they are here to serve as a reference in case of errors.
Code Quality (Agent)
Unit Tests (Agent)
End-to-End Tests (Agent)
Apply Pytest Marks (Agent)
Best Practices - Task Design (Agent)
Best Practices - Control Flow (Agent)
Best Practices - Datasets and Variants (Agent)
Best Practices - Scoring and Metrics (Agent)
Best Practices - Documentation, Environments, and Tooling (Agent)
Evaluation Report (Agent)
This checklist is in beta
We will continue to improve this over time, and aim to make it our standard approach for both contributors and reviewers by the end of Q1 2026. Feedback on this checklist is very welcome - you may reach out via this Google Form, or open an issue/PR.