Best Practices for Evaluations
A checklist of the most important of these can be found at EVALUATION_CHECKLIST.md
Rules of Thumb
Use these guidelines when implementing and reviewing evaluations. They focus on clarity, reproducibility, and minimal custom code while leveraging the Inspect framework wherever possible.
Scoring and metrics
Align scoring with the outcome
If the outcome is discrete or continuous (e.g., a numeric quantity), make the primary Score.value that numeric measure. Use [mean(), stderr()] as a minimal set of metrics. If the outcome is binary, use [accuracy(), stderr()] as a minimal set of metrics.
Binary = Score is either a success or a failure, with no in-between. (E.g, GPQA) Discrete = Score is done in a series of steps. (E.g, how many steps out of 5 were done successfully?) Continuous = Score is done with a metric with essentially infinite values. (E.g, NoveltyBench)
Normalize outputs before comparison (unless exact-match is required)
Trim whitespace/newlines and consider case/format normalization. If exact string match is required, document this clearly.
Avoid magic values; use framework constants
Compare against CORRECT/INCORRECT (and other provided constants) rather than literal strings that represent these values.
Record important run metadata
Err on the side of including dataset fields in sample.metadata - these can be useful for debugging and analysis.
Store additional information (e.g., interaction length, duration, configuration/prompt version, and exit reasons) in state.metadata for scoring and analysis.
Include potentially useful information in Score.metadata, in addition to providing Score.value as the primary outcome and Score.explanation if relevant.
Avoid storing reducible score values in Score.metadata
Inspect first aggregates samples across epochs (epoch-aggregation), and then aggregates over samples via a metric (sample-aggregation).
The default epoch-aggregation reduces Score.value by computing the mean, but does not reduce Score.metadata.
If your custom metrics read score values from metadata, they will not calculate correctly with epochs > 1 unless a custom --epochs-reducer is used (see custom-reducers).
To be compatible with default epoch aggregation:
- Do: Store all values that need to be aggregated/reduced in
Score.value(can be a dict with multiple keys) - Don’t: Store score values in
Score.metadatathat metrics will read and aggregate
Use Score.metadata only for:
- Identifiers for grouping that do not vary across epochs (e.g.,
task_id,category) - Debugging/logging information
- Values that should NOT be reduced across epochs
Task design and API usage
Leverage Inspect components wherever possible
If you can avoid writing a custom solver, scorer, or metric, do so. Use the framework components provided by Inspect wherever possible. Less code is better.
Use Model Roles for multi-model evals
Represent distinct agents/tools via Model Roles. Multi-model evals include evaluations with a scoring/judging model.
Only call get_model() inside solver/scorer
Resolve concrete models late (inside @solver/@scorer) so tasks remain declarative and configurable. See Model Roles for more details.
Keep prompts, tools, and scoring aligned
Ensure the system prompt, available tools, and what the scorer expects are consistent.
Separate prompt templates from formatting
Keep prompt text as module-level constants and inject variables at call time (e.g., via .format() or f-strings in a small wrapper). This simplifies testing and mocking.
Follow clear naming conventions for defaults
Use ALL_CAPS for constants; prefer get_default_* function names for default factories (e.g., get_default_judge_prompt(game), get_default_periodic_message(game)).
Use absolute imports for run-anywhere execution
Prefer absolute package imports (e.g., from inspect_evals.your_eval.dataset import get_dataset instead of from .dataset import get_dataset) so modules/tests run from IDEs and plugins that execute from arbitrary working directories. This is not necessary for code that runs only in sandbox containers.
Prefer public APIs; avoid private internals
Avoid importing from underscored/private modules in Inspect AI. (e.g., inspect_ai.*._internal). Use the public API wherever possible to ensure forward compatibility.
Avoid import-time side effects and optional dependency failures
Defer optional imports (e.g., nltk) inside functions/methods. This ensures the package can be imported without optional deps installed and avoids side effects at import time. See src/inspect_evals/livebench/utils.py for patterns.
Remove dead code and unused members early
Eliminate unused attributes, functions, or branches.
Use of default arguments
You should have a very good reason for having a default value for an argument.
Typically, this is appropriate in top-level, user-facing code.
Default arguments should generally live high in the call stack, not in lower-level helpers.
Extract repeated or unclear magic numbers to constants
Extract a magic number to a named constant if it appears 3+ times or if its meaning is not clear from context. Otherwise, inline magic numbers in function defaults are acceptable (e.g., max_turns: int = 5 is fine if used once and obviously a turn limit).
Control flow, limits, and performance
Respect “no-limit” semantics
If 0 means “no limit” for turns/duration, ensure loops still progress. Otherwise, validate early and document that at least one limit must be > 0.
Confirm tool timeouts and message limits are sufficient
Choose sensible defaults and allow overrides via task options.
Some models may take more turns to complete agentic tasks because they return a plan and then execute the tool call in a separate turn.
Confirm that any limit is sufficient to complete the task by having a capable model attempt the task and see how many turns/tokens/seconds it takes.
Consider parallelism prudently
The framework supports parallelism, but this can hit resource limits, particularly with sandboxes. Consider the trade-offs between parallelism and serial execution.
Validate parameters
Check for invalid combinations (e.g., negative limits) and provide informative errors or defaults.
Test critical behaviors
Add unit tests for: regex detection (including edge cases), model fallback path, and loop/limit behavior (including no-limit).
Provide defaults and allow overrides for datasets, solvers, scorers, metrics, and grader models
Users of inspect-evals should be able to override datasets, solvers/agents, scorers, metrics, and grader models. Try to make outputs conform to convention such that substituting different components works as expected. For example, as much as possible, metrics should not depend on specific scorer outputs, and scorers should not depend on specific metrics. Test this by running the eval with a different but compatible metric or reducer from the command line (eg at_least(n) vs max_score()).
This does not require adding these items as parameters to the task, as task_with() can be used to override. It does mean avoiding unchangable logic that alters these objects unless doing so is strictly necessary for the evaluation to function, such as a setup script that runs before any solver the user submits.
Handling errors
Only handle errors gracefully if you have a clear, justified reason. Otherwise, let them raise and crash. Failing fast is better than silently running broken code.
Do not write try catch blocks unless it absolutely makes sense.
Prefer narrow checks
For example, often it’s important to differentiate ‘no data’ from empty collections ([]).
Prefer narrow checks like if foo is None over if not foo, since the latter will match on "", [], {}, and None.
Datasets and Variants
Use stable, canonical IDs for samples
Use included sample ids where available. If not, make generated IDs consistent, predictable, and concise so that --sample-id can be used to select a specific sample, even if the dataset is shuffled.
If the dataset has categories, consider using the category name in the sample ID to make it more human-readable like {category}_{index}.
Pin datasets to specific versions
Pin HuggingFace datasets by passing revision=<commit_sha> to hf_dataset() or load_dataset() calls. For GitHub raw URLs, use commit SHAs instead of branch names like main or master. This ensures reproducibility even if upstream datasets change.
We recommend storing datasets on GitHub or HuggingFace, and may enforce this in future.
Ensure deterministic behavior where possible
Control shuffling and randomness with explicit seeds; document or expose as configuration when relevant.
Define clear turn semantics
If the benchmark has “turns” that include multiple steps, be explicit about what a “turn” counts. Make counting consistent with limits and document it.
Differentiate tasks from dataset splits via parameters
If variants share structure and scoring (e.g., difficulties), expose them via a task parameter like difficulty rather than separate task functions. Consider grouped scoring to report per-difficulty metrics in addition to overall results. If variants are conceptually different tasks, use separate task functions.
Provide oracle/gold solvers for testing
Include a simple gold solver to validate plumbing and scoring logic; namespace it clearly to avoid accidental use in benchmarks.
Documentation, environment, and tooling
Keep docs and defaults in sync
Ensure README, examples, and task signature defaults match (models, limits, prompt versions). Avoid provider bias in examples unless required; explain rationale for defaults.
Regenerate docs after changes to eval.yaml/README
Keep eval.yaml categories accurate and run tools/generate_readmes.py to ensure README formatting and docs are consistent.
Document and validate environment constraints
If an eval only supports specific OS or dependencies, document this in the README and validate at runtime with clear errors.
Least-privilege tooling
Run tools as non-root by default; elevate only when necessary and document when/why.
Keep dependency metadata and lockfiles in sync
When changing dependencies, update the lockfile (uv lock) and ensure CI/pre-commit hooks cover regeneration where applicable. This should happen automatically via the pre-commit hooks.
Don’t run uv update unless it’s necessary for a specific reason.
Writing comments
Prefer less comments
Comment why you’re doing something, not what you’re doing.
If you feel the need to write a comment describing what you’re doing, consider that you could instead:
- name functions more clearly
- name variables more clearly
- separate a chunk of logic into a function
- seperate an inlined computation into a meaningfully named variable
Avoid narrating changes
Base comments on the state of the code itself, not the changes you’re making.
These are examples of narrativising comments that should be avoided:
# the function now does x instead of y# changed x because of reason y
Typing
Don’t add type annotations outside of function arguments when they’re redundant e.g.
def do_something(age: int) -> str:
# name: str is not an
# informative type hint
name: str = "Foo"
return f'{name} (age: {age})'