Best Practices for Evaluations

A checklist of the most important of these can be found at EVALUATION_CHECKLIST.md

Rules of Thumb

Use these guidelines when implementing and reviewing evaluations. They focus on clarity, reproducibility, and minimal custom code while leveraging the Inspect framework wherever possible.

Scoring and metrics

Align scoring with the outcome

If the outcome is discrete or continuous (e.g., a numeric quantity), make the primary Score.value that numeric measure. Use [mean(), stderr()] as a minimal set of metrics. If the outcome is binary, use [accuracy(), stderr()] as a minimal set of metrics.

Binary = Score is either a success or a failure, with no in-between. (E.g, GPQA) Discrete = Score is done in a series of steps. (E.g, how many steps out of 5 were done successfully?) Continuous = Score is done with a metric with essentially infinite values. (E.g, NoveltyBench)

Normalize outputs before comparison (unless exact-match is required)

Trim whitespace/newlines and consider case/format normalization. If exact string match is required, document this clearly.

Avoid magic values; use framework constants

Compare against CORRECT/INCORRECT (and other provided constants) rather than literal strings that represent these values.

Validate judge-based comparisons

When using an LLM judge to compare systems (e.g., which model, prompt, or agent configuration produces better outputs), there is no reason to expect the judge to care about the same things as your domain experts. Judge scores live on the judge’s scale, not the scale of the expensive labels you actually care about (human ratings, ground truth). Standard error bars won’t catch this gap because they only reflect sampling noise, not judge miscalibration.

To get unbiased estimates in the oracle space without labeling every example:

Start with 50–100 oracle labels (human ratings or a stronger reference judge)
Use tools/judge_calibration_diagnostics.py to learn the judge-to-oracle mapping and produce calibrated estimates with confidence intervals
Check whether rankings hold and whether CIs are tight enough to distinguish the systems being compared
If confidence intervals are too wide, collect more oracle labels and re-run

This tool uses CJE (Causal Judge Evaluation) to learn the mapping via isotonic calibration, producing estimates with uncertainty that accounts for both sampling noise and calibration error.

See tools/README.md for usage examples. Requires optional dependency: pip install inspect-evals[cje]

Record important run metadata

Err on the side of including dataset fields in sample.metadata - these can be useful for debugging and analysis.

Store additional information (e.g., interaction length, duration, configuration/prompt version, and exit reasons) in state.metadata for scoring and analysis.

Include potentially useful information in Score.metadata, in addition to providing Score.value as the primary outcome and Score.explanation if relevant.

Avoid storing reducible score values in Score.metadata

Inspect first aggregates samples across epochs (epoch-aggregation), and then aggregates over samples via a metric (sample-aggregation).

The default epoch-aggregation reduces Score.value by computing the mean, but does not reduce Score.metadata.

If your custom metrics read score values from metadata, they will not calculate correctly with epochs > 1 unless a custom --epochs-reducer is used (see custom-reducers).

To be compatible with default epoch aggregation:

Do: Store all values that need to be aggregated/reduced in Score.value (can be a dict with multiple keys)
Don’t: Store score values in Score.metadata that metrics will read and aggregate

Use Score.metadata only for:

Identifiers for grouping that do not vary across epochs (e.g., task_id, category)
Debugging/logging information
Values that should NOT be reduced across epochs

Task design and API usage

Leverage Inspect components wherever possible

If you can avoid writing a custom solver, scorer, or metric, do so. Use the framework components provided by Inspect wherever possible. Less code is better.

Use Model Roles for multi-model evals

Represent distinct agents/tools via Model Roles. Multi-model evals include evaluations with a scoring/judging model.

Only call `get_model()` inside solver/scorer

Resolve concrete models late (inside @solver/@scorer) so tasks remain declarative and configurable. See Model Roles for more details.

Keep prompts, tools, and scoring aligned

Ensure the system prompt, available tools, and what the scorer expects are consistent.

Separate prompt templates from formatting

Keep prompt text as module-level constants and inject variables at call time (e.g., via .format() or f-strings in a small wrapper). This simplifies testing and mocking.

Follow clear naming conventions for defaults

Use ALL_CAPS for constants; prefer get_default_* function names for default factories (e.g., get_default_judge_prompt(game), get_default_periodic_message(game)).

Use absolute imports for run-anywhere execution

Prefer absolute package imports (e.g., from inspect_evals.your_eval.dataset import get_dataset instead of from .dataset import get_dataset) so modules/tests run from IDEs and plugins that execute from arbitrary working directories. This is not necessary for code that runs only in sandbox containers.

Prefer public APIs; avoid private internals

Avoid importing from underscored/private modules in Inspect AI. (e.g., inspect_ai.*._internal). Use the public API wherever possible to ensure forward compatibility.

Avoid import-time side effects and optional dependency failures

Defer optional imports (e.g., nltk) inside functions/methods. This ensures the package can be imported without optional deps installed and avoids side effects at import time. See src/inspect_evals/livebench/utils.py for patterns.

Remove dead code and unused members early

Eliminate unused attributes, functions, or branches.

Use of default arguments

You should have a very good reason for having a default value for an argument.

Typically, this is appropriate in top-level, user-facing code.

Default arguments should generally live high in the call stack, not in lower-level helpers.

Extract repeated or unclear magic numbers to constants

Extract a magic number to a named constant if it appears 3+ times or if its meaning is not clear from context. Otherwise, inline magic numbers in function defaults are acceptable (e.g., max_turns: int = 5 is fine if used once and obviously a turn limit).

Control flow, limits, and performance

Respect “no-limit” semantics

If 0 means “no limit” for turns/duration, ensure loops still progress. Otherwise, validate early and document that at least one limit must be > 0.

Confirm tool timeouts and message limits are sufficient

Choose sensible defaults and allow overrides via task options.

Some models may take more turns to complete agentic tasks because they return a plan and then execute the tool call in a separate turn.

Confirm that any limit is sufficient to complete the task by having a capable model attempt the task and see how many turns/tokens/seconds it takes.

Consider parallelism prudently

The framework supports parallelism, but this can hit resource limits, particularly with sandboxes. Consider the trade-offs between parallelism and serial execution.

Validate parameters

Check for invalid combinations (e.g., negative limits) and provide informative errors or defaults.

Test critical behaviors

Add unit tests for: regex detection (including edge cases), model fallback path, and loop/limit behavior (including no-limit).

Provide defaults and allow overrides for datasets, solvers, scorers, metrics, and grader models

Users of inspect-evals should be able to override datasets, solvers/agents, scorers, metrics, and grader models. Try to make outputs conform to convention such that substituting different components works as expected. For example, as much as possible, metrics should not depend on specific scorer outputs, and scorers should not depend on specific metrics. Test this by running the eval with a different but compatible metric or reducer from the command line (eg at_least(n) vs max_score()).

This does not require adding these items as parameters to the task, as task_with() can be used to override. It does mean avoiding unchangable logic that alters these objects unless doing so is strictly necessary for the evaluation to function, such as a setup script that runs before any solver the user submits.

Handling errors

Only handle errors gracefully if you have a clear, justified reason. Otherwise, let them raise and crash. Failing fast is better than silently running broken code.

Do not write try catch blocks unless it absolutely makes sense.

Prefer narrow checks

For example, often it’s important to differentiate ‘no data’ from empty collections ([]).

Prefer narrow checks like if foo is None over if not foo, since the latter will match on "", [], {}, and None.

Datasets and Variants

Use stable, canonical IDs for samples

Use included sample ids where available. If not, make generated IDs consistent, predictable, and concise so that --sample-id can be used to select a specific sample, even if the dataset is shuffled.

If the dataset has categories, consider using the category name in the sample ID to make it more human-readable like {category}_{index}.

Pin datasets to specific versions

Pin HuggingFace datasets by passing revision=<commit_sha> to hf_dataset() or load_dataset() calls. For GitHub raw URLs, use commit SHAs instead of branch names like main or master. This ensures reproducibility even if upstream datasets change.

We recommend storing datasets on GitHub or HuggingFace, and may enforce this in future.

Ensure deterministic behavior where possible

Control shuffling and randomness with explicit seeds; document or expose as configuration when relevant.

Define clear turn semantics

If the benchmark has “turns” that include multiple steps, be explicit about what a “turn” counts. Make counting consistent with limits and document it.

Differentiate tasks from dataset splits via parameters

If variants share structure and scoring (e.g., difficulties), expose them via a task parameter like difficulty rather than separate task functions. Consider grouped scoring to report per-difficulty metrics in addition to overall results. If variants are conceptually different tasks, use separate task functions.

Provide oracle/gold solvers for testing

Include a simple gold solver to validate plumbing and scoring logic; namespace it clearly to avoid accidental use in benchmarks.

Documentation, environment, and tooling

Keep docs and defaults in sync

Ensure README, examples, and task signature defaults match (models, limits, prompt versions). Avoid provider bias in examples unless required; explain rationale for defaults.

Regenerate docs after changes to eval.yaml/README

Keep eval.yaml categories accurate and run tools/generate_readmes.py to ensure README formatting and docs are consistent.

Document and validate environment constraints

If an eval only supports specific OS or dependencies, document this in the README and validate at runtime with clear errors.

Least-privilege tooling

Run tools as non-root by default; elevate only when necessary and document when/why.

Keep dependency metadata and lockfiles in sync

When changing dependencies, update the lockfile (uv lock) and ensure CI/pre-commit hooks cover regeneration where applicable. This should happen automatically via the pre-commit hooks.

Don’t run uv update unless it’s necessary for a specific reason.

Writing comments

Prefer less comments

Comment why you’re doing something, not what you’re doing.

If you feel the need to write a comment describing what you’re doing, consider that you could instead:

name functions more clearly
name variables more clearly
separate a chunk of logic into a function
seperate an inlined computation into a meaningfully named variable

Avoid narrating changes

Base comments on the state of the code itself, not the changes you’re making.

These are examples of narrativising comments that should be avoided:

# the function now does x instead of y
# changed x because of reason y

Typing

Don’t add type annotations outside of function arguments when they’re redundant e.g.

def do_something(age: int) -> str:
    # name: str is not an
    # informative type hint
    name: str = "Foo"
    return f'{name} (age: {age})'