Configurability Standard for inspect_evals tasks

This guide defines the repository-wide standard for task configurability in inspect_evals.

See also ADR-0008 for the architectural decision behind some of this guidance.

Purpose

This document defines the baseline configurability standard for evaluations in inspect_evals.

The goal is to make evals:

  • predictable to run
  • consistent to configure
  • easy to reuse without forking
  • reviewable against clear, shared expectations

This is a repo policy and guidance document, not a full API reference.

For the underlying Inspect mechanisms and their exact signatures, see the Inspect AI configuration docs.

Scope

This standard applies to task interfaces exposed by inspect_evals.

It is primarily about:

  • what kinds of configuration an eval should support
  • which configuration layer each kind of setting should use
  • what must be documented
  • what patterns should be avoided

It is not intended to require that every eval expose every internal detail as a task parameter.

Configuration Layers

Inspect has four main layers of configuration, from inner to outer:

  1. Task definition
  2. task_with()
  3. Environment variables / .env
  4. eval() / CLI

In general, later layers override earlier layers.

For authors of inspect_evals tasks, the important implication is that the task definition does not have to expose all configuration options as parameters. Instead, it should focus on the core task-specific options that are essential for the evaluation.

Users can run the eval with the default configuration, and use env vars and command-line flags or eval() to configure the majority of use cases involving normal runtime controls such as model, generation config, sample selection, limits, and execution behavior.

Advanced users / researchers can further customise task behaviour using task_with(), which provides a programmatic way to replace task components such as dataset, scorer, and metrics, which are not configurable via the CLI.

For example:

  • temperature should usually be controlled via eval() / CLI, not baked into a task parameter.
  • scorer cannot be changed from the CLI, so introducing a task parameter would be appropriate to enable switching between two provided scorers. For a user to use their own custom scorer, task_with() is the appropriate mechanism.

Standard

An eval satisfies the configurability standard if all of the following are true.

1. Reasonable Defaults

An eval must have sensible defaults such that a normal user can run it with a standard command like:

inspect eval inspect_evals/<eval_name> --model <model>

This means:

  • required task parameters should be avoided unless they are truly essential
  • default solver, scorer, metrics, and environment behavior must be coherent
  • paper-faithful or special-purpose settings should not force complexity onto the default invocation unless there is a strong reason

A good default task definition is usually:

  • a dataset factory
  • a small amount of task-specific wiring
  • a default solver
  • a default scorer
  • documented examples for runtime overrides

This makes the task easier to reuse through task_with() and eval().

2. Use the Correct Configuration Layer

Evals must use Inspect’s standard configuration layers rather than inventing custom configuration paths.

In particular:

  • Define task parameters for eval-specific behavior that is naturally part of the task interface and is representable in CLI-friendly values.
  • Rely on eval() / CLI for users to set framework-level runtime controls such as model, GenerateConfig, sample limits, epochs, sandbox, approval, and error handling.
  • Rely on task_with() for users to replace task components such as dataset, scorer, solver, metrics, setup, cleanup, or metadata.

An eval should not expose a task parameter for something the framework already handles cleanly unless there is a strong eval-specific reason.

Task parameters are most valuable when they express a meaningful task-level choice rather than a framework-level runtime tweak.

Good examples:

  • choosing between two scorer modes that imply different semantics
  • selecting a benchmark variant that changes prompts or scoring rules
  • filtering to a subset of samples
  • enabling an eval-specific setup mode

Less good examples:

  • adding temperature or max_tokens as task parameters when runtime config already handles them well

Note that there is no live inspect eval --scorer ... flag. In inspect_evals, there are usually two valid patterns for scorer configurability:

  • authors should expose a task parameter like scorer="default" | "original" when switching among built-in scorer modes is part of the task’s intended interface
  • users can use task_with(..., scorer=...) for user-supplied custom scorers

The second is a task convention, not a framework-level scorer override mechanism.

3. Python-Level Reuse Without Forking

Users must be able to adapt the task from Python without modifying eval source code.

At minimum, the eval should be compatible with replacing or overriding the relevant task components via task_with(), including where appropriate:

  • dataset
  • solver or agent
  • scorer
  • metrics
  • model roles
  • execution limits or task metadata

This does not mean every one of these should be a task parameter.

It does mean the eval should avoid baking in irreversible logic that prevents component substitution.

4. Runtime Controls Should Flow Through Inspect

Settings that Inspect already supports at runtime should generally remain runtime controls.

This includes things like:

  • model
  • model_roles
  • temperature, max_tokens, and other GenerateConfig fields
  • limit, sample_id, sample_shuffle
  • epochs (unless the eval specifically requires a particular number of epochs, or uses a specific reducer)
  • approval
  • error-handling and resource limits

If an eval hardcodes one of these in a way that blocks normal runtime override, that should be treated as a deviation requiring justification.

Evals can set GenerateConfig fields in their task definition, and CLI flags will still override them at runtime, but doing so may prevent users from relying on model-specific defaults for those settings.

Agentic tasks may need to set max_messages or other agent-specific config in the task definition to ensure eventual stopping, but this should be documented clearly.

QA evals do not need a sandbox. Agentic evals, however, must specify a sandbox in the task definition — usually docker, along with a Dockerfile or compose.yaml that defines the environment.

If an eval uses a judge or grader model, prefer resolving it through get_model(role="grader") or a similar named role pattern. This keeps the grader configurable via standard Inspect mechanisms rather than requiring task-specific parameters like grader_model. If the eval has paper-faithful grader settings, document them explicitly rather than hardcoding them invisibly.

5. No Hidden Behavior

An eval must not contain hidden or surprising configuration logic.

In particular:

  • changing a dataset, scorer, solver, or metric should not silently trigger unrelated behavior changes unless that relationship is intrinsic and documented
  • task code should not quietly rewrite user-supplied components without an explicit interface
  • “magic” defaults that materially change behavior should either be visible in the task interface, visible in runtime config, or clearly documented

If special logic is necessary, it should either:

  • apply only to the built-in default path, or
  • be exposed as an explicit option

6. Configurable Components Should Be Composable

As much as reasonably possible, configurable components should be substitutable without breaking the task.

This implies:

  • scorers should not depend on one exact metric implementation unless necessary
  • metrics should not depend on private scorer internals unless that coupling is essential and documented
  • solver replacement should not fail because of avoidable assumptions about message shape, metadata, or internal control flow
  • tasks should use the setup parameter to facilitate the use of other solvers where appropriate
  • datasets should be defined in a factory function rather than inline within the task, so that they can be used by researchers who want to run their own evaluations

Perfect decoupling is not required, but unnecessary coupling should be avoided.

7. Paper-Faithful Settings Should Be Explicit

If an eval has paper-faithful settings that differ from the recommended or default runtime path, they should be represented explicitly and transparently.

Good patterns include:

  • documented config files
  • documented --model-role examples
  • a clearly named task parameter when the variation is genuinely task-specific

The provenance of paper-faithful settings should be documented when it is not obvious from the code or paper.

8. Documentation must Match Reality

Each eval must document its actual configuration surface, including:

  • the important defaults
  • the intended way to override them
  • any paper-faithful or recommended settings
  • any important limitations or Python-only paths

The documentation should describe the real user-facing interface, not a hypothetical one.

Verification

The standard should be verified using a mix of review, tests, and automation.

Manual Review

Reviewers should ask:

  • Can I run the task with a normal inspect eval ... --model ... command?
  • Are task parameters only being used for genuinely task-specific controls?
  • Are framework-level controls left to Inspect runtime config where possible?
  • Can I replace the scorer, metrics, dataset, and model roles from Python without patching source code?
  • Are any important defaults hidden or surprising?
  • Does the README explain the actual configuration story?

Test Expectations

Where practical, evals should include tests that exercise configurability-relevant behavior, for example:

  • task runs with defaults
  • model_roles works when the scorer or solver uses get_model(role=...)
  • scorer logic still works when used with compatible metric/reducer paths
  • task-specific parameters change the intended behavior and nothing else
  • task can be run with epochs > 1 and scores are reduced correctly
  • component replacement paths are not blocked by hidden assumptions

Not every eval needs a dedicated test for every configuration path, but the design should be testable.

Automation

Potential automated checks include:

  • task can be imported and instantiated with defaults
  • README parameter sections stay in sync with task signatures
  • known runtime-configurable settings are not redundantly re-exposed in confusing ways
  • common anti-patterns are flagged during review or autolint

Anti-Patterns

The following patterns should generally be avoided:

  • hardcoding a grader model instead of using model roles
  • adding task parameters for ordinary GenerateConfig fields without a clear reason
  • exposing task parameters that duplicate CLI behavior and create two competing configuration paths
  • mixing hidden branching logic into scorer or solver setup
  • making metrics depend on ad hoc scorer payload structure when a cleaner interface is possible
  • making component replacement impossible without editing source
  • storing scores in Score.metadata instead of Score.value, breaking reducibility

SimpleQA as an Example

SimpleQA is a useful example of this standard in practice:

  • default tasks are runnable with normal Inspect commands
  • paper-faithful generation settings are documented in YAML config files
  • grader configuration uses model_roles
  • runtime generation settings use --generate-config / standard GenerateConfig
  • scorer selection is exposed as a task-level semantic choice
  • the underlying generic scorer can be reused independently of SimpleQA

This is close to the intended shape for configurable evals in the repo.

Rationale

This standard exists so that inspect_evals tasks are:

  • Reproducible — users can tell which settings matter and where they came from
  • Predictable — the same kinds of settings are configured in the same ways across evals
  • Reusable — downstream users can adapt tasks without forking them
  • Reviewable — maintainers can assess config quality against concrete expectations
  • Documentable — the configuration story can be explained once and applied consistently