Configurability Standard for inspect_evals tasks
This guide defines the repository-wide standard for task configurability in inspect_evals.
See also ADR-0008 for the architectural decision behind some of this guidance.
Purpose
This document defines the baseline configurability standard for evaluations in inspect_evals.
The goal is to make evals:
- predictable to run
- consistent to configure
- easy to reuse without forking
- reviewable against clear, shared expectations
This is a repo policy and guidance document, not a full API reference.
For the underlying Inspect mechanisms and their exact signatures, see the Inspect AI configuration docs.
Scope
This standard applies to task interfaces exposed by inspect_evals.
It is primarily about:
- what kinds of configuration an eval should support
- which configuration layer each kind of setting should use
- what must be documented
- what patterns should be avoided
It is not intended to require that every eval expose every internal detail as a task parameter.
Configuration Layers
Inspect has four main layers of configuration, from inner to outer:
- Task definition
task_with()- Environment variables /
.env eval()/ CLI
In general, later layers override earlier layers.
For authors of inspect_evals tasks, the important implication is that the task definition does not have to expose all configuration options as parameters. Instead, it should focus on the core task-specific options that are essential for the evaluation.
Users can run the eval with the default configuration, and use env vars and command-line flags or eval() to configure the majority of use cases involving normal runtime controls such as model, generation config, sample selection, limits, and execution behavior.
Advanced users / researchers can further customise task behaviour using task_with(), which provides a programmatic way to replace task components such as dataset, scorer, and metrics, which are not configurable via the CLI.
For example:
temperatureshould usually be controlled viaeval()/ CLI, not baked into a task parameter.scorercannot be changed from the CLI, so introducing a task parameter would be appropriate to enable switching between two provided scorers. For a user to use their own custom scorer,task_with()is the appropriate mechanism.
Standard
An eval satisfies the configurability standard if all of the following are true.
1. Reasonable Defaults
An eval must have sensible defaults such that a normal user can run it with a standard command like:
inspect eval inspect_evals/<eval_name> --model <model>This means:
- required task parameters should be avoided unless they are truly essential
- default solver, scorer, metrics, and environment behavior must be coherent
- paper-faithful or special-purpose settings should not force complexity onto the default invocation unless there is a strong reason
A good default task definition is usually:
- a dataset factory
- a small amount of task-specific wiring
- a default solver
- a default scorer
- documented examples for runtime overrides
This makes the task easier to reuse through task_with() and eval().
2. Use the Correct Configuration Layer
Evals must use Inspect’s standard configuration layers rather than inventing custom configuration paths.
In particular:
- Define task parameters for eval-specific behavior that is naturally part of the task interface and is representable in CLI-friendly values.
- Rely on
eval()/ CLI for users to set framework-level runtime controls such as model,GenerateConfig, sample limits, epochs, sandbox, approval, and error handling. - Rely on
task_with()for users to replace task components such as dataset, scorer, solver, metrics, setup, cleanup, or metadata.
An eval should not expose a task parameter for something the framework already handles cleanly unless there is a strong eval-specific reason.
Task parameters are most valuable when they express a meaningful task-level choice rather than a framework-level runtime tweak.
Good examples:
- choosing between two scorer modes that imply different semantics
- selecting a benchmark variant that changes prompts or scoring rules
- filtering to a subset of samples
- enabling an eval-specific setup mode
Less good examples:
- adding
temperatureormax_tokensas task parameters when runtime config already handles them well
Note that there is no live inspect eval --scorer ... flag. In inspect_evals, there are usually two valid patterns for scorer configurability:
- authors should expose a task parameter like
scorer="default" | "original"when switching among built-in scorer modes is part of the task’s intended interface - users can use
task_with(..., scorer=...)for user-supplied custom scorers
The second is a task convention, not a framework-level scorer override mechanism.
3. Python-Level Reuse Without Forking
Users must be able to adapt the task from Python without modifying eval source code.
At minimum, the eval should be compatible with replacing or overriding the relevant task components via task_with(), including where appropriate:
- dataset
- solver or agent
- scorer
- metrics
- model roles
- execution limits or task metadata
This does not mean every one of these should be a task parameter.
It does mean the eval should avoid baking in irreversible logic that prevents component substitution.
4. Runtime Controls Should Flow Through Inspect
Settings that Inspect already supports at runtime should generally remain runtime controls.
This includes things like:
modelmodel_rolestemperature,max_tokens, and otherGenerateConfigfieldslimit,sample_id,sample_shuffleepochs(unless the eval specifically requires a particular number of epochs, or uses a specific reducer)approval- error-handling and resource limits
If an eval hardcodes one of these in a way that blocks normal runtime override, that should be treated as a deviation requiring justification.
Evals can set GenerateConfig fields in their task definition, and CLI flags will still override them at runtime, but doing so may prevent users from relying on model-specific defaults for those settings.
Agentic tasks may need to set max_messages or other agent-specific config in the task definition to ensure eventual stopping, but this should be documented clearly.
QA evals do not need a sandbox. Agentic evals, however, must specify a sandbox in the task definition — usually docker, along with a Dockerfile or compose.yaml that defines the environment.
If an eval uses a judge or grader model, prefer resolving it through get_model(role="grader") or a similar named role pattern. This keeps the grader configurable via standard Inspect mechanisms rather than requiring task-specific parameters like grader_model. If the eval has paper-faithful grader settings, document them explicitly rather than hardcoding them invisibly.
6. Configurable Components Should Be Composable
As much as reasonably possible, configurable components should be substitutable without breaking the task.
This implies:
- scorers should not depend on one exact metric implementation unless necessary
- metrics should not depend on private scorer internals unless that coupling is essential and documented
- solver replacement should not fail because of avoidable assumptions about message shape, metadata, or internal control flow
- tasks should use the
setupparameter to facilitate the use of other solvers where appropriate - datasets should be defined in a factory function rather than inline within the task, so that they can be used by researchers who want to run their own evaluations
Perfect decoupling is not required, but unnecessary coupling should be avoided.
7. Paper-Faithful Settings Should Be Explicit
If an eval has paper-faithful settings that differ from the recommended or default runtime path, they should be represented explicitly and transparently.
Good patterns include:
- documented config files
- documented
--model-roleexamples - a clearly named task parameter when the variation is genuinely task-specific
The provenance of paper-faithful settings should be documented when it is not obvious from the code or paper.
8. Documentation must Match Reality
Each eval must document its actual configuration surface, including:
- the important defaults
- the intended way to override them
- any paper-faithful or recommended settings
- any important limitations or Python-only paths
The documentation should describe the real user-facing interface, not a hypothetical one.
Verification
The standard should be verified using a mix of review, tests, and automation.
Manual Review
Reviewers should ask:
- Can I run the task with a normal
inspect eval ... --model ...command? - Are task parameters only being used for genuinely task-specific controls?
- Are framework-level controls left to Inspect runtime config where possible?
- Can I replace the scorer, metrics, dataset, and model roles from Python without patching source code?
- Are any important defaults hidden or surprising?
- Does the README explain the actual configuration story?
Test Expectations
Where practical, evals should include tests that exercise configurability-relevant behavior, for example:
- task runs with defaults
model_rolesworks when the scorer or solver usesget_model(role=...)- scorer logic still works when used with compatible metric/reducer paths
- task-specific parameters change the intended behavior and nothing else
- task can be run with
epochs> 1 and scores are reduced correctly - component replacement paths are not blocked by hidden assumptions
Not every eval needs a dedicated test for every configuration path, but the design should be testable.
Automation
Potential automated checks include:
- task can be imported and instantiated with defaults
- README parameter sections stay in sync with task signatures
- known runtime-configurable settings are not redundantly re-exposed in confusing ways
- common anti-patterns are flagged during review or autolint
Anti-Patterns
The following patterns should generally be avoided:
- hardcoding a grader model instead of using model roles
- adding task parameters for ordinary
GenerateConfigfields without a clear reason - exposing task parameters that duplicate CLI behavior and create two competing configuration paths
- mixing hidden branching logic into scorer or solver setup
- making metrics depend on ad hoc scorer payload structure when a cleaner interface is possible
- making component replacement impossible without editing source
- storing scores in
Score.metadatainstead ofScore.value, breaking reducibility
SimpleQA as an Example
SimpleQA is a useful example of this standard in practice:
- default tasks are runnable with normal Inspect commands
- paper-faithful generation settings are documented in YAML config files
- grader configuration uses
model_roles - runtime generation settings use
--generate-config/ standardGenerateConfig - scorer selection is exposed as a task-level semantic choice
- the underlying generic scorer can be reused independently of SimpleQA
This is close to the intended shape for configurable evals in the repo.
Rationale
This standard exists so that inspect_evals tasks are:
- Reproducible — users can tell which settings matter and where they came from
- Predictable — the same kinds of settings are configured in the same ways across evals
- Reusable — downstream users can adapt tasks without forking them
- Reviewable — maintainers can assess config quality against concrete expectations
- Documentable — the configuration story can be explained once and applied consistently