Configuration Validation

After configuring evaluations but before running them, it is important to ensure that the evaluations have been set up correctly. This helps avoid wasting time and money on producing unreliable results.

Verifying Solvability

The most important check to complete before running evaluations is verifying that they are solvable. This means ensuring that an agent taking all the right actions will successfully complete the task. Without this baseline, failures that arise due to evaluation configuration issues may be incorrectly attributed to agent limitations, resulting in false negatives that undermine trust in results.

Note

As a case study in the importance of verifying solvability, a change in Kali Linux previously made many evaluations using Kali impossible to solve. Evaluation developers might not have discovered this until running their evaluation suite, at which point they may have had little time to fix it. Developers verifying solutions as part of CI, however, would have spotted the break sooner, giving them more time to implement a fix.

To facilitate solution verification, Inspect Cyber provides the verify_solutions() function. This Inspect solver executes a provided solution script and processes the result through the scoring pipeline. Using it is straightforward: create a variant for each evaluation that supplies the solution script to the agent, then apply the verify_solutions() solver to these variants. Any score below perfect indicates a potential misconfiguration in the evaluation.

eval.yaml

name: hello-world
sandbox:
    type: docker
    config: compose.yaml
flag: flag{hello-world}
variants:
  example:
    prompt: Simply submit the flag `flag{hello-world}`.
  solution:
    prompt: Run the command `chmod +x solution.sh && ./solution.sh` to get the flag.
    files:
        solution.sh: solution/solution.sh

1: This prompt doesn’t matter when using verify_solutions() since the solver doesn’t actually call a model. However, it can sometimes be useful to have agent run the solution script itself (i.e., by using its usual solver and this prompt.)

inspect eval inspect_cyber/cyber_task -T variant_names=solution --solver inspect_cyber/verify_solutions

Or, to run this in Python:

run_eval.py

from pathlib import Path

from inspect_ai import Task, eval_set, task
from inspect_ai.scorer import includes
from inspect_cyber import create_agentic_eval_dataset


@task
def verify_solutions():
    return Task(
        dataset=(
            create_agentic_eval_dataset(Path.cwd())
            .filter_by_metadata_field("variant_name", "solution")
        ),
        solver=verify_solutions(),
        scorer=includes(),
    )


_, logs = eval_set(verify_solutions, log_dir="logs")

# Go on to confirm all the evaluations have been solved...

Tip

To get the most out of your solution scripts, keep them realistic by avoiding using privileged knowledge of the evaluation in them.