Dataset Processing

The AgenticEvalDataset returned by create_agentic_eval_dataset() provides methods for filtering samples, partitioning them into separate datasets, and applying custom transformations to them.

Filtering

It is often useful to filter samples by their attributes (see Verifying Solvability). AgenticEvalDataset provides four methods for doing so.

Method	Description
`filter`	Filter samples by a predicate (i.e., `Callable[[Sample], bool]`).
`filter_by_metadata_field`	Filter samples based on whether a specified metadata field matches a specific value or any value from a provided `Iterable` of values. Nested fields can be accessed using dot notation (e.g., “skills.pwn.difficulty”). If the field is not present in the sample metadata, it is considered to be `None` and filtered out if the target value is not `None`.
`filter_by_metadata`	Same as above, but for filtering by a dictionary of metadata keys and values.
`filter_by_sandbox`	Filter samples to those which use specific sandbox types.

"eval_name" and "variant_name" are always included as fields in sample metadata, and they are common to use for filtering. For example, to run the “easy” variant of a “hello-world” evaluation using only Docker sandboxing, one might write:

task.py

...
dataset = (
        create_agentic_eval_dataset(Path.cwd())
        .filter_by_metadata_field("eval_name", "hello-world")
        .filter_by_metadata_field("variant_name", "easy")
        .filter_by_sandbox("docker")
    )

Or, to only run evaluations from a specific benchmark that involve certain skills, one might write:

task.py

...
dataset = (
        create_agentic_eval_dataset(Path.cwd())
        .filter_by_metadata({
            "benchmark": "cyber-bench",
            "skills.categories": ["web", "rev"]
        })
    )

Partitioning

When running many evaluations and variants, it is often useful to partition them into separate tasks. This enables running them more efficiently (e.g., in parallel as an Eval Set) as well as splitting their logs into manageable files containing relevant groups of samples.

This is the purpose of AgenticEvalDataset’s group_by() method. It partitions the dataset according to a key:

Key	Result
`"all`	Returns the dataset as is, with all samples in one dataset.
`"eval"`	Groups samples by evaluation names, creating one dataset per evaluation (with as many samples as there are variants).
`"variant"`	Creates a dataset for each sample.

For example, if one were running many evaluations and wanting to produce a task and log file for each, they might write:

task.py

from pathlib import Path

from inspect_ai import Task, eval_set, task
from inspect_ai.agent import react
from inspect_ai.scorer import includes
from inspect_ai.tool import bash, python
from inspect_cyber import create_agentic_eval_dataset


@task
def create_task(dataset):
    return Task(
        dataset=dataset,
        solver=react(tools=[bash(), python()]),
        scorer=includes(),
        name=dataset.name,
    )


datasets = create_agentic_eval_dataset(Path.cwd()).group_by("eval")

eval_set([create_task(dataset) for dataset in datasets], log_dir="logs")

Transformations

Occasionally it is useful to apply further processing to samples. This may even include extending one sample into multiple.

AgenticEvalDataset’s flat_map() method enables this by allowing users to apply a custom mapper to the dataset. A mapper takes in a sample and returns any number of transformed samples (i.e., Callable[[Sample], Iterable[Sample]]).