Dataset Processing
The AgenticEvalDataset
returned by create_agentic_eval_dataset()
provides methods for filtering samples, partitioning them into separate datasets, and applying custom transformations to them.
Filtering
It is often useful to filter samples by their attributes (see Verifying Solvability). AgenticEvalDataset
provides four methods for doing so.
Method | Description |
---|---|
filter |
Filter samples by a predicate (i.e., Callable[[Sample], bool] ). |
filter_by_metadata_field |
Filter samples based on whether a specified metadata field matches a specific value or any value from a provided Iterable of values. Nested fields can be accessed using dot notation (e.g., “skills.pwn.difficulty”). If the field is not present in the sample metadata, it is considered to be None and filtered out if the target value is not None . |
filter_by_metadata |
Same as above, but for filtering by a dictionary of metadata keys and values. |
filter_by_sandbox |
Filter samples to those which use specific sandbox types. |
"eval_name"
and "variant_name"
are always included as fields in sample metadata, and they are common to use for filtering. For example, to run the “easy” variant of a “hello-world” evaluation using only Docker sandboxing, one might write:
task.py
...= (
dataset
create_agentic_eval_dataset(Path.cwd())"eval_name", "hello-world")
.filter_by_metadata_field("variant_name", "easy")
.filter_by_metadata_field("docker")
.filter_by_sandbox( )
Or, to only run evaluations from a specific benchmark that involve certain skills, one might write:
task.py
...= (
dataset
create_agentic_eval_dataset(Path.cwd())
.filter_by_metadata({"benchmark": "cyber-bench",
"skills.categories": ["web", "rev"]
}) )
Partitioning
When running many evaluations and variants, it is often useful to partition them into separate tasks. This enables running them more efficiently (e.g., in parallel as an Eval Set) as well as splitting their logs into manageable files containing relevant groups of samples.
This is the purpose of AgenticEvalDataset
’s group_by()
method. It partitions the dataset according to a key
:
Key | Result |
---|---|
"all |
Returns the dataset as is, with all samples in one dataset. |
"eval" |
Groups samples by evaluation names, creating one dataset per evaluation (with as many samples as there are variants). |
"variant" |
Creates a dataset for each sample. |
For example, if one were running many evaluations and wanting to produce a task and log file for each, they might write:
task.py
from pathlib import Path
from inspect_ai import Task, eval_set, task
from inspect_ai.agent import react
from inspect_ai.scorer import includes
from inspect_ai.tool import bash, python
from inspect_cyber import create_agentic_eval_dataset
@task
def create_task(dataset):
return Task(
=dataset,
dataset=react(tools=[bash(), python()]),
solver=includes(),
scorer=dataset.name,
name
)
= create_agentic_eval_dataset(Path.cwd()).group_by("eval")
datasets
for dataset in datasets], log_dir="logs") eval_set([create_task(dataset)
Transformations
Occasionally it is useful to apply further processing to samples. This may even include extending one sample into multiple.
AgenticEvalDataset
’s flat_map()
method enables this by allowing users to apply a custom mapper
to the dataset. A mapper
takes in a sample and returns any number of transformed samples (i.e., Callable[[Sample], Iterable[Sample]]
).