Datasets

Overview

Inspect has native support for reading datasets in the CSV, JSON, and JSON Lines formats, as well as from Hugging Face. In addition, the core dataset interface for the evaluation pipeline is flexible enough to accept data read from just about any source (see the Custom Reader section below for details).

If your data is already in a format amenable for direct reading as an Inspect Sample, reading a dataset is as simple as this:

from inspect_ai.dataset import csv_dataset, json_dataset
dataset1 = csv_dataset("dataset1.csv")
dataset2 = json_dataset("dataset2.json")

Of course, many real-world datasets won’t be so trivial to read. Below we’ll discuss the various ways you can adapt your datasets for use with Inspect.

Dataset Samples

The core data type underlying the use of datasets with Inspect is the Sample, which consists of a required input field and several other optional fields:

Class inspect_ai.dataset.Sample

Field	Type	Description
`input`	`str \| list[ChatMessage]`	The input to be submitted to the model.
`choices`	`list[str] \| None`	Optional. Multiple choice answer list.
`target`	`str \| list[str] \| None`	Optional. Ideal target output. May be a literal value or narrative text to be used by a model grader.
`id`	`str \| None`	Optional. Unique identifier for sample.
`metadata`	`dict[str \| Any] \| None`	Optional. Arbitrary metadata associated with the sample.
`sandbox`	`str \| tuple[str,str]`	Optional. Sandbox environment type (or optionally a tuple with type and config file)
`files`	`dict[str \| str] \| None`	Optional. Files that go along with the sample (copied to sandbox environments).
`setup`	`str \| None`	Optional. Setup script to run for sample (executed within default sandbox environment).

So a CSV dataset with the following structure:

input	target
What cookie attributes should I use for strong security?	secure samesite and httponly
How should I store passwords securely for an authentication system database?	strong hashing algorithms with salt like Argon2 or bcrypt

Can be read directly with:

dataset = csv_dataset("security_guide.csv")

Note that samples from datasets without an id field will automatically be assigned ids based on an auto-incrementing integer starting with 1.

If your samples include choices, then the target should be a capital letter representing the correct answer in choices, see multiple_choice

Files

The files field maps sandbox target file paths to file contents (where contents can be either a filesystem path, a URL, or a string with inline content). For example, to copy a local file named flag.txt into the sandbox path /shared/flag.txt you would use this:

"/shared/flag.txt": "flag.txt"

Files are copied into the default sandbox environment unless their name contains a prefix mapping them into another environment. For example, to copy into the victim sandbox:

"victim:/shared/flag.txt": "flag.txt"

Setup

The setup field contains either a path to a bash setup script (resolved relative to the dataset path) or the contents of a script to execute. Setup scripts are executed with a 5 minute timeout. If you have setup scripts that may take longer than this you should move some of your setup code into the container build setup (e.g. Dockerfile).

Field Mapping

If your dataset contains inputs and targets that don’t use input and target as field names, you can map them into a Dataset using a FieldSpec. This same mechanism also enables you to collect arbitrary additional fields into the Sample metadata bucket. For example:

from inspect_ai.dataset import FieldSpec, json_dataset

dataset = json_dataset(
    "popularity.jsonl",
    FieldSpec(
        input="question",
        target="answer_matching_behavior",
        id="question_id",
        metadata=["label_confidence"],
    ),
)

If you need to do more than just map field names and actually do custom processing of the data, you can instead pass a function which takes a record (represented as a dict) from the underlying file and returns a Sample. For example:

from inspect_ai.dataset import Sample, json_dataset

def record_to_sample(record):
    return Sample(
        input=record["question"],
        target=record["answer_matching_behavior"].strip(),
        id=record["question_id"],
        metadata={
            "label_confidence": record["label_confidence"]
        }
    )

dataset = json_dataset("popularity.jsonl", record_to_sample)

Typed Metadata

If you want a more strongly typed interface to sample metadata, you can define a Pydantic model and use it to both validate and read metadata.

For validation, pass a BaseModel derived class in the FieldSpec. The interface to metadata is read-only so you must also specify frozen=True. For example:

from pydantic import BaseModel

class PopularityMetadata(BaseModel, frozen=True):
    category: str
    label_confidence: float

dataset = json_dataset(
    "popularity.jsonl",
    FieldSpec(
        input="question",
        target="answer_matching_behavior",
        id="question_id",
        metadata=PopularityMetadata,
    ),
)

To read metadata in a typesafe fashion, use the metadata_as() method on Sample or TaskState:

metadata = state.metadata_as(PopularityMetadata)

Note again that the intended semantics of metadata are read-only, so attempting to write into the returned metadata will raise a Pydantic FrozenInstanceError.

If you need per-sample mutable data, use the sample store, which also supports typing using Pydantic models.

Filter and Shuffle

The Dataset class includes filter() and shuffle() methods, as well as support for the slice operator.

To select a subset of the dataset, use filter():

dataset = json_dataset("popularity.jsonl", record_to_sample)
dataset = dataset.filter(
    lambda sample : sample.metadata["category"] == "advanced"
)

To select a subset of records, use standard Python slicing:

dataset = dataset[0:100]

Shuffling is often helpful when you want to vary the samples used during evaluation development. To do this, either use the shuffle() method or the shuffle parameter of the dataset loading functions:

# shuffle method
dataset = dataset.shuffle()

# shuffle on load
dataset = json_dataset("data.jsonl", shuffle=True)

Note that both of these methods optionally support specifying a random seed for shuffling.

Shuffling Choices

When working with datasets that contain multiple-choice options, you can randomize the order of these choices during data loading. The shuffling operation automatically updates any corresponding target values to maintain correct answer mappings.

For datasets that contain choices, you can shuffle the choices when the data is loaded. Shuffling choices will randomly re-order the choices and update the sample’s target value or values to align with the shuffled choices.

There are two ways to shuffle choices:

# Method 1: Using the dataset method
dataset = dataset.shuffle_choices()

# Method 2: During dataset loading
dataset = json_dataset("data.jsonl", shuffle_choices=True)

For reproducible shuffling, you can specify a random seed:

# Using a seed with the dataset method
dataset = dataset.shuffle_choices(seed=42)

# Using a seed during loading
dataset = json_dataset("data.jsonl", shuffle_choices=42)

Hugging Face

Hugging Face Datasets is a library for easily accessing and sharing datasets for machine learning, and features integration with Hugging Face Hub, a repository with a broad selection of publicly shared datasets. Typically datasets on Hugging Face will require specification of which split within the dataset to use (e.g. train, test, or validation) as well as some field mapping. Use the hf_dataset() function to read a dataset and specify the requisite split and field names:

from inspect_ai.dataset import FieldSpec, hf_dataset

dataset=hf_dataset("openai_humaneval", 
  split="test", 
  sample_fields=FieldSpec(
    id="task_id",
    input="prompt",
    target="canonical_solution",
    metadata=["test", "entry_point"]
  )
)

Note that some HuggingFace datasets execute Python code in order to resolve the underlying dataset files. Since this code is run on your local machine, you need to specify trust = True in order to perform the download. This option should only be set to True for repositories you trust and in which you have read the code. Here’s an example of using the trust option (note that it defaults to False if not specified):

dataset=hf_dataset("openai_humaneval", 
  split="test", 
  trust=True,
  ...
)

Under the hood, the hf_dataset() function is calling the load_dataset() function in the Hugging Face datasets package. You can additionally pass arbitrary parameters on to load_dataset() by including them in the call to hf_dataset(). For example hf_dataset(..., cache_dir="~/my-cache-dir").

Amazon S3

Inspect has integrated support for storing datasets on Amazon S3. Compared to storing data on the local file-system, using S3 can provide more flexible sharing and access control, and a more reliable long term store than local files.

Using S3 is mostly a matter of substituting S3 URLs (e.g. s3://my-bucket-name) for local file-system paths. For example, here is how you load a dataset from S3:

json_dataset("s3://my-bucket/dataset.jsonl")

S3 buckets are normally access controlled so require authentication to read from. There are a wide variety of ways to configure your client for AWS authentication, all of which work with Inspect. See the article on Configuring the AWS CLI for additional details.

Chat Messages

The most important data structure within Sample is the ChatMessage. Note that often datasets will contain a simple string as their input (which is then internally converted to a ChatMessageUser). However, it is possible to include a full message history as the input via ChatMessage. Another useful application of ChatMessage is providing multi-modal input (e.g. images).

Class inspect_ai.model.ChatMessage

Field	Type	Description
`role`	`"system" \| "user" \| "assistant" \| "tool"`	Role of this chat message.
`content`	`str \| list[Content]`	The content of the message. Can be a simple string or a list of content parts intermixing text and images.

An input with chat messages in your dataset might will look something like this:

"input": [
  {
    "role": "user",
    "content": "What cookie attributes should I use for strong security?"
  }
]

Note that for this example we wouldn’t normally use a full chat message object (rather we’d just provide a simple string). Chat message objects are more useful when you want to include a system prompt or prime the conversation with “assistant” responses.

Custom Reader

You are not restricted to the built in dataset functions for reading samples. You can also construct a MemoryDataset, and pass that to a task. For example:

from inspect_ai import Task, task
from inspect_ai.dataset import MemoryDataset, Sample
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import generate, system_message

dataset=MemoryDataset([
    Sample(
        input="What cookie attributes should I use for strong security?",
        target="secure samesite and httponly",
    )
])

@task
def security_guide():
    return Task(
        dataset=dataset,
        solver=[system_message(SYSTEM_MESSAGE), generate()],
        scorer=model_graded_fact(),
    )

So if the built in dataset functions don’t meet your needs, you can create a custom function that yields a MemoryDatasetand pass those directly to your Task.