Extensions

Overview

There are several ways to extend Inspect to integrate with systems not directly supported by the core package. These include:

  1. Model APIs (model hosting services, local inference engines, etc.)

  2. Tool Environments (local or cloud container runtimes)

  3. Storage Systems (for datasets, prompts, and evaluation logs)

For each of these, you can create an extension within a Python package, and then use it without any special registration with Inspect (this is done via setuptools entry points).

Model APIs

You can add a model provider by deriving a new class from ModelAPI and adding the @modelapi decorator to it. For example:

@modelapi(name="custom")
class CustomModelAPI(ModelAPI):
    def __init__(
        self, 
        model_name: str,
        base_url: str | None = None,
        api_key: str | None = None,
        config: GenerateConfig = GenerateConfig(),
        **model_args: dict[str,Any]
    ) -> None:
        super().__init__(model_name, base_url, api_key, config)
  
    async def generate(
        self,
        input: list[ChatMessage],
        tools: list[ToolInfo],
        tool_choice: ToolChoice,
        config: GenerateConfig,
    ) -> ModelOutput:
        ...

The __init__() method must call the super().__init__() method, and typically instantiates the model client library.

The generate() method handles interacting with the model, converting inspect messages, tools, and config into model native data structures. In addition, there are some optional properties you can override to specify various behaviours and constraints (default max tokens and connections, identifying rate limit errors, whether to collapse consecutive user and/or assistant messages, etc.).

See the ModelAPI source code for further documentation on these properties. See the implementation of the built-in model providers for additional insight on building a custom provider.

Model Registration

If you are publishing a custom model API within a Python package, you should register an inspect_ai setuptools entry point. This will ensure that inspect loads your extension before it attempts to resolve a model name that uses your provider.

For example, if your package was named inspect_package and your model provider was exported from a source file named inspect_extensions.py at the root of your package, you would register it like this in pyproject.toml:

[project.entry-points.inspect_ai]
inspect_package = "inspect_package.inspect_extensions"
[tool.poetry.plugins.inspect_ai]
inspect_package = "inspect_package.inspect_extensions"

Model Usage

Once you’ve created the class, decorated it with @modelapi as shown above, and registered it, then you can use it as follows:

inspect eval ctf.py --model custom/my-model

Where my-model is the name of some model supported by your provider (this will be passed to __init()__ in the model_name argument).

You can also reference it from within Python calls to get_model() or eval():

# get a model instance
model = get_model("custom/my-model")

# run an eval with the model
eval(math, model = "custom/my-model")

Tool Environments

Tool Environments provide a mechanism for sandboxing execution of tool code as well as providing more sophisticated infrastructure (e.g. creating network hosts for a cybersecurity eval). Inspect comes with two tool environments built in:

Environment Type Description
local Run tool_environment() methods in the same file system as the running evaluation (should only be used if you are already running your evaluation in another sandbox).
docker Run tool_environment() methods within a Docker container

To create a custom tool environment, derive a class from ToolEnvironment, implement the required static and instance methods, and add the @toolenv decorator to it. For example:

@toolenv(name="podman")
class PodmanToolEnvironment(ToolEnvironment):

    @classmethod
    async def task_init(
        cls, task_name: str, config: str | None
    ) -> None:
        ...

    @classmethod
    async def sample_init(
        cls, 
        task_name: str, 
        config: str | None, 
        metadata: dict[str, str]
    ) -> dict[str, ToolEnvironment]:
        ...

    @classmethod
    async def sample_cleanup(
        cls,
        task_name: str,
        config: str | None,
        environments: dict[str, ToolEnvironment],
        interrupted: bool,
    ) -> None:
        ...

    @classmethod
    async def task_cleanup(
        cls, task_name: str, config: str | None, cleanup: bool
    ) -> None:
       ...

    @classmethod
    async def cli_cleanup(cls, id: str | None) -> None:
        ...

    async def exec(
        self,
        cmd: list[str],
        input: str | bytes | None = None,
        env: dict[str, str] = {},
        timeout: int | None = None,
    ) -> ExecResult[str]:
        ...

    async def write_file(
        self, file: str, contents: str | bytes
    ) -> None:
        ...

    @overload
    async def read_file(
        self, file: str, text: Literal[True] = True
    ) -> str: 
        ...
      
    @overload
    async def read_file(
        self, file: str, text: Literal[False]
    ) -> bytes: 
        ...
      
    async def read_file(
        self, file: str, text: bool = True
    ) -> Union[str | bytes]:
        ...

The class methods take care of various stages of initialisation, setup, and teardown:

Method Lifecycle Purpose
task_init() Called at the beginning of each Task. Expensive initialisation operations (e.g. pulling or building images)
sample_init() Called at the beginning of each Sample. Create ToolEnvironment instances for the sample.
sample_cleanup() Called at the end of each Sample Cleanup ToolEnvironment instances for the sample.
task_cleanup() Called at the end of each Task. Last chance handler for any resources not yet cleaned up (see also discussion below).
cli_cleanup() Called via inspect toolenv cleanup CLI invoked manual cleanup of resources created by this ToolEnvironment.
max_samples() Called at startup Provide a default max_samples (used to cap the default, explicit max_samples will override this).

In the case of parallel execution of a group of tasks that share a working directory and tool environment, the task_init() and task_cleanup() functions may be called once for the entire group as a performance optimisation.

The task_cleanup() has a number of important functions:

  1. There may be global resources that are not tied to samples that need to be cleaned up.
  2. It’s possible that sample_cleanup() will be interrupted (e.g. via a Ctrl+C) during execution. In that case its resources are still not cleaned up.
  3. The sample_cleanup() function might be long running, and in the case of error or interruption you want to provide explicit user feedback on the cleanup in the console (which isn’t possible when cleanup is run “inline” with samples). An interrupted flag is passed to sample_cleanup() which allows for varying behaviour for this scenario.
  4. Cleanup may be disabled (e.g. when the user passes --no-toolenv-cleanup) in which case it should print container IDs and instructions for cleaning up after the containers are no longer needed.

To implement task_cleanup() properly, you’ll likely need to track running environments using a per-coroutine ContextVar. The DockerToolEnvironment provides an example of this. Note that the cleanup argument passed to task_cleanup() indicates whether to actually clean up (it would be False if --no-toolenv-cleanup was passed to inspect eval). In this case you might want to print a list of the resources that were not cleaned up and provide directions on how to clean them up manually.

The cli_cleanup() function is a global cleanup handler that should be able to do the following:

  1. Cleanup all environments created by this provider (corresponds to e.g. inspect toolenv cleanup docker at the CLI).
  2. Cleanup a single environment created by this provider (corresponds to e.g. inspect toolenv cleanup docker <id> at the CLI).

The task_cleanup() function will typically print out the information required to invoke cli_cleanup() when it is invoked with cleanup = False. Try invoking the DockerToolEnvironmet with --no-toolenv-cleanup to see an example.

The ToolEnvironment instance methods provide access to process execution and file input/output within the environment. A few notes on implementing these methods:

  1. The exec() method currently only handles text output. If a call results in binary output then a UnicodeDecodeError will be raised. Tool environments should catch this and raise a ToolError.

  2. The read_file() method raise a FileNotFoundError if the specified file does not exist in the tool environment, as tools calling read_file() will often want to catch the FileNotFoundError and re-throw a ToolError (since models will frequently attempt to read files that do not exist).

The best way to learn about writing tool environments is to look at the source code for the built in environments, LocalToolEnvironment and DockerToolEnvironment.

Environment Registration

You should build your custom tool environment within a Python package, and then register an inspect_ai setuptools entry point. This will ensure that inspect loads your extension before it attempts to resolve a tool environment that uses your provider.

For example, if your package was named inspect_package and your tool environment provider was exported from a source file named inspect_extensions.py at the root of your package, you would register it like this in pyproject.toml:

[project.entry-points.inspect_ai]
inspect_package = "inspect_package.inspect_extensions"
[tool.poetry.plugins.inspect_ai]
inspect_package = "inspect_package.inspect_extensions"

Environment Usage

Once the package is installed, you can refer to the custom tool environment the same way you’d refer to a built in tool environment. For example:

Task(
    ...,
    tool_environment="podman"
)

Tool environments can be invoked with an optional configuration parameter, which is passed as the config argument to the startup() and setup() methods. In Python this is done with a tuple

Task(
    ...,
    tool_environment=("podman","config.yaml")
)

Storage

Filesystems with fsspec

Datasets, prompt templates, and evaluation logs can be stored using either the local filesystem or a remote filesystem. Inspect uses the fsspec package to read and write files, which provides support for a wide variety of filesystems, including:

Support for Amazon S3 is built in to Inspect via the s3fs package. Other filesystems may require installation of additional packages. See the list of built in filesystems and other known implementations for all supported storage back ends.

See Custom Filesystems below for details on implementing your own fsspec compatible filesystem as a storage back-end.

Filesystem Functions

The following Inspect API functions use fsspec:

  • resource() for reading prompt templates and other supporting files.

  • csv_dataset() and json_dataset() for reading datasets (note that files referenced within samples can also use fsspec filesystem references).

  • list_eval_logs() , read_eval_log(), write_eval_log(), and retryable_eval_logs().

For example, to use S3 you would prefix your paths with s3://:

# read a prompt template from s3
prompt_template("s3://inspect-prompts/ctf.txt")

# read a dataset from S3
csv_dataset("s3://inspect-datasets/ctf-12.csv")

# read eval logs from S3
list_eval_logs("s3://my-s3-inspect-log-bucket")

Custom Filesystems

See the fsspec developer documentation for details on implementing a custom filesystem. Note that if your implementation is only for use with Inspect, you need to implement only the subset of the fsspec API used by Inspect. The properties and methods used by Inspect include:

  • sep
  • open()
  • makedirs()
  • info()
  • created()
  • exists()
  • ls()
  • walk()
  • unstrip_protocol()
  • invalidate_cache()

As with Model APIs and Tool Environments, fsspec filesystems should be registered using a setuptools entry point. For example, if your package is named inspect_package and you have implemented a myfs:// filesystem using the MyFs class exported from the root of the package, you would register it like this in pyproject.toml:

[project.entry-points."fsspec.specs"]
myfs = "inspect_package:MyFs"
[tool.poetry.plugins."fsspec.specs"]
myfs = "inspect_package:MyFs"

Once this package is installed, you’ll be able to use myfs:// with Inspect without any further registration.