Eval Suites

Overview

Most of the examples in the documentation run a single evaluation task by either passing a script name to inspect eval or by calling the eval() function directly. While this is a good workflow for developing evaluations, once you’ve settled on a group of evaluations you want to run frequently, you’ll typically want to run them all together as an evaluation suite. Below we’ll cover the various tools and techniques available to create eval suites.

Prerequisites

Before describing the various ways you can define and run eval suites, we’ll cover some universal prerequisites related to logging and task definitions.

Logging Context

A precursor to running any evaluation suite is to establish an isolated logging context for it. This enables you to enumerate and analyse all of the eval logs in the suite as a cohesive whole (rather than having them intermixed with the results of other runs). Generally, you’ll do this by setting the INSPECT_LOG_DIR prior to running the suite. For example:

export INSPECT_LOG_DIR = ./security-mistral_04-07-2024
export INSPECT_EVAL_MODEL = mistral/mistral-large-latest
inspect eval security

This will group all of the log files for the suite, enabling you to call list_eval_logs() to collect and analyse all of the tasks.

Task Definitions

Whether you are working on evaluations in Python scripts or Jupyter Notebooks, you likely have a lot of code that looks roughly like this:

@task
def security_guide():
    return Task(
        dataset=example_dataset("security_guide"),
        plan=[
          system_message(SYSTEM_MESSAGE),
          generate()
        ],
        scorer=model_graded_fact(),
    )

eval(security_guide, model="google/gemini-1.0-pro")

This is a natural and convenient way to run evals during development, but in a task suite you’ll want inspect eval to do the execution rather than direct calls to eval() (as this allows for varying the model, generation config, and task parameters dynamically). You can keep your existing code more or less as-is, but you’ll just want to add one line above eval():

if __name__ == "__main__":
    eval(security_guide, model="google/gemini-1.0-pro")

Doing this allows your source file to be both a Python script that is convenient to run during development as well as be a Python module that tasks can be read from without executing the eval. There is no real downside to this, and it’s a good way in general to write all of your eval scripts and notebooks (see the docs on __main__ for additional details.)

Use Cases

Multiple Tasks in a File

The simplest possible eval suite would be multiple tasks defined in a single source file. Consider this source file (ctf.py) with two tasks in it:

@task
def jeopardy():
  return Task(
    ...
  )

@task
def attack_defense():
  return Task(
    ...
  )

We can run both of these tasks with the following command (note for this and the remainder of examples we’ll assume that you have let an INSPECT_EVAL_MODEL environment variable so you don’t need to pass the --model argument explicitly):

$ inspect eval ctf.py 

Note we could also run the tasks individually as follows (e.g. for development and debugging):

$ inspect eval ctf.py@jeopardy
$ inspect eval ctf.py@attack_defense

Multiple Tasks in a Directory

Next, let’s consider a multiple tasks in a directory. Imagine you have the following directory structure, where jeopardy.py and attack_defense.py each have one or more @task functions defined:

security/
  import.py
  analyze.py
  jeopardy.py
  attack_defense.py

Here is the listing of all the tasks in the suite:

$ inspect list tasks security
jeopardy.py@crypto
jeopardy.py@decompile
jeopardy.py@packet
jeopardy.py@heap_trouble
attack_defense.py@saar
attack_defense.py@bank
attack_defense.py@voting
attack_defense.py@dns

You can run this eval suite as follows:

$ inspect eval security

Note that some of the files in this directory don’t contain evals (e.g. import.py and analyze.py). These files are not read or executed by inspect eval (which only executes files that contain @task definitions).

If we wanted to run more than one directory we could do so by just passing multiple directory names. For example:

$ inspect eval security persuasion

Eval Function

Note that all of the above example uses of inspect eval apply equally to the eval() function. in the context of the above, all of these statements would work as expected:

eval("ctf.py")
eval("ctf.py@jeopardy")
eval("ctf.py@attack_defense")

eval("security")
eval(["security", "persuasion"])

Listing and Filtering

Recursive Listings

Note that directories or expanded globs of directory names passed to eval are recursively scanned for tasks. So you could have a very deep hierarchy of directories, with a mix of task and non task scripts, and the eval command or function will discover all of the tasks automatically.

There are some rules for how recursive directory scanning works that you should keep in mind:

  1. Sources files and directories that start with . or _ are not scanned for tasks.
  2. Directories named env, venv, and tests are not scanned for tasks.

Attributes and Filters

Eval suites will sometimes be defined purely by directory structure, but there will be cross-cutting concerns that are also used to filter what is run. For example, you might want to define some tasks as part of a “light” suite that is less expensive and time consuming to run. This is supported by adding attributes to task decorators. For example:

@task(light=True)
def jeopardy():
  return Task(
    ...
  )

Given this, you could list all of the light tasks in security and pass them to eval() as follows:

light_suite = list_tasks(
  "security", 
  filter = lambda task: task.attribs.get("light") is True
)
logs = eval(light_suite)

Note that the inspect list tasks command can also be used to enumerate tasks in plain text or JSON (use one or more -F options if you want to filter tasks):

$ inspect list tasks security
$ inspect list tasks security --json
$ inspect list tasks security --json -F light=true

One important thing to keep in mind when using attributes to filter tasks is that both inspect list tasks (and the underlying list_tasks() function) do not execute code when scanning for tasks (rather they parse it). This means that if you want to use a task attribute in a filtering expression it needs to be a constant (rather than the result of function call). For example:

# this is valid for filtering expressions
@task(light=True)
def jeopardy():
  ...

# this is NOT valid for filtering expressions
@task(light=light_enabled("ctf"))
def jeopardy():
  ...

Errors and Retries

If a runtime error occurs during an evaluation, it is caught, logged, and reported, and then the eval() function returns as normal. The returned EvalLog has a status field which can be used to see which tasks need to be retried, and the failed log file can be passed directly to eval_retry(), for example:

# list the security suite and run it
task_suite = list_tasks("security")
eval_logs = eval(task_suite)

# check for failed evals and retry
error_logs = [log in eval_logs if log.status != "success"]
eval_retry(error_logs)

Note that eval_retry() does not overwrite previous log files, but rather creates a new one (preserving the task_id from the original file). In addition, completed samples from the original file are preserved and copied to the new eval.

Retry Workflow

If you want to create a task suite supervisor that can robustly retry failed evaluations until all work is completed, we recommend the following approach:

  1. For a given suite of tasks, provision a dedicated log directory where all work will be recorded (you might track this independently in a supervisor database so retries can happen “later” as opposed to immediately after the first run).

  2. Run the task suite.

  3. After the initial run (and perhaps after a delay), query the log directory for retryable tasks, and then execute those retries (possibly using a lower max_connections if rate limiting was the source of failures).

  4. Repeat (3) as required until there are no more retryable tasks.

  5. Collect up all of the successful task logs from the log directory for analysis.

Here is a somewhat simplfied version of the code required to implement this workflow. We start by creating a log directory (imagine we have a create_log_dir() function that will provision a new log_dir with a unique name) and running our evals (contained in a directory named “suite”):

from inspect_ai import eval

# create a new log dir with a unique path/name
log_dir = create_log_dir()

# run the suite aginst two models (using the log_dir)
for model in ["openai/gpt-4", "google/gemini-1.0"]:
    eval("suite", model=model, log_dir=log_dir)

After this first pass, all of the evals may have completed succesfully, or there could be some errors. We use the retryable_eval_logs() function to filter the list of logs in the directory to those which need a retry to complete. After the retries, there still could be failures, so we run in a loop until there are no more retryable logs:

from inspect_ai.log import list_eval_logs, retryable_eval_logs

retryable = retryable_eval_logs(list_eval_logs(log_dir))
while (len(retryable) > 0):
    eval_retry(retryable, log_dir = log_dir)
    retryable = retryable_eval_logs(list_eval_logs(log_dir))

This is oversimplified because we’d likely also want to (a) Wait for some time between retries; (b) Have a maximum number of iterations before giving up; and (c) Analyse the errors and try to remedy (e.g. reduce max_connections for rate limit errors).

The retryable_eval_logs() function takes a log listing and filters it as follows:

  1. Finds all logs with status "error" or "cancelled"

  2. Checks to see if another log with the same task_id has a status of "success" (in that case, discard the log from the retryable pool).

  3. For each retryable log not found to have been subsequently completed, take the most recent one associated with the task_id (for handling multiple retries).

When retryable_eval_logs() returns an empty list, it indicates that all of the tasks have run successfully. At this point, we’ll likely want to collect up all of the successful logs (note that there will still be logs with errors in the log_dir as logs aren’t overwritten on retry). We can do this with a filter as follows:

logs = list_eval_logs(
  log_dir=log_dir
  filter=lambda log : log.status == "success")
)

Sample Preservation

When retrying a log file, Inspect will attempt to re-use completed samples from the original task. This can result in substantial time and cost savings compared to starting over from the beginning.

IDs and Shuffling

An important constraint on the ability to re-use completed samples is matching them up correctly with samples in the new task. To do this, Inspect requires stable unique identifiers for each sample. This can be achieved in 1 of 2 ways:

  1. Samples can have an explicit id field which contains the unique identifier; or

  2. You can rely on Inspect’s assignment of an auto-incrementing id for samples, however this will not work correctly if your dataset is shuffled. Inspect will log a warning and not re-use samples if it detects that the dataset.shuffle() method was called, however if you are shuffling by some other means this automatic safeguard won’t be applied.

If dataset shuffling is important to your evaluation and you want to preserve samples for retried tasks, then you should include an explicit id field in your dataset.

Max Samples

Another consideration is max_samples, which is the maximum number of samples to run concurrently within a task. Larger numbers of concurrent samples will result in higher throughput, but will also result in completed samples being written less frequently to the log file, and consequently less total recovable samples in the case of an interrupted task.

By default, Inspect sets the value of max_samples to max_connections + 1, ensuring that the model API is always fully saturated (note that it would rarely make sense to set it lower than max_connections). The default max_connections is 10, which will typically result in samples being written to the log frequently. On the other hand, setting a very large max_connections (e.g. 100 max_connections for a dataset with 100 samples) may result in very few recoverable samples in the case of an interruption.

Note also that when using Tool Environments, the tool environment provider may place an additional cap on the default max_samples (for example, the Docker provider limits the default max_samples to no more than 25).