Eval Suites
Overview
Most of the examples in the documentation run a single evaluation task by either passing a script name to inspect eval
or by calling the eval()
function directly. While this is a good workflow for developing evaluations, once you’ve settled on a group of evaluations you want to run frequently, you’ll typically want to run them all together as an evaluation suite. Below we’ll cover the various tools and techniques available to create eval suites.
Prerequisites
Before describing the various ways you can define and run eval suites, we’ll cover some universal prerequisites related to logging and task definitions.
Logging Context
A precursor to running any evaluation suite is to establish an isolated logging context for it. This enables you to enumerate and analyse all of the eval logs in the suite as a cohesive whole (rather than having them intermixed with the results of other runs). Generally, you’ll do this by setting the INSPECT_LOG_DIR
prior to running the suite. For example:
export INSPECT_LOG_DIR = ./security-mistral_04-07-2024
export INSPECT_EVAL_MODEL = mistral/mistral-large-latest
inspect eval security
This will group all of the log files for the suite, enabling you to call list_eval_logs()
to collect and analyse all of the tasks.
Task Definitions
Whether you are working on evaluations in Python scripts or Jupyter Notebooks, you likely have a lot of code that looks roughly like this:
@task
def security_guide():
return Task(
=example_dataset("security_guide"),
dataset=[
plan
system_message(SYSTEM_MESSAGE),
generate()
],=model_graded_fact(),
scorer
)
eval(security_guide, model="google/gemini-1.0-pro")
This is a natural and convenient way to run evals during development, but in a task suite you’ll want inspect eval
to do the execution rather than direct calls to eval()
(as this allows for varying the model, generation config, and task parameters dynamically). You can keep your existing code more or less as-is, but you’ll just want to add one line above eval()
:
if __name__ == "__main__":
eval(security_guide, model="google/gemini-1.0-pro")
Doing this allows your source file to be both a Python script that is convenient to run during development as well as be a Python module that tasks can be read from without executing the eval. There is no real downside to this, and it’s a good way in general to write all of your eval scripts and notebooks (see the docs on __main__ for additional details.)
Use Cases
Multiple Tasks in a File
The simplest possible eval suite would be multiple tasks defined in a single source file. Consider this source file (ctf.py
) with two tasks in it:
@task
def jeopardy():
return Task(
...
)
@task
def attack_defense():
return Task(
... )
We can run both of these tasks with the following command (note for this and the remainder of examples we’ll assume that you have let an INSPECT_EVAL_MODEL
environment variable so you don’t need to pass the --model
argument explicitly):
$ inspect eval ctf.py
Note we could also run the tasks individually as follows (e.g. for development and debugging):
$ inspect eval ctf.py@jeopardy
$ inspect eval ctf.py@attack_defense
Multiple Tasks in a Directory
Next, let’s consider a multiple tasks in a directory. Imagine you have the following directory structure, where jeopardy.py
and attack_defense.py
each have one or more @task
functions defined:
security/
import.py
analyze.py
jeopardy.py
attack_defense.py
Here is the listing of all the tasks in the suite:
list tasks security
$ inspect @crypto
jeopardy.py@decompile
jeopardy.py@packet
jeopardy.py@heap_trouble
jeopardy.py@saar
attack_defense.py@bank
attack_defense.py@voting
attack_defense.py@dns attack_defense.py
You can run this eval suite as follows:
$ inspect eval security
Note that some of the files in this directory don’t contain evals (e.g. import.py
and analyze.py
). These files are not read or executed by inspect eval
(which only executes files that contain @task
definitions).
If we wanted to run more than one directory we could do so by just passing multiple directory names. For example:
$ inspect eval security persuasion
Eval Function
Note that all of the above example uses of inspect eval
apply equally to the eval()
function. in the context of the above, all of these statements would work as expected:
eval("ctf.py")
eval("ctf.py@jeopardy")
eval("ctf.py@attack_defense")
eval("security")
eval(["security", "persuasion"])
Listing and Filtering
Recursive Listings
Note that directories or expanded globs of directory names passed to eval
are recursively scanned for tasks. So you could have a very deep hierarchy of directories, with a mix of task and non task scripts, and the eval
command or function will discover all of the tasks automatically.
There are some rules for how recursive directory scanning works that you should keep in mind:
- Sources files and directories that start with
.
or_
are not scanned for tasks. - Directories named
env
,venv
, andtests
are not scanned for tasks.
Attributes and Filters
Eval suites will sometimes be defined purely by directory structure, but there will be cross-cutting concerns that are also used to filter what is run. For example, you might want to define some tasks as part of a “light” suite that is less expensive and time consuming to run. This is supported by adding attributes to task decorators. For example:
@task(light=True)
def jeopardy():
return Task(
... )
Given this, you could list all of the light tasks in security
and pass them to eval()
as follows:
= list_tasks(
light_suite "security",
filter = lambda task: task.attribs.get("light") is True
)= eval(light_suite) logs
Note that the inspect list tasks
command can also be used to enumerate tasks in plain text or JSON (use one or more -F
options if you want to filter tasks):
$ inspect list tasks security
$ inspect list tasks security --json
$ inspect list tasks security --json -F light=true
One important thing to keep in mind when using attributes to filter tasks is that both inspect list tasks
(and the underlying list_tasks()
function) do not execute code when scanning for tasks (rather they parse it). This means that if you want to use a task attribute in a filtering expression it needs to be a constant (rather than the result of function call). For example:
# this is valid for filtering expressions
@task(light=True)
def jeopardy():
...
# this is NOT valid for filtering expressions
@task(light=light_enabled("ctf"))
def jeopardy():
...
Errors and Retries
If a runtime error occurs during an evaluation, it is caught, logged, and reported, and then the eval()
function returns as normal. The returned EvalLog
has a status
field which can be used to see which tasks need to be retried, and the failed log file can be passed directly to eval_retry()
, for example:
# list the security suite and run it
= list_tasks("security")
task_suite = eval(task_suite)
eval_logs
# check for failed evals and retry
= [log in eval_logs if log.status != "success"]
error_logs eval_retry(error_logs)
Note that eval_retry()
does not overwrite previous log files, but rather creates a new one (preserving the task_id
from the original file). In addition, completed samples from the original file are preserved and copied to the new eval.
Retry Workflow
If you want to create a task suite supervisor that can robustly retry failed evaluations until all work is completed, we recommend the following approach:
For a given suite of tasks, provision a dedicated log directory where all work will be recorded (you might track this independently in a supervisor database so retries can happen “later” as opposed to immediately after the first run).
Run the task suite.
After the initial run (and perhaps after a delay), query the log directory for retryable tasks, and then execute those retries (possibly using a lower
max_connections
if rate limiting was the source of failures).Repeat (3) as required until there are no more retryable tasks.
Collect up all of the successful task logs from the log directory for analysis.
Here is a somewhat simplfied version of the code required to implement this workflow. We start by creating a log directory (imagine we have a create_log_dir()
function that will provision a new log_dir
with a unique name) and running our evals (contained in a directory named “suite”):
from inspect_ai import eval
# create a new log dir with a unique path/name
= create_log_dir()
log_dir
# run the suite aginst two models (using the log_dir)
for model in ["openai/gpt-4", "google/gemini-1.0"]:
eval("suite", model=model, log_dir=log_dir)
After this first pass, all of the evals may have completed succesfully, or there could be some errors. We use the retryable_eval_logs()
function to filter the list of logs in the directory to those which need a retry to complete. After the retries, there still could be failures, so we run in a loop until there are no more retryable logs:
from inspect_ai.log import list_eval_logs, retryable_eval_logs
= retryable_eval_logs(list_eval_logs(log_dir))
retryable while (len(retryable) > 0):
= log_dir)
eval_retry(retryable, log_dir = retryable_eval_logs(list_eval_logs(log_dir)) retryable
This is oversimplified because we’d likely also want to (a) Wait for some time between retries; (b) Have a maximum number of iterations before giving up; and (c) Analyse the errors and try to remedy (e.g. reduce max_connections
for rate limit errors).
The retryable_eval_logs()
function takes a log listing and filters it as follows:
Finds all logs with status
"error"
or"cancelled"
Checks to see if another log with the same
task_id
has a status of"success"
(in that case, discard the log from the retryable pool).For each retryable log not found to have been subsequently completed, take the most recent one associated with the
task_id
(for handling multiple retries).
When retryable_eval_logs()
returns an empty list, it indicates that all of the tasks have run successfully. At this point, we’ll likely want to collect up all of the successful logs (note that there will still be logs with errors in the log_dir
as logs aren’t overwritten on retry). We can do this with a filter
as follows:
= list_eval_logs(
logs =log_dir
log_dirfilter=lambda log : log.status == "success")
)
Sample Preservation
When retrying a log file, Inspect will attempt to re-use completed samples from the original task. This can result in substantial time and cost savings compared to starting over from the beginning.
IDs and Shuffling
An important constraint on the ability to re-use completed samples is matching them up correctly with samples in the new task. To do this, Inspect requires stable unique identifiers for each sample. This can be achieved in 1 of 2 ways:
Samples can have an explicit
id
field which contains the unique identifier; orYou can rely on Inspect’s assignment of an auto-incrementing
id
for samples, however this will not work correctly if your dataset is shuffled. Inspect will log a warning and not re-use samples if it detects that thedataset.shuffle()
method was called, however if you are shuffling by some other means this automatic safeguard won’t be applied.
If dataset shuffling is important to your evaluation and you want to preserve samples for retried tasks, then you should include an explicit id
field in your dataset.
Max Samples
Another consideration is max_samples
, which is the maximum number of samples to run concurrently within a task. Larger numbers of concurrent samples will result in higher throughput, but will also result in completed samples being written less frequently to the log file, and consequently less total recovable samples in the case of an interrupted task.
By default, Inspect sets the value of max_samples
to max_connections + 1
, ensuring that the model API is always fully saturated (note that it would rarely make sense to set it lower than max_connections
). The default max_connections
is 10, which will typically result in samples being written to the log frequently. On the other hand, setting a very large max_connections
(e.g. 100 max_connections
for a dataset with 100 samples) may result in very few recoverable samples in the case of an interruption.
Note also that when using Tool Environments, the tool environment provider may place an additional cap on the default max_samples
(for example, the Docker provider limits the default max_samples
to no more than 25).