OSWorld
Benchmark for multimodal agents for open-ended tasks in real computer environments.
Overview
This is an inspect-native implementation of the OSWorld dataset, a benchmark for multimodal agents for open-ended tasks in real computer environments.
[!NOTE]
For reasons described in Unsupported Samples below, this Inspect eval currently supports a subset of the tests defined in OSWorld’s test_all.json and test_small.json. It supports 246 of the 369
test_all.json
tests and 22 of the 39test_small.json
tests.If you are interested in helping increase the percentage of OSWorld tests that this eval supports, please reach out to @epatey.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Then, evaluate against one or more models with:
inspect eval inspect_evals/osworld --model openai/gpt-4o
inspect eval inspect_evals/osworld_small --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
[!NOTE] When first running the osworld task, it will build the necessary docker image. This will take several minutes and create a ~8GB image.
OSWorld will take a while to run, and use a lot of tokens.
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/osworld --limit 10
inspect eval inspect_evals/osworld_small --max-connections 10
inspect eval inspect_evals/osworld --temperature 0.5
See inspect eval --help
for all available options.
Corpus
OSWorld defines two different corpora for their samples — all
with 369 samples and small
with 39 samples. The osworld
task has a corpus
option that can be used to specify which to use. This can be controlled via the command line with:
inspect eval inspect_evals/osworld -T corpus="all"
inspect eval inspect_evals/osworld -T corpus="small"
inspect eval inspect_evals/osworld -T corpus="~/my_custom_corpus.json"
Notes: - The parameter also supports a file path to a custom corpus. - The osworld_small
task provides shorthand for selecting the small
corpus. - When no corpus is specified, the task will default to using the all
corpus.
Dataset
Sample Selection
Applications
Within a corpus, OSWorld further categorizes samples by application. The current applications are: - chrome
, - gimp
, - libreoffice_calc
, - libreoffice_impress
, - libreoffice_write
, - os
, - thunderbird
, - vlc
, - vs_code
, - multi_apps
,
This eval has two additional options that can be used to control which applications should be evaluated. - include_apps
— Specifies the apps to filter the dataset. Can be a single app, a list of apps, or omitted to include all apps. - exclude_apps
— Specifies the apps to exclude from the dataset. Can be a single app, a list of apps, or omitted to exclude no apps.
For example, to evaluate all corpus applications except vlc
:
inspect eval inspect_evals/osworld -T exclude_apps=vlc
To evaluate only gimp
and vs_code
samples in the corpus:
inspect eval inspect_evals/osworld -T include_apps=gimp,vs_code
The two options could also be used together in the case of the multi_apps
application. For example, to run all multi_apps
examples except those that involve thunderbird
:
inspect eval inspect_evals/osworld -T include_apps=multi_apps -T exclude_apps=thunderbird
Connected
By default, OSWorld samples that require network connectivity (excluding sample setup) are disabled. This behavior can be overridden.
inspect eval inspect_evals/osworld -T include_connected=true
Unsupported Samples
In the osworld
dataset, certain samples are excluded from the dataset for various reasons. These exclusions are categorized into three main groups:
Unsupported Applications: Some applications are not yet supported within the dataset due to specific limitations. For example:
chrome
: The Docker container does not yet have Chrome installed.thunderbird
: The credentials included in these OSWorld examples are invalid, causing modal error windows to open and distract the model.
(see:
unsupported_apps
in src/inspect_evals/osworld/dataset.py)Examples Requiring Network Access: These examples require network access to function correctly. If network access is not allowed, these examples are excluded from the dataset. This includes examples that use VLC’s HTTP interface, require internet for instructions, or need to install applications or extensions. These examples are excluded by default, but they can be included by passing
-T include_connected=true
on the command line.(see:
examples_requiring_network
in src/inspect_evals/osworld/dataset.py)Unsupported Examples: These are specific example IDs that are excluded because they are not yet supported for various reasons, such as reliance on systemd, use of accessibility tree getters, or assumptions about the environment that are not met in Docker.
(see:
unsupported_examples
in src/inspect_evals/osworld/dataset.py)
Algorithm
You can inspect the code that implements the above inclusion/exclusion logic in the function _should_include_example
in dataset.py
Architectural Design
OSWorld has a large volume of support code to score their samples. A principle of this eval is to leverage as much of that code as possible without modifying it. This allows our eval to automatically benefit from future work in their repo — minimizing the need for our code to be updated.
To accomplish this goal, I’ve taken the approach of doing all execution of OSWorld code within the SandboxEnvironment.
OSWorld phases
OSWorld’s architecture includes various phases of execution.
config
— Prepares the sandbox for the sample by downloading files, opening applications, etc.post-config
— Executes code within the sandbox to provide inputs for evaluation.evaluate
— Assesses the success of the sample. This phase is implemented in terms of code in OSWorld getters and metrics modules.
example.json
All of these phases are described declaratively in a json file for each sample. These files conform to the Example
type defined in osworld_types.py. This json file is copied into the container to guide the execution of the phases.
cli.py
With the one exception described in the note below, all of these phases are executed strictly within the sandbox via a simple cli.py
application. That application supports two commands — config
and evaluate
. (post-config
is an implied pre-requisite step of evaluate
).
[!NOTE] In order to allow as many samples as possible to execute in a sandbox without network access,
download
andcloud_file
operations are executed on the Inspect host and the resulting files are placed in the sandbox withfiles=
.
Inspect Eval Code design
OSWorld defines two abstract interfaces, env
and controller
, that are presented to the getters and metrics. This eval provides Inspect specific implementations of these interfaces. This code is designed to be the minimal, yet idiomatic, code to present the OSWorld samples into the inspect ecosystem.