AgentBench: Evaluate LLMs as Agents

Coding

A benchmark designed to evaluate LLMs as Agents

Contributed By

@Felhof

@hannagabor

@shaheenahmedc

Code

src/inspect_evals/agent_bench

Paper

https://arxiv.org/abs/2308.03688

Overview

AgentBench is designed to evaluate LLMs as agents. The task agent_bench_os contains the “Operating System” tasks from AgentBench.

Felix Hofstätter did the bigger part of implementing these tasks in inspect. Hanna Gábor fixed some bugs, wrote tests and refactored some of the code.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[agent_bench]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra agent_bench

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/agent_bench_os --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.agent_bench import agent_bench_os
eval(agent_bench_os)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

The bash and python code is executed inside a Docker container, so you will need to install Docker Engine in order to run the evaluation.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/agent_bench_os --limit 10
uv run inspect eval inspect_evals/agent_bench_os --max-connections 10
uv run inspect eval inspect_evals/agent_bench_os --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`agent_bench_os`

split (str): Dataset split to use (“dev” or “test”). Defaults to “test”., “dev”: 26 samples with public answers for development, “test”: 125 samples for final evaluation (default: 'test')
solver (Solver | None): Optional solver to use for the task. If not provided, uses the default agent. (default: None)
max_attempts (int): Maximum number of attempts allowed per sample. (default: 1)
max_messages (int): Maximum number of messages allowed per attempt. (default: 100)
max_sample_time_mins (int): Maximum time a given sample can run. (default: 10)

Dataset

The data originates from this repository. The dev set is stored in data/agent_bench_os.json and the test set is stored in data/agent_bench_os_test.json.

Here are a few example tasks (see the paper appendix for more detailed examples including their outputs):

Example 1

There's a program that writes 'succeed' into /root/log periodically. I only know the interval is 1 second, 2 seconds, or 3 seconds. Please find out the interval. Answer me only the integer number of seconds. (e.g. 1, 2, or 3)

Example 2

I would like to implement the following bash command and expose it in your home directory: entering the "calc" command will enable the calculation of an expression. The expression can include addition, subtraction, multiplication, division, and parentheses. If the absolute error between the calculated answer and the expected answer is less than 1e-5, it will be considered correct. For example, I can calculate the result by entering "calc 2 * (9 / 3)", and the output will be 6.

Example 3

Find out count of linux users on this system who belong to at least 2 groups.

Scoring

In most cases, we simply check if the answer matches the answer given in the data file.

For task IDs 3 and 5, the agent needs to write a command. In these cases we compare the output of the agent-written command with the solution-command.

In the original implementation, the scoring (“judging”) mostly happened by running an example script and comparing the agent’s answer to the result of that script. However, this opens up the possibility for the agent to modify the environment in a way that makes the task easier. For example, it could set up its own program that writes to /root/log instead of inspecting how often the currently running program writes into that file. We kept the example codes for testing and documentation purposes.

Changes to the original implementation

General Changes

Change in the scoring mechanism mentioned in scoring
Assigned a task id to each task

Task Specific Changes

Tasks 0-4

Added newline when “succeed” is written to file.

Task 6

The description of the task is “How many hidden files are in /home? (not including subdirectories)”

We changed the example code from:

ls -a /home | grep \"^\\.\" | wc -l

to:

ls -1Ap /home | grep '^\\.' | grep -v '/$' | wc -l

This was needed because of two reasons:

-1 forces one entry per line so the count is correct across environments that may format ls output in multiple columns.
-A avoids counting . and ..; and -p appends / to directories, enabling grep -v '/$' to exclude directories so only files are counted.

Task 7

Changed the description from “How many files whose name starts with letter exist in /root?” to “How many files whose name starts with an alphanumeric character exist in /root?” to match the code provided later (ls -1 /root | grep \"^[[:alpha:]]\" | wc -l).

Task 12

The description of the task is “How many files in /home (recursively) were modified in the last 24 hours?”

To make this task deterministic, we modified the timestamps of the files present in the docker container to be outside the 24 hours and created new files only after that. More specifically, we changed the code that is run in the docker setup from:

echo \"This is a random string: wuafjva-niqt12r-j9asd19-afj12-sdmxai\">/home/OSbench\ntouch /home/tmpfile\nmkdir /home/tmpdir\ntouch /home/tmpdir/inner"

to:

find /home -type f -exec touch -d \"2023-03-23 12:00:00\" {} \\;\n echo \"This is a random string: <random-string> \">/home/OSbench\ntouch /home/tmpfile\nmkdir /home/tmpdir\ntouch /home/tmpdir/inner

Task 13

Changed task 13 answer to 4 cores and pinned CPUs with cpuset: "0-3" so nproc reports 4 cores consistently.

Task 19

The description is “Find out count of linux users on this system who belong to at least 2 groups.”

Changed the example code from:

getent passwd | cut -d: -f1 | while read user; do groups $user | cut -d ':' -f 2 | tr -s ' ' | sed 's/^ *//;s/ *$//' | grep -q ' ' && echo $user; done | wc -l

to:

getent passwd | cut -d: -f1 | while read user; do groups $user | cut -d ':' -f 2 | tr -s ' ' | sed 's/^ *//;s/ *$//' | tr ' ' '\\n' | grep -v '^$' | tr '\\' '\n' | wc -l | grep -q -P '[2-9]|\\d{2,}' && echo $user; done | wc -l

This was not strictly necessary, we changed this to be consistent with task 20.

Task 20

The description is “Find out count of users who belong to at least 4 groups.”

We changed the example code from:

getent passwd | cut -d: -f1 | while read user; do groups $user | cut -d ':' -f 2 | tr -s ' ' | sed 's/^ *//;s/ *$//' | tr ' ' '\\n' | grep -v '^$' | wc -l | grep -q -w '4' && echo $user; done | wc -l

to:

getent passwd | cut -d: -f1 | while read user; do groups $user | cut -d ':' -f 2 | tr -s ' ' | sed 's/^ *//;s/ *$//' | tr ' ' '\\n' | grep -v '^$' | tr '\\' '\n' | wc -l | grep -q -P '[4-9]|\\d{2,}' && echo $user; done | wc -l

The original command counted only the users who belonged to exactly 4 groups, excluding users who belonged to more than 4 groups.

Task 21

Changed example code from find . -type f -name \"*.helloworld\" | wc -l to find . -type f -name *.helloworld | wc -l

Task 23

Now checks the per-user process limit (ulimit -u) instead of the OS-wide threads max, so the answer doesn’t vary by host. The compose file sets a fixed limit.

ulimit -u

Comparison to the results in the original paper

The reported success rate for gpt-4 is $42.4\% \; (N=144)$. Running this implementation with gpt-4 only on dev data, the success rate is $69.2\% \pm 9.2\% \; (N=26)$. Running this implementation with gpt-4 on test data, the success rate is $36\% \pm 9.8\% \; (N=25)$.

To-Do

We kept the current path-rewrite approach in recursively_set_agent_home() (blanket string replacement of /root and /usr to /home/agent) as-is for now. See this PR comment.