AgentBench: Evaluate LLMs as Agents
A benchmark designed to evaluate LLMs as Agents
Overview
AgentBench is designed to evaluate LLMs as agents. The task agent_bench_os
contains the “Operating System” dev tasks from AgentBench.
Felix Hofstätter did the bigger part of implementing these tasks in inspect. Hanna Gábor fixed some bugs, wrote tests and refactored some of the code.
Usage
First, install the dependencies:
uv sync --extra agent_bench
Then, evaluate against one or more models with:
uv run inspect eval inspect_evals/agent_bench_os --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
uv run inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
The bash and python code is executed inside a Docker container, so you will need to install Docker Engine in order to run the evaluation.
Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/agent_bench_os --limit 10
uv run inspect eval inspect_evals/agent_bench_os --max-connections 10
uv run inspect eval inspect_evals/agent_bench_os --temperature 0.5
See uv run inspect eval --help
for all available options.
Dataset
The dataset used for inspect_eval’s implementation is currently the dev set. Onboarding the test set has been left for a future PR. The data originates from this repository, and is stored in data/agent_bench_os.json
.
Here are a few example tasks (see the paper appendix for more detailed examples including their outputs):
Example 1
There's a program that writes 'succeed' into /root/log periodically. I only know the interval is 1 second, 2 seconds, or 3 seconds. Please find out the interval. Answer me only the integer number of seconds. (e.g. 1, 2, or 3)
Example 2
I would like to implement the following bash command and expose it in your home directory: entering the "calc" command will enable the calculation of an expression. The expression can include addition, subtraction, multiplication, division, and parentheses. If the absolute error between the calculated answer and the expected answer is less than 1e-5, it will be considered correct. For example, I can calculate the result by entering "calc 2 * (9 / 3)", and the output will be 6.
Example 3
Find out count of linux users on this system who belong to at least 2 groups.
Scoring
In most cases, we simply check if the answer matches the answer given in the data file.
For task IDs 3 and 5, the agent needs to write a command. In these cases we compare the output of the agent-written command with the solution-command.
In the original implementation, the scoring (“judging”) mostly happened by running an example script and comparing the agent’s answer to the result of that script. However, this opens up the possibility for the agent to modify the environment in a way that makes the task easier. For example, it could set up its own program that writes to /root/log
instead of inspecting how often the currently running program writes into that file. We kept the example codes for testing and documentation purposes.
Changes to the original implementation
General Changes
- Change in the scoring mechanism mentioned in scoring
- Assigned a task id to each task
Task Specific Changes
Tasks 0-4
Added newline when “succeed” is written to file.
Task 6
The description of the task is “How many hidden files are in /home? (not including subdirectories)”
We changed the example code from:
ls -a /home | grep \"^\\.\" | wc -l
to:
ls -1Ap /home | grep '^\\.' | grep -v '/$' | wc -l
This was needed because of two reasons:
-1
forces one entry per line so the count is correct across environments that may formatls
output in multiple columns.-A
avoids counting.
and..
; and-p
appends/
to directories, enablinggrep -v '/$'
to exclude directories so only files are counted.
Task 7
Changed the description from “How many files whose name starts with letter exist in /root?” to “How many files whose name starts with an alphanumeric character exist in /root?” to match the code provided later (ls -1 /root | grep \"^[[:alpha:]]\" | wc -l
).
Task 12
The description of the task is “How many files in /home (recursively) were modified in the last 24 hours?”
To make this task deterministic, we modified the timestamps of the files present in the docker container to be outside the 24 hours and created new files only after that. More specifically, we changed the code that is run in the docker setup from:
echo \"This is a random string: wuafjva-niqt12r-j9asd19-afj12-sdmxai\">/home/OSbench\ntouch /home/tmpfile\nmkdir /home/tmpdir\ntouch /home/tmpdir/inner"
to:
find /home -type f -exec touch -d \"2023-03-23 12:00:00\" {} \\;\n echo \"This is a random string: <random-string> \">/home/OSbench\ntouch /home/tmpfile\nmkdir /home/tmpdir\ntouch /home/tmpdir/inner
Task 13
Changed task 13 answer to 4 cores and pinned CPUs with cpuset: "0-3"
so nproc
reports 4 cores consistently.
Task 19
The description is “Find out count of linux users on this system who belong to at least 2 groups.”
Changed the example code from:
getent passwd | cut -d: -f1 | while read user; do groups $user | cut -d ':' -f 2 | tr -s ' ' | sed 's/^ *//;s/ *$//' | grep -q ' ' && echo $user; done | wc -l
to:
getent passwd | cut -d: -f1 | while read user; do groups $user | cut -d ':' -f 2 | tr -s ' ' | sed 's/^ *//;s/ *$//' | tr ' ' '\\n' | grep -v '^$' | tr '\\' '\n' | wc -l | grep -q -P '[2-9]|\\d{2,}' && echo $user; done | wc -l
This was not strictly necessary, we changed this to be consistent with task 20.
Task 20
The description is “Find out count of users who belong to at least 4 groups.”
We changed the example code from:
getent passwd | cut -d: -f1 | while read user; do groups $user | cut -d ':' -f 2 | tr -s ' ' | sed 's/^ *//;s/ *$//' | tr ' ' '\\n' | grep -v '^$' | wc -l | grep -q -w '4' && echo $user; done | wc -l
to:
getent passwd | cut -d: -f1 | while read user; do groups $user | cut -d ':' -f 2 | tr -s ' ' | sed 's/^ *//;s/ *$//' | tr ' ' '\\n' | grep -v '^$' | tr '\\' '\n' | wc -l | grep -q -P '[4-9]|\\d{2,}' && echo $user; done | wc -l
The original command counted only the users who belonged to exactly 4 groups, excluding users who belonged to more than 4 groups.
Task 21
Changed example code from find . -type f -name \"*.helloworld\" | wc -l
to find . -type f -name *.helloworld | wc -l
Task 23
Now checks the per-user process limit (ulimit -u
) instead of the OS-wide threads max, so the answer doesn’t vary by host. The compose file sets a fixed limit.
ulimit -u
Comparison to the results in the original paper
The test set used in the paper is now public, but we’ve left the onboarding of this to a future PR. We include only the dev set as discussed in Dataset. This also means the results are not directly comparable.
The reported success rate for gpt-4 is \(42.4\% \; (N=144)\). Running this implementation with gpt-4 only on dev data, the success rate is \(69.2\% \pm 9.2\% \; (N=26)\).
To-Do
- We kept the current path-rewrite approach in
recursively_set_agent_home()
(blanket string replacement of/root
and/usr
to/home/agent
) as-is for now. See the comment here.