The Agent Company: Evaluating multi-tool autonomous agents in a synthetic company
The Agent Company benchmark evaluates autonomous agents in a realistic, self-contained company environment. Tasks require browsing internal web services, reading and writing files, running code, and coordinating tools to solve multi-step problems.
Overview
This eval provides an adaptation of the original TheAgentCompany implementation. Where possible it reuses code with attribution.
The full eval comprises 175 different tasks that connect to a range of bespoke services. To make this implementation easier to review, this work is split into stages:
- ownCloud example tasks - 10 tasks - to check the general approach, including both string and LLM judges of tasks
- ownCloud only remaining 23 tasks - splitting to make the review easier
- ownCloud and rocketchat example - this requires adding interactive NPCs - 5 tasks to check approach
- ownCloud and rocketchat remaining tasks - about 50 remaining
- gitlab tasks - 5 examples (I haven’t looked at this in detail but I think it’d be fairly straightforward)
- all remaining tasks - 87 tasks
This PR comprises step 1 only.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evals[theagentcompany]If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv sync --extra theagentcompanyRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/theagentcompany --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.theagentcompany import theagentcompany
eval(theagentcompany)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/theagentcompany --limit 10
uv run inspect eval inspect_evals/theagentcompany --max-connections 10
uv run inspect eval inspect_evals/theagentcompany --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
theagentcompany
task_name(Optional[Literal[‘ds_answer_numerical_data_question’, ‘ds_sql_exercise’, ‘hr_salary_analysis’, ‘hr_populate_salary_increase_memo’, ‘hr_resume_categorization’, ‘ml_grade_exam’, ‘research_answer_questions_on_paper’, ‘sde_copy_table_from_pdf_to_xlsx’, ‘sde_create_sqlite_database’, ‘sde_run_rising_wave_locally’, ‘test_basic_llm_judge’, ‘test_owncloud_file_reading’, ‘test_rocketchat_connect’, ‘test_install_permissions’]]): The name of the task to load. Can be a real task or a test task. If omitted, runs all real tasks and dynamically selects the correct scorer per sample via the sample’seval_namemetadata. (default:None)scoring_mode(Literal[‘original’, ‘improved’]): Fords_sql_exercise, chooseoriginal(exact file match) orimproved(LLM-as-judge on the same outputs). Other tasks ignore this and use their usual scorer. (default:'original')max_messages(int): Maximum number of messages in agent conversation. This is set to 100 to match the original TheAgentCompany implementation. (default:100)
Running individual tasks
To run a specific task, pass task_name via -T:
# Run a specific task
uv run inspect eval inspect_evals/theagentcompany -T task_name=hr_populate_salary_increase_memo --model anthropic/claude-haiku-4-5Some tasks also support an LLM-judge scoring mode:
uv run inspect eval inspect_evals/theagentcompany -T task_name=ds_sql_exercise -T scoring_mode=improved --model openai/gpt-5-nano --limit 1Dataset
This eval provides a self-contained environment with internal websites and data. To replicate those systems, it uses additional service containers. For example, a task might require the agent to use a web browser to access a company website served by a connected container.
A simpler task is to download a spreadsheet from a web server, read the file, perform calculations, and save the result in the workspace. More complex tasks can involve interacting with colleagues, installing software, and running local tooling.
There is also a small set of test tasks, provided to help development and to test that the environment and agent is configured correctly: test_basic_llm_judge, test_owncloud_file_reading, test_rocketchat_connect.
Scoring
Scoring is done using a custom scorer for each task, which runs on the host. All tasks implemented so far have an original scoring method, which matches the original implementation as closely as possible.
However, many of these scorers have issues - for example relying on string matching e.g. ds_answer_numerical_data_question, and therefore failing correct answers due to minor formatting differences. Others fail models for not achieving results not specified in the eval, for example ml_grade_exam, or use very stringent marking criteria research_answer_questions_on_paper. In these cases, an improved scoring method has been implemented, access using the parameter -T scoring_mode=improved.
Evaluation Report
The Agent Company team maintains a leaderboard of model performance. The table below reproduces figures from their published run OpenHands–Versa with Claude Sonnet 4 (evaluation 1.0.0, 14 June 2025).
Report configuration:
- Eval version:
1-A - A model ID (published baseline):
anthropic/claude-sonnet-4 - B model ID (this implementation):
anthropic/claude-haiku-4-5-20251001
Commands used to produce the B results in this table:
uv run inspect eval inspect_evals/theagentcompany --model anthropic/claude-haiku-4-5-20251001 --limit 1
uv run inspect eval inspect_evals/theagentcompany -T task_name=ds_sql_exercise -T scoring_mode=improved --model anthropic/claude-haiku-4-5-20251001 --limit 1
uv run inspect eval inspect_evals/theagentcompany -T task_name=research_answer_questions_on_paper -T scoring_mode=improved --model anthropic/claude-haiku-4-5-20251001 --limit 1The Agent Company leaderboard results (⭐ indicates perfect completion)
A — OpenHands–Versa with Claude Sonnet 4 (anthropic/claude-sonnet-4, published run linked above). B — this implementation with Claude Haiku 4.5 (anthropic/claude-haiku-4-5-20251001).
| Filename | Total | A - Result | A - Score | B - Result | B - Score |
|---|---|---|---|---|---|
| ds-answer-numerical-data-question | 6 | 0 | 0.00 | 0 | 0.00 |
| ds-sql-exercise | 6 | 6 | 1.00 ⭐ | 3 | 0.50 |
| hr-populate-salary-increase-memo | 7 | 2 | 0.14 | 2 | 0.00 |
| hr-resume-categorization | 4 | 1 | 0.12 | 1 | 0.25 |
| hr-salary-analysis | 2 | 0 | 0.00 | 0 | 0.00 |
| ml-grade-exam | 8 | 0 | 0.00 | 0 | 0.00 |
| research-answer-questions-on-paper | 12 | 0 | 0.00 | 0 | 0.00 |
| sde-copy-table-from-pdf-to-xlsx | 5 | 0 | 0.00 | 1 | 0.20 |
| sde-create-sqlite-database | 8 | 7 | 0.44 (1) | 6 | 0.75 |
| sde-run-rising-wave-locally | 2 | 0 | 0.00 | 1 | 0.50 |
Notes
- I think this could be an error in the leaderboard scoring, I think getting 7 of 8 checkpoints should be a score of 0.75 or higher.
Testing
This eval includes several test tasks that should pass quickly for most capable models. They verify that services are connected and required data is loaded.
test-owncloud-file-reading- verifies ownCloud data is loaded and accessibletest-basic-llm-judge- verifies an LLM judge can evaluate container-provided contenttest-rocketchat-connect- verifies the agent can connect to Rocket.Chat