The Agent Company: Evaluating multi-tool autonomous agents in a synthetic company

Assistants
Agent
Tools

The Agent Company benchmark evaluates autonomous agents in a realistic, self-contained company environment. Tasks require browsing internal web services, reading and writing files, running code, and coordinating tools to solve multi-step problems.

Overview

This eval provides an adaptation of the original TheAgentCompany implementation. Where possible it reuses code with attribution.

The full eval comprises 175 different tasks that connect to a range of bespoke services. To make this implementation easier to review, this work is split into stages:

  1. ownCloud example tasks - 10 tasks - to check the general approach, including both string and LLM judges of tasks
  2. ownCloud only remaining 23 tasks - splitting to make the review easier
  3. ownCloud and rocketchat example - this requires adding interactive NPCs - 5 tasks to check approach
  4. ownCloud and rocketchat remaining tasks - about 50 remaining
  5. gitlab tasks - 5 examples (I haven’t looked at this in detail but I think it’d be fairly straightforward)
  6. all remaining tasks - 87 tasks

This PR comprises step 1 only.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[theagentcompany]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra theagentcompany

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/theagentcompany --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.theagentcompany import theagentcompany
eval(theagentcompany)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/theagentcompany --limit 10
uv run inspect eval inspect_evals/theagentcompany --max-connections 10
uv run inspect eval inspect_evals/theagentcompany --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

theagentcompany

  • task_name (Optional[Literal[‘ds_answer_numerical_data_question’, ‘ds_sql_exercise’, ‘hr_salary_analysis’, ‘hr_populate_salary_increase_memo’, ‘hr_resume_categorization’, ‘ml_grade_exam’, ‘research_answer_questions_on_paper’, ‘sde_copy_table_from_pdf_to_xlsx’, ‘sde_create_sqlite_database’, ‘sde_run_rising_wave_locally’, ‘test_basic_llm_judge’, ‘test_owncloud_file_reading’, ‘test_rocketchat_connect’, ‘test_install_permissions’]]): The name of the task to load. Can be a real task or a test task. If omitted, runs all real tasks and dynamically selects the correct scorer per sample via the sample’s eval_name metadata. (default: None)
  • scoring_mode (Literal[‘original’, ‘improved’]): For ds_sql_exercise, choose original (exact file match) or improved (LLM-as-judge on the same outputs). Other tasks ignore this and use their usual scorer. (default: 'original')
  • max_messages (int): Maximum number of messages in agent conversation. This is set to 100 to match the original TheAgentCompany implementation. (default: 100)

Running individual tasks

To run a specific task, pass task_name via -T:

# Run a specific task
uv run inspect eval inspect_evals/theagentcompany -T task_name=hr_populate_salary_increase_memo --model anthropic/claude-haiku-4-5

Some tasks also support an LLM-judge scoring mode:

uv run inspect eval inspect_evals/theagentcompany -T task_name=ds_sql_exercise -T scoring_mode=improved --model openai/gpt-5-nano --limit 1

Dataset

This eval provides a self-contained environment with internal websites and data. To replicate those systems, it uses additional service containers. For example, a task might require the agent to use a web browser to access a company website served by a connected container.

A simpler task is to download a spreadsheet from a web server, read the file, perform calculations, and save the result in the workspace. More complex tasks can involve interacting with colleagues, installing software, and running local tooling.

There is also a small set of test tasks, provided to help development and to test that the environment and agent is configured correctly: test_basic_llm_judge, test_owncloud_file_reading, test_rocketchat_connect.

Scoring

Scoring is done using a custom scorer for each task, which runs on the host. All tasks implemented so far have an original scoring method, which matches the original implementation as closely as possible.

However, many of these scorers have issues - for example relying on string matching e.g. ds_answer_numerical_data_question, and therefore failing correct answers due to minor formatting differences. Others fail models for not achieving results not specified in the eval, for example ml_grade_exam, or use very stringent marking criteria research_answer_questions_on_paper. In these cases, an improved scoring method has been implemented, access using the parameter -T scoring_mode=improved.

Evaluation Report

The Agent Company team maintains a leaderboard of model performance. The table below reproduces figures from their published run OpenHands–Versa with Claude Sonnet 4 (evaluation 1.0.0, 14 June 2025).

Report configuration:

  • Eval version: 1-A
  • A model ID (published baseline): anthropic/claude-sonnet-4
  • B model ID (this implementation): anthropic/claude-haiku-4-5-20251001

Commands used to produce the B results in this table:

uv run inspect eval inspect_evals/theagentcompany --model anthropic/claude-haiku-4-5-20251001 --limit 1
uv run inspect eval inspect_evals/theagentcompany -T task_name=ds_sql_exercise -T scoring_mode=improved --model anthropic/claude-haiku-4-5-20251001 --limit 1
uv run inspect eval inspect_evals/theagentcompany -T task_name=research_answer_questions_on_paper -T scoring_mode=improved --model anthropic/claude-haiku-4-5-20251001 --limit 1

The Agent Company leaderboard results (⭐ indicates perfect completion)

A — OpenHands–Versa with Claude Sonnet 4 (anthropic/claude-sonnet-4, published run linked above). B — this implementation with Claude Haiku 4.5 (anthropic/claude-haiku-4-5-20251001).

Filename Total A - Result A - Score B - Result B - Score
ds-answer-numerical-data-question 6 0 0.00 0 0.00
ds-sql-exercise 6 6 1.00 ⭐ 3 0.50
hr-populate-salary-increase-memo 7 2 0.14 2 0.00
hr-resume-categorization 4 1 0.12 1 0.25
hr-salary-analysis 2 0 0.00 0 0.00
ml-grade-exam 8 0 0.00 0 0.00
research-answer-questions-on-paper 12 0 0.00 0 0.00
sde-copy-table-from-pdf-to-xlsx 5 0 0.00 1 0.20
sde-create-sqlite-database 8 7 0.44 (1) 6 0.75
sde-run-rising-wave-locally 2 0 0.00 1 0.50

Notes

  1. I think this could be an error in the leaderboard scoring, I think getting 7 of 8 checkpoints should be a score of 0.75 or higher.

Testing

This eval includes several test tasks that should pass quickly for most capable models. They verify that services are connected and required data is loaded.

  • test-owncloud-file-reading - verifies ownCloud data is loaded and accessible
  • test-basic-llm-judge - verifies an LLM judge can evaluate container-provided content
  • test-rocketchat-connect - verifies the agent can connect to Rocket.Chat

Changelog