The Agent Company: Evaluating multi-tool autonomous agents in a synthetic company

Assistants

inspect-evals

Agent

Tools

The Agent Company benchmark evaluates autonomous agents in a realistic, self-contained company environment. Tasks require browsing internal web services, reading and writing files, running code, and coordinating tools to solve multi-step problems.

Contributed By

@bndxn

Code

src/inspect_evals/theagentcompany

Paper

https://arxiv.org/abs/2412.14161

Overview

This eval provides an adaptation of the original TheAgentCompany implementation. Where possible it reuses code with attribution.

The full eval comprises 175 different tasks that connect to a range of bespoke services. To make this implementation easier to review, this work is split into stages:

Basic example tasks - 5 tasks, ownCloud only - to check approach
Basic remaining - 28 tasks, ownCloud only - splitting to make the review easier
Supporting non-player LLM interactions - 5 tasks, adding Rocketchat - check approach
Remaining non-player LLM interactions - about 50
Coding tasks - 5 tasks, gitlab
All remaining tasks - 87 tasks

The first PR has been merged. The current PR focuses on step 2.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[theagentcompany]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra theagentcompany

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/theagentcompany --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.theagentcompany import theagentcompany
eval(theagentcompany)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Pre-flight checks (before running agents)

Structural checks (dataset YAML, asset paths, scorer imports, optional Docker Compose validation) without a live model:

uv sync --extra theagentcompany
uv run pytest tests/theagentcompany/

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/theagentcompany --limit 10
uv run inspect eval inspect_evals/theagentcompany --max-connections 10
uv run inspect eval inspect_evals/theagentcompany --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`theagentcompany`

task_name (Union[Literal[‘admin_make_spreadsheet’, ‘admin_mass_forms_filling’, ‘admin_remove_pages_pdf’, ‘admin_translate_sales_chat’, ‘admin_watch_video’, ‘ds_answer_numerical_data_question’, ‘ds_coffee_shop_database_management’, ‘ds_find_meeting_spreadsheet’, ‘ds_fix_table_values_and_missing_answers’, ‘ds_format_excel_sheets’, ‘ds_predictive_modeling’, ‘ds_sql_exercise’, ‘ds_stock_analysis_slides’, ‘ds_visualize_data_in_pie_and_bar_chart’, ‘finance_budget_variance’, ‘finance_check_attendance_payroll’, ‘finance_expense_validation’, ‘finance_invoice_matching’, ‘finance_nonqualified_bill_ask_for_reimburse’, ‘hr_check_attendance_multiple_days’, ‘hr_check_attendance_multiple_days_department’, ‘hr_check_attendance_one_day’, ‘hr_create_employee_manual’, ‘hr_new_grad_job_description’, ‘hr_organize_talent_info’, ‘hr_populate_salary_increase_memo’, ‘hr_resume_categorization’, ‘hr_salary_analysis’, ‘ml_generate_gradcam’, ‘ml_grade_exam’, ‘research_answer_questions_on_paper’, ‘sde_copy_table_from_pdf_to_xlsx’, ‘sde_create_sqlite_database’, ‘sde_run_rising_wave_locally’], Literal[‘test_basic_llm_judge’, ‘test_owncloud_file_reading’, ‘test_install_permissions’, ‘test_evaluator_helpers’], None]): The name of the task to load. Can be a real task or a test task. If omitted, runs all real tasks and dynamically selects the correct scorer per sample via the sample’s eval_name metadata. (default: None)
scoring_mode (Literal[‘original’, ‘improved’]): For ds_sql_exercise, choose original (exact file match) or improved (LLM-as-judge on the same outputs). Other tasks ignore this and use their usual scorer. (default: 'original')
max_messages (int): Maximum number of messages in agent conversation. This is set to 100 to match the original TheAgentCompany implementation. (default: 100)

Running individual tasks

To run a specific task, pass task_name via -T:

# Run a specific task
uv run inspect eval inspect_evals/theagentcompany -T task_name=hr_populate_salary_increase_memo --model anthropic/claude-haiku-4-5

Some tasks also support an LLM-judge scoring mode:

uv run inspect eval inspect_evals/theagentcompany -T task_name=ds_sql_exercise -T scoring_mode=improved --model openai/gpt-5-nano --limit 1

Dataset

This eval provides a self-contained environment with internal websites and data. To replicate those systems, it uses additional service containers. For example, a task might require the agent to use a web browser to access a company website served by a connected container.

A simpler task is to download a spreadsheet from a web server, read the file, perform calculations, and save the result in the workspace. More complex tasks can involve interacting with colleagues, installing software, and running local tooling.

There is also a small set of test tasks, provided to help development and to test that the environment and agent is configured correctly: test_basic_llm_judge, test_owncloud_file_reading, test_install_permissions, test_evaluator_helpers.

Scoring

Scoring is done using a custom scorer for each task, which runs on the host. All tasks implemented so far have an original scoring method, which matches the original implementation as closely as possible.

However, many of these scorers have issues - for example relying on string matching e.g. ds_answer_numerical_data_question, and therefore failing correct answers due to minor formatting differences. Others fail models for not achieving results not specified in the eval, for example ml_grade_exam, or use very stringent marking criteria research_answer_questions_on_paper. In these cases, an improved scoring method has been implemented, access using the parameter -T scoring_mode=improved.

Metrics

Each task reports two kinds of outcome in a single Score.value dict:

score — overall pass or fail (C or I), matching the original TheAgentCompany “all checkpoints passed” rule.
checkpoints — partial progress as a float from 0.0 to 1.0 (for example 0.2 means roughly one in five checkpoints passed).

That split is deliberate: you get a clear pass/fail for leaderboard-style accuracy, and a separate signal for how far the agent got when it did not fully succeed.

Inspect is wired to treat each field on its own. TAC_SCORER_METRICS maps metrics to dict keys so accuracy() runs on score and checkpoint averaging runs on checkpoints, rather than on the whole dict. In the eval log you should see both metrics (sometimes under separate entries named score and checkpoints).

The react solver also needs to know whether a submission passed. After each submit(), it checks the score to decide whether to retry. theagentcompany.py passes tac_attempt_score_value, which reads only the score field for that decision—the same pass/fail semantics as the metrics, without mixing in checkpoint progress.

If you still see a warning like unable to convert value to float: {'score': 'C', 'checkpoints': 1.0}, Inspect received the full dict where it expected a single scalar. That usually means an older checkout, a log rescored before this layout, or a custom integration that has not adopted the dict-key metrics. Re-running the eval on current code is the right fix; the run itself is otherwise fine.

Evaluation Report

The Agent Company team maintains a leaderboard of model performance. The table below reproduces figures from their published run OpenHands–Versa with Claude Sonnet 4 (evaluation 1.0.0, 14 June 2025).

Report configuration:

Eval version: 1-A
A model ID (published baseline): anthropic/claude-sonnet-4
B model ID (this implementation): anthropic/claude-haiku-4-5-20251001

Commands used to produce the B results in this table:

uv run inspect eval inspect_evals/theagentcompany --model anthropic/claude-haiku-4-5-20251001 --limit 1
uv run inspect eval inspect_evals/theagentcompany -T task_name=ds_sql_exercise -T scoring_mode=improved --model anthropic/claude-haiku-4-5-20251001 --limit 1
uv run inspect eval inspect_evals/theagentcompany -T task_name=research_answer_questions_on_paper -T scoring_mode=improved --model anthropic/claude-haiku-4-5-20251001 --limit 1

By default the same model as the agent will be used as the grader, for tasks requiring a grader model. If you would like to specify a different grader model, you can do this using this syntax:

uv run inspect eval inspect_evals/theagentcompany --model anthropic/claude-haiku-4-5-20251001 --limit 1 --model-role grader=anthropic/claude-haiku-4-5-20251001

The Agent Company leaderboard results (⭐ indicates perfect completion)

This section contains a comparison with results from the original implementation, available in this folder. I spent a while trying to set up the original implementation on a Mac but was not able to, so the results from this leaderboard are used. One of the most capable models they used was Claude Sonnet 4. At the time of writing this is no longer available, so I use Claude Haiku 4.5 as it has similar performance and lower cost.

A — OpenHands–Versa with Claude Sonnet 4 (anthropic/claude-sonnet-4, published run linked above). B — this implementation with Claude Haiku 4.5 (anthropic/claude-haiku-4-5-20251001). The grader model for this implementation is the same as the agent model: Claude Haiku 4.5 (anthropic/claude-haiku-4-5-20251001).

Extracted task scores for implemented tasks from comparison_results.md (parsed programmatically, original implementation) and TRAJECTORY_ANALYSIS.md (checkpoint scores for Inspect implementation):

Task	Original implementation	Inspect implementation
`ds-fix-table-values-and-missing-answers-image`	1.00	0.67
`ds-sql-exercise-image`	1.00	0.67
`finance-check-attendance-payroll-image`	1.00	0.33
`hr-check-attendance-one-day-image`	1.00	1.00
`hr-new-grad-job-description-image`	1.00	0.00
`sde-create-sqlite-database-image`	0.44	0.88
`finance-expense-validation-image`	0.38	0.50
`ds-find-meeting-spreadsheet-image`	0.25	0.50
`ds-visualize-data-in-pie-and-bar-chart-image`	0.25	0.50
`hr-check-attendance-multiple-days-image`	0.25	0.67
`admin-remove-pages-pdf-image`	0.17	0.67
`hr-populate-salary-increase-memo-image`	0.14	0.83
`finance-budget-variance-image`	0.12	1.00
`hr-create-employee-manual-image`	0.12	0.33
`hr-organize-talent-info-image`	0.12	0.25
`hr-resume-categorization-image`	0.12	0.25
`admin-make-spreadsheet-image`	0.00	0.00
`admin-mass-forms-filling-image`	0.00	0.00
`admin-translate-sales-chat-image`	0.00	0.00
`admin-watch-video-image`	0.00	0.00
`ds-answer-numerical-data-question-image`	0.00	0.00
`ds-coffee-shop-database-management-image`	0.00	0.00
`ds-format-excel-sheets-image`	0.00	0.00
`ds-predictive-modeling-image`	0.00	1.00
`ds-stock-analysis-slides-image`	0.00	0.17
`finance-invoice-matching-image`	0.00	0.33
`finance-nonqualified-bill-ask-for-reimburse-image`	0.00	0.00
`hr-check-attendance-multiple-days-department-image`	0.00	0.00
`hr-salary-analysis-image`	0.00	0.00
`ml-generate-gradcam-image`	0.00	0.00
`ml-grade-exam-image`	0.00	0.00
`research-answer-questions-on-paper-image`	0.00	0.58
`sde-copy-table-from-pdf-to-xlsx-image`	0.00	0.20
`sde-run-rising-wave-locally-image`	0.00	1.00

Scoring and multiple epochs

Task scorers store the aggregate checkpoint completion rate in Score.value["checkpoints"] so Inspect can reduce it correctly when running with --epochs greater than 1. Per-checkpoint detail remains in Score.metadata["checkpoints"] for log inspection; that metadata is not averaged across epochs.

Testing

This eval includes several test tasks that should pass quickly for most capable models. They verify that services are connected and required data is loaded.

test-owncloud-file-reading - verifies ownCloud data is loaded and accessible via the browser
test-basic-llm-judge - verifies an LLM judge can evaluate container-provided content
test-install-permissions - verifies the agent can install packages in the sandbox (poppler-utils)
test-evaluator-helpers - verifies shared file-read helpers (text, CSV, Excel, ownCloud lookup)

Rocket.Chat is not covered by a test task in this PR; connectivity checks will be added with the Rocket.Chat tasks in stage 3.

Changelog

[1-A] - 2026-05-19

Initial Inspect adaptation of The Agent Company benchmark. The full benchmark has 175 tasks; this implementation is rolled out in stages for reviewability.
Stage 1 (merged): five ownCloud-only example tasks to validate the approach.
Stage 2: twenty-eight additional ownCloud-only tasks (34 dataset samples including test tasks).
Per-task checkpoint scoring with separate score (pass/fail) and checkpoints (partial progress) metrics; optional improved scoring mode where original scorers are known to be brittle.
Test tasks for ownCloud connectivity, LLM judge, sandbox install permissions, and evaluator helpers. Rocket.Chat connectivity checks deferred to stage 3.