The Agent Company: Evaluating multi-tool autonomous agents in a synthetic company
The Agent Company benchmark evaluates autonomous agents in a realistic, self-contained company environment. Tasks require browsing internal web services, reading and writing files, running code, and coordinating tools to solve multi-step problems.
Overview
This eval provides an adaptation of the original TheAgentCompany implementation. Where possible it reuses code with attribution.
The full eval comprises 175 different tasks that connect to a range of bespoke services. To make this implementation easier to review, this work is split into stages:
- Basic example tasks - 5 tasks, ownCloud only - to check approach
- Basic remaining - 28 tasks, ownCloud only - splitting to make the review easier
- Supporting non-player LLM interactions - 5 tasks, adding Rocketchat - check approach
- Remaining non-player LLM interactions - about 50
- Coding tasks - 5 tasks, gitlab
- All remaining tasks - 87 tasks
The first PR has been merged. The current PR focuses on step 2.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evals[theagentcompany]If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv sync --extra theagentcompanyRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/theagentcompany --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.theagentcompany import theagentcompany
eval(theagentcompany)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Pre-flight checks (before running agents)
Structural checks (dataset YAML, asset paths, scorer imports, optional Docker Compose validation) without a live model:
uv sync --extra theagentcompany
uv run pytest tests/theagentcompany/Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/theagentcompany --limit 10
uv run inspect eval inspect_evals/theagentcompany --max-connections 10
uv run inspect eval inspect_evals/theagentcompany --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
theagentcompany
task_name(Union[Literal[‘admin_make_spreadsheet’, ‘admin_mass_forms_filling’, ‘admin_remove_pages_pdf’, ‘admin_translate_sales_chat’, ‘admin_watch_video’, ‘ds_answer_numerical_data_question’, ‘ds_coffee_shop_database_management’, ‘ds_find_meeting_spreadsheet’, ‘ds_fix_table_values_and_missing_answers’, ‘ds_format_excel_sheets’, ‘ds_predictive_modeling’, ‘ds_sql_exercise’, ‘ds_stock_analysis_slides’, ‘ds_visualize_data_in_pie_and_bar_chart’, ‘finance_budget_variance’, ‘finance_check_attendance_payroll’, ‘finance_expense_validation’, ‘finance_invoice_matching’, ‘finance_nonqualified_bill_ask_for_reimburse’, ‘hr_check_attendance_multiple_days’, ‘hr_check_attendance_multiple_days_department’, ‘hr_check_attendance_one_day’, ‘hr_create_employee_manual’, ‘hr_new_grad_job_description’, ‘hr_organize_talent_info’, ‘hr_populate_salary_increase_memo’, ‘hr_resume_categorization’, ‘hr_salary_analysis’, ‘ml_generate_gradcam’, ‘ml_grade_exam’, ‘research_answer_questions_on_paper’, ‘sde_copy_table_from_pdf_to_xlsx’, ‘sde_create_sqlite_database’, ‘sde_run_rising_wave_locally’], Literal[‘test_basic_llm_judge’, ‘test_owncloud_file_reading’, ‘test_install_permissions’, ‘test_evaluator_helpers’], None]): The name of the task to load. Can be a real task or a test task. If omitted, runs all real tasks and dynamically selects the correct scorer per sample via the sample’seval_namemetadata. (default:None)scoring_mode(Literal[‘original’, ‘improved’]): Fords_sql_exercise, chooseoriginal(exact file match) orimproved(LLM-as-judge on the same outputs). Other tasks ignore this and use their usual scorer. (default:'original')max_messages(int): Maximum number of messages in agent conversation. This is set to 100 to match the original TheAgentCompany implementation. (default:100)
Running individual tasks
To run a specific task, pass task_name via -T:
# Run a specific task
uv run inspect eval inspect_evals/theagentcompany -T task_name=hr_populate_salary_increase_memo --model anthropic/claude-haiku-4-5Some tasks also support an LLM-judge scoring mode:
uv run inspect eval inspect_evals/theagentcompany -T task_name=ds_sql_exercise -T scoring_mode=improved --model openai/gpt-5-nano --limit 1Dataset
This eval provides a self-contained environment with internal websites and data. To replicate those systems, it uses additional service containers. For example, a task might require the agent to use a web browser to access a company website served by a connected container.
A simpler task is to download a spreadsheet from a web server, read the file, perform calculations, and save the result in the workspace. More complex tasks can involve interacting with colleagues, installing software, and running local tooling.
There is also a small set of test tasks, provided to help development and to test that the environment and agent is configured correctly: test_basic_llm_judge, test_owncloud_file_reading, test_install_permissions, test_evaluator_helpers.
Scoring
Scoring is done using a custom scorer for each task, which runs on the host. All tasks implemented so far have an original scoring method, which matches the original implementation as closely as possible.
However, many of these scorers have issues - for example relying on string matching e.g. ds_answer_numerical_data_question, and therefore failing correct answers due to minor formatting differences. Others fail models for not achieving results not specified in the eval, for example ml_grade_exam, or use very stringent marking criteria research_answer_questions_on_paper. In these cases, an improved scoring method has been implemented, access using the parameter -T scoring_mode=improved.
Metrics
Each task reports two kinds of outcome in a single Score.value dict:
score— overall pass or fail (CorI), matching the original TheAgentCompany “all checkpoints passed” rule.checkpoints— partial progress as a float from 0.0 to 1.0 (for example0.2means roughly one in five checkpoints passed).
That split is deliberate: you get a clear pass/fail for leaderboard-style accuracy, and a separate signal for how far the agent got when it did not fully succeed.
Inspect is wired to treat each field on its own. TAC_SCORER_METRICS maps metrics to dict keys so accuracy() runs on score and checkpoint averaging runs on checkpoints, rather than on the whole dict. In the eval log you should see both metrics (sometimes under separate entries named score and checkpoints).
The react solver also needs to know whether a submission passed. After each submit(), it checks the score to decide whether to retry. theagentcompany.py passes tac_attempt_score_value, which reads only the score field for that decision—the same pass/fail semantics as the metrics, without mixing in checkpoint progress.
If you still see a warning like unable to convert value to float: {'score': 'C', 'checkpoints': 1.0}, Inspect received the full dict where it expected a single scalar. That usually means an older checkout, a log rescored before this layout, or a custom integration that has not adopted the dict-key metrics. Re-running the eval on current code is the right fix; the run itself is otherwise fine.
Evaluation Report
The Agent Company team maintains a leaderboard of model performance. The table below reproduces figures from their published run OpenHands–Versa with Claude Sonnet 4 (evaluation 1.0.0, 14 June 2025).
Report configuration:
- Eval version:
1-A - A model ID (published baseline):
anthropic/claude-sonnet-4 - B model ID (this implementation):
anthropic/claude-haiku-4-5-20251001
Commands used to produce the B results in this table:
uv run inspect eval inspect_evals/theagentcompany --model anthropic/claude-haiku-4-5-20251001 --limit 1
uv run inspect eval inspect_evals/theagentcompany -T task_name=ds_sql_exercise -T scoring_mode=improved --model anthropic/claude-haiku-4-5-20251001 --limit 1
uv run inspect eval inspect_evals/theagentcompany -T task_name=research_answer_questions_on_paper -T scoring_mode=improved --model anthropic/claude-haiku-4-5-20251001 --limit 1By default the same model as the agent will be used as the grader, for tasks requiring a grader model. If you would like to specify a different grader model, you can do this using this syntax:
uv run inspect eval inspect_evals/theagentcompany --model anthropic/claude-haiku-4-5-20251001 --limit 1 --model-role grader=anthropic/claude-haiku-4-5-20251001The Agent Company leaderboard results (⭐ indicates perfect completion)
This section contains a comparison with results from the original implementation, available in this folder. I spent a while trying to set up the original implementation on a Mac but was not able to, so the results from this leaderboard are used. One of the most capable models they used was Claude Sonnet 4. At the time of writing this is no longer available, so I use Claude Haiku 4.5 as it has similar performance and lower cost.
A — OpenHands–Versa with Claude Sonnet 4 (anthropic/claude-sonnet-4, published run linked above). B — this implementation with Claude Haiku 4.5 (anthropic/claude-haiku-4-5-20251001). The grader model for this implementation is the same as the agent model: Claude Haiku 4.5 (anthropic/claude-haiku-4-5-20251001).
Extracted task scores for implemented tasks from comparison_results.md (parsed programmatically, original implementation) and TRAJECTORY_ANALYSIS.md (checkpoint scores for Inspect implementation):
| Task | Original implementation | Inspect implementation |
|---|---|---|
ds-fix-table-values-and-missing-answers-image |
1.00 | 0.67 |
ds-sql-exercise-image |
1.00 | 0.67 |
finance-check-attendance-payroll-image |
1.00 | 0.33 |
hr-check-attendance-one-day-image |
1.00 | 1.00 |
hr-new-grad-job-description-image |
1.00 | 0.00 |
sde-create-sqlite-database-image |
0.44 | 0.88 |
finance-expense-validation-image |
0.38 | 0.50 |
ds-find-meeting-spreadsheet-image |
0.25 | 0.50 |
ds-visualize-data-in-pie-and-bar-chart-image |
0.25 | 0.50 |
hr-check-attendance-multiple-days-image |
0.25 | 0.67 |
admin-remove-pages-pdf-image |
0.17 | 0.67 |
hr-populate-salary-increase-memo-image |
0.14 | 0.83 |
finance-budget-variance-image |
0.12 | 1.00 |
hr-create-employee-manual-image |
0.12 | 0.33 |
hr-organize-talent-info-image |
0.12 | 0.25 |
hr-resume-categorization-image |
0.12 | 0.25 |
admin-make-spreadsheet-image |
0.00 | 0.00 |
admin-mass-forms-filling-image |
0.00 | 0.00 |
admin-translate-sales-chat-image |
0.00 | 0.00 |
admin-watch-video-image |
0.00 | 0.00 |
ds-answer-numerical-data-question-image |
0.00 | 0.00 |
ds-coffee-shop-database-management-image |
0.00 | 0.00 |
ds-format-excel-sheets-image |
0.00 | 0.00 |
ds-predictive-modeling-image |
0.00 | 1.00 |
ds-stock-analysis-slides-image |
0.00 | 0.17 |
finance-invoice-matching-image |
0.00 | 0.33 |
finance-nonqualified-bill-ask-for-reimburse-image |
0.00 | 0.00 |
hr-check-attendance-multiple-days-department-image |
0.00 | 0.00 |
hr-salary-analysis-image |
0.00 | 0.00 |
ml-generate-gradcam-image |
0.00 | 0.00 |
ml-grade-exam-image |
0.00 | 0.00 |
research-answer-questions-on-paper-image |
0.00 | 0.58 |
sde-copy-table-from-pdf-to-xlsx-image |
0.00 | 0.20 |
sde-run-rising-wave-locally-image |
0.00 | 1.00 |
Scoring and multiple epochs
Task scorers store the aggregate checkpoint completion rate in Score.value["checkpoints"] so Inspect can reduce it correctly when running with --epochs greater than 1. Per-checkpoint detail remains in Score.metadata["checkpoints"] for log inspection; that metadata is not averaged across epochs.
Testing
This eval includes several test tasks that should pass quickly for most capable models. They verify that services are connected and required data is loaded.
test-owncloud-file-reading- verifies ownCloud data is loaded and accessible via the browsertest-basic-llm-judge- verifies an LLM judge can evaluate container-provided contenttest-install-permissions- verifies the agent can install packages in the sandbox (poppler-utils)test-evaluator-helpers- verifies shared file-read helpers (text, CSV, Excel, ownCloud lookup)
Rocket.Chat is not covered by a test task in this PR; connectivity checks will be added with the Rocket.Chat tasks in stage 3.
Changelog
[1-A] - 2026-05-19
- Initial Inspect adaptation of The Agent Company benchmark. The full benchmark has 175 tasks; this implementation is rolled out in stages for reviewability.
- Stage 1 (merged): five ownCloud-only example tasks to validate the approach.
- Stage 2: twenty-eight additional ownCloud-only tasks (34 dataset samples including test tasks).
- Per-task checkpoint scoring with separate
score(pass/fail) andcheckpoints(partial progress) metrics; optionalimprovedscoring mode where original scorers are known to be brittle. - Test tasks for ownCloud connectivity, LLM judge, sandbox install permissions, and evaluator helpers. Rocket.Chat connectivity checks deferred to stage 3.