Terminal-Bench 2.0: Terminal Mastery Challenges

Coding
Agent

A popular benchmark for measuring the capabilities of agents and language models to perform realistic, end-to-end challenges in containerized environments. It measures whether an agent can execute workflows that resemble real engineering, data, devops, and software-automation tasks, not just text generation.

Overview

Terminal-Bench is a popular benchmark for measuring the capabilities of agents and language models to perform realistic, end-to-end challenges in containerized environments. It measures whether an agent can execute workflows that resemble real engineering, data, devops, and software-automation tasks, not just text generation.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[terminal_bench_2]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra terminal_bench_2

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/terminal_bench_2 --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.terminal_bench_2 import terminal_bench_2
eval(terminal_bench_2)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/terminal_bench_2 --limit 10
uv run inspect eval inspect_evals/terminal_bench_2 --max-connections 10
uv run inspect eval inspect_evals/terminal_bench_2 --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

terminal_bench_2

  • eval_names (str | list[str] | None): The list of evals to load. If None, all evals are loaded. (default: None)
  • variant_names (str | list[str] | None): The variant or list of variants to include. If None, only the default variant is included. (default: 'default')

Dataset

The dataset source is the Terminal-Bench 2.0 github repository containing 89 challenges (as of Dec 2025). Each challenge is manually authored and includes:

  • A natural-language instruction
  • A deterministic Docker environment
  • A test suite defining ground-truth behavior
  • An oracle (reference) solution

For 2.0, all challenges went through manual review + LLM-assisted verification to ensure they are solvable, deterministic, and robust.

Scoring

Each challenge provides a test.sh script, which should install test dependencies and verify the agent completed the instruction. For each challenge, Terminal Bench uses all or nothing scoring: only if all the tests defined in the test script succeed, does the agent pass. The agent is given a single attempt to pass the challenge.

For an eval run including several challenges, the accuracy score is calculated as: (Number of tasks successfully completed / Total tasks attempted) × 100%

For stronger statistical confidence, Terminal-Bench typically runs:

  • 3 repetitions for standard evaluation
  • 5 repetitions for leaderboard runs Final accuracy = mean pass rate across repetitions

Evaluation Report

Implementation Details

Full details in this doc

Key implementation characteristics:

  • A single task is created for all the Terminal Bench 2.0 challenges. In the challenges subdirectory, each challenge will have its own folder containing all the files required to configure and run the eval. Inspect Cyber is used to scan the challenges subdirectory and automatically find evaluation configuration files and convert the variants within them into samples.
  • A ReAct agent with bash and python tools is used as a solver. The agent prompt has been customized to closely resemble Harbor’s Terminus 2 prompts while working within Inspect’s agent framework.
  • The custom scorer simply runs the test script in the sandbox, and checks the value written in the reward.txt file to determine success or failure
  • Uses pre-built Docker images from Docker Hub by default

Known differences from Terminal-Bench 2.0 (Terminus 2 agent):

  • This implementation uses Inspect’s ReAct agent framework, while Terminus 2 uses Harbor’s custom agent implementation
  • Prompting strategy has been adapted to work with Inspect’s tool calling patterns
  • The react agent’s reasoning and tool use patterns may differ from Terminus 2’s approach

Known limitations:

  • Some challenges require internet access for installing dependencies during test execution, which may occasionally result in transient network failures (mitigated by automatic retry logic)
  • Docker container startup time can vary based on system resources

Results

Model Provider Accuracy Stderr Time
gpt-5-mini OpenAI 0.315 0.050 1h 21m
gpt-5-nano OpenAI 0.124 0.035 1h 48m
claude-haiku-4-5 Anthropic 0.371 0.051 3h 07 m

Comparison with official leaderboard:

The official Terminal-Bench 2.0 leaderboard reports the following results using the Terminus 2 agent (evaluated October 31, 2025):

Model Inspect (Dec 19, 2025) Terminus 2 (Oct 31, 2025) Difference
gpt-5-mini 31.5% ± 5.0% 24.0% ± 2.5% +7.5pp
gpt-5-nano 12.4% ± 3.5% 7.9% ± 1.9% +4.5pp
claude-haiku-4-5 37.1% ± 5.1% 28.3% ± 2.9% +8.8pp

Analysis:

Our Inspect implementation shows notably higher accuracy compared to the official Terminus 2 agent results. This performance difference is likely attributable to several factors:

  1. Agent architecture differences: Inspect’s react agent vs Harbor’s Terminus 2 implementation have different reasoning and tool use patterns
  2. Prompting strategies: Adapted prompts for Inspect’s framework may elicit different model behaviors
  3. Model improvements over time: The leaderboard results are from October 31, 2025, while these results are from December 19, 2025 (~7 weeks later). OpenAI/Anthropic may have deployed model updates during this period that improved performance, even though the model names remained the same. This was confirmed by running the full eval on Harbor using gpt-5-nano (harbor run -d terminal-bench@2.0 -a terminus-2 -m gpt-5-nano --quiet -n 6). This new run on Dec 23, 2025 got a score of 13.5%, much closer to our Inspect results.
  4. Scorer differences: During testing we found that the scorer sporadically fails due to transient network errors (as the test scripts often install several dependencies in the container). To mitigate this, our implementation allows the scorer to retry up to 3 times. While this is a slight deviation from Harbor which doesn’t have such retry, we chose this for a fairer score.

Both models successfully completed all 89 challenges without errors, indicating the implementation is stable and functionally correct.

Reproducibility Information

Evaluation runs:

  • Date: Dec 19, 2025 (OpenAI) and Dec 21, 2025 (Anthropic)
  • Dataset: Full Terminal-Bench 2.0 dataset (89 challenges)
  • Samples completed: 89/89 for both models (100% completion rate)
  • Epochs: 1 (single pass through dataset)

Environment details:

  • Inspect AI version: 0.3.154.dev37+g1289c1ad
  • Models tested:
    • openai/gpt-5-mini (run started: 2025-12-19T12:04:34+00:00)
    • openai/gpt-5-nano (run started: 2025-12-19T10:08:03+00:00)
    • anthropic/claude-haiku-4-5-20251001 (run started 2025-12-21T21:21:22+00:00)
  • Model configuration:
    • Temperature: Default (not explicitly set, uses Inspect’s default behavior)
    • Max messages: Unlimited
  • Docker: Pre-built images from Docker Hub
  • Platform: Linux ubuntu 6.8.0-88-generic

Replication instructions:

To replicate these results, run:

uv run inspect eval inspect_evals/terminal_bench_2 \
    --model openai/gpt-5-mini,openai/gpt-5-nano,anthropic/claude-haiku-4-5-20251001 --max-samples 12

For reproducibility considerations:

  • Temperature is not explicitly set in this evaluation (uses Inspect defaults), which allows for some variation in model outputs across runs
  • Each challenge has deterministic test suites, but agent behavior may vary due to the stochastic nature of model generation
  • Docker environments are fully specified and deterministic
  • No random seed is set for model generation; results may vary slightly on repeated runs

How to migrate more Terminal Bench 2.0 challenges?

The migrate_terminal_bench.py script can be reused if more challenges are added to the Terminal Bench 2.0 repository after the initial port to inspect_evals.

From the root of the inspect_evals project, execute the following command:

# For a single challenge
uv run --with ruamel.yaml python src/inspect_evals/terminal_bench_2/migration_scripts/migrate_terminal_bench.py \
    --source /path/to/terminal-bench-2 \
    --target src/inspect_evals/terminal_bench_2/challenges \
    --challenge <new_challenge_name>
    
# For multiple challenges at once
uv run --with ruamel.yaml python src/inspect_evals/terminal_bench_2/migration_scripts/migrate_terminal_bench.py \
    --source /path/to/terminal-bench-2 \
    --target src/inspect_evals/terminal_bench_2/challenges \
    --all \
    --overwrite    # This overwrites every challenge, but has no effect on existing challenges due to idempotency.