MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Machine learning tasks drawn from 75 Kaggle competitions.

Contributed By

Code

Paper

Overview

This is an inspect-native implementation of MLE-bench, a benchmark for evaluating language model agents on Machine Learning Engineering tasks sourced from Kaggle competitions.

Usage

Note: This eval has dependencies that conflict with other evals in this repo. It ships its own isolated environment under packages/mle_bench/ rather than sharing the repo’s main one.

Installation

From the repository (recommended)

Sync the eval’s environment from its package directory:

cd packages/mle_bench
uv sync

Then run evaluations from that directory:

uv run inspect eval inspect_evals/mle_bench --model openai/gpt-5-nano
uv run inspect eval inspect_evals/mle_bench_full --model openai/gpt-5-nano
uv run inspect eval inspect_evals/mle_bench_lite --model openai/gpt-5-nano

To run multiple tasks simultaneously use inspect eval-set:

uv run inspect eval-set inspect_evals/mle_bench inspect_evals/mle_bench_full inspect_evals/mle_bench_lite

Into an existing environment

If you already have a Python environment, install the eval’s package directly — it brings in inspect-evals alongside the eval-specific dependencies:

pip install /path/to/inspect_evals/packages/mle_bench/

Then run without uv run:

inspect eval inspect_evals/mle_bench --model openai/gpt-5-nano
inspect eval inspect_evals/mle_bench_full --model openai/gpt-5-nano
inspect eval inspect_evals/mle_bench_lite --model openai/gpt-5-nano

You can also import the task directly from Python:

from inspect_ai import eval, eval_set
from inspect_evals.mle_bench import mle_bench, mle_bench_full, mle_bench_lite
eval(mle_bench)
eval_set([mle_bench, mle_bench_full, mle_bench_lite], log_dir='logs-run-42')

From a Jupyter notebook (using Inspect Flow)

Inspect Flow manages the isolated environment for you. Install it once:

pip install inspect-flow

Then from a notebook cell, point Flow at this eval’s pyproject.toml so it builds the isolated venv from the same dependency set as uv sync:

from inspect_flow import FlowSpec, FlowTask, FlowDependencies
from inspect_flow.api import run

spec = FlowSpec(
    execution_type="venv",
    log_dir="logs/",
    dependencies=FlowDependencies(
        dependency_file="path/to/inspect_evals/packages/mle_bench/pyproject.toml",
    ),
    tasks=[FlowTask(name="inspect_evals/mle_bench", model="openai/gpt-5-nano")],
)

run(spec)  # blocking; builds isolated venv on first run, cached thereafter

After the run completes, read and analyse logs in the same notebook — the eval’s heavy dependencies are not required for log reading.

Running evaluations

After installing via any of the methods above, you can evaluate models. The examples below assume the repository approach; drop uv run if you are managing dependencies yourself.

uv run inspect eval inspect_evals/mle_bench --model openai/gpt-5-nano
uv run inspect eval inspect_evals/mle_bench_full --model openai/gpt-5-nano
uv run inspect eval inspect_evals/mle_bench_lite --model openai/gpt-5-nano

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download the Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

[!NOTE] When first running the mle_bench task, it will build the necessary docker image. This will take ~half an hour and create a ~25GB image.

The dataset is sourced from the Kaggle API. You will need to ensure that you have downloaded your Kaggle credentials (kaggle.json) and placed it in the ~/.kaggle/ directory (this is the default location where the Kaggle API looks for your credentials).

For each competition, the dataset will be downloaded and prepared automatically. You will need to manually join each competition by clicking the “Late Submission” button on the Kaggle competition page. We strongly recommend accepting the rules for all competitions you intend to run before starting the evaluation — see Competition Rules Acceptance below.

Competition Rules Acceptance

Before running evaluations, you must accept the competition rules on Kaggle for each competition in your chosen split. You can optionally skip the runtime TOS verification with -T skip_tos_check=true.

Full benchmark (`all.txt`) — 73 competitions

Note: The competitions detecting-insults-in-social-commentary and the-icml-2013-whale-challenge-right-whale-redux are excluded — their rules can no longer be accepted on Kaggle.

Lite benchmark (`low.txt`) — 20 competitions

Note: The competitions detecting-insults-in-social-commentary and the-icml-2013-whale-challenge-right-whale-redux are also in this split file but are excluded at runtime — their rules can no longer be accepted on Kaggle.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/mle_bench --limit 10
uv run inspect eval inspect_evals/mle_bench_full --max-connections 10
uv run inspect eval inspect_evals/mle_bench_lite --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`mle_bench`

split (pathlib.Path | str): str | Path The split to use. This should be the name of a .txt file within the splits directory, or a path to a custom split file. low.txt corresponts to MLE-bench-LITE, while all.txt is the full benchmark. The default is a split continaining only the spaceship-titanic competition. (default: 'spaceship-titanic.txt')
solver (Solver | None): Solver The solver to use when creating the task. If None, uses the default solver. (default: None)
max_messages (int): int The maximum number of messages to allow for each sample. Only applies when using the default solver. (default: 30)
scorer (Scorer | list[Scorer] | None): Scorer | list[Scorer] | None The scorer to use when evaluating mle_bench. If None, uses the default scorer which returns boolean values for the criteria of “valid_submission”, “above_median” and “any_medal”. (default: None)
epochs (int): int Number of times to repeat each sample. (default: 1)
docker_image_name (str | None): str | None The name of the docker image to be used for each sandbox. If such an image does not already exist, it will be built from inspect_evals/mle_bench/Dockerfile. (default: None)
force_rebuild (bool): bool Whether to rebuild when an image already exists with the name docker_image_name. (default: False)
data_dir (pathlib.Path | str | None): Path | str | None The directory that will be used for per competition data. If it does not exist, the data will be prepared here. If None, a directory in the user’s cache will be used by default. (default: None)
cpu_override (int): int Number of CPUs to restrict the sandbox to. (default: 1)
mem_override (int | None): int | None Amount of RAM (in GB) to restrict the sandbox to. (default: None)
gpu_driver (str | None): str | None GPU driver to use, e.g. “nvidia”. Requires the NVIDIA Container Toolkit (or equivalent) on the host. (default: None)
gpu_count (int | None): int | None Number of GPUs to reserve for the sandbox. (default: None)
compose_overrides (dict[Any, Any] | None): dict[Any, Any] | None Docker Compose service-level keys to deep-merge into the generated sandbox config. Use this to set any option not exposed as a dedicated parameter, e.g. {"shm_size": "2gb"} to increase shared memory, or {"deploy": {"resources": {"reservations": {"devices": [{"driver": "nvidia", "device_ids": ["0"], "capabilities": ["gpu"]}]}}}} to pin the container to a specific GPU device ID. Overrides are applied after all other config is built, so they take precedence. (default: None)
skip_tos_check (bool): bool Skip the upfront Kaggle competition rules (TOS) verification. (default: False)
sandbox_type (str): str The sandbox provider to use (e.g., “docker”, “k8s”, or a custom registered provider). (default: 'docker')
sandbox_config (Optional[Callable[[str, inspect_ai.dataset.Sample], SandboxEnvironmentSpec]]): Callable[[str, Sample], SandboxEnvironmentSpec] | None Optional custom function to create sandbox specs. Receives (sandbox_type, sample) and returns a SandboxEnvironmentSpec. If None, uses default Docker compose file generation. (default: None)

`mle_bench_full`

solver (Solver | None): Solver The solver to use when creating the task. If None, uses the default solver. (default: None)
max_messages (int): int The maximum number of messages to allow for each sample. Only applies when using the default solver. (default: 30)
scorer (Scorer | list[Scorer] | None): Scorer | list[Scorer] | None The scorer to use when evaluating mle_bench. If None, uses the default scorer which returns boolean values for the criteria of “valid_submission”, “above_median” and “any_medal”. (default: None)
epochs (int): int Number of times to repeat each sample. (default: 1)
docker_image_name (str | None): str | None The name of the docker image to be used for each sandbox. If such an image does not already exist, it will be built from inspect_evals/mle_bench/Dockerfile. (default: None)
force_rebuild (bool): bool Whether to rebuild when an image already exists with the name docker_image_name. (default: False)
data_dir (pathlib.Path | str | None): Path | str | None The directory that will be used for per competition data. If it does not exist, the data will be prepared here. If None, a directory in the user’s cache will be used by default. (default: None)
cpu_override (int): int Number of CPUs to restrict the sandbox to. (default: 1)
mem_override (int | None): int | None Amount of RAM (in GB) to restrict the sandbox to. (default: None)
gpu_driver (str | None): str | None GPU driver to use, e.g. “nvidia”. Requires the NVIDIA Container Toolkit (or equivalent) on the host. (default: None)
gpu_count (int | None): int | None Number of GPUs to reserve for the sandbox. (default: None)
compose_overrides (dict[Any, Any] | None): dict[Any, Any] | None Docker Compose service-level keys to deep-merge into the generated sandbox config. See mle_bench for details. (default: None)
skip_tos_check (bool): bool Skip the upfront Kaggle competition rules (TOS) verification. (default: False)
sandbox_type (str): str The sandbox provider to use (e.g., “docker”, “k8s”, or a custom registered provider). (default: 'docker')
sandbox_config (Optional[Callable[[str, inspect_ai.dataset.Sample], SandboxEnvironmentSpec]]): Callable[[str, Sample], SandboxEnvironmentSpec] | None Optional custom function to create sandbox specs. See mle_bench for details. (default: None)

`mle_bench_lite`

solver (Solver | None): Solver The solver to use when creating the task. If None, uses the default solver. (default: None)
max_messages (int): int The maximum number of messages to allow for each sample. Only applies when using the default solver. (default: 30)
scorer (Scorer | list[Scorer] | None): Scorer | list[Scorer] | None The scorer to use when evaluating mle_bench. If None, uses the default scorer which returns boolean values for the criteria of “valid_submission”, “above_median” and “any_medal”. (default: None)
epochs (int): int Number of times to repeat each sample. (default: 1)
docker_image_name (str | None): str | None The name of the docker image to be used for each sandbox. If such an image does not already exist, it will be built from inspect_evals/mle_bench/Dockerfile. (default: None)
force_rebuild (bool): bool Whether to rebuild when an image already exists with the name docker_image_name. (default: False)
data_dir (pathlib.Path | str | None): Path | str | None The directory that will be used for per competition data. If it does not exist, the data will be prepared here. If None, a directory in the user’s cache will be used by default. (default: None)
cpu_override (int): int Number of CPUs to restrict the sandbox to. (default: 1)
mem_override (int | None): int | None Amount of RAM (in GB) to restrict the sandbox to. (default: None)
gpu_driver (str | None): str | None GPU driver to use, e.g. “nvidia”. Requires the NVIDIA Container Toolkit (or equivalent) on the host. (default: None)
gpu_count (int | None): int | None Number of GPUs to reserve for the sandbox. (default: None)
compose_overrides (dict[Any, Any] | None): dict[Any, Any] | None Docker Compose service-level keys to deep-merge into the generated sandbox config. See mle_bench for details. (default: None)
skip_tos_check (bool): bool Skip the upfront Kaggle competition rules (TOS) verification. (default: False)
sandbox_type (str): str The sandbox provider to use (e.g., “docker”, “k8s”, or a custom registered provider). (default: 'docker')
sandbox_config (Optional[Callable[[str, inspect_ai.dataset.Sample], SandboxEnvironmentSpec]]): Callable[[str, Sample], SandboxEnvironmentSpec] | None Optional custom function to create sandbox specs. See mle_bench for details. (default: None)

Splits

A split is a .txt file with a list of competition ids. They can be found in the splits directory and run as follows

inspect eval inspect_evals/mle_bench -T split="spaceship-titanic.txt"
inspect eval inspect_evals/mle_bench -T split="low.txt"
inspect eval inspect_evals/mle_bench -T split="all.txt"

Information about the splits can be found at MLE-bench.

Notes:

The full dataset is ~3.3TB in size and will take ~2 days to download and prepare from scratch.The lite dataset (split = “low.txt”) is ~158GB.
You are not required to use the full datasets and can just download the competitions you are interested in. “spaceship-titanic.txt” and “dev.txt” are both useful for testing and development purposes, and you can pass custom splits as well.

Dataset

The dataset is sourced from Kaggle competitions. MLE-bench includes:

75 curated ML engineering competitions from Kaggle
Tasks covering tabular, vision, NLP, and other domains
Public and private test sets with standardized scoring

Scoring

The scorer returns boolean values for three criteria:

valid_submission: Whether the agent produced a valid submission
above_median: Whether the submission scored above the median
any_medal: Whether the submission achieved any medal (bronze, silver, or gold)

Example Invocation with `compose_overrides`

compose_overrides is useful for supplying specific docker compose file config:

uv run --group mle_bench inspect eval inspect_evals/mle_bench \
        -T gpu_driver=nvidia \
        -T skip_tos_check=true \
        -T 'compose_overrides={"shm_size": "4gb", "deploy": {"resources": {"reservations": {"devices": [{"driver": "nvidia", "device_ids": ["1"], "capabilities": ["gpu"]}]}}}}' \
        --model openai/gpt-4o-mini \
        --log-level info

Changelog

[7-F] - 2026-06-11

Added sandbox_type and sandbox_config parameters for provider flexibility.

[7-E] - 2026-04-21

Replaced deprecated basic_agent() with react() as the default agent. This fixes a pathological case where basic_agent() would loop on repeated content_filter stops until hitting message_limit. It also changes the stall-nudge message (sent when the model stops without calling a tool) to append a reminder to call the submit tool — a one-sentence prompt addition that may bias models toward earlier submission.

[6-E] - 2026-04-09

Fix compose file collision: filenames now include a hash of the sandbox config, so concurrent inspect invocations with different GPU/CPU settings no longer overwrite each other’s compose files.
Add compose_overrides task parameter, for supplying extra docker compose configuration.

[6-D] - 2026-03-24

Update pinned upstream commit to 2451bcb.

[5-D] - 2026-03-24

Add skip_tos_check parameter to skip upfront TOS verification.

[5-C] - 2026-03-23

Add gpu_driver and gpu_count parameters to reserve GPU devices in the sandbox container, e.g. -T gpu_driver=nvidia -T gpu_count=2.

[5-B] - 2026-03-18

Pin Dockerfile mle-bench clone to match pyproject.toml rev (1d391b0).

[4-B] - 2026-02-26

Bump per-sample timeout to 6 minutes.
Add configurable CPU and memory.
Make dataset downloading lazy.

[3-A] - 2026-02-19

Bugfix: Skip the the-icml-2013-whale-challenge-right-whale-redux competition due to being unavailable.

[2-A] - 2026-02-16

Migrate version to new scheme. See #907.

[2.0.1] - 2026-02-12

Bugfix: Ensure grading conda env is correctly setup and used.

[2.0.0] - 2026-02-10

Bugfix: Skip the detecting-insults-in-social-commentary competition due to being unavailable.