MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Machine learning tasks drawn from 75 Kaggle competitions.
Overview
This is an inspect-native implementation of MLE-bench, a benchmark for evaluating language model agents on Machine Learning Engineering tasks sourced from Kaggle competitions.
Usage
Note: This eval has dependencies that conflict with other evals in this repo. It ships its own isolated environment under
packages/mle_bench/rather than sharing the repo’s main one.
Installation
From the repository (recommended)
Sync the eval’s environment from its package directory:
cd packages/mle_bench
uv syncThen run evaluations from that directory:
uv run inspect eval inspect_evals/mle_bench --model openai/gpt-5-nano
uv run inspect eval inspect_evals/mle_bench_full --model openai/gpt-5-nano
uv run inspect eval inspect_evals/mle_bench_lite --model openai/gpt-5-nanoTo run multiple tasks simultaneously use inspect eval-set:
uv run inspect eval-set inspect_evals/mle_bench inspect_evals/mle_bench_full inspect_evals/mle_bench_liteInto an existing environment
If you already have a Python environment, install the eval’s package directly — it brings in inspect-evals alongside the eval-specific dependencies:
pip install /path/to/inspect_evals/packages/mle_bench/Then run without uv run:
inspect eval inspect_evals/mle_bench --model openai/gpt-5-nano
inspect eval inspect_evals/mle_bench_full --model openai/gpt-5-nano
inspect eval inspect_evals/mle_bench_lite --model openai/gpt-5-nanoYou can also import the task directly from Python:
from inspect_ai import eval, eval_set
from inspect_evals.mle_bench import mle_bench, mle_bench_full, mle_bench_lite
eval(mle_bench)
eval_set([mle_bench, mle_bench_full, mle_bench_lite], log_dir='logs-run-42')From a Jupyter notebook (using Inspect Flow)
Inspect Flow manages the isolated environment for you. Install it once:
pip install inspect-flowThen from a notebook cell, point Flow at this eval’s pyproject.toml so it builds the isolated venv from the same dependency set as uv sync:
from inspect_flow import FlowSpec, FlowTask, FlowDependencies
from inspect_flow.api import run
spec = FlowSpec(
execution_type="venv",
log_dir="logs/",
dependencies=FlowDependencies(
dependency_file="path/to/inspect_evals/packages/mle_bench/pyproject.toml",
),
tasks=[FlowTask(name="inspect_evals/mle_bench", model="openai/gpt-5-nano")],
)
run(spec) # blocking; builds isolated venv on first run, cached thereafterAfter the run completes, read and analyse logs in the same notebook — the eval’s heavy dependencies are not required for log reading.
Running evaluations
After installing via any of the methods above, you can evaluate models. The examples below assume the repository approach; drop uv run if you are managing dependencies yourself.
uv run inspect eval inspect_evals/mle_bench --model openai/gpt-5-nano
uv run inspect eval inspect_evals/mle_bench_full --model openai/gpt-5-nano
uv run inspect eval inspect_evals/mle_bench_lite --model openai/gpt-5-nanoAfter running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download the Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>[!NOTE] When first running the mle_bench task, it will build the necessary docker image. This will take ~half an hour and create a ~25GB image.
The dataset is sourced from the Kaggle API. You will need to ensure that you have downloaded your Kaggle credentials (kaggle.json) and placed it in the ~/.kaggle/ directory (this is the default location where the Kaggle API looks for your credentials).
For each competition, the dataset will be downloaded and prepared automatically. You will need to manually join each competition by clicking the “Late Submission” button on the Kaggle competition page. We strongly recommend accepting the rules for all competitions you intend to run before starting the evaluation — see Competition Rules Acceptance below.
Competition Rules Acceptance
Before running evaluations, you must accept the competition rules on Kaggle for each competition in your chosen split. You can optionally skip the runtime TOS verification with -T skip_tos_check=true.
Full benchmark (all.txt) — 73 competitions
- 3d-object-detection-for-autonomous-vehicles
- AI4Code
- aerial-cactus-identification
- alaska2-image-steganalysis
- aptos2019-blindness-detection
- billion-word-imputation
- bms-molecular-translation
- cassava-leaf-disease-classification
- cdiscount-image-classification-challenge
- chaii-hindi-and-tamil-question-answering
- champs-scalar-coupling
- denoising-dirty-documents
- dog-breed-identification
- dogs-vs-cats-redux-kernels-edition
- facebook-recruiting-iii-keyword-extraction
- freesound-audio-tagging-2019
- google-quest-challenge
- google-research-identify-contrails-reduce-global-warming
- h-and-m-personalized-fashion-recommendations
- herbarium-2020-fgvc7
- herbarium-2021-fgvc8
- herbarium-2022-fgvc9
- histopathologic-cancer-detection
- hms-harmful-brain-activity-classification
- hotel-id-2021-fgvc8
- hubmap-kidney-segmentation
- icecube-neutrinos-in-deep-ice
- imet-2020-fgvc7
- inaturalist-2019-fgvc6
- iwildcam-2019-fgvc6
- iwildcam-2020-fgvc7
- jigsaw-toxic-comment-classification-challenge
- jigsaw-unintended-bias-in-toxicity-classification
- kuzushiji-recognition
- leaf-classification
- learning-agency-lab-automated-essay-scoring-2
- lmsys-chatbot-arena
- mlsp-2013-birds
- multi-modal-gesture-recognition
- new-york-city-taxi-fare-prediction
- nfl-player-contact-detection
- nomad2018-predict-transparent-conductors
- osic-pulmonary-fibrosis-progression
- petfinder-pawpularity-score
- plant-pathology-2020-fgvc7
- plant-pathology-2021-fgvc8
- predict-volcanic-eruptions-ingv-oe
- random-acts-of-pizza
- ranzcr-clip-catheter-line-classification
- rsna-2022-cervical-spine-fracture-detection
- rsna-breast-cancer-detection
- rsna-miccai-brain-tumor-radiogenomic-classification
- seti-breakthrough-listen
- siim-covid19-detection
- siim-isic-melanoma-classification
- smartphone-decimeter-2022
- spooky-author-identification
- stanford-covid-vaccine
- statoil-iceberg-classifier-challenge
- tabular-playground-series-dec-2021
- tabular-playground-series-may-2022
- tensorflow-speech-recognition-challenge
- tensorflow2-question-answering
- text-normalization-challenge-english-language
- text-normalization-challenge-russian-language
- tgs-salt-identification-challenge
- tweet-sentiment-extraction
- us-patent-phrase-to-phrase-matching
- uw-madison-gi-tract-image-segmentation
- ventilator-pressure-prediction
- vesuvius-challenge-ink-detection
- vinbigdata-chest-xray-abnormalities-detection
- whale-categorization-playground
Note: The competitions
detecting-insults-in-social-commentaryandthe-icml-2013-whale-challenge-right-whale-reduxare excluded — their rules can no longer be accepted on Kaggle.
Lite benchmark (low.txt) — 20 competitions
- aerial-cactus-identification
- aptos2019-blindness-detection
- denoising-dirty-documents
- dog-breed-identification
- dogs-vs-cats-redux-kernels-edition
- histopathologic-cancer-detection
- jigsaw-toxic-comment-classification-challenge
- leaf-classification
- mlsp-2013-birds
- new-york-city-taxi-fare-prediction
- nomad2018-predict-transparent-conductors
- plant-pathology-2020-fgvc7
- random-acts-of-pizza
- ranzcr-clip-catheter-line-classification
- siim-isic-melanoma-classification
- spooky-author-identification
- tabular-playground-series-dec-2021
- tabular-playground-series-may-2022
- text-normalization-challenge-english-language
- text-normalization-challenge-russian-language
Note: The competitions
detecting-insults-in-social-commentaryandthe-icml-2013-whale-challenge-right-whale-reduxare also in this split file but are excluded at runtime — their rules can no longer be accepted on Kaggle.
Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/mle_bench --limit 10
uv run inspect eval inspect_evals/mle_bench_full --max-connections 10
uv run inspect eval inspect_evals/mle_bench_lite --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
mle_bench
split(pathlib.Path | str): str | Path The split to use. This should be the name of a .txt file within the splits directory, or a path to a custom split file. low.txt corresponts to MLE-bench-LITE, while all.txt is the full benchmark. The default is a split continaining only the spaceship-titanic competition. (default:'spaceship-titanic.txt')solver(Solver | None): Solver The solver to use when creating the task. If None, uses the default solver. (default:None)max_messages(int): int The maximum number of messages to allow for each sample. Only applies when using the default solver. (default:30)scorer(Scorer | list[Scorer] | None): Scorer | list[Scorer] | None The scorer to use when evaluating mle_bench. If None, uses the default scorer which returns boolean values for the criteria of “valid_submission”, “above_median” and “any_medal”. (default:None)epochs(int): int Number of times to repeat each sample. (default:1)docker_image_name(str | None): str | None The name of the docker image to be used for each sandbox. If such an image does not already exist, it will be built from inspect_evals/mle_bench/Dockerfile. (default:None)force_rebuild(bool): bool Whether to rebuild when an image already exists with the name docker_image_name. (default:False)data_dir(pathlib.Path | str | None): Path | str | None The directory that will be used for per competition data. If it does not exist, the data will be prepared here. If None, a directory in the user’s cache will be used by default. (default:None)cpu_override(int): int Number of CPUs to restrict the sandbox to. (default:1)mem_override(int | None): int | None Amount of RAM (in GB) to restrict the sandbox to. (default:None)gpu_driver(str | None): str | None GPU driver to use, e.g. “nvidia”. Requires the NVIDIA Container Toolkit (or equivalent) on the host. (default:None)gpu_count(int | None): int | None Number of GPUs to reserve for the sandbox. (default:None)compose_overrides(dict[Any, Any] | None): dict[Any, Any] | None Docker Compose service-level keys to deep-merge into the generated sandbox config. Use this to set any option not exposed as a dedicated parameter, e.g.{"shm_size": "2gb"}to increase shared memory, or{"deploy": {"resources": {"reservations": {"devices": [{"driver": "nvidia", "device_ids": ["0"], "capabilities": ["gpu"]}]}}}}to pin the container to a specific GPU device ID. Overrides are applied after all other config is built, so they take precedence. (default:None)skip_tos_check(bool): bool Skip the upfront Kaggle competition rules (TOS) verification. (default:False)sandbox_type(str): str The sandbox provider to use (e.g., “docker”, “k8s”, or a custom registered provider). (default:'docker')sandbox_config(Optional[Callable[[str, inspect_ai.dataset.Sample], SandboxEnvironmentSpec]]): Callable[[str, Sample], SandboxEnvironmentSpec] | None Optional custom function to create sandbox specs. Receives (sandbox_type, sample) and returns a SandboxEnvironmentSpec. If None, uses default Docker compose file generation. (default:None)
mle_bench_full
solver(Solver | None): Solver The solver to use when creating the task. If None, uses the default solver. (default:None)max_messages(int): int The maximum number of messages to allow for each sample. Only applies when using the default solver. (default:30)scorer(Scorer | list[Scorer] | None): Scorer | list[Scorer] | None The scorer to use when evaluating mle_bench. If None, uses the default scorer which returns boolean values for the criteria of “valid_submission”, “above_median” and “any_medal”. (default:None)epochs(int): int Number of times to repeat each sample. (default:1)docker_image_name(str | None): str | None The name of the docker image to be used for each sandbox. If such an image does not already exist, it will be built from inspect_evals/mle_bench/Dockerfile. (default:None)force_rebuild(bool): bool Whether to rebuild when an image already exists with the name docker_image_name. (default:False)data_dir(pathlib.Path | str | None): Path | str | None The directory that will be used for per competition data. If it does not exist, the data will be prepared here. If None, a directory in the user’s cache will be used by default. (default:None)cpu_override(int): int Number of CPUs to restrict the sandbox to. (default:1)mem_override(int | None): int | None Amount of RAM (in GB) to restrict the sandbox to. (default:None)gpu_driver(str | None): str | None GPU driver to use, e.g. “nvidia”. Requires the NVIDIA Container Toolkit (or equivalent) on the host. (default:None)gpu_count(int | None): int | None Number of GPUs to reserve for the sandbox. (default:None)compose_overrides(dict[Any, Any] | None): dict[Any, Any] | None Docker Compose service-level keys to deep-merge into the generated sandbox config. See mle_bench for details. (default:None)skip_tos_check(bool): bool Skip the upfront Kaggle competition rules (TOS) verification. (default:False)sandbox_type(str): str The sandbox provider to use (e.g., “docker”, “k8s”, or a custom registered provider). (default:'docker')sandbox_config(Optional[Callable[[str, inspect_ai.dataset.Sample], SandboxEnvironmentSpec]]): Callable[[str, Sample], SandboxEnvironmentSpec] | None Optional custom function to create sandbox specs. See mle_bench for details. (default:None)
mle_bench_lite
solver(Solver | None): Solver The solver to use when creating the task. If None, uses the default solver. (default:None)max_messages(int): int The maximum number of messages to allow for each sample. Only applies when using the default solver. (default:30)scorer(Scorer | list[Scorer] | None): Scorer | list[Scorer] | None The scorer to use when evaluating mle_bench. If None, uses the default scorer which returns boolean values for the criteria of “valid_submission”, “above_median” and “any_medal”. (default:None)epochs(int): int Number of times to repeat each sample. (default:1)docker_image_name(str | None): str | None The name of the docker image to be used for each sandbox. If such an image does not already exist, it will be built from inspect_evals/mle_bench/Dockerfile. (default:None)force_rebuild(bool): bool Whether to rebuild when an image already exists with the name docker_image_name. (default:False)data_dir(pathlib.Path | str | None): Path | str | None The directory that will be used for per competition data. If it does not exist, the data will be prepared here. If None, a directory in the user’s cache will be used by default. (default:None)cpu_override(int): int Number of CPUs to restrict the sandbox to. (default:1)mem_override(int | None): int | None Amount of RAM (in GB) to restrict the sandbox to. (default:None)gpu_driver(str | None): str | None GPU driver to use, e.g. “nvidia”. Requires the NVIDIA Container Toolkit (or equivalent) on the host. (default:None)gpu_count(int | None): int | None Number of GPUs to reserve for the sandbox. (default:None)compose_overrides(dict[Any, Any] | None): dict[Any, Any] | None Docker Compose service-level keys to deep-merge into the generated sandbox config. See mle_bench for details. (default:None)skip_tos_check(bool): bool Skip the upfront Kaggle competition rules (TOS) verification. (default:False)sandbox_type(str): str The sandbox provider to use (e.g., “docker”, “k8s”, or a custom registered provider). (default:'docker')sandbox_config(Optional[Callable[[str, inspect_ai.dataset.Sample], SandboxEnvironmentSpec]]): Callable[[str, Sample], SandboxEnvironmentSpec] | None Optional custom function to create sandbox specs. See mle_bench for details. (default:None)
Splits
A split is a .txt file with a list of competition ids. They can be found in the splits directory and run as follows
inspect eval inspect_evals/mle_bench -T split="spaceship-titanic.txt"
inspect eval inspect_evals/mle_bench -T split="low.txt"
inspect eval inspect_evals/mle_bench -T split="all.txt"Information about the splits can be found at MLE-bench.
Notes:
The full dataset is ~3.3TB in size and will take ~2 days to download and prepare from scratch.The lite dataset (split = “low.txt”) is ~158GB.
You are not required to use the full datasets and can just download the competitions you are interested in. “spaceship-titanic.txt” and “dev.txt” are both useful for testing and development purposes, and you can pass custom splits as well.
Dataset
The dataset is sourced from Kaggle competitions. MLE-bench includes:
- 75 curated ML engineering competitions from Kaggle
- Tasks covering tabular, vision, NLP, and other domains
- Public and private test sets with standardized scoring
Scoring
The scorer returns boolean values for three criteria:
valid_submission: Whether the agent produced a valid submissionabove_median: Whether the submission scored above the medianany_medal: Whether the submission achieved any medal (bronze, silver, or gold)
Example Invocation with compose_overrides
compose_overrides is useful for supplying specific docker compose file config:
uv run --group mle_bench inspect eval inspect_evals/mle_bench \
-T gpu_driver=nvidia \
-T skip_tos_check=true \
-T 'compose_overrides={"shm_size": "4gb", "deploy": {"resources": {"reservations": {"devices": [{"driver": "nvidia", "device_ids": ["1"], "capabilities": ["gpu"]}]}}}}' \
--model openai/gpt-4o-mini \
--log-level infoChangelog
[7-F] - 2026-06-11
- Added
sandbox_typeandsandbox_configparameters for provider flexibility.
[7-E] - 2026-04-21
- Replaced deprecated
basic_agent()withreact()as the default agent. This fixes a pathological case wherebasic_agent()would loop on repeatedcontent_filterstops until hittingmessage_limit. It also changes the stall-nudge message (sent when the model stops without calling a tool) to append a reminder to call the submit tool — a one-sentence prompt addition that may bias models toward earlier submission.
[6-E] - 2026-04-09
- Fix compose file collision: filenames now include a hash of the sandbox config, so concurrent
inspectinvocations with different GPU/CPU settings no longer overwrite each other’s compose files. - Add
compose_overridestask parameter, for supplying extra docker compose configuration.
[6-D] - 2026-03-24
- Update pinned upstream commit to 2451bcb.
[5-D] - 2026-03-24
- Add
skip_tos_checkparameter to skip upfront TOS verification.
[5-C] - 2026-03-23
- Add
gpu_driverandgpu_countparameters to reserve GPU devices in the sandbox container, e.g.-T gpu_driver=nvidia -T gpu_count=2.
[5-B] - 2026-03-18
- Pin Dockerfile mle-bench clone to match pyproject.toml rev (1d391b0).
[4-B] - 2026-02-26
- Bump per-sample timeout to 6 minutes.
- Add configurable CPU and memory.
- Make dataset downloading lazy.
[3-A] - 2026-02-19
- Bugfix: Skip the
the-icml-2013-whale-challenge-right-whale-reduxcompetition due to being unavailable.
[2-A] - 2026-02-16
- Migrate version to new scheme. See #907.
[2.0.1] - 2026-02-12
- Bugfix: Ensure grading conda env is correctly setup and used.
[2.0.0] - 2026-02-10
- Bugfix: Skip the
detecting-insults-in-social-commentarycompetition due to being unavailable.