MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
This benchmark evaluates LLM-based research agents on their ability to propose and implement novel methods using tasks from recent ML conference competitions, assessing both novelty and effectiveness compared to a baseline and top human solutions.
Overview
This is an Inspect AI implementation of MLRC-Bench, a benchmark for evaluating whether AI agents can solve open-ended machine learning research challenges by proposing and implementing novel methods that improve on a provided baseline.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evalsIf you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv syncRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/mlrc_bench --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.mlrc_bench import mlrc_bench
eval(mlrc_bench)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Helper models
This eval uses Model Roles for its helper models.
Roles used
- edit_script: Model used by the “Edit Script (AI)” tool (default:
openai/o4-mini). - fast: Model used for lightweight helper tools such as reflection and file understanding (default:
openai/gpt-4o-mini).
- edit_script: Model used by the “Edit Script (AI)” tool (default:
Example override of the default helper models
uv run inspect eval inspect_evals/mlrc_bench \ --model openai/gpt-5-nano \ --model-role edit_script=openai/o4-mini \ --model-role fast=openai/gpt-4o-mini
Running product-recommendation task
The product-rec task requires a dataset downloaded from AIcrowd. This is done automatically during the Docker image build, but first you need to create an AIcrowd account and set the AICROWD_API_KEY environment variable to your account’s API key.
Running machine-unlearning task
The machine-unlearning task requires access to Kaggle to retrieve evaluation results. To set this up, navigate to your Kaggle account settings page and scroll to the “Legacy API Credentials” section. Click “Create Legacy API Key” to generate a kaggle.json file, then store it at ~/.kaggle/kaggle.json. You can also specify a custom path to this file by using the kaggle_json_path argument.
In addition, you must manually accept the competition terms and conditions on Kaggle. Most importantly, running the test will push a notebook to Kaggle, and you must submit it manually. After you receive the Kaggle score, compute the final metrics absolute_improvement_to_baseline and relative_improvement_to_human using the following formulas:
relative_improvement_to_human: \(\frac{s_{\text{agent}} - s_{\text{baseline}}}{s_{\text{top human}} - s_{\text{baseline}}} \times 100\)absolute_improvement_to_baseline: \(\frac{s_{\text{agent}} - s_{\text{baseline}}}{s_{\text{baseline}}} \times 100\)
Where:
- \(s_{\text{agent}}\): the agent’s test performance as reported by Kaggle.
- \(s_{\text{baseline}}\): the baseline test performance (0.06085605833).
- \(s_{\text{top human}}\): the top human test performance in the competition (0.0984971060).
Running llm-merging task
The llm-merging task requires access to Kaggle to retrieve evaluation results. To set this up, navigate to your Kaggle account settings page and scroll to the “Legacy API Credentials” section. Click “Create Legacy API Key” to generate a kaggle.json file, then store it at ~/.kaggle/kaggle.json. You can also specify a custom path to this file by using the kaggle_json_path argument. Additionally, you must accept the competition rules. Go to the llm-merging competition page, click the Late Submission button, and accept the rules.
It also requires access to the Hugging Face API to download a model checkpoint. Set the HF_AUTH_TOKEN environment variable to your account’s API key. Additionally, it requires access to the Meta Llama 3 collection. You can request access here.
The weather-forecast task
The weather-forecast task requires a dataset (downloaded automatically) that was unavailable at the time the code was implemented. Running this task results in an error.
This task has been filtered out of the evaluation by default. To change this, code edits are required in the dataset.py file, as it is not expected that this issue will be rectified by the original evaluation authors.
Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/mlrc_bench --limit 10
uv run inspect eval inspect_evals/mlrc_bench --max-connections 10
uv run inspect eval inspect_evals/mlrc_bench --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
mlrc_bench
task_names(Union[list[Literal[‘llm-merging’, ‘backdoor-trigger’, ‘temporal-action-loc’, ‘machine-unlearning’, ‘meta-learning’, ‘product-rec’, ‘weather-forecast’]], Literal[‘llm-merging’, ‘backdoor-trigger’, ‘temporal-action-loc’, ‘machine-unlearning’, ‘meta-learning’, ‘product-rec’, ‘weather-forecast’], None]): Optional allowlist of task IDs (either a single task ID or a list). Available tasks: llm-merging, backdoor-trigger, temporal-action-loc, machine-unlearning, meta-learning, product-rec, weather-forecast. (default:None)exclude_task_names(Union[list[Literal[‘llm-merging’, ‘backdoor-trigger’, ‘temporal-action-loc’, ‘machine-unlearning’, ‘meta-learning’, ‘product-rec’, ‘weather-forecast’]], Literal[‘llm-merging’, ‘backdoor-trigger’, ‘temporal-action-loc’, ‘machine-unlearning’, ‘meta-learning’, ‘product-rec’, ‘weather-forecast’], None]): Optional blocklist of task IDs (either a single task ID or a list). (default:None)limit(int): Limit number of tasks loaded into the dataset (0 = no limit). (default:0)force_download(bool): Re-download and re-extract the upstream benchmark repository. (default:False)edit_script_max_tokens(int): Max tokens for the “edit script” high-level action. (default:4000)max_time_in_hours(float): Inspect tasktime_limitin hours. (default:5.0)message_limit(int | None): Inspect taskmessage_limit(None disables message limiting). (default:50)kaggle_json_path(str): Path to Kaggle API credentials JSON file (typically~/.kaggle/kaggle.json). (default:'~/.kaggle/kaggle.json')
Dataset
MLRC-Bench is a benchmark designed to evaluate how effectively AI agents can solve open-ended Machine Learning research problems that require genuine methodological novelty. Unlike other benchmarks that focus on standard coding tasks or Kaggle-style engineering, MLRC-Bench focuses on the “research” aspect: proposing and implementing new methods to improve upon a baseline.
Source: The tasks are curated from 7 recent ML conference competitions (e.g., NeurIPS, KDD, ECCV) covering diverse areas like LLM safety, computer vision, and AI for science.
Goal: Agents are given a research problem, a baseline code repository, and a literature background. They must propose a novel idea, implement it, and achieve a performance improvement over the baseline.
Evaluation: Performance is measured objectively against a reliable baseline and the top human solution from the actual competition. The benchmark also evaluates the efficiency and simplicity of the solution, not just raw accuracy.
Scoring
This evaluation uses the upstream MLRC-Bench harness to run each task’s test phase and then reports the resulting objective metrics. Concretely, after the agent writes an implementation, the scorer runs the upstream command:
python main.py -m <method_name> -p test
and reads the produced output/idea_evals.json. The scorer then compares the implementation against the task’s provided baseline and top human constants and returns a structured set of metrics.
The following fields are reported per task run (i.e., per sample):
performance(\(s_{\text{agent}}\)): The task’s test performance metric for the implemented method. (The metric is task-specific; depending on the task it may be defined such that higher is better or lower is better.)relative_improvement_to_human: Measures how much of the performance gap between baseline and human the agent has closed: \(\frac{s_{\text{agent}} - s_{\text{baseline}}}{s_{\text{top human}} - s_{\text{baseline}}} \times 100\)absolute_improvement_to_baseline: Measures the percentage improvement over the baseline performance: \(\frac{s_{\text{agent}} - s_{\text{baseline}}}{s_{\text{baseline}}} \times 100\)
Where:
- \(s_{\text{baseline}}\): the baseline test performance.
- \(s_{\text{top human}}\): the top human test performance in the competition.
runtime: The test runtime of agent’s method.relative_runtime: Percent change in runtime vs. the baseline runtime: \(\frac{\text{runtime} - \text{baseline runtime}}{\text{baseline runtime}} \times 100\)method_complexity: The method implementation complexity.baseline_complexity: The baseline method complexity.relative_complexity: Percent change in complexity vs. baseline complexity: \(\frac{\text{method complexity} - \text{baseline complexity}}{\text{baseline complexity}} \times 100\)error: Indicates whether the evaluation encountered an error. The error usually stems from the agent’s inability to evaluate the method on the validation split. This failure prevents the generation of a results file, which the benchmark requires to identify the method name before proceeding to the final test split. Alternatively, the issue may originate from a bug within the original benchmark code itself.1means an error occurred;0means the evaluation completed successfully.
The benchmark-level summary metrics include only relative_improvement_to_human and absolute_improvement_to_baseline, which are the main benchmark metrics. They are averaged across all tasks (i.e., samples), except for the machine-unlearning task, because its score requires a manual submission step. If an evaluation run errors, it is treated as an agent failure and relative_improvement_to_human and absolute_improvement_to_baseline are set to 0.0, which lowers the overall average.
GPU requirements
This benchmark requires a GPU with either 16GB of VRAM (for temporal-action-loc, machine-unlearning, meta-learning, and product-rec) or 48GB of VRAM (for llm-merging, backdoor-trigger, and weather-forecast), depending on the task.
Evaluation Report
Generally, agents struggle to produce novel, viable methods within the prescribed message limit. Reaching the stage of obtaining numerical results on the test split is challenging, as the agent must first successfully execute the evaluation script on the validation split at least once. This execution generates a results file from which the benchmark infers the method for final evaluation on the test split — a process that occurs without further agent involvement.
In cases where a results file is produced, the outcomes often reflect the baseline method (which is available for each task), as the benchmark defaults to whichever method the agent tested last. Since the original paper reports the best approach across eight trials, it is plausible that some models completed the task correctly by chance in a single trial.
To ensure my implementation aligns with the original, I ran the source code twice for the temporal-action-loc task. The inputs and agent behavior were nearly identical to those in my implementation.
Below are the best results I obtained while evaluating the openai/gpt-4o. Those can be directly compared with the results in the MLRC-Bench leaderboard (MLAB (gpt-4o)).
Relative Improvement to Human
| Agent | Temporal Action Localisation | LLM Merging | Next Product Recommendation | Rainfall Prediction | Meta Learning | Machine Unlearning | Backdoor Trigger Recovery |
|---|---|---|---|---|---|---|---|
| MLAB (gpt-4o) | -0.8 | 2.0 | -0.9 | - | -2.9 | -3.4 | -0.3 |
Absolute Improvement to Baseline
| Agent | Temporal Action Localisation | LLM Merging | Next Product Recommendation | Rainfall Prediction | Meta Learning | Machine Unlearning | Backdoor Trigger Recovery |
|---|---|---|---|---|---|---|---|
| MLAB (gpt-4o) | -2.1 | 1.4 | -3.9 | - | -8.9 | -2.1 | -1.8 |
Notes
Updates to the original task-specific environments
The environments and scripts required fixes due to errors encountered during the build or execution phases. The corrected files are located in the orig_benchmark_corrections folder within the MLRC-Bench directory. Below is the list of changes:
llm-merging:
environment.yml: Created a minimalenvironment.ymlfile, as the original appeared to be a full environment export and contained hard package conflicts.BaseMethod.py: Ensured the correct Hugging Face token is used for data downloads.evaluation.py: Added retries for Kaggle submissions and a polling mechanism for results to improve robustness.
machine-unlearning:
environment.yml: Fixed the formatting of the.ymlfile.evaluation.py: Modified the kernel pushed to Kaggle to remove underscores (_), as Kaggle was not handling them correctly.
meta-learning:
environment.yml: Switched to pre-built PyTorch wheels to avoid compilation errors and ensured PyTorch is installed first. Adjusted the Jax version for compatibility.
product-rec:
prepare.py: Ensured the API key is used for authenticating with the AICrowd CLI.
weather-forecast:
environment.yml: Configured pip to fetch PyTorch CUDA wheels from the official PyTorch index.
Helper models
The original implementation used o1-mini to edit the scripts (edit_script role model); however, that model is now deprecated. This version utilizes o4-mini instead.
Snapshots
In MLRC-Bench, the agent can iteratively modify and test the codebase multiple times on the development set. The original implementation stores a snapshot of the agent’s code after every change. Furthermore, whenever an execution occurs on the development set, the resulting metrics and the name of the method evaluated for that specific snapshot are recorded. Upon task completion, the snapshot that achieved the best development performance is selected for evaluation on the test set to determine the final results.
In this current implementation, however, snapshots are not stored, and the method evaluated last is used for the test set evaluation due to the complexity of implementing snapshots. In practice, this change does not appear to significantly influence benchmark outcomes; the agent typically submits its answer immediately following the implementation of the first approach (a behavior verified using gpt-4o and o3-mini).