Frontier-CS: Benchmarking LLMs on Computer Science Problems

Coding

inspect-evals

Agent

238 open-ended computer science problems spanning algorithmic (172) and research (66) tracks. Problems feature continuous partial scoring, with algorithmic solutions evaluated via compilation and test-case checking, and research solutions evaluated via custom evaluator scripts. Current frontier models score well below human expert baselines, making this a challenging, unsaturated benchmark.

Contributed By

@JayBaileyCS

Code

src/inspect_evals/frontier_cs

Paper

https://arxiv.org/abs/2512.15699

Overview

Frontier-CS is a benchmark of 238 open-ended computer science problems spanning two tracks: algorithmic (172 C++ competitive programming problems) and research (66 Python problems covering GPU kernel optimization, symbolic regression, and more). Problems feature continuous partial scoring — algorithmic solutions are compiled and tested against test cases with custom checkers, while research solutions are evaluated by problem-specific Python scripts. Current frontier models score well below human expert baselines, making this a challenging, unsaturated benchmark.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/frontier_cs --model openai/gpt-5-nano
uv run inspect eval inspect_evals/frontier_cs_algorithmic --model openai/gpt-5-nano
uv run inspect eval inspect_evals/frontier_cs_research --model openai/gpt-5-nano

To run multiple tasks simulteneously use inspect eval-set:

uv run inspect eval-set inspect_evals/frontier_cs inspect_evals/frontier_cs_algorithmic inspect_evals/frontier_cs_research

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval, eval_set
from inspect_evals.frontier_cs import frontier_cs, frontier_cs_algorithmic, frontier_cs_research
eval(frontier_cs)
eval_set([frontier_cs, frontier_cs_algorithmic, frontier_cs_research], log_dir='logs-run-42')

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/frontier_cs --limit 10
uv run inspect eval inspect_evals/frontier_cs_algorithmic --max-connections 10
uv run inspect eval inspect_evals/frontier_cs_research --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`frontier_cs`

track (str): (default: 'all')
solver (Solver | Agent | None): (default: None)
scorer (Scorer | list[Scorer] | None): (default: None)
message_limit (int): (default: 100)
sandbox (str | tuple[str, str] | SandboxEnvironmentSpec): (default: ('docker', 'src/inspect_evals/frontier_cs/compose.yaml'))
agentic (bool): (default: True)
include_gpu_problems (bool): (default: False)
shuffle (bool): (default: False)
seed (int): (default: 42)

`frontier_cs_algorithmic`

solver (Solver | Agent | None): (default: None)
scorer (Scorer | list[Scorer] | None): (default: None)
message_limit (int): (default: 100)
sandbox (str | tuple[str, str] | SandboxEnvironmentSpec): (default: ('docker', 'src/inspect_evals/frontier_cs/compose.yaml'))
agentic (bool): (default: True)
shuffle (bool): (default: False)
seed (int): (default: 42)

`frontier_cs_research`

solver (Solver | Agent | None): (default: None)
scorer (Scorer | list[Scorer] | None): (default: None)
message_limit (int): (default: 100)
sandbox (str | tuple[str, str] | SandboxEnvironmentSpec): (default: ('docker', 'src/inspect_evals/frontier_cs/compose.yaml'))
agentic (bool): (default: True)
include_gpu_problems (bool): (default: False)
shuffle (bool): (default: False)
seed (int): (default: 42)

Prerequisites

This evaluation requires Docker to be installed and running. On first run, a local Docker image is built containing g++, python3, and testlib.h for compiling and running solutions and checkers.

Dataset

The dataset is loaded from HuggingFace (238 problems in the test split). Each problem has a problem_id, category (“algorithmic” or “research”), statement (the problem text), and config (YAML string with time/memory limits and scoring metadata).

Test data (inputs, expected outputs, custom checkers) is downloaded from the FrontierCS GitHub repository and cached locally.

Scoring

Algorithmic track: Solutions are compiled with g++ -O2 -std=gnu++17, run against test cases, and checked with custom testlib-based checkers. Each test case yields a score ratio (0.0-1.0), and the final score is the mean across all test cases.

Research track: Solutions are evaluated by problem-specific Python evaluator scripts. The score is parsed from the evaluator’s stdout and normalized to 0.0-1.0.

Metrics reported: mean and stderr of per-problem scores.

Evaluation Report

Algorithmic Track (43/172 problems, C++)

Model	Provider	Mean Score	Stderr	Time
claude-sonnet-4-5-20250929	Anthropic	0.207	0.056	32m 56s
gpt-5.1-2025-11-13	OpenAI	0.110	0.041	42m 7s

Research Track (43 problems, Python, GPU problems excluded)

Model	Provider	Mean Score	Stderr	Time
claude-sonnet-4-5-20250929	Anthropic	0.194	0.046	55m 25s
gpt-5.1-2025-11-13	OpenAI	0.132	0.043	57m 42s

Reproducibility:

uv run inspect eval inspect_evals/frontier_cs_algorithmic --model anthropic/claude-sonnet-4-5-20250929,openai/gpt-5.1-2025-11-13 --limit 43 --max-tasks 1 -T shuffle=True -T message_limit=100
uv run inspect eval inspect_evals/frontier_cs_research --model anthropic/claude-sonnet-4-5-20250929,openai/gpt-5.1-2025-11-13 --limit 43 --max-tasks 1 -T shuffle=True -T message_limit=100

Single-Turn (Non-Agentic) Results

These results use agentic=False, which matches the paper’s single-turn generation methodology — the model produces a solution in one pass without iterative tool use.

Algorithmic Track (43/172 problems, C++)

Model	Provider	Mean Score	Stderr	Time
claude-sonnet-4-5-20250929	Anthropic	0.119	0.038	2m 4s
gpt-5.1-2025-11-13	OpenAI	0.076	0.031	5m 30s

Research Track (43 problems, Python, GPU problems excluded)

Model	Provider	Mean Score	Stderr	Time
claude-sonnet-4-5-20250929	Anthropic	0.172	0.045	16m 18s
gpt-5.1-2025-11-13	OpenAI	0.082	0.032	16m 15s

Reproducibility:

uv run inspect eval inspect_evals/frontier_cs_algorithmic --model anthropic/claude-sonnet-4-5-20250929,openai/gpt-5.1-2025-11-13 --limit 43 --max-sandboxes 10 -T agentic=False -T shuffle=True -T seed=42
uv run inspect eval inspect_evals/frontier_cs_research --model anthropic/claude-sonnet-4-5-20250929,openai/gpt-5.1-2025-11-13 --limit 43 --max-sandboxes 10 -T agentic=False -T shuffle=True -T seed=42

Notes:

Runs used the default agentic solver (react with bash and python tools, message_limit=100)
The research track excludes 23 GPU-requiring problems (include_gpu_problems=False by default), reducing the track from 66 to 43 problems
Algorithmic track ran 43 of 172 problems using shuffle=True and --limit 43 to match the research track size for cost efficiency. No seed was set, so each model received a different random subset; a seed parameter (default 42) has since been added for reproducibility
Results produced February 2026
Post-evaluation trajectory analysis identified environment issues (missing PySR dependency, SIFT1M dataset download timeouts, interactive problem guidance in system prompts). These have been fixed but the evaluation has not been rerun due to cost constraints; scores may improve in future runs
Paper baselines (Score@1, single-turn generation): Gemini 3.0 Pro scored 33.12% on algorithmic and 46.55% on research
Single-turn scores are significantly lower than agentic scores, as expected — multi-turn tool use allows iterative debugging and refinement
Claude Sonnet 4.5 outperforms GPT-5.1 on both tracks in single-turn mode
All single-turn runs used shuffle=True, seed=42, limit=43, max_sandboxes=10

Changelog

[2-A] - 2026-04-21

Replaced deprecated basic_agent() with react() as the default agent. Fixes a pathological case where basic_agent() would loop on repeated content_filter stops until hitting message_limit. Also changes the stall-nudge message (sent when the model stops without calling a tool) to append a reminder to call the submit tool — a one-sentence prompt addition that may bias models toward earlier submission.

[1-A] - 2026-03-05

Initial release: agentic and single-turn evaluation modes for algorithmic (C++) and research (Python) tracks. GPU-requiring research problems excluded by default (include_gpu_problems=False).