Frontier-CS: Benchmarking LLMs on Computer Science Problems
238 open-ended computer science problems spanning algorithmic (172) and research (66) tracks. Problems feature continuous partial scoring, with algorithmic solutions evaluated via compilation and test-case checking, and research solutions evaluated via custom evaluator scripts. Current frontier models score well below human expert baselines, making this a challenging, unsaturated benchmark.
Overview
Frontier-CS is a benchmark of 238 open-ended computer science problems spanning two tracks: algorithmic (172 C++ competitive programming problems) and research (66 Python problems covering GPU kernel optimization, symbolic regression, and more). Problems feature continuous partial scoring — algorithmic solutions are compiled and tested against test cases with custom checkers, while research solutions are evaluated by problem-specific Python scripts. Current frontier models score well below human expert baselines, making this a challenging, unsaturated benchmark.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evalsIf you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv syncRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/frontier_cs --model openai/gpt-5-nano
uv run inspect eval inspect_evals/frontier_cs_algorithmic --model openai/gpt-5-nano
uv run inspect eval inspect_evals/frontier_cs_research --model openai/gpt-5-nanoTo run multiple tasks simulteneously use inspect eval-set:
uv run inspect eval-set inspect_evals/frontier_cs inspect_evals/frontier_cs_algorithmic inspect_evals/frontier_cs_researchYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval, eval_set
from inspect_evals.frontier_cs import frontier_cs, frontier_cs_algorithmic, frontier_cs_research
eval(frontier_cs)
eval_set([frontier_cs, frontier_cs_algorithmic, frontier_cs_research], log_dir='logs-run-42')After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/frontier_cs --limit 10
uv run inspect eval inspect_evals/frontier_cs_algorithmic --max-connections 10
uv run inspect eval inspect_evals/frontier_cs_research --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
frontier_cs
track(str): (default:'all')solver(Solver | None): (default:None)scorer(Scorer | list[Scorer] | None): (default:None)message_limit(int): (default:100)sandbox(str | tuple[str, str] | SandboxEnvironmentSpec): (default:('docker', 'src/inspect_evals/frontier_cs/compose.yaml'))agentic(bool): (default:True)include_gpu_problems(bool): (default:False)shuffle(bool): (default:False)seed(int): (default:42)
frontier_cs_algorithmic
solver(Solver | None): (default:None)scorer(Scorer | list[Scorer] | None): (default:None)message_limit(int): (default:100)sandbox(str | tuple[str, str] | SandboxEnvironmentSpec): (default:('docker', 'src/inspect_evals/frontier_cs/compose.yaml'))agentic(bool): (default:True)shuffle(bool): (default:False)seed(int): (default:42)
frontier_cs_research
solver(Solver | None): (default:None)scorer(Scorer | list[Scorer] | None): (default:None)message_limit(int): (default:100)sandbox(str | tuple[str, str] | SandboxEnvironmentSpec): (default:('docker', 'src/inspect_evals/frontier_cs/compose.yaml'))agentic(bool): (default:True)include_gpu_problems(bool): (default:False)shuffle(bool): (default:False)seed(int): (default:42)
Prerequisites
This evaluation requires Docker to be installed and running. On first run, a local Docker image is built containing g++, python3, and testlib.h for compiling and running solutions and checkers.
Dataset
The dataset is loaded from HuggingFace (238 problems in the test split). Each problem has a problem_id, category (“algorithmic” or “research”), statement (the problem text), and config (YAML string with time/memory limits and scoring metadata).
Test data (inputs, expected outputs, custom checkers) is downloaded from the FrontierCS GitHub repository and cached locally.
Scoring
Algorithmic track: Solutions are compiled with g++ -O2 -std=gnu++17, run against test cases, and checked with custom testlib-based checkers. Each test case yields a score ratio (0.0-1.0), and the final score is the mean across all test cases.
Research track: Solutions are evaluated by problem-specific Python evaluator scripts. The score is parsed from the evaluator’s stdout and normalized to 0.0-1.0.
Metrics reported: mean and stderr of per-problem scores.
Evaluation Report
Algorithmic Track (43/172 problems, C++)
| Model | Provider | Mean Score | Stderr | Time |
|---|---|---|---|---|
| claude-sonnet-4-5-20250929 | Anthropic | 0.207 | 0.056 | 32m 56s |
| gpt-5.1-2025-11-13 | OpenAI | 0.110 | 0.041 | 42m 7s |
Research Track (43 problems, Python, GPU problems excluded)
| Model | Provider | Mean Score | Stderr | Time |
|---|---|---|---|---|
| claude-sonnet-4-5-20250929 | Anthropic | 0.194 | 0.046 | 55m 25s |
| gpt-5.1-2025-11-13 | OpenAI | 0.132 | 0.043 | 57m 42s |
Reproducibility:
uv run inspect eval inspect_evals/frontier_cs_algorithmic --model anthropic/claude-sonnet-4-5-20250929,openai/gpt-5.1-2025-11-13 --limit 43 --max-tasks 1 -T shuffle=True -T message_limit=100
uv run inspect eval inspect_evals/frontier_cs_research --model anthropic/claude-sonnet-4-5-20250929,openai/gpt-5.1-2025-11-13 --limit 43 --max-tasks 1 -T shuffle=True -T message_limit=100Single-Turn (Non-Agentic) Results
These results use agentic=False, which matches the paper’s single-turn generation methodology — the model produces a solution in one pass without iterative tool use.
Algorithmic Track (43/172 problems, C++)
| Model | Provider | Mean Score | Stderr | Time |
|---|---|---|---|---|
| claude-sonnet-4-5-20250929 | Anthropic | 0.119 | 0.038 | 2m 4s |
| gpt-5.1-2025-11-13 | OpenAI | 0.076 | 0.031 | 5m 30s |
Research Track (43 problems, Python, GPU problems excluded)
| Model | Provider | Mean Score | Stderr | Time |
|---|---|---|---|---|
| claude-sonnet-4-5-20250929 | Anthropic | 0.172 | 0.045 | 16m 18s |
| gpt-5.1-2025-11-13 | OpenAI | 0.082 | 0.032 | 16m 15s |
Reproducibility:
uv run inspect eval inspect_evals/frontier_cs_algorithmic --model anthropic/claude-sonnet-4-5-20250929,openai/gpt-5.1-2025-11-13 --limit 43 --max-sandboxes 10 -T agentic=False -T shuffle=True -T seed=42
uv run inspect eval inspect_evals/frontier_cs_research --model anthropic/claude-sonnet-4-5-20250929,openai/gpt-5.1-2025-11-13 --limit 43 --max-sandboxes 10 -T agentic=False -T shuffle=True -T seed=42Notes:
- Runs used the default agentic solver (
basic_agentwith bash and python tools,message_limit=100) - The research track excludes 23 GPU-requiring problems (
include_gpu_problems=Falseby default), reducing the track from 66 to 43 problems - Algorithmic track ran 43 of 172 problems using
shuffle=Trueand--limit 43to match the research track size for cost efficiency. No seed was set, so each model received a different random subset; aseedparameter (default 42) has since been added for reproducibility - Results produced February 2026
- Post-evaluation trajectory analysis identified environment issues (missing PySR dependency, SIFT1M dataset download timeouts, interactive problem guidance in system prompts). These have been fixed but the evaluation has not been rerun due to cost constraints; scores may improve in future runs
- Paper baselines (Score@1, single-turn generation): Gemini 3.0 Pro scored 33.12% on algorithmic and 46.55% on research
- Single-turn scores are significantly lower than agentic scores, as expected — multi-turn tool use allows iterative debugging and refinement
- Claude Sonnet 4.5 outperforms GPT-5.1 on both tracks in single-turn mode
- All single-turn runs used
shuffle=True,seed=42,limit=43,max_sandboxes=10
Changelog
- v1-A (initial release): Agentic and single-turn evaluation modes for algorithmic (C++) and research (Python) tracks. GPU-requiring research problems excluded by default (
include_gpu_problems=False).