LiveCodeBench-Pro: Competitive Programming Benchmark
Evaluates LLMs on competitive programming problems using a specialized Docker sandbox (LightCPVerifier) to execute and judge C++ code submissions against hidden test cases with time and memory constraints.
Overview
LiveCodeBench-Pro is a benchmark for evaluating LLMs on competitive programming problems. It uses a specialized Docker sandbox (LightCPVerifier) to execute and judge C++ code submissions.
Contributed by @GauravJoshi
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evalsIf you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv syncRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/livecodebench_pro --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.livecodebench_pro import livecodebench_pro
eval(livecodebench_pro)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/livecodebench_pro --limit 10
uv run inspect eval inspect_evals/livecodebench_pro --max-connections 10
uv run inspect eval inspect_evals/livecodebench_pro --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
livecodebench_pro
split(str | list[str] | None): Dataset split(s) to evaluate (e.g., “quater_2024_10_12”). If None, all available splits are loaded. (default:None)difficulty(Union[Literal[‘easy’, ‘medium’, ‘hard’], list[Literal[‘easy’, ‘medium’, ‘hard’]], None]): Problem difficulty filter (“easy”, “medium”, “hard”). If None, no difficulty filtering is applied. (default:None)custom_solver(Solver | list[Solver] | None): Custom solver(s) to use. Defaults to [system_message(), generate()]. (default:None)custom_scorer(Scorer | None): Custom scorer to use. (default:None)docker_handling(DockerHandling): How to handle Docker images (default, force_build, force_pull). (default:<DockerHandling.DEFAULT: 'default'>)
Dataset
The dataset is loaded from QAQAQAQAQ/LiveCodeBench-Pro on Hugging Face. Each sample contains a competitive programming problem with:
- Problem Statement: A detailed description of the programming task
- Test Cases: Hidden test cases used by the judge to evaluate correctness
Here is an example of how problems are presented to the model:
You are a competitive programmer. You will be given a problem statement, please implement solution in C++. The execution time and memory limit are also stated in the statement so be aware of the complexity of the program. Please wrap the code in ```cpp and ``` so that it is properly formatted.
[Problem Statement with detailed description, input/output format, constraints, and examples]
The model is tasked with generating a complete C++ solution that solves the problem within the given constraints.
Scoring
The evaluation uses a custom scorer that:
- Extracts C++ code from the model’s output
- Submits the code to the LightCPVerifier judge service running in a Docker sandbox
- Executes the code against hidden test cases with time and memory limits enforced
- Returns the verdict:
- CORRECT: If the judge returns “Accepted” (all test cases passed)
- INCORRECT: For any other verdict
- Judge Results include:
Invalid: Invalid ResultAccepted: Correct resultMemoryLimitExceeded: Execution exceeded the allowed memoryTimeLimitExceeded: Execution exceeded the allowed timeOutputLimitExceeded: Output exceeded the allowed sizeFileError: Error accessing required filesNonZeroExitStatus: Program exited with a non-zero return codeSignalled: Program was terminated by a signalDangerousSyscall: Program attempted a forbidden system callInternalError: An internal judge error occurred
The final metric reported is accuracy (percentage of problems solved correctly) along with standard error.
Evaluation Report
Results from LiveCodeBench Pro paper
| Model | Hard | Medium | Easy | Rating | Pct.% | AvgTok | AvgCost |
|---|---|---|---|---|---|---|---|
| Reasoning Models | |||||||
| o4-mini-high | 0.0% | 53.5% | 83.1% | 2,116 | 1.5% | 23,819 | $0.1048 |
| Gemini 2.5 Pro | 0.0% | 25.4% | 70.4% | 1,992 | 2.3% | 29,879 | $0.2988 |
| o3-mini | 0.0% | 16.9% | 77.5% | 1,777 | 4.9% | 18,230 | $0.0802 |
| DeepSeek R1 | 0.0% | 9.9% | 56.3% | 1,442 | 18.0% | 16,716 | $0.0366 |
| Gemini 2.5 Flash | 0.0% | 12.7% | 47.9% | 1,334 | 30.3% | 35,085 | $0.0116 |
| DeepSeek R1 Distill-Llama-70B | 0.0% | 2.8% | 33.8% | 999 | 56.0% | 12,425 | $0.0050 |
| Claude 3.7 Sonnet (Max Reasoning) | 0.0% | 1.4% | 36.6% | 992 | 56.5% | 19,075 | $0.2861 |
| Gemini 2.0 Flash Reasoning | 0.0% | 0.0% | 29.6% | 893 | 63.1% | 11,143 | $0.0390 |
| Non-Reasoning Models | |||||||
| GPT-4.1 mini | 0.0% | 5.6% | 28.2% | 1,006 | 55.5% | 2,662 | $0.0043 |
| DeepSeek V3 0324 | 0.0% | 5.6% | 32.4% | 984 | 57.1% | 2,712 | $0.0030 |
| GPT-4.1 | 0.0% | 0.0% | 23.9% | 889 | 64.2% | 2,131 | $0.0170 |
| GPT-4.5 | 0.0% | 0.0% | 26.8% | 881 | 64.8% | 968 | $0.1452 |
| Qwen-Max | 0.0% | 0.0% | 14.1% | 821 | 69.4% | 1,244 | $0.0080 |
| Claude 3.7 Sonnet (No Reasoning) | 0.0% | 1.4% | 16.9% | 804 | 70.7% | 3,554 | $0.0533 |
| Llama 4 Maverick | 0.0% | 0.0% | 15.5% | 634 | 80.4% | 1,160 | $0.0007 |
| Claude 3.5 Sonnet | 0.0% | 0.0% | 14.1% | 617 | 81.4% | 810 | $0.0122 |
| Gemma 3 27B | 0.0% | 0.0% | 8.5% | 601 | 82.5% | 668 | $0.0001 |
| GPT-4o | 0.0% | 0.0% | 9.9% | 592 | 83.1% | 1,133 | $0.0227 |
| Meta Llama 3.1 405B Instruct | 0.0% | 0.0% | 9.9% | 574 | 84.3% | 568 | $0.0005 |
| DeepSeek V3 | 0.0% | 0.0% | 12.7% | 557 | 84.9% | 1,020 | $0.0011 |
Results from eval
Evaluation Version: 1-A
| Model | Provider | Split | Accuracy | Stderr | Samples | Difficulty |
|---|---|---|---|---|---|---|
| gpt-4.1-mini-2025-04-14 | OpenAI | quater_2025_1_3 | 0.281 | 0.048 | 89 | easy |
| gemini-3-flash-preview | quater_2025_1_3 | 0.921 | 0.029 | 89 | easy | |
| gemini-3-flash-preview | quater_2025_4_6 | 0.607 | 0.066 | 56 | medium | |
| gpt-4o | OpenAI | All | 0.100 | 0.025 | 150 | Unspecified |
Evaluation Commands:
uv run inspect eval inspect_evals/livecodebench_pro --model openai/gpt-4.1-mini-2025-04-14 --max-connections 4 -T difficulty=easy -T split=quater_2025_1_3
uv run inspect eval inspect_evals/livecodebench_pro --model google/gemini-3-flash-preview --max-connections 4 -T difficulty=easy -T split=quater_2025_1_3
uv run inspect eval inspect_evals/livecodebench_pro --model google/gemini-3-flash-preview --max-connections 4 -T difficulty=medium -T split=quater_2025_4_6
uv run inspect eval inspect_evals/livecodebench_pro --limit 150 --model openai/gpt-4o --log-level INFONotes:
- Missing test cases are skipped
- Testcase dataset QAQAQAQAQ/LiveCodeBench-Pro-Testcase is currently missing test cases for ‘quarter_2025_7_9’ split
- Gemini 3 Flash Preview results can be compared to LiveCodeBench-Pro Leaderboard by split and difficulty
Differences from Original Implementation
- Dataset handling: Allow splitting of dataset by splits, difficulty, and limit
- More Logging: Logging for code extraction, judge connectivity, execution results, and error handling
- Custom Docker Compose:
compose.yamlfor Inspect AI’s sandbox environment
Security Considerations
This evaluation requires running the judge service container in privileged mode (privileged: true in compose.yaml).
The LightCPVerifier judge service uses go-judge internally, which requires elevated privileges.
References
- Paper: LiveCodeBench-Pro: A Benchmark for Evaluating LLMs on Competitive Programming
- Original Implementation: GavinZhengOI/LiveCodeBench-Pro
- Judge Service: YanagiOrigami/LightCPVerifier