ComputeEval: CUDA Code Generation Benchmark

Coding

inspect-evals

Evaluates LLM capability to generate correct CUDA code for kernel implementation, memory management, and parallel algorithm optimization tasks.

Contributed By

@Vitamoon

Code

src/inspect_evals/compute_eval

Overview

ComputeEval is a benchmark for evaluating whether LLMs can generate correct CUDA code. It is developed by NVIDIA Research and available at github.com/NVIDIA/compute-eval and huggingface.co/datasets/nvidia/compute-eval.

As noted in the original repo:

This repository executes machine-generated CUDA C++ code. While it’s unlikely that the code is malicious, it could still pose potential risks. We strongly recommend using a sandbox environment (e.g., a Docker container or virtual machine) when running generated code to minimize security risks.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/compute_eval --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.compute_eval import compute_eval
eval(compute_eval)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/compute_eval --limit 10
uv run inspect eval inspect_evals/compute_eval --max-connections 10
uv run inspect eval inspect_evals/compute_eval --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`compute_eval`

limit (int | None): Maximum number of samples to evaluate (None for all) (default: None)
dataset_version (str): Dataset version to use. Options: “default” (latest), “2025-1”, “2025-2”, “2025-3”. Default is “default” which uses the latest version. (default: 'default')
skip_cuda_check (bool): Skip host CUDA environment validation. Ignored when sandbox_type is set; the host check is always skipped when using a sandbox since nvcc need not be present on the host. (default: False)
num_epochs (int): Number of solutions to generate per problem for pass@k evaluation. Use higher values (e.g., 10) with epoch_reducer=“pass_at_1” for unbiased pass@k estimation. (default: 1)
epoch_reducer (str | list[str]): Strategy to reduce multiple epoch scores. Options: “mean”, “max”, “median”, “mode”, or “pass_at_k” where k <= num_epochs. Can also be a list of reducers. (default: 'mean')
temperature (float): Temperature for generation. Use higher values (e.g., 0.8) when generating multiple samples for pass@k. (default: 1.0)
benchmark_performance (bool): If True, measure execution time of passing solutions by running them multiple times and recording the average. (default: False)
benchmark_runs (int): Number of times to run each passing solution when benchmark_performance is enabled, used to compute average execution time. (default: 3)
sandbox_type (Optional[Literal[‘docker’]]): Sandbox type for isolating code execution. Use “docker” (default) to run code inside a container. Set to None to run directly on the host machine. (default: 'docker')

Dataset

ComputeEval contains CUDA programming challenges spanning kernel implementation, memory management, and algorithm optimization. Each task provides:

A natural language prompt describing the task
Context files (headers, interfaces) visible to the model
Hidden test files for evaluation
Build commands for compilation

Problem types:

cuda_cpp - C++ CUDA kernel implementations
cuda_python - Python CUDA code (e.g., CuPy, Numba). Currently, the dataset by NVIDIA does not contain any Python questions.

Releases:

Version	Problems	Status
2025.1	127	Released
2025.2	232	Released
2025.3	406	Released

Note: default is an alias for the latest version (2025.3).

Scoring

The scorer compiles the generated CUDA code using nvcc and runs the test harness to verify functional correctness. A solution is marked correct if:

Compilation succeeds without errors
All test cases pass (exit code 0)

Requirements

NVIDIA GPU with CUDA Toolkit 12.0+
nvcc compiler available in PATH

Other Notes

CUDA Environment Validation

Before running the benchmark, the implementation validates your CUDA environment:

Checks that nvcc is available in PATH
Verifies CUDA version is 12.0+ (required for -arch=native support)
Tests that -arch=native compilation flag works

If validation fails, the benchmark will raise a RuntimeError with instructions. You can skip this check with skip_cuda_check=True, but compilation may fail on individual problems.

Per-Problem CUDA Version Requirements

Some problems in the dataset require specific minimum CUDA versions (specified in min_cuda_toolkit field). If a problem requires a newer CUDA version than installed, it will be skipped and marked as incorrect with metadata {"skipped": True, "reason": "cuda_version_insufficient"}.

Pass@k Evaluation

For unbiased pass@k estimation, use multiple epochs with temperature > 0:

uv run inspect eval inspect_evals/compute_eval \
    --model openai/gpt-5-nano \
    -T num_epochs=10 \
    -T epoch_reducer=pass_at_1 \
    -T temperature=0.8

This generates 10 solutions per problem and computes the probability of at least 1 correct solution.

Prompt Fidelity

This implementation uses prompts that match NVIDIA’s original implementation to ensure comparable benchmark results. The system prompt and user prompt format are taken directly from NVIDIA’s compute-eval repository.

Source Reference Validation

Some problems include source_references metadata specifying required API calls or symbols that must appear in the solution. The scorer validates these references and marks solutions as incorrect if required symbols are missing, even if they would otherwise compile and pass tests.

Performance Benchmarking

Performance benchmarking is disabled by default because it re-runs each passing solution benchmark_runs times (default 3), which can significantly increase total evaluation time. Enable it with benchmark_performance=True:

uv run inspect eval inspect_evals/compute_eval -T benchmark_performance=True -T benchmark_runs=5

This runs the solution multiple times and records average execution time in the score metadata. If the problem includes a reference_time_ms, speedup relative to the reference is also computed.

Evaluation Report

See the ComputeEval page for evaluation methodology. This report was produced using version 1-A of the repository.

Used the following command (with the model varying for each row)

uv run inspect eval inspect_evals/compute_eval --model openai/gpt-5-nano-2025-08-07

Model	Accuracy	Stderr	Time	Tokens
openai/gpt-5-nano-2025-08-07	45.3%	±0.025	45:49	2,819,584
openai/gpt-5.2-2025-12-11	60.3%	±0.024	35:58	766,827

Note: No paper or leaderboard available for comparison.