KernelBench: Can LLMs Write Efficient GPU Kernels?

A benchmark for evaluating the ability of LLMs to write efficient GPU kernels.

Contributed By

@jiito

Code

src/inspect_evals/kernelbench

Paper

https://arxiv.org/html/2502.10517v1

Overview

KernelBench is a benchmark for evaluating whether LLMs can generate efficient GPU kernels.

This implementation follows KernelBench 0.1, an improved version of KernelBench by the original authors. See this blog post for more information: https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/

Usage

Note: This eval has dependencies that conflict with other evals in this repo. It ships its own isolated environment under packages/kernelbench/ rather than sharing the repo’s main one.

Installation

From the repository (recommended)

Sync the eval’s environment from its package directory:

cd packages/kernelbench
uv sync

Then run evaluations from that directory:

uv run inspect eval inspect_evals/kernelbench --model openai/gpt-5-nano

Into an existing environment

If you already have a Python environment, install the eval’s package directly — it brings in inspect-evals alongside the eval-specific dependencies:

pip install /path/to/inspect_evals/packages/kernelbench/

Then run without uv run:

inspect eval inspect_evals/kernelbench --model openai/gpt-5-nano

You can also import the task directly from Python:

from inspect_ai import eval
from inspect_evals.kernelbench import kernelbench
eval(kernelbench)

From a Jupyter notebook (using Inspect Flow)

Inspect Flow manages the isolated environment for you. Install it once:

pip install inspect-flow

Then from a notebook cell, point Flow at this eval’s pyproject.toml so it builds the isolated venv from the same dependency set as uv sync:

from inspect_flow import FlowSpec, FlowTask, FlowDependencies
from inspect_flow.api import run

spec = FlowSpec(
    execution_type="venv",
    log_dir="logs/",
    dependencies=FlowDependencies(
        dependency_file="path/to/inspect_evals/packages/kernelbench/pyproject.toml",
    ),
    tasks=[FlowTask(name="inspect_evals/kernelbench", model="openai/gpt-5-nano")],
)

run(spec)  # blocking; builds isolated venv on first run, cached thereafter

After the run completes, read and analyse logs in the same notebook — the eval’s heavy dependencies are not required for log reading.

Running evaluations

After installing via any of the methods above, you can evaluate models. The examples below assume the repository approach; drop uv run if you are managing dependencies yourself.

uv run inspect eval inspect_evals/kernelbench --model openai/gpt-5-nano

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download the Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/kernelbench --limit 10
uv run inspect eval inspect_evals/kernelbench --max-connections 10
uv run inspect eval inspect_evals/kernelbench --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`kernelbench`

splits (list[Literal['level_1', 'level_2', 'level_3', 'level_4']]): Dataset splits to use for evaluation (default: ['level_1', 'level_2', 'level_3'])
limit (int | None): Maximum number of samples to evaluate (None for all) (default: None)
option (Literal['zero_shot', 'one_shot', 'few_shot']): Prompt option for kernel generation (default: 'one_shot')
precision (Literal['fp32', 'fp16', 'bf16']): Computation precision (torch.dtype) (default: 'fp32')
include_hardware (bool): Whether to include hardware-specific information in prompts (default: False)
gpu_name (Literal['L40S', 'H100', 'A100', 'A100-80GB', 'L4', 'T4', 'A10G']): Target GPU name for hardware-specific prompts (default: 'L40S')
backend (Literal['cuda', 'triton', 'tilelang', 'cute', 'thunderkittens']): Backend to use (‘cuda’, ‘triton’, ‘tilelang’, or ‘cute’) (default: 'cuda')
check_kernel (bool): Whether to perform static kernel validation before scoring (default: True)
max_regenerations (int): Maximum number of regeneration attempts for invalid kernels (default: 10)
num_correct_trials (int): Number of trials with different random inputs for correctness testing; correctness passes only if all trials pass (default: 5)
num_perf_trials (int): Number of performance trials to run for averaging timing results (default: 100)
measure_performance (bool): Whether to measure kernel performance (default: True)
timing_method (Literal['cuda_event', 'do_bench', 'do_bench_impl', 'host_time']): Method to time kernel execution (see timing.py for details) (default: 'cuda_event')
verbose (bool): Whether to enable verbose output during evaluation (default: False)
build_dir (pathlib.Path | None): Directory for building kernels (default: None)
device (int | None): GPU (cuda) device to run the evaluation on (default: None)
p (float): Threshold for fast_p metric (fraction of samples that are correct and have speedup > p) (default: 1.0)
temperature (float): Sampling temperature for generation, defaults to 0.0 for this task (default: 0.0)
sandbox_type (Literal['docker']): Sandbox provider type. Currently only “docker” is supported; this may be expanded to additional providers in future. (default: 'docker')

Dataset

copied from: https://github.com/ScalingIntelligence/KernelBench

KernelBench is a comprehensive benchmark designed to evaluate LLMs’ ability to generate efficient GPU kernels. Each task provides a PyTorch operator implementation as a reference and asks the model to transpile or generate a corresponding GPU kernel (in CUDA or related DSLs). Tasks are organized into several levels of increasing complexity:

Level 1: Single-kernel operators (e.g., matrix multiplications, convolutions, layer normalization).
Level 2: Simple fusion patterns (e.g., combining operations like conv+bias+ReLU).
Level 3: Full model architectures (e.g., MobileNet, VGG, MiniGPT, Mamba).
Level 4: Hugging Face model-level optimizations.

The problems can be sourced either from a local dataset or the Hugging Face KernelBench dataset.

Scoring

copied from: https://github.com/ScalingIntelligence/KernelBench

KernelBench evaluates two core aspects of model-generated kernels:

Correctness: Each generated kernel is run against randomized test cases and checked for consistency with the corresponding PyTorch reference implementation.
Performance (Speedup): The runtime of the generated kernel is compared to the PyTorch baseline over multiple trials to compute the speedup.

The main benchmark metric is fast_p: the fraction of tasks for which the model output is both correct and achieves a speedup greater than a threshold p. For example:

fast_1 measures the fraction of tasks where the kernel is correct and faster than PyTorch.
fast_2 measures the fraction of tasks where the kernel is correct and at least 2x faster than PyTorch.
fast_0 is simply the overall correctness rate.

Notes

KernelBench uses temperature 0.0 as a default value, however this can be overridden using the available -T temperature parameter

There is a fixed set of GPUs that this can be run on while using the original prompts. You can track the development of fix in this issue.

Due to dependencies (uvloop) in the original repository that we depend on, this eval is not compatible on Windows.

Evaluation Report

Following the pattern from the original paper and METRs write-up, the evaluation tested on Levels 1 - 3. The explanation from METR:

As with the original KernelBench work, we did not evaluate our models on level 4. Level 4 of KernelBench consists of 20 tasks that require models to directly optimize the performance of 20 medium-sized 2019-2022 language models. This is mainly because the KernelBench PyTorch implementation for Level 4 does not provide information about the internals of the models but instead loads the model using the HuggingFace transformer package – fairly evaluating frontier model performance in writing these agents would require providing agent scaffolding that efficiently parses the HuggingFace codebase, just to provide the agents with the information required to perform the task.

Inspect evals version 1.0.0 of Kernelbench used (this implementation is Kernelbench 0.1)

Model	Implementation	Level 1 Fast_1 (n=100)	Level 1 Correct	Level 2 Fast_1 (n=100)	Level 2 Correct	Level 3 Fast_1 (n=50)	Level 3 Correct
gpt-4o	Original (temp=0)	0.04	30.0%	0.07	8.0%	0.04	8.0%
gpt-4o	inspect (temp=0)	0.07	30.0%	0.06	7.0%	0.02	6.0%

Changelog

[5-B] - 2026-05-17

Reworked the scorer to distinguish infrastructure failures from verdicts on the generated kernel. See #1679.
- Outcomes attributable to the model’s kernel are scored: static-check failure, no code block, segfault, timeout, and any verdict encoded in output.json (including compiled=False / correctness=False). A timeout is a documented judgment call: it is ambiguous (hanging kernel vs. slow compile under contention), but upstream KernelBench treats it as a kernel failure.
- Failures of the evaluation machinery raise instead, so the sample is marked ERRORED and stays eligible for --retry-on-error: eval_runner exiting non-zero without a segfault, unparseable output.json (the runner was killed mid-write), and "error" payloads, which mean no verdict was reached (an exception escaped eval_kernel_against_ref, or compilation lock retries were exhausted). Previously these paths were scored 0.0, conflating them with kernel failures.
- eval_runner now tags error payloads with error_kind ("exception" with a traceback, or "retry_exhausted") so the failure cause is visible in the error message and log.

[4-B] - 2026-04-13

Pin HuggingFace dataset to revision ca1464e5.

[3-B] - 2026-03-02

Scorer refactored to execute kernel evaluation inside a Docker sandbox instead of via subprocess on the host. Results are not directly comparable with v2-A due to the changed execution environment.
Added sandbox_type task parameter.

[2-A] - 2026-02-16

Migrate version to new scheme. See #907.