KernelBench: Can LLMs Write Efficient GPU Kernels?

Coding

A benchmark for evaluating the ability of LLMs to write efficient GPU kernels.

Contributed By

@jiito

Code

src/inspect_evals/kernelbench

Paper

https://arxiv.org/html/2502.10517v1

Overview

KernelBench is a benchmark for evaluating whether LLMs can generate efficient GPU kernels.

This implementation follows KernelBench 0.1, an improved version of KernelBench by the original authors. See this blog post for more information: https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[kernelbench]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra kernelbench

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/kernelbench --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.kernelbench import kernelbench
eval(kernelbench)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/kernelbench --limit 10
uv run inspect eval inspect_evals/kernelbench --max-connections 10
uv run inspect eval inspect_evals/kernelbench --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`kernelbench`

splits (list[Literal[‘level_1’, ‘level_2’, ‘level_3’, ‘level_4’]]): Dataset splits to use for evaluation (default: ['level_1', 'level_2', 'level_3'])
limit (int | None): Maximum number of samples to evaluate (None for all) (default: None)
option (Literal[‘zero_shot’, ‘one_shot’, ‘few_shot’]): Prompt option for kernel generation (default: 'one_shot')
precision (Literal[‘fp32’, ‘fp16’, ‘bf16’]): Computation precision (torch.dtype) (default: 'fp32')
include_hardware (bool): Whether to include hardware-specific information in prompts (default: False)
gpu_name (Literal[‘L40S’, ‘H100’, ‘A100’, ‘A100-80GB’, ‘L4’, ‘T4’, ‘A10G’]): Target GPU name for hardware-specific prompts (default: 'L40S')
backend (Literal[‘cuda’, ‘triton’, ‘tilelang’, ‘cute’, ‘thunderkittens’]): Backend to use (‘cuda’, ‘triton’, ‘tilelang’, or ‘cute’) (default: 'cuda')
check_kernel (bool): Whether to perform static kernel validation before scoring (default: True)
max_regenerations (int): Maximum number of regeneration attempts for invalid kernels (default: 10)
num_correct_trials (int): Number of trials with different random inputs for correctness testing; correctness passes only if all trials pass (default: 5)
num_perf_trials (int): Number of performance trials to run for averaging timing results (default: 100)
measure_performance (bool): Whether to measure kernel performance (default: True)
timing_method (Literal[‘cuda_event’, ‘do_bench’, ‘do_bench_impl’, ‘host_time’]): Method to time kernel execution (see timing.py for details) (default: 'cuda_event')
verbose (bool): Whether to enable verbose output during evaluation (default: False)
build_dir (pathlib.Path | None): Directory for building kernels (default: None)
device (int | None): GPU (cuda) device to run the evaluation on (default: None)
p (float): Threshold for fast_p metric (fraction of samples that are correct and have speedup > p) (default: 1.0)
temperature (float): Sampling temperature for generation, defaults to 0.0 for this task (default: 0.0)
sandbox_type (Literal[‘docker’]): Sandbox provider type. Currently only “docker” is supported; this may be expanded to additional providers in future. (default: 'docker')

Dataset

copied from: https://github.com/ScalingIntelligence/KernelBench

KernelBench is a comprehensive benchmark designed to evaluate LLMs’ ability to generate efficient GPU kernels. Each task provides a PyTorch operator implementation as a reference and asks the model to transpile or generate a corresponding GPU kernel (in CUDA or related DSLs). Tasks are organized into several levels of increasing complexity:

Level 1: Single-kernel operators (e.g., matrix multiplications, convolutions, layer normalization).
Level 2: Simple fusion patterns (e.g., combining operations like conv+bias+ReLU).
Level 3: Full model architectures (e.g., MobileNet, VGG, MiniGPT, Mamba).
Level 4: Hugging Face model-level optimizations.

The problems can be sourced either from a local dataset or the Hugging Face KernelBench dataset.

Scoring

copied from: https://github.com/ScalingIntelligence/KernelBench

KernelBench evaluates two core aspects of model-generated kernels:

Correctness: Each generated kernel is run against randomized test cases and checked for consistency with the corresponding PyTorch reference implementation.
Performance (Speedup): The runtime of the generated kernel is compared to the PyTorch baseline over multiple trials to compute the speedup.

The main benchmark metric is fast_p: the fraction of tasks for which the model output is both correct and achieves a speedup greater than a threshold p. For example:

fast_1 measures the fraction of tasks where the kernel is correct and faster than PyTorch.
fast_2 measures the fraction of tasks where the kernel is correct and at least 2x faster than PyTorch.
fast_0 is simply the overall correctness rate.

Notes

KernelBench uses temperature 0.0 as a default value, however this can be overridden using the available -T temperature parameter

There is a fixed set of GPUs that this can be run on while using the original prompts. You can track the development of fix in this issue.

Due to dependencies (uvloop) in the original repository that we depend on, this eval is not compatible on Windows.

Evaluation Report

Following the pattern from the original paper and METRs write-up, the evaluation tested on Levels 1 - 3. The explanation from METR: > As with the original KernelBench work, we did not evaluate our models on level 4. Level 4 of KernelBench consists of 20 tasks that require models to directly optimize the performance of 20 medium-sized 2019-2022 language models. This is mainly because the KernelBench PyTorch implementation for Level 4 does not provide information about the internals of the models but instead loads the model using the HuggingFace transformer package – fairly evaluating frontier model performance in writing these agents would require providing agent scaffolding that efficiently parses the HuggingFace codebase, just to provide the agents with the information required to perform the task.

Inspect evals version 1.0.0 of Kernelbench used (this implementation is Kernelbench 0.1)

Model | Implementation | Level 1 Fast_1 (n=100) | Level 1 Correct | Level 2 Fast_1 (n=100) | Level 2 Correct | Level 3 Fast_1 (n=50) | Level 3 Correct |
gpt-4o| Original (temp=0) | 0.04 | 30.0% | 0.07 | 8.0% | 0.04 | 8.0% |
gpt-4o| inspect (temp=0) | 0.07 | 30.0% | 0.06 | 7.0% | 0.02 | 6.0% |

Changelog

[3-B] - 2026-03-02

Scorer refactored to execute kernel evaluation inside a Docker sandbox instead of via subprocess on the host. Results are not directly comparable with v2-A due to the changed execution environment.
Added sandbox_type task parameter.

[2-A] - 2026-02-16

Migrate version to new scheme. See #907.