SWE-bench Verified: Resolving Real-World GitHub Issues

Coding

Agent

Evaluates AI’s ability to resolve genuine software engineering issues sourced from 12 popular Python GitHub repositories, reflecting realistic coding and debugging scenarios.

Contributed By

@max-kaufmann

Code

src/inspect_evals/swe_bench

Paper

https://arxiv.org/abs/2310.06770

Overview

This is an inspect-native implementation of the SWE-bench dataset, a benchmark for testing the ability of AI agents to solve software engineering tasks.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[swe_bench]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra swe_bench

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/swe_bench --model openai/gpt-5-nano
uv run inspect eval inspect_evals/swe_bench_verified_mini --model openai/gpt-5-nano

To run multiple tasks simulteneously use inspect eval-set:

uv run inspect eval-set inspect_evals/swe_bench inspect_evals/swe_bench_verified_mini

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval, eval_set
from inspect_evals.swe_bench import swe_bench, swe_bench_verified_mini
eval(swe_bench)
eval_set([swe_bench, swe_bench_verified_mini], log_dir='logs-run-42')

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

[!NOTE] When first running the swe_bench task, it will build the necessary docker images. This can be resource intensive - for the full swe_bench Verified split, up to several hours, and ~280GB of storage.

SWE-bench will take a while to run, and uses a lot of tokens. If things are too slow, you should increase the level of parallelism - see https://inspect.ai-safety-institute.org.uk/parallelism.html. Note that running too many docker containers on your machine can also cause issues, most notably with a ‘ALL PREDEFINED ADDRESS POOLS HAVE BEEN FULLY SUBNETTED’ error - we don’t recommend running more than 32 containers at any one time.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/swe_bench --limit 10
uv run inspect eval inspect_evals/swe_bench_verified_mini --max-connections 10
uv run inspect eval inspect_evals/swe_bench --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`swe_bench`

dataset (str): (default: 'princeton-nlp/SWE-bench_Verified')
split (str): (default: 'test')
solver (Solver | None): (default: None)
input_prompt (str): (default: 'Please solve the following coding issue:\n\n{issue_text}')
instance_ids (list[str] | None): (default: None)
scorer (Scorer | list[Scorer] | None): (default: None)
epochs (int): (default: 1)
sandbox_type (Literal[‘docker’, ‘k8s’]): (default: 'docker')
build_docker_images (bool): (default: True)
pull_remote_images_if_available (bool): (default: True)
docker_image_from_id (Callable[[str], str]): (default: get_remote_docker_image_from_id)
allow_internet (bool): (default: False)
sandbox_config_template_file (str | None): (default: None)

`swe_bench_verified_mini`

split (str): Dataset split to use. (default: 'test')
solver (Solver | None): Solver to use for the evaluation. (default: None)
input_prompt (str): Prompt template for the task input. (default: 'Please solve the following coding issue:\n\n{issue_text}')
instance_ids (list[str] | None): List of instance IDs to filter the dataset to. (default: None)
scorer (Scorer | list[Scorer] | None): Scorer to use for the evaluation. (default: None)
epochs (int): Number of epochs to run the evaluation for. (default: 1)
sandbox_type (Literal[‘docker’, ‘k8s’]): Type of sandbox to use (docker or k8s). (default: 'docker')
build_docker_images (bool): Whether to build Docker images for the samples. (default: True)
pull_remote_images_if_available (bool): Whether to pull the images from DockerHub. (default: True)
docker_image_from_id (Callable[[str], str]): Function that takes an instance ID and returns the corresponding Docker image name. (default: get_remote_docker_image_from_id)
allow_internet (bool): Whether to allow internet access in the sandbox. (default: False)
sandbox_config_template_file (str | None): Path to a custom sandbox config file template (e.g., compose.yaml for Docker or values.yaml for k8s). The file should contain {{IMAGE_NAME}} placeholder which will be replaced with the actual Docker image name for each instance. (default: None)

Dataset

You will be provided with a partial code base and an issue statement explaining a problem to resolve.

Issue: napoleon_use_param should also affect “other parameters” section

Problem:

Currently, napoleon always renders the Other parameters section as if napoleon_use_param was False, see source ``` def parse_other_parameters_section(self, section): # type: (unicode) -> List[unicode] return self._format_fields((‘Other Parameters’), self._consume_fields())
def _parse_parameters_section(self, section):
    # type: (unicode) -> List[unicode]
    fields = self._consume_fields()
    if self._config.napoleon_use_param:
        return self._format_docutils_params(fields)
    else:
        return self._format_fields(_('Parameters'), fields)
```

Other Notes

Running the benchmark

``swe_bench.pyfile containsswe_bench``` function, which creates an instance of a SWE-bench Task:

from inspect_ai import eval
from inspect_ai.solver import system_message, basic_agent
from inspect_ai.tool import bash, python
from inspect_evals.swe_bench import swe_bench

# Create an agent that only uses bash, and an agent that only uses python
agent = basic_agent(tools=[bash(timeout=30)])
task = swe_bench(
    dataset="princeton-nlp/SWE-bench_Verified",
    solver=agent,
    instance_ids=["sympy__sympy-22914", "sympy__sympy-13878"],
)
# Compare how these two agents perform.
eval(task, model="openai/gpt-4o")

Comparing to official swe-bench baselines

Submissions to the official SWE-bench leaderboard often come with a set of trajectories and per-swebench-instance results. To download these baselines, you need to follow the instructions at https://github.com/swe-bench/experiments?tab=readme-ov-file#-viewing-logs-trajectories to set up the AWS CLI, and download the baselines:

./download_baselines.sh

This means that you can compare the performance of your agent not just on the full dataset, but also on subsets of that dataset. This is often useful when trying to run smaller, cheaper experiments. To make this comparison easy, we provide a scorer that will calculate the average performance of an existing baseline on that set of trajectories:

from inspect_evals.swe_bench import swe_bench_baseline_scorer
from pathlib import Path
from platformdirs import user_cache_dir

agent = basic_agent(tools=[bash(timeout=30)])
baseline_dir = Path(user_cache_dir("inspect_swebench_test")) / "swebench_baselines"
scorers = [
    swe_bench_baseline_scorer(
        str(baseline_dir / "20240620_sweagent_claude3.5sonnet"),
        name="SWE-agent_3.5_sonnet",
    ),
    swe_bench_scorer(),
]
task = swe_bench(
    dataset="princeton-nlp/SWE-bench_Verified",
    solver=agent,
    instance_ids=["sympy__sympy-22914", "sympy__sympy-13878"],
    scorer=scorers,
)

eval(task, model="openai/gpt-4o")

This will lead to both numbers being reported in the final output, allowing you to compare baselines:

Parity with the original implementation

We keep track of any known issues with our scoring. We recommend that before submitting to the leaderboard, or comparing to public results in a paper, you use the save_outputs_to_swebench_format function to score with the original implementation:

from inspect_evals.swe_bench import save_outputs_to_swebench_format
logs = eval(task = swe_bench,solver=agent)
save_outputs_to_swebench_format(logs, "./swebench_formatted_outputs/")

You can then use the CLI to score these with the original implementation (as described in their README):

python -m swebench.harness.run_evaluation \
    --predictions_path path-to-outputs \
    --max_workers 4 \
    --run_id check-outputs