SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Coding
Agent

A benchmark of freelance software engineering tasks from Upwork, valued at $1 million USD total in realworld payouts.

Overview

This is an inspect-native implementation of SWE_Lancer, a benchmark for evaluating language model agents on freelancer tasks posted on Upwork for the Expensify open source repository.

Note: This evaluation is currently affected by the following issue: https://github.com/UKGovernmentBEIS/inspect_ai/pull/2778

This can lead to memory usage build up due to incomplete shutdown of jest and node processes when running the ic_swe variant.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/swe_lancer --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.swe_lancer import swe_lancer
eval(swe_lancer)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Parameters

swe_lancer

  • task_variant (Literal[‘ic_swe’, ‘swe_manager’, ‘all’]): The variant of the dataset to use. Can be “swe_manager”, “ic_swe”, or “all”. (default: 'all')
  • solver (Solver | None): The solver to use for generating completions. (default: None)
  • scorer (Scorer | None): The scorer to use for grading completions. (default: None)
  • epochs (int): The number of epochs to run the task for. (default: 1)
  • use_user_tool (bool): Whether to use the user tool for interaction. (default: True)
  • use_per_task_images (bool): Whether to use per-task images. (default: False)
  • debug (bool): If True, enables debugging features: Preserves video recordings (copied before cleanup), Enables Playwright tracing, Automatically exports traces and videos via debug_artifact_exporter scorer, Artifacts saved to debug_export_path/{issue_id}/{sample_uuid}/. (default: False)
  • debug_export_path (str | pathlib.Path): Path to export debug artifacts (traces, videos). Only used when debug=True. (default: Path('{INSPECT_EVALS_CACHE_PATH}/swelancer/artifacts'))
  • tool_timeout (int): Timeout in seconds for bash and python tools. (default: 180)
  • setup_timeout (int): Timeout in seconds for setup. (default: 600)

Detailed usage

Example usage (ic_swe):

uv run inspect eval inspect_evals/swe_lancer --model openai/gpt-4o --limit 1 -T task_variant="ic_swe" -T use_user_tool=True --epochs 1

It’s also recommended to identify the –max-samples total that your setup can run at once. If you are concerned about specific samples causing errors that break the full run, use the –no-fail-on-error flag.

uv run inspect eval inspect_evals/swe_lancer --model openai/gpt-4o --limit 1 -T task_variant="ic_swe" -T use_user_tool=True --epochs 1 --max-samples 5 --no-fail-on-error

Example Usage (swe_manager):

uv run inspect eval inspect_evals/swe_lancer --model openai/gpt-4o --limit 1 -T task_variant="swe_manager" --epochs 1 --no-fail-on-error

To specify individual issues, use the inspect --sample-id option, e.g.:

uv run inspect eval inspect_evals/swe_lancer --sample-id 16912_4

Environment variables

The evaluation can optionally read environment variables from a swe_lancer.env file in the src/inspect_evals/swe_lancer/ directory. If not provided, the following default values are used:

USE_WEB_PROXY=false
EXPENSIFY_URL=https://www.expensify.com
NEW_EXPENSIFY_URL=https://www.new.expensify.com

Additional environment variables are automatically set by the evaluation (PUSHER_APP_KEY, PUSHER_APP_SECRET, PUSHER_APP_ID, ISSUE_ID, LC_ALL, EVAL_VARIANT).

Docker Images

This evaluation provides official pre-built Docker images from Docker Hub. Each task has its own pre-built image that includes the issue ID, and tagged with “releasev1” (but not “latest”) e.g., swelancer/swelancer_x86_16921_1:releasev1.

The images are automatically pulled when needed. No local Docker image building is required.

To use custom images (e.g., for development), set the SWE_LANCER_IMAGE_REGISTRY environment variable:

export SWE_LANCER_IMAGE_REGISTRY=myregistry.io/swe-lancer

Timing Reference

The following table shows approximate times for various steps on ARM64 (Apple Silicon). Times will vary based on hardware and network conditions.

Step Per-Task Image Monolith Image
Image pull ~40s per task ~40s once
Container initialization ~40s ~200s
Webpack compilation Pre-built ~60-90s
npm test (full suite) ~9 min ~9 min
user_tool execution ~3-4 min ~3-4 min

Notes

  • Per-task images have webpack pre-compiled at build time
  • Monolith images run RUNTIME_SETUP at container start (git checkout, npm install, webpack build)
  • user_tool includes webpack recompilation, server startup, and Playwright tests
  • First npm test run may be slower due to Jest cache warming
  • The user_tool does the same sequence of operations as the scorer, but with different handling of output and logging

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/swe_lancer --limit 10
uv run inspect eval inspect_evals/swe_lancer --max-connections 10
uv run inspect eval inspect_evals/swe_lancer --temperature 0.5

See uv run inspect eval --help for all available options.

Dataset

The SWE-Lancer dataset contains 463 freelancer software engineering tasks sourced from Upwork postings for the Expensify open source repository. The dataset is derived from the SWE-Lancer benchmark by OpenAI.

Task Variants

The dataset includes two task variants:

Variant | Count | Description |
ic_swe | 198 | Individual Contributor tasks where the agent must implement a fix or feature directly |
swe_manager | 265 | Manager tasks where the agent must select the correct proposal from a set of candidates |

Use the task_variant parameter to filter by variant: "ic_swe", "swe_manager", or "all" (default).

Note about samples

Issues have been found with three of the IC_swe samples: “29854_602”, “49523_849”, “32236_652” These samples have been excluded from the dataset, reducing the number of IC_swe samples from 198 to 195, and the total number of samples to 460.

Data Fields

Each sample includes:

  • question_id: Unique task identifier (e.g., 16912_4)
  • title: Task title from the original posting
  • description: Full task description with HTML/Markdown formatting
  • price: Payment amount for the task (used in scoring)
  • variant: Task type (ic_swe or swe_manager)
  • cwd: Working directory within the Expensify repository
  • proposals: Candidate proposals (for swe_manager tasks)

Scoring

Both SWE_Manager and IC_SWE tasks are scored using the following metrics:

  • Accuracy: Percentage of tasks completed correctly
  • Money Total: Total earnings from successfully completed tasks (sum of task prices)
  • Number of correct samples: Count of successfully completed tasks
  • Number of incorrect samples: Count of failed tasks

The evaluation uses pass_at_k with k equal to the number of epochs (default: 1).

Debugging

When debugging SWE-Lancer tasks, enable debug mode to access Playwright traces, video recordings, and VNC access:

uv run inspect eval inspect_evals/swe_lancer --model openai/gpt-4o --sample-id 16912_4 -T debug=True

Debug Features

Set debug=True on the task to enable debugging features:

  • VNC Access: Exposes VNC ports to watch Playwright execute in the container via browser. Access at http://localhost:{port}/vnc.html (port is logged at startup).
  • Playwright Traces: Enables tracing for step-by-step debugging. View with npx playwright show-trace <path>.
  • Video Recording: Preserves video recordings of test execution.
  • Artifact Export: Automatically exports traces and videos to {debug_export_path}/{issue_id}/{sample_uuid}/.

You can override the export path with debug_export_path=....

Viewing Setup Logs

Container setup logs (git checkout, npm install, webpack build) can be viewed by checking the Docker container logs:

docker logs <container_id>

Evaluation Report

Note: The dataset has been updated since the original paper, and so the total amount of samples has shifted.

IC SWE Tasks (Diamond split, 198 samples)

Model | Samples | Accuracy | Notes |
GPT-4o | 198 | 8.0% | Original paper results |
GPT-4o | 19 | 5.3% (1/19) | Partial run (inspect-evals) |

SWE Manager Tasks (Diamond split, 265 samples)

Model | Samples | Accuracy | Notes |
GPT-4o | 265 | 37.0% | Original Paper results |
GPT-4o | 100 | 46.0% | Partial run (Inspect Evals) |

Original Paper Reference: Results from SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (OpenAI, 2025).

Changelog

[1-A] - 2026-02-16

  • Migrate version to new scheme. See #907.
  • Initial implementation