SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
A benchmark of freelance software engineering tasks from Upwork, valued at $1 million USD total in realworld payouts.
Overview
This is an inspect-native implementation of SWE_Lancer, a benchmark for evaluating language model agents on freelancer tasks posted on Upwork for the Expensify open source repository.
Note: This evaluation is currently affected by the following issue: https://github.com/UKGovernmentBEIS/inspect_ai/pull/2778
This can lead to memory usage build up due to incomplete shutdown of jest and node processes when running the ic_swe variant.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evalsIf you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv syncRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/swe_lancer --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.swe_lancer import swe_lancer
eval(swe_lancer)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Parameters
swe_lancer
task_variant(Literal[‘ic_swe’, ‘swe_manager’, ‘all’]): The variant of the dataset to use. Can be “swe_manager”, “ic_swe”, or “all”. (default:'all')solver(Solver | None): The solver to use for generating completions. (default:None)scorer(Scorer | None): The scorer to use for grading completions. (default:None)epochs(int): The number of epochs to run the task for. (default:1)use_user_tool(bool): Whether to use the user tool for interaction. (default:True)use_per_task_images(bool): Whether to use per-task images. (default:False)debug(bool): If True, enables debugging features: Preserves video recordings (copied before cleanup), Enables Playwright tracing, Automatically exports traces and videos via debug_artifact_exporter scorer, Artifacts saved to debug_export_path/{issue_id}/{sample_uuid}/. (default:False)debug_export_path(str | pathlib.Path): Path to export debug artifacts (traces, videos). Only used when debug=True. (default:Path('{INSPECT_EVALS_CACHE_PATH}/swelancer/artifacts'))tool_timeout(int): Timeout in seconds for bash and python tools. (default:180)setup_timeout(int): Timeout in seconds for setup. (default:600)
Detailed usage
Example usage (ic_swe):
uv run inspect eval inspect_evals/swe_lancer --model openai/gpt-4o --limit 1 -T task_variant="ic_swe" -T use_user_tool=True --epochs 1It’s also recommended to identify the –max-samples total that your setup can run at once. If you are concerned about specific samples causing errors that break the full run, use the –no-fail-on-error flag.
uv run inspect eval inspect_evals/swe_lancer --model openai/gpt-4o --limit 1 -T task_variant="ic_swe" -T use_user_tool=True --epochs 1 --max-samples 5 --no-fail-on-errorExample Usage (swe_manager):
uv run inspect eval inspect_evals/swe_lancer --model openai/gpt-4o --limit 1 -T task_variant="swe_manager" --epochs 1 --no-fail-on-errorTo specify individual issues, use the inspect --sample-id option, e.g.:
uv run inspect eval inspect_evals/swe_lancer --sample-id 16912_4Environment variables
The evaluation can optionally read environment variables from a swe_lancer.env file in the src/inspect_evals/swe_lancer/ directory. If not provided, the following default values are used:
USE_WEB_PROXY=false
EXPENSIFY_URL=https://www.expensify.com
NEW_EXPENSIFY_URL=https://www.new.expensify.com
Additional environment variables are automatically set by the evaluation (PUSHER_APP_KEY, PUSHER_APP_SECRET, PUSHER_APP_ID, ISSUE_ID, LC_ALL, EVAL_VARIANT).
Docker Images
This evaluation provides official pre-built Docker images from Docker Hub. Each task has its own pre-built image that includes the issue ID, and tagged with “releasev1” (but not “latest”) e.g., swelancer/swelancer_x86_16921_1:releasev1.
The images are automatically pulled when needed. No local Docker image building is required.
To use custom images (e.g., for development), set the SWE_LANCER_IMAGE_REGISTRY environment variable:
export SWE_LANCER_IMAGE_REGISTRY=myregistry.io/swe-lancerTiming Reference
The following table shows approximate times for various steps on ARM64 (Apple Silicon). Times will vary based on hardware and network conditions.
| Step | Per-Task Image | Monolith Image |
|---|---|---|
| Image pull | ~40s per task | ~40s once |
| Container initialization | ~40s | ~200s |
| Webpack compilation | Pre-built | ~60-90s |
npm test (full suite) |
~9 min | ~9 min |
user_tool execution |
~3-4 min | ~3-4 min |
Notes
- Per-task images have webpack pre-compiled at build time
- Monolith images run
RUNTIME_SETUPat container start (git checkout, npm install, webpack build) user_toolincludes webpack recompilation, server startup, and Playwright tests- First
npm testrun may be slower due to Jest cache warming - The user_tool does the same sequence of operations as the scorer, but with different handling of output and logging
Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/swe_lancer --limit 10
uv run inspect eval inspect_evals/swe_lancer --max-connections 10
uv run inspect eval inspect_evals/swe_lancer --temperature 0.5See uv run inspect eval --help for all available options.
Dataset
The SWE-Lancer dataset contains 463 freelancer software engineering tasks sourced from Upwork postings for the Expensify open source repository. The dataset is derived from the SWE-Lancer benchmark by OpenAI.
Task Variants
The dataset includes two task variants:
ic_swe | 198 | Individual Contributor tasks where the agent must implement a fix or feature directly |swe_manager | 265 | Manager tasks where the agent must select the correct proposal from a set of candidates |Use the task_variant parameter to filter by variant: "ic_swe", "swe_manager", or "all" (default).
Note about samples
Issues have been found with three of the IC_swe samples: “29854_602”, “49523_849”, “32236_652” These samples have been excluded from the dataset, reducing the number of IC_swe samples from 198 to 195, and the total number of samples to 460.
Data Fields
Each sample includes:
- question_id: Unique task identifier (e.g.,
16912_4) - title: Task title from the original posting
- description: Full task description with HTML/Markdown formatting
- price: Payment amount for the task (used in scoring)
- variant: Task type (
ic_sweorswe_manager) - cwd: Working directory within the Expensify repository
- proposals: Candidate proposals (for
swe_managertasks)
Scoring
Both SWE_Manager and IC_SWE tasks are scored using the following metrics:
- Accuracy: Percentage of tasks completed correctly
- Money Total: Total earnings from successfully completed tasks (sum of task prices)
- Number of correct samples: Count of successfully completed tasks
- Number of incorrect samples: Count of failed tasks
The evaluation uses pass_at_k with k equal to the number of epochs (default: 1).
Debugging
When debugging SWE-Lancer tasks, enable debug mode to access Playwright traces, video recordings, and VNC access:
uv run inspect eval inspect_evals/swe_lancer --model openai/gpt-4o --sample-id 16912_4 -T debug=TrueDebug Features
Set debug=True on the task to enable debugging features:
- VNC Access: Exposes VNC ports to watch Playwright execute in the container via browser. Access at
http://localhost:{port}/vnc.html(port is logged at startup). - Playwright Traces: Enables tracing for step-by-step debugging. View with
npx playwright show-trace <path>. - Video Recording: Preserves video recordings of test execution.
- Artifact Export: Automatically exports traces and videos to
{debug_export_path}/{issue_id}/{sample_uuid}/.
You can override the export path with debug_export_path=....
Viewing Setup Logs
Container setup logs (git checkout, npm install, webpack build) can be viewed by checking the Docker container logs:
docker logs <container_id>Evaluation Report
Note: The dataset has been updated since the original paper, and so the total amount of samples has shifted.
IC SWE Tasks (Diamond split, 198 samples)
GPT-4o | 198 | 8.0% | Original paper results |
GPT-4o | 19 | 5.3% (1/19) | Partial run (inspect-evals) |
SWE Manager Tasks (Diamond split, 265 samples)
GPT-4o | 265 | 37.0% | Original Paper results |
GPT-4o | 100 | 46.0% | Partial run (Inspect Evals) |
Original Paper Reference: Results from SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (OpenAI, 2025).
Changelog
[1-A] - 2026-02-16
- Migrate version to new scheme. See #907.
- Initial implementation