GDPval

Assistants

Agent

GDPval measures model performance on economically valuable, real-world tasks across 44 occupations.

Contributed By

@jeqcho

Code

src/inspect_evals/gdpval

Paper

https://arxiv.org/abs/2510.04374

Overview

GDPval measures model performance on economically valuable, real-world tasks across 44 occupations. The first version of this evaluation spans 44 occupations selected from the top 9 industries contributing to U.S. GDP (Gross Domestic Product). The GDPval full set includes 1,320 specialized tasks (220 in the gold open-sourced set), each meticulously crafted and vetted by experienced professionals with over 14 years of experience on average from these fields. Every task is based on real work products, such as a legal brief, an engineering blueprint, a customer support conversation, or a nursing care plan.

The version here only includes the 220 tasks in the gold open-sourced set.

Scoring this eval involves saving the results as their own dataset, uploading it to huggingface, and submitting it to the OpenAI auto grader. We provide a helper function to automatically upload to huggingface, but you will have to submit it to the OpenAI auto grader through their website. See Judging for more details.

An example of the result from running the Inspect Eval implemenation can be found in this uploaded huggingface repo.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[gdpval]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra gdpval

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/gdpval --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.gdpval import gdpval
eval(gdpval)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

An output folder with the deliverable files will be created. See Output Folder Naming below.

Options

You can specify grader models, solvers, shuffling, and automatically uploading the folder to huggingface. For example

uv run inspect eval inspect_evals/gdpval -T upload_to_hf=True
uv run inspect eval inspect_evals/gdpval -T shuffle=True

Parameters

`gdpval`

solver (Solver): The solver to use. (default: solve)
scorer (Scorer): Scorer for deliverable text/output. We used exact() but this can be anything, as long as it generates the Score object so we can capture the final text output to be included in the table as deliverable_text. (default: score)
shuffle (bool): Whether to shuffle the dataset. Differs from –sample-shuffle by shuffling after limiting (default: False)
upload_to_hf (bool): If true, upload resulting folder to HuggingFace. (default: False)

Replication Notes

We ran our implementation for GPT-5 with thinking set to low, submitted it to OpenAI, and obtained the following results.

Dataset: https://huggingface.co/datasets/jeqcho/2025-10-16T21-42-02plus00-00_gdpval Average Score: 47.3% ± 6.2% Tasks graded: 219 / 220 (see report for errors)

Grading methodology Our automated grader runs pairwise comparisons against a golden set of human solutions, using GPT-5 with a variety of tools to analyze the deliverables. We grade each submission three times to account for variance in the grader, and report a 95% CI calculated by bootstrapping over the grades.

The task that errored was because "Responses API did not complete within 3600.0 seconds". You can find the sample submission and full results in this hf repo.

We note that the official score reported in the official GDPval leaderboard by OpenAI reports a lower score of 31.9% (wins only) and 34.6% (wins and ties).

Judging

Note that OpenAI provides their automated grader model for free. We thus provide an option -T upload_to_hf=True for users to automatically upload the results to huggingface. Users should login to huggingface beforehand.

hf auth login
uv run inspect eval inspect_evals/gdpval -T upload_to_hf=True

If the user doesn’t specify upload_to_hf, it is False by default. Later on, the user can decide to upload with:

hf upload {username}/{folder_path.name} {folder_path} --repo-type dataset

This command with the username and folder paths prefilled is generated and shown at the end of each run. It is important to use this generated path as the repo ID as the path is used when generating data for the submission.

After uploading, the user should submit the OpenAI form to request grading.

Output Folder Naming

Similar to Inspect logs, the evaluation creates output folders with timestamps in the format: gdpval_hf_upload/YYYY-MM-DDTHH-MM-SS[plus|minus]HH-MM_gdpval/

Note: The + sign in UTC timezone offsets (e.g., +00:00) is replaced with plus to avoid issues with special characters in folder names (particularly when uploading to huggingface). Negative offsets use - as normal.

Custom Judges

If you prefer to use your own grader instead of submitting it to OpenAI, we also provide the option to specify a grader, see example below.

# make a new file, say src/inspect_evals/gdpval/eval_gdpval.py
from inspect_evals.gdpval.gdpval import gdpval
from inspect_ai import eval

@scorer(metrics=[accuracy()])
def a_nice_scorer() -> Scorer:
    async def score(state: TaskState, target: Target) -> Score:
        pass

eval(gdpval(scorer=a_nice_scorer()))

# then run `uv run src/inspect_evals/gdpval/eval_gdpval.py`

Note that the preliminary metric in the paper is the win rate against human deliverables, which is at is implemented in the OpenAI grading server.

Notes

In the GDPval paper:

For the OpenAI models, we enabled the web search tool and the code interpreter tool, with background sampling.

Inspect doesn’t support the code interpreter API. Instead, we provide it with the default bash and python tool, and the task is run within a docker sandbox.

The packages provided are listed in the appendix section A.6.4 of the paper. Even though the title is “AUTOMATED GRADER PACKAGES”, the paper also says “These were also made available to the agent during sampling of OpenAI models.”

Docker Build Times

Docker build can take up to 10 minutes due to the number of packages installed. This build time could be eliminated in the future if Inspect Evals (or OpenAI) share pre-built images. Initialising a sandbox during an eval times out after 120 seconds by default, so the docker image should be built before running the eval.

To build the docker image before the eval run, use the provided script:

./src/inspect_evals/gdpval/build_docker_image.sh

Or manually with:

docker build -t gdpval -f src/inspect_evals/gdpval/Dockerfile src/inspect_evals/gdpval

Note that once the image is built, running the eval doesn’t incur the long build time anymore.

Docker Build Known Issues

Packages Without Pre-built Wheels

The following packages from the original requirements have been commented out in docker-requirements.txt due to lack of pre-built wheels for the linux/amd64 platform with Python 3.11:

databricks-sql-connector==0.9.1
- Issue: Requires pyarrow>=8.0.0,<8.1.0, but pyarrow 8.0.0 has no pre-built wheel for linux/amd64
- Attempted workarounds: Pre-installing newer pyarrow, using pip constraints, –only-binary flags
- All attempts failed as pip’s build dependency resolver still tries to build pyarrow 8.0.0 from source
- Building from source requires Apache Arrow C++ libraries not available in Debian repos
- Workaround: Could be re-enabled by:
  - Upgrading to a newer databricks-sql-connector version that supports newer pyarrow versions
  - Installing Apache Arrow C++ dev libraries from custom repos
  - Using a different base image with Arrow pre-installed
snowflake-connector-python==2.7.12
- Issue: Same as databricks-sql-connector - requires pyarrow>=8.0.0,<8.1.0
- Same workarounds and resolution strategies apply

Package Replacements

The following packages have been replaced with alternatives that provide pre-built wheels:

dlib==19.24.2 → dlib-bin==19.24.6
- Reason: Original dlib requires compilation with no pre-built wheel for Python 3.11 linux/amd64
- dlib-bin provides the same functionality with pre-built wheels

Platform Configuration

The Docker image is built explicitly for linux/amd64 platform to maximize package compatibility. While this means the image will run under emulation on ARM64 machines (like Apple Silicon Macs), it ensures the broadest package availability since most Python packages prioritize AMD64 wheels. Note that it could take up to 10 minutes to build the image on a M3 chip, and the eval could run slowly under emulation on a Mac.

We developed and ran this eval on a t3.2xlarge EC2 instance with no issues.

Changelog

[2-A] - 2026-02-16

Migrate version to new scheme. See #907.