BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Assistants
Agent

A benchmark for evaluating agents’ ability to browse the web. The dataset consists of challenging questions that generally require web-access to answer correctly.

Overview

BrowseComp (“Browsing Competition”) is a benchmark for evaluating agents’ ability to browse the web. This eval is based on the author’s implementation found here and consists of questions that are very challenging to solve without access to a browser.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/browse_comp --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

This eval uses the web_search and web_browser tools. For web_search, the eval will default to OpenAI’s built-in web search tool if using an OpenAI model capable of this. If not using an OpenAI model, the eval will fallback to using Tavily or Google. You’ll need to configure the necessary API keys accordingly.

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/browse_comp --limit 10
inspect eval inspect_evals/browse_comp --max-connections 10
inspect eval inspect_evals/browse_comp --temperature 0.5

See inspect eval --help for all available options.

Eval-specific options

There are several options that are specific to the BrowseComp eval and that can be set via the CLI:

with_browsing: whether to use browsing or not (default: false)
num_samples: how many samples to run (default: all samples)
scorer_model: which model to use for scoring the QA model's answer (defaults to model specified via `--model`)

Dataset

The dataset used in this eval can be found here. The entries in the dataset are encrypted and can be decrypted using the functions in utils.py.

Each entry consists of a specific question, a target answer, and a canary used for encryption and decryption. An example question (not from the dataset) and answer combination could be:

Question: What was the name of the 1995 film starring the actress who played the character Victoria that married the final scorer of the World Cup 1998 winner two years after the world cup win? 

Answer: La nouvelle tribu

Scoring

Scoring is done by an LLM that determines whether the produced answer matches the target answer. The score value is a binary choice between CORRECT and INCORRECT. In addition to the model answer, the model must provide a confidence score to accompany its answer. The model’s accuracy is determined over the totality of the CORRECT/INCORRECT answers. A calibration error is also calculated based on the models’ confidence values. While the authors of the original paper do not provide their calibration error formula, this evaluation uses the Expected Calibration Error (ECE), which is common for determining how well a model’s confidence predicts the correctness of its answer. See, for example, the PyTorch documentation for more details.

Evaluation Report

The original authors evaluated the BrowseComp dataset (1266 samples) on the following models: gpt-4o, gpt-4o w/ browsing, gpt-4.5, o1, and deep research. At the current pricing as of May 21st, 2025, running the eval for the lowest performing model, gpt-4o, costs about $20. Back-of-the-envelope calculations for o1 put the cost of this eval at around $350-400 making it prohibitively expensive for self-funding.

The following table shows the results produced by this eval vs the results from the original authors:

Model Accuracy (%) Calibration Error (%) Comment
Theirs gpt-4o 0.6% 69% No browsing
Ours gpt-4o 0.9% (+0.3%) 56% (-13%) No browsing
———- ——————– ————– ———————– —————————————————————————–
Theirs gpt-4o w/ browsing 1.9% 82%
Ours gpt-4o w/ browsing 2.5% (+0.6%) 80% (-2%) Evaluated on first 276 samples (22%). Using OpenAI’s built-in browsing tool
———- ——————– ————– ———————– —————————————————————————–
Theirs o1 9.9% 65%
Ours o1 10% (+0.1%) 48% (-17%) Only evaluated on first 100 samples
———- ——————– ————– ———————– —————————————————————————–