BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
A benchmark for evaluating agents’ ability to browse the web. The dataset consists of challenging questions that generally require web-access to answer correctly.
Overview
BrowseComp (“Browsing Competition”) is a benchmark for evaluating agents’ ability to browse the web. This eval is based on the author’s implementation found here and consists of questions that are very challenging to solve without access to a browser.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Or, if developing on a clone of the inspect_evals
repo, you can install the package in editable mode with:
pip install -e ".[dev]"
Then, evaluate against one or more models with:
inspect eval inspect_evals/browse_comp --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
This eval uses the web_search and web_browser tools. For web_search
, the eval will default to OpenAI’s built-in web search tool if using an OpenAI model capable of this. If not using an OpenAI model, the eval will fallback to using Tavily or Google. You’ll need to configure the necessary API keys accordingly.
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/browse_comp --limit 10
inspect eval inspect_evals/browse_comp --max-connections 10
inspect eval inspect_evals/browse_comp --temperature 0.5
See inspect eval --help
for all available options.
Eval-specific options
There are several options that are specific to the BrowseComp eval and that can be set via the CLI:
with_browsing: whether to use browsing or not (default: false)
num_samples: how many samples to run (default: all samples)
scorer_model: which model to use for scoring the QA model's answer (defaults to model specified via `--model`)
Dataset
The dataset used in this eval can be found here. The entries in the dataset are encrypted and can be decrypted using the functions in utils.py.
Each entry consists of a specific question, a target answer, and a canary used for encryption and decryption. An example question (not from the dataset) and answer combination could be:
Question: What was the name of the 1995 film starring the actress who played the character Victoria that married the final scorer of the World Cup 1998 winner two years after the world cup win?
Answer: La nouvelle tribu
Scoring
Scoring is done by an LLM that determines whether the produced answer matches the target answer. The score value is a binary choice between CORRECT and INCORRECT. In addition to the model answer, the model must provide a confidence score to accompany its answer. The model’s accuracy is determined over the totality of the CORRECT/INCORRECT answers. A calibration error is also calculated based on the models’ confidence values. While the authors of the original paper do not provide their calibration error formula, this evaluation uses the Expected Calibration Error (ECE), which is common for determining how well a model’s confidence predicts the correctness of its answer. See, for example, the PyTorch documentation for more details.
Evaluation Report
The original authors evaluated the BrowseComp dataset (1266 samples) on the following models: gpt-4o, gpt-4o w/ browsing, gpt-4.5, o1, and deep research. At the current pricing as of May 21st, 2025, running the eval for the lowest performing model, gpt-4o, costs about $20. Back-of-the-envelope calculations for o1 put the cost of this eval at around $350-400 making it prohibitively expensive for self-funding.
The following table shows the results produced by this eval vs the results from the original authors:
Model | Accuracy (%) | Calibration Error (%) | Comment | |
---|---|---|---|---|
Theirs | gpt-4o | 0.6% | 69% | No browsing |
Ours | gpt-4o | 0.9% (+0.3%) | 56% (-13%) | No browsing |
———- | ——————– | ————– | ———————– | —————————————————————————– |
Theirs | gpt-4o w/ browsing | 1.9% | 82% | |
Ours | gpt-4o w/ browsing | 2.5% (+0.6%) | 80% (-2%) | Evaluated on first 276 samples (22%). Using OpenAI’s built-in browsing tool |
———- | ——————– | ————– | ———————– | —————————————————————————– |
Theirs | o1 | 9.9% | 65% | |
Ours | o1 | 10% (+0.1%) | 48% (-17%) | Only evaluated on first 100 samples |
———- | ——————– | ————– | ———————– | —————————————————————————– |