BFCL: Berkeley Function-Calling Leaderboard

Assistants

Evaluates LLM function/tool-calling ability on a simplified split of the Berkeley Function-Calling Leaderboard (BFCL).

Overview

The Berkeley Function-Calling Leaderboard (BFCL) is the first comprehensive evaluation on the LLM’s ability to call functions and tools.

It has now gone through three generations and contains over a dozen datasets which evaluate various aspects of LLM function-calling.

This port of the BFCL to the Inspect API implements V1 (original categories), V2 (live datasets), and V3 (multi-turn). To map the components of the official BFCL implementation we have broken down the port to 4 main components:

  1. Category ‘handlers’ in utils.py
  2. Functions from the official implementation in scorer.py and solver.py. The rules for AST matching can be found in Appendix H of the BFCL paper.
  3. Dataset loading and pre-processing in data.py
  4. Mapping to Inspect AI components in bfcl.py

Evaluation Categories

This evaluation contains multiple evaluation categories, requiring different loading, solving, and scoring. To address this, the bfcl scorer and solver contains logic to choose the appropriate method.

Single-Turn Tasks (AST Evaluation)

Category Description Samples Evaluation
simple_python Single function call in Python 400 AST match
parallel_multiple Multiple calls, different functions 200 AST match
parallel Multiple calls, same function 200 AST match
multiple Choose from 2-4 functions 200 AST match
simple_java Java-specific types 100 AST match
simple_javascript JavaScript-specific types 50 AST match
irrelevance Relevance detection (abstain) 240 Hallucination Measure
rest REST API calls 100 Execute match (implemented with AST scorer)
exec_simple Single function call 220 Execute match (implemented with AST scorer)
exec_parallel_multiple Multiple calls, different functions 40 Execute match (implemented with AST scorer)
exec_parallel Multiple calls, same function 50 Execute match (implemented with AST scorer)
exec_multiple Choose from 2-4 functions 50 Execute match (implemented with AST scorer)
live_simple User-contributed simple 258 AST match
live_multiple User-contributed multiple 1053 AST match
live_parallel User-contributed parallel 16 AST match
live_parallel_multiple User-contributed parallel 24 AST match
live_relevance User-contributed relevance 16 Function call check
live_irrelevance User-contributed irrelevance 882 Abstention check

Multi-Turn Tasks (State/Response Evaluation)

Category Description Samples Evaluation
multi_turn_base Basic multi-step interactions 200 State & Response-based
multi_turn_miss_func Missing functions handling 200 State & Response-based
multi_turn_miss_param Underspecified queries 200 State & Response-based
multi_turn_long_context Long conversation contexts 200 State & Response-based
multi_turn_composite Combines missing funcs, missing params, and long context 200 State & Response-based

Note: multi-turn-composite is not included in the score comparison below, as the category is missing from the paper results (figure 1 on page 6)

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[bfcl]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra bfcl

Execution

To run the BFCL evaluation, use the following command:

inspect eval inspect_evals/bfcl --model openai/gpt-4o-mini

An example from the dataset - the model must invoke the parameters of the calc_binomial_probability function to pass.

{
  "id":"exec_simple_0",
  "question":[
    [
      {
        "role":"user",
        "content":"I've been playing a game where rolling a six is somehow more likely than usual, and the chance of it happening on a single roll is 60%. I'm curious, if I roll the die 20 times, what are the odds that I'll get exactly five sixes?"
      }
    ]
  ],
  "function":[
    {
      "name":"calc_binomial_probability",
      "description":"Calculates the probability of getting k successes in n trials.",
      "parameters":{
        "type":"dict",
        "properties":{
          "n":{
            "type":"integer",
            "description":"The number of trials."
          },
          "k":{
            "type":"integer",
            "description":"The number of successes."
          },
          "p":{
            "type":"float",
            "description":"The probability of success."
          }
        },
        "required":[
          "n",
          "k",
          "p"
        ]
      }
    }
  ],
  "execution_result_type":[
    "exact_match"
  ],
  "ground_truth":[
    "calc_binomial_probability(n=20, k=5, p=0.6)"
  ]
}

Limitations

This implementation supports all V1 (original), V2 (live), and V3 (multi-turn) categories. The REST category is not yet implemented.

See also

Parameters

bfcl

  • categories (str | list[str]): (default: ['exec_multiple', 'exec_parallel', 'exec_parallel_multiple', 'exec_simple', 'irrelevance', 'live_irrelevance', 'live_multiple', 'live_parallel', 'live_parallel_multiple', 'live_relevance', 'live_simple', 'multi_turn_base', 'multi_turn_composite', 'multi_turn_long_context', 'multi_turn_miss_func', 'multi_turn_miss_param', 'multiple', 'parallel', 'parallel_multiple', 'simple_java', 'simple_javascript', 'simple_python', 'sql'])

Issues

  • Implement the Rest category.
  • Implement the execution scorer for the exec categories. This is low priority as it this code was never published and these categories retired as of PR #943..
  • Going through the logs as there might be some false positives. A trajectory analysis of 100 samples was done and found to have valid results, but there are other smaller things to check for example Invalid value for function: 'lambda t: 3t**2 + 2t + 1'. Expected one of ['lambda x: 3x**2 + 2x + 1'] where the two answers are syntactically equivalent but marked incorrectly. Or a sample that was marked incorrect because a tweet was posted with ..."psi = 3.0" rather than the ground truth which was ..."psi =3" (argueably, the first first is more correct). A trajectory analysis of 100 samples was done on the single-turn categories which returned normal findings.
  • Do an error analysis on the multi-turn categories. Based off a visual inspection of a handful of incorrect samples there maybe False negatives. For example ‘multi_turn_base_45 was marked incorrect as it didn’t do the search with the key term ’draft’ however this did not impact model output and answer which was able to identify the names with the term draft in it.
  • Provide a ‘grouping’ helper function to format the results in the same way as the live leaderboard (categories are grouped in specific ways.). See https://github.com/ShishirPatil/gorilla/blob/cf12f01fc5582837cfcb496e78bc5dafd18f5f0e/berkeley-function-call-leaderboard/bfcl_eval/eval_checker/eval_runner_helper.py#L288 for the official implementation mapping eval results to their leaderboard.
  • AST scorer implements one-level-deep element type checking for arrays. To be more completely, this should be recursive, but following the the official BFCL implementation, which also only checks one level deep, we did not implement recursion to avoid unnecessary complexity.
  • OpenAI gpt-4.1-mini fails on multi-turn samples with Optional parameters. Three upstream backend methods (find, echo in gorilla_file_system.py and get_user_tickets in ticket_api.py) have Optional[str] type annotations. When _make_recording_tool wraps these methods, functools.wraps copies __annotations__ and inspect.signature(method) carries the Optional[str] types. Inspect AI’s schema builder (parse_tool_infoget_type_hintsjson_schema) then converts these to anyOf: [{type: "string"}, {type: "null"}] in the tool JSON schema. GPT-4.1-mini rejects this with a misleading “could not parse the JSON body” error (400 BadRequestError). GPT-5.1 handles the same schema correctly. Recommended fix: In _make_recording_tool (multi_turn_solver.py), after setting the __signature__, strip Optional wrappers from parameter annotations (unwrap Optional[X] / X | None to X) and patch both __signature__ and __annotations__ before the @tool decorator builds the schema. This keeps the fix local to BFCL without modifying the Inspect AI framework.

Evaluation Report

Model Provider Accuracy (Inspect) Standard Error (Inspect) Time
claude-haiku-4-5-20251001 Anthropic 0.806 0.006 11m 55s
gpt-4.1-mini-2025-04-14 OpenAI 0.788 0.006 10m 3s

Results by Category

claude-haiku-4-5-20251001

Category Accuracy Stderr
exec_multiple 0.860 0.050
exec_parallel 0.780 0.059
exec_parallel_multiple 0.725 0.071
exec_simple 0.940 0.024
irrelevance 0.846 0.023
live_irrelevance 0.848 0.012
live_multiple 0.767 0.013
live_parallel 0.750 0.112
live_parallel_multiple 0.708 0.095
live_relevance 0.625 0.121
live_simple 0.767 0.026
multiple 0.940 0.017
parallel 0.905 0.021
parallel_multiple 0.870 0.024
simple_java 0.380 0.049
simple_javascript 0.220 0.059
simple_python 0.928 0.013
sql 0.440 0.050

gpt-4.1-mini-2025-04-14

Category Accuracy Stderr
exec_multiple 0.840 0.052
exec_parallel 0.820 0.055
exec_parallel_multiple 0.700 0.073
exec_simple 0.940 0.024
irrelevance 0.850 0.023
live_irrelevance 0.782 0.014
live_multiple 0.774 0.013
live_parallel 0.750 0.112
live_parallel_multiple 0.625 0.101
live_relevance 0.812 0.098
live_simple 0.733 0.028
multiple 0.915 0.020
parallel 0.915 0.020
parallel_multiple 0.870 0.024
simple_java 0.460 0.050
simple_javascript 0.360 0.069
simple_python 0.930 0.013
sql 0.170 0.038
  • Evaluation date: 2026-03-06
  • Total samples: 3,981
  • uv run inspect eval inspect_evals/bfcl --model openai/gpt-4.1-mini-2025-04-14,anthropic/claude-haiku-4-5-20251001 --sample-shuffle

V3 Multi-Turn Results

Model Provider Accuracy (Inspect) Standard Error (Inspect) Time
claude-haiku-4-5-20251001 Anthropic 0.419 0.016 35m 49s
gpt-5.1-2025-11-13 OpenAI 0.273 0.037 7m 38s

claude-haiku-4-5-20251001

Category Accuracy Stderr
multi_turn_base 0.575 0.035
multi_turn_miss_func 0.445 0.035
multi_turn_miss_param 0.535 0.035
multi_turn_long_context 0.525 0.035
multi_turn_composite 0.015 0.009

gpt-5.1-2025-11-13

Category Accuracy Stderr
multi_turn_base (n=28) 0.464 0.096
multi_turn_miss_func (n=39) 0.308 0.075
multi_turn_miss_param (n=22) 0.273 0.097
multi_turn_long_context (n=28) 0.357 0.092
multi_turn_composite (n=33) 0.000 0.000
  • Note that only 10% of the eval was ran for cost purposes (where n is the number of samples in that category).

  • Evaluation date: 2026-03-24

  • claude-haiku-4-5-20251001: 1,000 samples (200 per category × 5 categories). multi_turn_base and multi_turn_long_context from initial run; multi_turn_miss_func, multi_turn_miss_param, and multi_turn_composite rerun after bug fix.

  • gpt-5.1-2025-11-13: 150 samples (30 per category × 5 categories)

  • uv run inspect eval inspect_evals/bfcl -T "categories=['multi_turn_base','multi_turn_miss_func','multi_turn_miss_param','multi_turn_long_context','multi_turn_composite']" --model anthropic/claude-haiku-4-5-20251001 --sample-shuffle

  • uv run inspect eval inspect_evals/bfcl -T "categories=['multi_turn_base','multi_turn_miss_func','multi_turn_miss_param','multi_turn_long_context','multi_turn_composite']" --model openai/gpt-5.1-2025-11-13 --limit 150 --sample-shuffle

Results Comparison to Paper

The table compares the results of the Inspect Evals run with the BFCL leaderboard (Last Updated: 2026-03-06).

FC means they used native support for function/tool calling whilst Prompt describes the walk-around option for function calling, using model’s normal text generation capability.

claude-haiku-4-5-20251001

Category Notes on how to calculate Inspect Evals Run FC Prompt
Simple Function (AST) Unweighted mean of simple_python, simple_java, simple_javascript 0.509 0.710 0.557
Multiple Function (AST) multiple 0.940 0.940 0.840
Parallel Function (AST) parallel 0.905 0.925 0.380
Parallel Multiple (AST) parallel_multiple 0.870 0.885 0.440
Live Simple (AST) live_simple 0.767 0.837 0.667
Live Multiple (AST) live_multiple 0.767 0.776 0.498
Live Parallel (AST) live_parallel 0.750 0.750 0.563
Live Parallel Multiple (AST) live_parallel_multiple 0.708 0.750 0.167
Live Relevance live_relevance 0.625 0.625 0.313
Live Irrelevance live_irrelevance 0.848 0.851 0.953

gpt-4.1-mini-2025-04-14

Category Notes on how to calculate Inspect Evals Run FC Prompt
Simple Function (AST) Unweighted mean of simple_python, simple_java, simple_javascript 0.583 0.733 0.749
Multiple Function (AST) multiple 0.915 0.890 0.925
Parallel Function (AST) parallel 0.915 0.910 0.875
Parallel Multiple (AST) parallel_multiple 0.870 0.820 0.835
Live Simple (AST) live_simple 0.733 0.671 0.806
Live Multiple (AST) live_multiple 0.774 0.698 0.733
Live Parallel (AST) live_parallel 0.750 0.438 0.813
Live Parallel Multiple (AST) live_parallel_multiple 0.625 0.625 0.708
Live Relevance live_relevance 0.812 0.813 0.875
Live Irrelevance live_irrelevance 0.782 0.817 0.739

claude-haiku-4-5-20251001 (Multi-Turn)

Category Notes on how to calculate Inspect Evals Run FC Prompt
Overall Accuracy Mean of 4 categories below (excl. composite) 0.520 0.536 0.018
Multi-Turn Base multi_turn_base 0.575 0.635 0.015
Multi-Turn Miss Func multi_turn_miss_func 0.445 0.425 0.000
Multi-Turn Miss Param multi_turn_miss_param 0.535 0.525 0.040
Multi-Turn Long Context multi_turn_long_context 0.525 0.560 0.015

Note: The leaderboard uses an unweighted average across language subcategories despite very different sample sizes (simple_python: 400, simple_java: 100, simple_javascript: 50) (see: https://github.com/ShishirPatil/gorilla/blob/cf12f01fc5582837cfcb496e78bc5dafd18f5f0e/berkeley-function-call-leaderboard/bfcl_eval/eval_checker/eval_runner_helper.py#L320).

Note: The leaderboard is now V4 and includes additional categories (web search, memory) not implemented here. Our Non-live and Live AST aggregate scores are lower than the leaderboard’s in part because we exclude exec categories from the non-live aggregate.

Note: The leaderboard appears to exclude the non-live irrelevance score from the Hallucination Measurements (relevance and irrelevance).

Note: Multi-turn composite is excluded from the overall accuracy comparison as it is missing from the paper results (figure 1 on page 6).

Changelog

[5-B] - 2026-04-01

  • Fix crash in multi-turn solver when missed_function_docs is empty for a turn but missed_function_names has entries.

[4-B] - 2026-03-12

  • Added multi-turn categories
  • Added in a multi-turn solver and scorer which included the import of function modules

[3-B] - 2026-02-20

  • Renamed the category parameter to categories.
  • Expanded from single category (exec_simple) to all V1 and V2 categories(except rest).
  • Updated scorers (matching functions) to match the official implementation more closely.

[2-A] - 2026-02-16

  • Migrate version to new scheme. See #907.

[1.0.2] - 2026-02-02

  • This update aims to preserve the core logic of the exec_simple category while updating the data source from a Huggingface repo that has not been updated to the official github source.
  • Creates a CategoryConfig object handle different categories.
  • Adds shuffle parameter and sorting by ID following the official implementation.
  • Adds _func_doc_language_specific_pre_processing as per the official implementation (this should not change the running of exec_simple)

[1.0.1] - 2025-12-18

  • Adds backoff policy for functions that connect to huggingface servers.