BFCL: Berkeley Function-Calling Leaderboard

Assistants

Evaluates LLM function/tool-calling ability on a simplified split of the Berkeley Function-Calling Leaderboard (BFCL).

Overview

The Berkeley Function-Calling Leaderboard (BFCL) is the first comprehensive evaluation on the LLM’s ability to call functions and tools.

It has now gone through three generations and contains over a dozen datasets which evaluate various aspects of LLM function-calling.

This port of the BFCL to the Inspect API implements V1 (original categories) and V2 (live datasets). To map the components of the official BFCL implementation we have broken down the port to 4 main components:

  1. Category ‘handlers’ in utils.py
  2. Functions from the official implementation in scorer.py and solver.py. The rules for AST matching can be found in Appendix H of the BFCL paper.
  3. Dataset loading and pre-processing in data.py
  4. Mapping to Inspect AI components in bfcl.py

Evaluation Categories

This evaluation contains multiple evaluation categories, requiring different loading, solving, and scoring. To address this, the bfcl scorer and solver contains logic to choose the appropriate method.

Single-Turn Tasks (AST Evaluation)

Category Description Samples Evaluation
simple_python Single function call in Python 400 AST match
parallel_multiple Multiple calls, different functions 200 AST match
parallel Multiple calls, same function 200 AST match
multiple Choose from 2-4 functions 200 AST match
simple_java Java-specific types 100 AST match
simple_javascript JavaScript-specific types 50 AST match
irrelevance Relevance detection (abstain) 240 Hallucination Measure
rest REST API calls 100 Execute match (implemented with AST scorer)
exec_simple Single function call 220 Execute match (implemented with AST scorer)
exec_parallel_multiple Multiple calls, different functions 40 Execute match (implemented with AST scorer)
exec_parallel Multiple calls, same function 50 Execute match (implemented with AST scorer)
exec_multiple Choose from 2-4 functions 50 Execute match (implemented with AST scorer)
live_simple User-contributed simple 258 AST match
live_multiple User-contributed multiple 1053 AST match
live_parallel User-contributed parallel 16 AST match
live_parallel_multiple User-contributed parallel 24 AST match
live_relevance User-contributed relevance 16 Function call check
live_irrelevance User-contributed irrelevance 882 Abstention check

Multi-Turn Tasks (State/Response Evaluation)

Category Description Samples Evaluation
multi_turn_base Basic multi-step interactions 200 State & Response-based (not implemented)
multi_turn_miss_func Missing functions handling 200 State & Response-based (not implemented)
multi_turn_miss_param Underspecified queries 200 State & Response-based (not implemented)
multi_turn_long_context Long conversation contexts 200 State & Response-based (not implemented)

Execution

To run the BFCL evaluation, use the following command:

inspect eval inspect_evals/bfcl --model openai/gpt-4o-mini

An example from the dataset - the model must invoke the parameters of the calc_binomial_probability function to pass.

{
  "id":"exec_simple_0",
  "question":[
    [
      {
        "role":"user",
        "content":"I've been playing a game where rolling a six is somehow more likely than usual, and the chance of it happening on a single roll is 60%. I'm curious, if I roll the die 20 times, what are the odds that I'll get exactly five sixes?"
      }
    ]
  ],
  "function":[
    {
      "name":"calc_binomial_probability",
      "description":"Calculates the probability of getting k successes in n trials.",
      "parameters":{
        "type":"dict",
        "properties":{
          "n":{
            "type":"integer",
            "description":"The number of trials."
          },
          "k":{
            "type":"integer",
            "description":"The number of successes."
          },
          "p":{
            "type":"float",
            "description":"The probability of success."
          }
        },
        "required":[
          "n",
          "k",
          "p"
        ]
      }
    }
  ],
  "execution_result_type":[
    "exact_match"
  ],
  "ground_truth":[
    "calc_binomial_probability(n=20, k=5, p=0.6)"
  ]
}

Limitations

This implementation supports all V1 (original) and V2 (live) categories. V3 (multi-turn) categories and the REST category are not yet implemented.

See also

Parameters

bfcl

  • categories (str | list[str]): (default: ['exec_multiple', 'exec_parallel', 'exec_parallel_multiple', 'exec_simple', 'irrelevance', 'live_irrelevance', 'live_multiple', 'live_parallel', 'live_parallel_multiple', 'live_relevance', 'live_simple', 'multiple', 'parallel', 'parallel_multiple', 'simple_java', 'simple_javascript', 'simple_python', 'sql'])
  • shuffle (bool): (default: True)

Issues

  • Implement the Rest category.
  • Implement the execution scorer for the exec categories. This is low priority as it this code was never published and these categories retired as of PR #943..
  • Going through the logs as there might be some false positives. A trajectory analysis of 100 samples was done and found to have valid results, but there are other smaller things to check for example Invalid value for function: 'lambda t: 3t**2 + 2t + 1'. Expected one of ['lambda x: 3x**2 + 2x + 1'] where the two answers are syntactically equivalent but marked incorrectly. Or a sample that was marked incorrect because a tweet was posted with ..."psi = 3.0" rather than the ground truth which was ..."psi =3" (argueably, the first first is more correct).
  • Provide a ‘grouping’ helper function to format the results in the same way as the live leaderboard (categories are grouped in specific ways.). See https://github.com/ShishirPatil/gorilla/blob/cf12f01fc5582837cfcb496e78bc5dafd18f5f0e/berkeley-function-call-leaderboard/bfcl_eval/eval_checker/eval_runner_helper.py#L288 for the official implementation mapping eval results to their leaderboard.
  • AST scorer implements one-level-deep element type checking for arrays. To be more completely, this should be recursive, but following the the official BFCL implementation, which also only checks one level deep, we did not implement recursion to avoid unnecessary complexity.

Evaluation Report

Model Provider Accuracy (Inspect) Standard Error (Inspect) Time
claude-haiku-4-5-20251001 Anthropic 0.806 0.006 11m 55s
gpt-4.1-mini-2025-04-14 OpenAI 0.788 0.006 10m 3s

Results by Category

claude-haiku-4-5-20251001

Category Accuracy Stderr
exec_multiple 0.860 0.050
exec_parallel 0.780 0.059
exec_parallel_multiple 0.725 0.071
exec_simple 0.940 0.024
irrelevance 0.846 0.023
live_irrelevance 0.848 0.012
live_multiple 0.767 0.013
live_parallel 0.750 0.112
live_parallel_multiple 0.708 0.095
live_relevance 0.625 0.121
live_simple 0.767 0.026
multiple 0.940 0.017
parallel 0.905 0.021
parallel_multiple 0.870 0.024
simple_java 0.380 0.049
simple_javascript 0.220 0.059
simple_python 0.928 0.013
sql 0.440 0.050

gpt-4.1-mini-2025-04-14

Category Accuracy Stderr
exec_multiple 0.840 0.052
exec_parallel 0.820 0.055
exec_parallel_multiple 0.700 0.073
exec_simple 0.940 0.024
irrelevance 0.850 0.023
live_irrelevance 0.782 0.014
live_multiple 0.774 0.013
live_parallel 0.750 0.112
live_parallel_multiple 0.625 0.101
live_relevance 0.812 0.098
live_simple 0.733 0.028
multiple 0.915 0.020
parallel 0.915 0.020
parallel_multiple 0.870 0.024
simple_java 0.460 0.050
simple_javascript 0.360 0.069
simple_python 0.930 0.013
sql 0.170 0.038
  • Evaluation date: 2026-03-06
  • Total samples: 3,981
  • uv run inspect eval inspect_evals/bfcl --model openai/gpt-4.1-mini-2025-04-14,anthropic/claude-haiku-4-5-20251001 -T shuffle=True

Results Comparison to Paper

The table compares the results of the Inspect Evals run with the BFCL leaderboard (Last Updated: 2026-03-06).

FC means they used native support for function/tool calling whilst Prompt describes the walk-around option for function calling, using model’s normal text generation capability.

claude-haiku-4-5-20251001

Category Notes on how to calculate Inspect Evals Run FC Prompt
Simple Function (AST) Unweighted mean of simple_python, simple_java, simple_javascript 0.509 0.710 0.557
Multiple Function (AST) multiple 0.940 0.940 0.840
Parallel Function (AST) parallel 0.905 0.925 0.380
Parallel Multiple (AST) parallel_multiple 0.870 0.885 0.440
Live Simple (AST) live_simple 0.767 0.837 0.667
Live Multiple (AST) live_multiple 0.767 0.776 0.498
Live Parallel (AST) live_parallel 0.750 0.750 0.563
Live Parallel Multiple (AST) live_parallel_multiple 0.708 0.750 0.167
Live Relevance live_relevance 0.625 0.625 0.313
Live Irrelevance live_irrelevance 0.848 0.851 0.953

gpt-4.1-mini-2025-04-14

Category Notes on how to calculate Inspect Evals Run FC Prompt
Simple Function (AST) Unweighted mean of simple_python, simple_java, simple_javascript 0.583 0.733 0.749
Multiple Function (AST) multiple 0.915 0.890 0.925
Parallel Function (AST) parallel 0.915 0.910 0.875
Parallel Multiple (AST) parallel_multiple 0.870 0.820 0.835
Live Simple (AST) live_simple 0.733 0.671 0.806
Live Multiple (AST) live_multiple 0.774 0.698 0.733
Live Parallel (AST) live_parallel 0.750 0.438 0.813
Live Parallel Multiple (AST) live_parallel_multiple 0.625 0.625 0.708
Live Relevance live_relevance 0.812 0.813 0.875
Live Irrelevance live_irrelevance 0.782 0.817 0.739

Note: The leaderboard uses an unweighted average across language subcategories despite very different sample sizes (simple_python: 400, simple_java: 100, simple_javascript: 50) (see: https://github.com/ShishirPatil/gorilla/blob/cf12f01fc5582837cfcb496e78bc5dafd18f5f0e/berkeley-function-call-leaderboard/bfcl_eval/eval_checker/eval_runner_helper.py#L320).

Note: The leaderboard is now V4 and includes additional categories (web search, memory, multi-turn) not implemented here. Our Non-live and Live AST aggregate scores are lower than the leaderboard’s in part because we exclude exec categories from the non-live aggregate.

Note: The leaderboard appears to exclude the non-live irrelevance score from the Hallucination Measurements (relevance and irrelevance).

Changelog

[3-B] - 2026-02-20

  • Renamed the category parameter to categories.
  • Expanded from single category (exec_simple) to all V1 and V2 categories(except rest).
  • Updated scorers (matching functions) to match the official implementation more closely.

[2-A] - 2026-02-16

  • Migrate version to new scheme. See #907.

[1.0.2] - 2026-02-02

  • This update aims to preserve the core logic of the exec_simple category while updating the data source from a Huggingface repo that has not been updated to the official github source.
  • Creates a CategoryConfig object handle different categories.
  • Adds shuffle parameter and sorting by ID following the official implementation.
  • Adds _func_doc_language_specific_pre_processing as per the official implementation (this should not change the running of exec_simple)

[1.0.1] - 2025-12-18

  • Adds backoff policy for functions that connect to huggingface servers.