BFCL: Berkeley Function-Calling Leaderboard

Assistants

Evaluates LLM function/tool-calling ability on a simplified split of the Berkeley Function-Calling Leaderboard (BFCL).

Contributed By

Code

Overview

The Berkeley Function-Calling Leaderboard (BFCL) is the first comprehensive evaluation on the LLM’s ability to call functions and tools.

It has now gone through three generations and contains over a dozen datasets which evaluate various aspects of LLM function-calling.

This port of the BFCL to the Inspect API implements V1 (original categories), V2 (live datasets), and V3 (multi-turn). To map the components of the official BFCL implementation we have broken down the port to 4 main components:

Category ‘handlers’ in utils.py
Functions from the official implementation in scorer.py and solver.py. The rules for AST matching can be found in Appendix H of the BFCL paper.
Dataset loading and pre-processing in data.py
Mapping to Inspect AI components in bfcl.py

Evaluation Categories

This evaluation contains multiple evaluation categories, requiring different loading, solving, and scoring. To address this, the bfcl scorer and solver contains logic to choose the appropriate method.

Single-Turn Tasks (AST Evaluation)

Category	Description	Samples	Evaluation
`simple_python`	Single function call in Python	400	AST match
`parallel_multiple`	Multiple calls, different functions	200	AST match
`parallel`	Multiple calls, same function	200	AST match
`multiple`	Choose from 2-4 functions	200	AST match
`simple_java`	Java-specific types	100	AST match
`simple_javascript`	JavaScript-specific types	50	AST match
`irrelevance`	Relevance detection (abstain)	240	Hallucination Measure
`rest`	REST API calls	100	Execute match (implemented with AST scorer)
`exec_simple`	Single function call	220	Execute match (implemented with AST scorer)
`exec_parallel_multiple`	Multiple calls, different functions	40	Execute match (implemented with AST scorer)
`exec_parallel`	Multiple calls, same function	50	Execute match (implemented with AST scorer)
`exec_multiple`	Choose from 2-4 functions	50	Execute match (implemented with AST scorer)
`live_simple`	User-contributed simple	258	AST match
`live_multiple`	User-contributed multiple	1053	AST match
`live_parallel`	User-contributed parallel	16	AST match
`live_parallel_multiple`	User-contributed parallel	24	AST match
`live_relevance`	User-contributed relevance	16	Function call check
`live_irrelevance`	User-contributed irrelevance	882	Abstention check

Multi-Turn Tasks (State/Response Evaluation)

Category	Description	Samples	Evaluation
`multi_turn_base`	Basic multi-step interactions	200	State & Response-based
`multi_turn_miss_func`	Missing functions handling	200	State & Response-based
`multi_turn_miss_param`	Underspecified queries	200	State & Response-based
`multi_turn_long_context`	Long conversation contexts	200	State & Response-based
`multi_turn_composite`	Combines missing funcs, missing params, and long context	200	State & Response-based

Note: multi-turn-composite is not included in the score comparison below, as the category is missing from the paper results (figure 1 on page 6)

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[bfcl]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra bfcl

Execution

To run the BFCL evaluation, use the following command:

inspect eval inspect_evals/bfcl --model openai/gpt-4o-mini

An example from the dataset - the model must invoke the parameters of the calc_binomial_probability function to pass.

{
  "id":"exec_simple_0",
  "question":[
    [
      {
        "role":"user",
        "content":"I've been playing a game where rolling a six is somehow more likely than usual, and the chance of it happening on a single roll is 60%. I'm curious, if I roll the die 20 times, what are the odds that I'll get exactly five sixes?"
      }
    ]
  ],
  "function":[
    {
      "name":"calc_binomial_probability",
      "description":"Calculates the probability of getting k successes in n trials.",
      "parameters":{
        "type":"dict",
        "properties":{
          "n":{
            "type":"integer",
            "description":"The number of trials."
          },
          "k":{
            "type":"integer",
            "description":"The number of successes."
          },
          "p":{
            "type":"float",
            "description":"The probability of success."
          }
        },
        "required":[
          "n",
          "k",
          "p"
        ]
      }
    }
  ],
  "execution_result_type":[
    "exact_match"
  ],
  "ground_truth":[
    "calc_binomial_probability(n=20, k=5, p=0.6)"
  ]
}

Limitations

This implementation supports all V1 (original), V2 (live), and V3 (multi-turn) categories. The REST category is not yet implemented.

Parameters

`bfcl`

categories (str | list[str]): (default: ['exec_multiple', 'exec_parallel', 'exec_parallel_multiple', 'exec_simple', 'irrelevance', 'live_irrelevance', 'live_multiple', 'live_parallel', 'live_parallel_multiple', 'live_relevance', 'live_simple', 'multi_turn_base', 'multi_turn_composite', 'multi_turn_long_context', 'multi_turn_miss_func', 'multi_turn_miss_param', 'multiple', 'parallel', 'parallel_multiple', 'simple_java', 'simple_javascript', 'simple_python', 'sql'])

Issues

Implement the Rest category.
Implement the execution scorer for the exec categories. This is low priority as it this code was never published and these categories retired as of PR #943..
Going through the logs as there might be some false positives. A trajectory analysis of 100 samples was done and found to have valid results, but there are other smaller things to check for example Invalid value for function: 'lambda t: 3t**2 + 2t + 1'. Expected one of ['lambda x: 3x**2 + 2x + 1'] where the two answers are syntactically equivalent but marked incorrectly. Or a sample that was marked incorrect because a tweet was posted with ..."psi = 3.0" rather than the ground truth which was ..."psi =3" (argueably, the first first is more correct). A trajectory analysis of 100 samples was done on the single-turn categories which returned normal findings.
Do an error analysis on the multi-turn categories. Based off a visual inspection of a handful of incorrect samples there maybe False negatives. For example ‘multi_turn_base_45 was marked incorrect as it didn’t do the search with the key term ’draft’ however this did not impact model output and answer which was able to identify the names with the term draft in it.
Provide a ‘grouping’ helper function to format the results in the same way as the live leaderboard (categories are grouped in specific ways.). See https://github.com/ShishirPatil/gorilla/blob/cf12f01fc5582837cfcb496e78bc5dafd18f5f0e/berkeley-function-call-leaderboard/bfcl_eval/eval_checker/eval_runner_helper.py#L288 for the official implementation mapping eval results to their leaderboard.
AST scorer implements one-level-deep element type checking for arrays. To be more completely, this should be recursive, but following the the official BFCL implementation, which also only checks one level deep, we did not implement recursion to avoid unnecessary complexity.
OpenAI gpt-4.1-mini fails on multi-turn samples with Optional parameters. Three upstream backend methods (find, echo in gorilla_file_system.py and get_user_tickets in ticket_api.py) have Optional[str] type annotations. When _make_recording_tool wraps these methods, functools.wraps copies __annotations__ and inspect.signature(method) carries the Optional[str] types. Inspect AI’s schema builder (parse_tool_info → get_type_hints → json_schema) then converts these to anyOf: [{type: "string"}, {type: "null"}] in the tool JSON schema. GPT-4.1-mini rejects this with a misleading “could not parse the JSON body” error (400 BadRequestError). GPT-5.1 handles the same schema correctly. Recommended fix: In _make_recording_tool (multi_turn_solver.py), after setting the __signature__, strip Optional wrappers from parameter annotations (unwrap Optional[X] / X | None to X) and patch both __signature__ and __annotations__ before the @tool decorator builds the schema. This keeps the fix local to BFCL without modifying the Inspect AI framework.

Evaluation Report

Model	Provider	Accuracy (Inspect)	Standard Error (Inspect)	Time
claude-haiku-4-5-20251001	Anthropic	0.806	0.006	11m 55s
gpt-4.1-mini-2025-04-14	OpenAI	0.788	0.006	10m 3s

Results by Category

claude-haiku-4-5-20251001

Category	Accuracy	Stderr
exec_multiple	0.860	0.050
exec_parallel	0.780	0.059
exec_parallel_multiple	0.725	0.071
exec_simple	0.940	0.024
irrelevance	0.846	0.023
live_irrelevance	0.848	0.012
live_multiple	0.767	0.013
live_parallel	0.750	0.112
live_parallel_multiple	0.708	0.095
live_relevance	0.625	0.121
live_simple	0.767	0.026
multiple	0.940	0.017
parallel	0.905	0.021
parallel_multiple	0.870	0.024
simple_java	0.380	0.049
simple_javascript	0.220	0.059
simple_python	0.928	0.013
sql	0.440	0.050

gpt-4.1-mini-2025-04-14

Category	Accuracy	Stderr
exec_multiple	0.840	0.052
exec_parallel	0.820	0.055
exec_parallel_multiple	0.700	0.073
exec_simple	0.940	0.024
irrelevance	0.850	0.023
live_irrelevance	0.782	0.014
live_multiple	0.774	0.013
live_parallel	0.750	0.112
live_parallel_multiple	0.625	0.101
live_relevance	0.812	0.098
live_simple	0.733	0.028
multiple	0.915	0.020
parallel	0.915	0.020
parallel_multiple	0.870	0.024
simple_java	0.460	0.050
simple_javascript	0.360	0.069
simple_python	0.930	0.013
sql	0.170	0.038

Evaluation date: 2026-03-06
Total samples: 3,981
uv run inspect eval inspect_evals/bfcl --model openai/gpt-4.1-mini-2025-04-14,anthropic/claude-haiku-4-5-20251001 --sample-shuffle

V3 Multi-Turn Results

Model	Provider	Accuracy (Inspect)	Standard Error (Inspect)	Time
claude-haiku-4-5-20251001	Anthropic	0.419	0.016	35m 49s
gpt-5.1-2025-11-13	OpenAI	0.273	0.037	7m 38s

claude-haiku-4-5-20251001

Category	Accuracy	Stderr
multi_turn_base	0.575	0.035
multi_turn_miss_func	0.445	0.035
multi_turn_miss_param	0.535	0.035
multi_turn_long_context	0.525	0.035
multi_turn_composite	0.015	0.009

gpt-5.1-2025-11-13

Category	Accuracy	Stderr
multi_turn_base (n=28)	0.464	0.096
multi_turn_miss_func (n=39)	0.308	0.075
multi_turn_miss_param (n=22)	0.273	0.097
multi_turn_long_context (n=28)	0.357	0.092
multi_turn_composite (n=33)	0.000	0.000

Note that only 10% of the eval was ran for cost purposes (where n is the number of samples in that category).
Evaluation date: 2026-03-24
claude-haiku-4-5-20251001: 1,000 samples (200 per category × 5 categories). multi_turn_base and multi_turn_long_context from initial run; multi_turn_miss_func, multi_turn_miss_param, and multi_turn_composite rerun after bug fix.
gpt-5.1-2025-11-13: 150 samples (30 per category × 5 categories)
uv run inspect eval inspect_evals/bfcl -T "categories=['multi_turn_base','multi_turn_miss_func','multi_turn_miss_param','multi_turn_long_context','multi_turn_composite']" --model anthropic/claude-haiku-4-5-20251001 --sample-shuffle
uv run inspect eval inspect_evals/bfcl -T "categories=['multi_turn_base','multi_turn_miss_func','multi_turn_miss_param','multi_turn_long_context','multi_turn_composite']" --model openai/gpt-5.1-2025-11-13 --limit 150 --sample-shuffle

Results Comparison to Paper

The table compares the results of the Inspect Evals run with the BFCL leaderboard (Last Updated: 2026-03-06).

FC means they used native support for function/tool calling whilst Prompt describes the walk-around option for function calling, using model’s normal text generation capability.

claude-haiku-4-5-20251001

Category	Notes on how to calculate	Inspect Evals Run	FC	Prompt
Simple Function (AST)	Unweighted mean of simple_python, simple_java, simple_javascript	0.509	0.710	0.557
Multiple Function (AST)	multiple	0.940	0.940	0.840
Parallel Function (AST)	parallel	0.905	0.925	0.380
Parallel Multiple (AST)	parallel_multiple	0.870	0.885	0.440
Live Simple (AST)	live_simple	0.767	0.837	0.667
Live Multiple (AST)	live_multiple	0.767	0.776	0.498
Live Parallel (AST)	live_parallel	0.750	0.750	0.563
Live Parallel Multiple (AST)	live_parallel_multiple	0.708	0.750	0.167
Live Relevance	live_relevance	0.625	0.625	0.313
Live Irrelevance	live_irrelevance	0.848	0.851	0.953

gpt-4.1-mini-2025-04-14

Category	Notes on how to calculate	Inspect Evals Run	FC	Prompt
Simple Function (AST)	Unweighted mean of simple_python, simple_java, simple_javascript	0.583	0.733	0.749
Multiple Function (AST)	multiple	0.915	0.890	0.925
Parallel Function (AST)	parallel	0.915	0.910	0.875
Parallel Multiple (AST)	parallel_multiple	0.870	0.820	0.835
Live Simple (AST)	live_simple	0.733	0.671	0.806
Live Multiple (AST)	live_multiple	0.774	0.698	0.733
Live Parallel (AST)	live_parallel	0.750	0.438	0.813
Live Parallel Multiple (AST)	live_parallel_multiple	0.625	0.625	0.708
Live Relevance	live_relevance	0.812	0.813	0.875
Live Irrelevance	live_irrelevance	0.782	0.817	0.739

claude-haiku-4-5-20251001 (Multi-Turn)

Category	Notes on how to calculate	Inspect Evals Run	FC	Prompt
Overall Accuracy	Mean of 4 categories below (excl. composite)	0.520	0.536	0.018
Multi-Turn Base	multi_turn_base	0.575	0.635	0.015
Multi-Turn Miss Func	multi_turn_miss_func	0.445	0.425	0.000
Multi-Turn Miss Param	multi_turn_miss_param	0.535	0.525	0.040
Multi-Turn Long Context	multi_turn_long_context	0.525	0.560	0.015

Note: The leaderboard uses an unweighted average across language subcategories despite very different sample sizes (simple_python: 400, simple_java: 100, simple_javascript: 50) (see: https://github.com/ShishirPatil/gorilla/blob/cf12f01fc5582837cfcb496e78bc5dafd18f5f0e/berkeley-function-call-leaderboard/bfcl_eval/eval_checker/eval_runner_helper.py#L320).

Note: The leaderboard is now V4 and includes additional categories (web search, memory) not implemented here. Our Non-live and Live AST aggregate scores are lower than the leaderboard’s in part because we exclude exec categories from the non-live aggregate.

Note: The leaderboard appears to exclude the non-live irrelevance score from the Hallucination Measurements (relevance and irrelevance).

Note: Multi-turn composite is excluded from the overall accuracy comparison as it is missing from the paper results (figure 1 on page 6).

Changelog

[5-B] - 2026-04-01

Fix crash in multi-turn solver when missed_function_docs is empty for a turn but missed_function_names has entries.

[4-B] - 2026-03-12

Added multi-turn categories
Added in a multi-turn solver and scorer which included the import of function modules

[3-B] - 2026-02-20

Renamed the category parameter to categories.
Expanded from single category (exec_simple) to all V1 and V2 categories(except rest).
Updated scorers (matching functions) to match the official implementation more closely.

[2-A] - 2026-02-16

Migrate version to new scheme. See #907.

[1.0.2] - 2026-02-02

This update aims to preserve the core logic of the exec_simple category while updating the data source from a Huggingface repo that has not been updated to the official github source.
Creates a CategoryConfig object handle different categories.
Adds shuffle parameter and sorting by ID following the official implementation.
Adds _func_doc_language_specific_pre_processing as per the official implementation (this should not change the running of exec_simple)

[1.0.1] - 2025-12-18

Adds backoff policy for functions that connect to huggingface servers.

Overview

Evaluation Categories

Single-Turn Tasks (AST Evaluation)

Multi-Turn Tasks (State/Response Evaluation)

Installation

Execution

Limitations

See also

Parameters

bfcl

Issues

Evaluation Report

Results by Category

claude-haiku-4-5-20251001

gpt-4.1-mini-2025-04-14

V3 Multi-Turn Results

claude-haiku-4-5-20251001

gpt-5.1-2025-11-13

Results Comparison to Paper

Changelog

[5-B] - 2026-04-01

[4-B] - 2026-03-12

[3-B] - 2026-02-20

[2-A] - 2026-02-16

[1.0.2] - 2026-02-02

[1.0.1] - 2025-12-18

`bfcl`