BFCL: Berkeley Function-Calling Leaderboard
Evaluates LLM function/tool-calling ability on a simplified split of the Berkeley Function-Calling Leaderboard (BFCL).
Overview
The Berkeley Function-Calling Leaderboard (BFCL) is the first comprehensive evaluation on the LLM’s ability to call functions and tools.
It has now gone through three generations and contains over a dozen datasets which evaluate various aspects of LLM function-calling.
This port of the BFCL to the Inspect API implements V1 (original categories) and V2 (live datasets). To map the components of the official BFCL implementation we have broken down the port to 4 main components:
- Category ‘handlers’ in utils.py
- Functions from the official implementation in scorer.py and solver.py. The rules for AST matching can be found in Appendix H of the BFCL paper.
- Dataset loading and pre-processing in data.py
- Mapping to Inspect AI components in bfcl.py
Evaluation Categories
This evaluation contains multiple evaluation categories, requiring different loading, solving, and scoring. To address this, the bfcl scorer and solver contains logic to choose the appropriate method.
Single-Turn Tasks (AST Evaluation)
| Category | Description | Samples | Evaluation |
|---|---|---|---|
simple_python |
Single function call in Python | 400 | AST match |
parallel_multiple |
Multiple calls, different functions | 200 | AST match |
parallel |
Multiple calls, same function | 200 | AST match |
multiple |
Choose from 2-4 functions | 200 | AST match |
simple_java |
Java-specific types | 100 | AST match |
simple_javascript |
JavaScript-specific types | 50 | AST match |
irrelevance |
Relevance detection (abstain) | 240 | Hallucination Measure |
rest |
REST API calls | 100 | Execute match (implemented with AST scorer) |
exec_simple |
Single function call | 220 | Execute match (implemented with AST scorer) |
exec_parallel_multiple |
Multiple calls, different functions | 40 | Execute match (implemented with AST scorer) |
exec_parallel |
Multiple calls, same function | 50 | Execute match (implemented with AST scorer) |
exec_multiple |
Choose from 2-4 functions | 50 | Execute match (implemented with AST scorer) |
live_simple |
User-contributed simple | 258 | AST match |
live_multiple |
User-contributed multiple | 1053 | AST match |
live_parallel |
User-contributed parallel | 16 | AST match |
live_parallel_multiple |
User-contributed parallel | 24 | AST match |
live_relevance |
User-contributed relevance | 16 | Function call check |
live_irrelevance |
User-contributed irrelevance | 882 | Abstention check |
Multi-Turn Tasks (State/Response Evaluation)
| Category | Description | Samples | Evaluation |
|---|---|---|---|
multi_turn_base |
Basic multi-step interactions | 200 | State & Response-based (not implemented) |
multi_turn_miss_func |
Missing functions handling | 200 | State & Response-based (not implemented) |
multi_turn_miss_param |
Underspecified queries | 200 | State & Response-based (not implemented) |
multi_turn_long_context |
Long conversation contexts | 200 | State & Response-based (not implemented) |
Execution
To run the BFCL evaluation, use the following command:
inspect eval inspect_evals/bfcl --model openai/gpt-4o-miniAn example from the dataset - the model must invoke the parameters of the calc_binomial_probability function to pass.
{
"id":"exec_simple_0",
"question":[
[
{
"role":"user",
"content":"I've been playing a game where rolling a six is somehow more likely than usual, and the chance of it happening on a single roll is 60%. I'm curious, if I roll the die 20 times, what are the odds that I'll get exactly five sixes?"
}
]
],
"function":[
{
"name":"calc_binomial_probability",
"description":"Calculates the probability of getting k successes in n trials.",
"parameters":{
"type":"dict",
"properties":{
"n":{
"type":"integer",
"description":"The number of trials."
},
"k":{
"type":"integer",
"description":"The number of successes."
},
"p":{
"type":"float",
"description":"The probability of success."
}
},
"required":[
"n",
"k",
"p"
]
}
}
],
"execution_result_type":[
"exact_match"
],
"ground_truth":[
"calc_binomial_probability(n=20, k=5, p=0.6)"
]
}Limitations
This implementation supports all V1 (original) and V2 (live) categories. V3 (multi-turn) categories and the REST category are not yet implemented.
See also
Parameters
bfcl
categories(str | list[str]): (default:['exec_multiple', 'exec_parallel', 'exec_parallel_multiple', 'exec_simple', 'irrelevance', 'live_irrelevance', 'live_multiple', 'live_parallel', 'live_parallel_multiple', 'live_relevance', 'live_simple', 'multiple', 'parallel', 'parallel_multiple', 'simple_java', 'simple_javascript', 'simple_python', 'sql'])shuffle(bool): (default:True)
Issues
- Implement the Rest category.
- Implement the execution scorer for the exec categories. This is low priority as it this code was never published and these categories retired as of PR #943..
- Going through the logs as there might be some false positives. A trajectory analysis of 100 samples was done and found to have valid results, but there are other smaller things to check for example
Invalid value for function: 'lambda t: 3t**2 + 2t + 1'. Expected one of ['lambda x: 3x**2 + 2x + 1']where the two answers are syntactically equivalent but marked incorrectly. Or a sample that was marked incorrect because a tweet was posted with..."psi = 3.0"rather than the ground truth which was..."psi =3"(argueably, the first first is more correct). - Provide a ‘grouping’ helper function to format the results in the same way as the live leaderboard (categories are grouped in specific ways.). See https://github.com/ShishirPatil/gorilla/blob/cf12f01fc5582837cfcb496e78bc5dafd18f5f0e/berkeley-function-call-leaderboard/bfcl_eval/eval_checker/eval_runner_helper.py#L288 for the official implementation mapping eval results to their leaderboard.
- AST scorer implements one-level-deep element type checking for arrays. To be more completely, this should be recursive, but following the the official BFCL implementation, which also only checks one level deep, we did not implement recursion to avoid unnecessary complexity.
Evaluation Report
| Model | Provider | Accuracy (Inspect) | Standard Error (Inspect) | Time |
|---|---|---|---|---|
| claude-haiku-4-5-20251001 | Anthropic | 0.806 | 0.006 | 11m 55s |
| gpt-4.1-mini-2025-04-14 | OpenAI | 0.788 | 0.006 | 10m 3s |
Results by Category
claude-haiku-4-5-20251001
| Category | Accuracy | Stderr |
|---|---|---|
| exec_multiple | 0.860 | 0.050 |
| exec_parallel | 0.780 | 0.059 |
| exec_parallel_multiple | 0.725 | 0.071 |
| exec_simple | 0.940 | 0.024 |
| irrelevance | 0.846 | 0.023 |
| live_irrelevance | 0.848 | 0.012 |
| live_multiple | 0.767 | 0.013 |
| live_parallel | 0.750 | 0.112 |
| live_parallel_multiple | 0.708 | 0.095 |
| live_relevance | 0.625 | 0.121 |
| live_simple | 0.767 | 0.026 |
| multiple | 0.940 | 0.017 |
| parallel | 0.905 | 0.021 |
| parallel_multiple | 0.870 | 0.024 |
| simple_java | 0.380 | 0.049 |
| simple_javascript | 0.220 | 0.059 |
| simple_python | 0.928 | 0.013 |
| sql | 0.440 | 0.050 |
gpt-4.1-mini-2025-04-14
| Category | Accuracy | Stderr |
|---|---|---|
| exec_multiple | 0.840 | 0.052 |
| exec_parallel | 0.820 | 0.055 |
| exec_parallel_multiple | 0.700 | 0.073 |
| exec_simple | 0.940 | 0.024 |
| irrelevance | 0.850 | 0.023 |
| live_irrelevance | 0.782 | 0.014 |
| live_multiple | 0.774 | 0.013 |
| live_parallel | 0.750 | 0.112 |
| live_parallel_multiple | 0.625 | 0.101 |
| live_relevance | 0.812 | 0.098 |
| live_simple | 0.733 | 0.028 |
| multiple | 0.915 | 0.020 |
| parallel | 0.915 | 0.020 |
| parallel_multiple | 0.870 | 0.024 |
| simple_java | 0.460 | 0.050 |
| simple_javascript | 0.360 | 0.069 |
| simple_python | 0.930 | 0.013 |
| sql | 0.170 | 0.038 |
- Evaluation date: 2026-03-06
- Total samples: 3,981
uv run inspect eval inspect_evals/bfcl --model openai/gpt-4.1-mini-2025-04-14,anthropic/claude-haiku-4-5-20251001 -T shuffle=True
Results Comparison to Paper
The table compares the results of the Inspect Evals run with the BFCL leaderboard (Last Updated: 2026-03-06).
FC means they used native support for function/tool calling whilst Prompt describes the walk-around option for function calling, using model’s normal text generation capability.
claude-haiku-4-5-20251001
| Category | Notes on how to calculate | Inspect Evals Run | FC | Prompt |
|---|---|---|---|---|
| Simple Function (AST) | Unweighted mean of simple_python, simple_java, simple_javascript | 0.509 | 0.710 | 0.557 |
| Multiple Function (AST) | multiple | 0.940 | 0.940 | 0.840 |
| Parallel Function (AST) | parallel | 0.905 | 0.925 | 0.380 |
| Parallel Multiple (AST) | parallel_multiple | 0.870 | 0.885 | 0.440 |
| Live Simple (AST) | live_simple | 0.767 | 0.837 | 0.667 |
| Live Multiple (AST) | live_multiple | 0.767 | 0.776 | 0.498 |
| Live Parallel (AST) | live_parallel | 0.750 | 0.750 | 0.563 |
| Live Parallel Multiple (AST) | live_parallel_multiple | 0.708 | 0.750 | 0.167 |
| Live Relevance | live_relevance | 0.625 | 0.625 | 0.313 |
| Live Irrelevance | live_irrelevance | 0.848 | 0.851 | 0.953 |
gpt-4.1-mini-2025-04-14
| Category | Notes on how to calculate | Inspect Evals Run | FC | Prompt |
|---|---|---|---|---|
| Simple Function (AST) | Unweighted mean of simple_python, simple_java, simple_javascript | 0.583 | 0.733 | 0.749 |
| Multiple Function (AST) | multiple | 0.915 | 0.890 | 0.925 |
| Parallel Function (AST) | parallel | 0.915 | 0.910 | 0.875 |
| Parallel Multiple (AST) | parallel_multiple | 0.870 | 0.820 | 0.835 |
| Live Simple (AST) | live_simple | 0.733 | 0.671 | 0.806 |
| Live Multiple (AST) | live_multiple | 0.774 | 0.698 | 0.733 |
| Live Parallel (AST) | live_parallel | 0.750 | 0.438 | 0.813 |
| Live Parallel Multiple (AST) | live_parallel_multiple | 0.625 | 0.625 | 0.708 |
| Live Relevance | live_relevance | 0.812 | 0.813 | 0.875 |
| Live Irrelevance | live_irrelevance | 0.782 | 0.817 | 0.739 |
Note: The leaderboard uses an unweighted average across language subcategories despite very different sample sizes (simple_python: 400, simple_java: 100, simple_javascript: 50) (see: https://github.com/ShishirPatil/gorilla/blob/cf12f01fc5582837cfcb496e78bc5dafd18f5f0e/berkeley-function-call-leaderboard/bfcl_eval/eval_checker/eval_runner_helper.py#L320).
Note: The leaderboard is now V4 and includes additional categories (web search, memory, multi-turn) not implemented here. Our Non-live and Live AST aggregate scores are lower than the leaderboard’s in part because we exclude exec categories from the non-live aggregate.
Note: The leaderboard appears to exclude the non-live irrelevance score from the Hallucination Measurements (relevance and irrelevance).
Changelog
[3-B] - 2026-02-20
- Renamed the category parameter to categories.
- Expanded from single category (
exec_simple) to all V1 and V2 categories(exceptrest). - Updated scorers (matching functions) to match the official implementation more closely.
[2-A] - 2026-02-16
- Migrate version to new scheme. See #907.
[1.0.2] - 2026-02-02
- This update aims to preserve the core logic of the
exec_simplecategory while updating the data source from a Huggingface repo that has not been updated to the official github source. - Creates a CategoryConfig object handle different categories.
- Adds shuffle parameter and sorting by ID following the official implementation.
- Adds
_func_doc_language_specific_pre_processingas per the official implementation (this should not change the running ofexec_simple)
[1.0.1] - 2025-12-18
- Adds backoff policy for functions that connect to huggingface servers.