BFCL: Berkeley Function-Calling Leaderboard
Evaluates LLM function/tool-calling ability on a simplified split of the Berkeley Function-Calling Leaderboard (BFCL).
Overview
The Berkeley Function-Calling Leaderboard (BFCL) is the first comprehensive evaluation on the LLM’s ability to call functions and tools.
It has now gone through three generations and contains over a dozen datasets which evaluate various aspects of LLM function-calling.
This is an initial port of the BFCL to the Inspect API such that models can be checked for tool calling challenges before being evaluated on more complex tasks.
Execution
To run the BFCL evaluation, use the following command:
inspect eval inspect_evals/bfcl --model openai/gpt-4o-mini
An example from the dataset - the model must invoke the parameters of the calc_binomial_probability
function to pass.
{
"id":"exec_simple_0",
"question":[
[
{
"role":"user",
"content":"I've been playing a game where rolling a six is somehow more likely than usual, and the chance of it happening on a single roll is 60%. I'm curious, if I roll the die 20 times, what are the odds that I'll get exactly five sixes?"
}
]
],
"function":[
{
"name":"calc_binomial_probability",
"description":"Calculates the probability of getting k successes in n trials.",
"parameters":{
"type":"dict",
"properties":{
"n":{
"type":"integer",
"description":"The number of trials."
},
"k":{
"type":"integer",
"description":"The number of successes."
},
"p":{
"type":"float",
"description":"The probability of success."
}
},
"required":[
"n",
"k",
"p"
]
}
}
],
"execution_result_type":[
"exact_match"
],
"ground_truth":[
"calc_binomial_probability(n=20, k=5, p=0.6)"
]
}
Limitations
Note that the complete BFCL v3 evaluation has thousands of samples spanning over a dozen splits, each with different properties. Due to this complexity, we currently only support the exec_simple
split, and it is likely that future work should integrate with the BFCL codebase so we can import a lot of their data transformation components.