Changelog
All notable changes to the Berkeley Function Calling Leaderboard will be documented in this file.
- [Oct 17, 2024] #683: Bug fix for the multi turn categories for ambiguity in action intention and function parameters.
- [Oct 17, 2024] #709: Rephrase question prompt for Java and JavaScript categories to improve clarity and action intent.
- [Oct 17, 2024] #708: Update the ground truth for the REST category to be up-to-date with the latest API response structure.
- [Oct 16, 2024] #701: Bug fix the multi turn function source code for
TravelAPI
.
- [Oct 16, 2024] #696: Add the following new models to the leaderboard:
google/gemma-2-2b-it
google/gemma-2-9b-it
google/gemma-2-27b-it
- [Oct 16, 2024] #661: Bug fix in the dataset and possible answers.
- Irrelevance: 1 affected
- Parallel Multiple: 2 affected
- Live Simple: 104 affected
- Live Multiple: 547 affected
- Live Parallel: 11 affected
- Live Parallel Multiple: 17 affected
- [Oct 11, 2024] #667: Add the following new models to the leaderboard:
MadeAgents/Hammer2.0-7b
MadeAgents/Hammer2.0-3b
MadeAgents/Hammer2.0-1.5b
MadeAgents/Hammer2.0-0.5b
- [Oct 10, 2024] #621, #675: Add a basic command-line interface for ease of use.
- [Oct 5, 2024] #633: Add new model
openbmb/MiniCPM3-4B
to the leaderboard.
- [Oct 5, 2024] #642: Add the following new models to the leaderboard:
Qwen/Qwen2.5-7B-Instruct
Qwen/Qwen2.5-1.5B-Instruct
Qwen/Qwen2-7B-Instruct
Qwen/Qwen2-1.5B-Instruct
- [Oct 4, 2024] #653: Add new model
Team-ACE/ToolACE-8B
to the leaderboard.
- [Oct 4, 2024] #671: Speed up locally-hosted model's inference process by parallelizing the inference requests.
- [Sept 27, 2024] #640: Add the following new models to the leaderboard:
microsoft/Phi-3.5-mini-instruct
microsoft/Phi-3-medium-128k-instruct
microsoft/Phi-3-medium-4k-instruct
microsoft/Phi-3-small-128k-instruct
microsoft/Phi-3-small-8k-instruct
microsoft/Phi-3-mini-128k-instruct
microsoft/Phi-3-mini-4k-instruct
- [Sept 25, 2024] #660: Bug fix in
parse_nested_value
function to handle nested dictionary values properly.
- [Sept 24, 2024] #657: Add the following new models to the leaderboard:
meta-llama/Llama-3.2-1B-Instruct
meta-llama/Llama-3.2-1B-Instruct-FC
meta-llama/Llama-3.2-3B-Instruct
meta-llama/Llama-3.2-3B-Instruct-FC
meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.1-8B-Instruct-FC
meta-llama/Llama-3.1-70B-Instruct
meta-llama/Llama-3.1-70B-Instruct-FC
- [Sept 24, 2024] #648: Add the following new models to the leaderboard:
gemini-1.5-pro-002
gemini-1.5-pro-002-FC
gemini-1.5-pro-001
gemini-1.5-pro-001-FC
gemini-1.5-flash-002
gemini-1.5-flash-002-FC
gemini-1.5-flash-001
gemini-1.5-flash-001-FC
gemini-1.0-pro-002
gemini-1.0-pro-002-FC
- [Sept 19, 2024] #644: BFCL V3 release:
- Introduce new multi-turn dataset and state-based evaluation metric
- Separate ast_checker and executable_checker for readability
- Several outdated or deprecated models will be excluded from the leaderboard and replaced with their updated successors to improve the leaderboard's overall maintainability.
- Switch to use vllm serve for OSS model inference
- [Sept 13, 2024] #638: Fix prompt formatting issue for
THUDM/glm-4-9b-chat
.
- [Sept 12, 2024] #635: Add new models
o1-preview-2024-09-12
and o1-mini-2024-09-12
to the leaderboard.
- [Sept 8, 2024] #627 Add new model
MadeAgents/Hammer-7b
to the leaderboard.
- [Sept 7, 2024] #626: Fix prompt format for Llama models.
- [Sept 4, 2024] #623: Fix decoding issue in the
NvidiaHandler
; remove duplicate ArcticHandler
class.
- [August 29, 2024] #616: Add the following new models to the leaderboard:
Salesforce/xLAM-7b-r
Salesforce/xLAM-8x7b-r
Salesforce/xLAM-8x22b-r
- [August 28, 2024] #565, #612: Packagerize the BFCL pipeline for easier deployment and maintenance.
- [August 27, 2024] #608: Bug fix in the dataset and possible answers.
- simple: 16 affected
- multiple: 5 affected
- [August 23, 2024] #600: Bug fix in the dataset and possible answers.
- simple: 12 affected
- multiple: 3 affected
- parallel: 3 affected
- parallel multiple: 6 affected
- [August 22, 2024] #593:
- Move formatting instructions and function documentation to system prompt instead of user prompt in the message section. All prompting models are affected.
- Bug fix in the dataset and possible answers.
- irrelevance: 1 affected
- live_irrelevance: 1 affected
- live_simple: 1 affected
- live_parallel: 3 affected
- [August 19, 2024] #580: Introduce BFCL V2 Live dataset, featuring user-contributed live prompts and function docs. To read more about the composition and construction of this dataset, please refer to our blog. All CLI commands have been updated to support the new dataset.
- [August 8, 2024] #574: Set temperature to 0.001 for all models for consistency and reproducibility.
- [August 7, 2024] #571: Support parallel inference for hosted models. User can specify the number of threads to use for parallel inference by setting the
--num-threads
flag. The default is 1, which means no parallel inference.
- [August 6, 2024] #569, #570, #573: Add the following new models to the leaderboard:
open-mistral-nemo-2407
open-mistral-nemo-2407-FC
open-mixtral-8x22b
open-mixtral-8x22b-FC
open-mixtral-8x7b
gpt-4o-mini-2024-07-18
gpt-4o-mini-2024-07-18-FC
gpt-4o-2024-08-06
gpt-4o-2024-08-06-FC
meetkai/functionary-medium-v3.1-FC
meetkai/functionary-small-v3.1-FC
meetkai/functionary-small-v3.2-FC
- [August 5, 2024] #568: Rephrase the question prompt for the
executable_parallel_function
category to remove potentially misleading information implying multi-turn function calls.
- [August 4, 2024] #557: Bug fix in the possible answers.
- simple: 7 affected
- multiple function: 3 affected
- parallel function: 5 affected
- parallel multiple function: 6 affected
- executable parallel function: 1 affected
- javascript: 3 affected
- [July 26, 2024] #549: Fix
js_type_converter.py
to properly handle JavaScript array value inside dictionary.
- [July 25, 2024] #532, #543, #556, #542: Add the following new models to the leaderboard:
Salesforce/xLAM-7b-fc-r
Salesforce/xLAM-1b-fc-r
yi-large-fc
NousResearch/Hermes-2-Pro-Llama-3-8B
NousResearch/Hermes-2-Pro-Llama-3-70B
NousResearch/Hermes-2-Theta-Llama-3-8B
NousResearch/Hermes-2-Theta-Llama-3-70B
- [July 22, 2024] #540: Chore: Improve handling of vLLM's cleanup phase error by combining all selected test categories into one single task to submit to the vLLM server.
- [July 21, 2024] #538, #545: Fix
language_specific_pre_processing
and convert_to_tool
function to properly handle pre-processing for prompts and function docs in Java and JavaScript test categories. All entries in these categories are affected.
- [July 20, 2024] #537: Update generation script for locally-hosted OSS model to use single-node multi-GPU inference method (tensor parallel). Ray is not used anymore.
- [July 16, 2024] #525, #536: Add new model
ibm-granite/granite-20b-functioncalling
to the leaderboard.
- [July 10, 2024] #522: Bug fix in the evaluation dataset for Executable Parallel Multiple category. This includes updates to both prompts and function docs. 2 entries are affected.
- [July 8, 2024] #516: Fix double-casting issue in
model_handler
for Java and JavaScript test categories.
- [July 7, 2024] #504, #505, #506, #508, #512, #517: Make BFCL user-friendly and easy to extend.
- [July 6, 2024] #423 and #503: Bug fix in possible answers for the AST evaluation dataset (parallel category: 14 affected; parallel_multiple category: 25 affected).
- [July 5, 2024] #496: Updates to API status checks. Checking the health of executable APIs is now off by default. Further, even when triggered, un-healthy APIs will not terminate the evaluation process. Users can enable this feature by setting the
--api-sanity-check
flag or -c
for short. The previous --skip-api-sanity-check
or -s
flag is now deprecated.
- [July 3, 2024] #489: Add new model
nvidia/nemotron-4-340b-instruct
to the leaderboard.
- [July 2, 2024] #474: Add new model
THUDM/glm-4-9b-chat
to the leaderboard.
- [June 18, 2024] #470: Add new model
firefunction-v2-FC
to the leaderboard.
- [June 15, 2024] #437: Fix prompting issues for
Nexusflow-Raven-v2 (FC)
.
- [June 7, 2024] #407, #462: Update the AST evaluation logic to allow the use of
int
values for Python parameters expecting float
values. This is to accommodate the Python auto-conversion feature from int
to float
.
- [May 14, 2024] #426:
- Add the following new models to the leaderboard:
gpt-4o-2024-05-13
gpt-4o-2024-05-13-FC
gemini-1.5-pro-preview-0514
gemini-1.5-flash-preview-0514
- Update price for the following models:
- All Gemini Series
Claude-2.1 (Prompt)
and Claude-instant-1.2 (Prompt)
Mistral-large
and Mistral-Small
GPT-3.5-Turbo-0125
- [May 8, 2024] #406 and #421: Update the
gemini_handler.py
to better handle parallel function calls for Gemini models.
- [May 6, 2024] #412: Bug fix in evaluation dataset for AST categories. This includes updates to both prompts and function docs.
- [May 2, 2024] #405: Bug fix in the possible answers for the AST Simple evaluation dataset. Prompt and function docs are not affected.
- [April 28, 2024] #397: Add new model
snowflake/arctic
to the leaderboard. Note that there are multiple ways to inference the model, and we choose to do it via Nvidia API catalog.
- [April 27, 2024] #390: Bug fix in cost and latency calculation for open-source models, which are now all calculated when serving the model with vLLM using 8 V100 GPUs for consistency. $$\text{Cost} = \text{Latency per 1000 function call} * (\text{8xV100 azure-pay-as-you-go-price per hour / 3600})$$
- [April 25, 2024] #386: Add 5 new models to the leaderboard:
meta-llama/Meta-Llama-3-8B-Instruct
, meta-llama/Meta-Llama-3-70B-Instruct
, gemini-1.5-pro-preview-0409
, command-r-plus
, command-r-plus-FC
.
- [April 19, 2024] #377:
- Bug fix for the evaluation dataset in the executable test categories. This includes updates to both prompts and function docs.
- The
evaluation_result
field has been removed to accommodate the variability in API execution results across different evaluation runs. Instead, a human-verified ground_truth
is now included for the executable test categories. During each evaluation run, evaluation_result
is generated anew using the ground_truth
, and then compared against the model output.
- A stricter metric has been adopted when using the
structural_match
(aka. type match) evaluation criteria ---- For list
results, the lengths are compared; for dict
results, the keys are matched. This is to account for the fast-changing nature of some of the real-time API results while ensuring the evaluation remains meaningful.
- Added another evaluation criteria
real_time_match
for the executable category, which is a looser form of exact_match
specifically for numerical execution results. The execution result must be within a certain percentage threshold (20%) from the expected result to accommodate the live updates of API responses. User can change this threshold value in eval_checker_constant.py
.
- [April 18, 2024] #375: A more comprehensive API sanity check is included; the APIs that are invoked during the non-REST executable evaluation process will also be checked for their availability before running the evaluation. Also, add support for the shortcut
-s
for the --skip-api-sanity-check
flag, based on the community feedback.
- [April 16, 2024] #366: Switch to use Anthropic's new Tool Use Beta
tools-2024-04-04
when generating Claude 3 FC series data. gpt-4-turbo-2024-04-09
and gpt-4-turbo-2024-04-09-FC
are also added to the leaderboard.
- [April 11, 2024] #347: Add the 95th percentile latency to the leaderboard statistics. This metric is useful for understanding the latency distribution of the models, especially the worst-case scenario.
- [April 10, 2024] #339: Introduce REST API sanity check for the REST executable test category. It ensures that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, the evaluation process will be stopped by default as the result will be inaccurate. Users can choose to bypass this check by setting the
--skip-api-sanity-check
flag or -s
for short.
- [April 9, 2024] #338: Bug fix in the evaluation datasets (including both prompts and function docs). Bug fix for possible answers as well.
- [April 8, 2024] #330: Fixed an oversight that was introduced in #299. For function-calling (FC) models that cannot take
float
type in input, when the parameter type is a float
, the evaluation procedure will convert that type to number
in the model input and mention in the parameter description that This is a float type value.
. An additional field format: float
will also be included in the model input to make it clear about the type. Updated the model handler for Claude, Mistral, and OSS to better parse the model output.
- [April 8, 2024] #327: Add new model
NousResearch/Hermes-2-Pro-Mistral-7B
to the leaderboard.
- [April 3, 2024] #309: Bug fix for evaluation dataset possible answers. Implement string standardization for the AST evaluation pipeline, i.e. removing white spaces and a subset of punctuations (
,./-_*^
) to make the AST evaluation more robust and accurate. Fixed AST evaluation issue for type tuple
. Add 2 new models meetkai/functionary-small-v2.4 (FC)
, meetkai/functionary-medium-v2.4 (FC)
to the leaderboard.
- [April 1, 2024] #299: Leaderboard update with new models (
Claude-3-Haiku
, Databrick-DBRX-Instruct
), more advanced AST evaluation procedure, and updated evaluation datasets. Cost and latency statistics during evaluation are also measured. We also released the manual that our evaluation procedure is based on, available here.
- [Mar 11, 2024] #254: Leaderboard update with 3 new models:
Claude-3-Opus-20240229 (Prompt)
, Claude-3-Sonnet-20240229 (Prompt)
, and meetkai/functionary-medium-v2.2 (FC)
- [Mar 5, 2024] #237 and 238: leaderboard update resulting from #223; 3 new models:
mistral-large-2402
, gemini-1.0-pro
, and google/gemma-7b-it
.
- [Feb 29, 2024] #223: modifications to REST evaluation.