BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Coding

Python coding benchmark with 1,140 diverse questions drawing on numerous python libraries.

Contributed By

@tim-hua-01

Code

src/inspect_evals/bigcodebench

Paper

https://arxiv.org/abs/2406.15877

Overview

BigCodeBench is a coding benchmark with a special focus on models’ use of Python libraries.

Currently, this eval pulls from version v0.1.2 of the dataset from huggingface, which has a total of 1140 questions. As of Nov 25, 2024, gpt-4o-mini gets 55.4% of them correct in this inspect implementation¹ (compared to 57.4% on the official leaderboard). Running gpt-4o-mini took a total of 372,319 input tokens and 799,339 output tokens.

This implementation copies heavily from the humaneval implementation in inspect. The biggest difference is that it requires access to several python packages, which are installed through the Dockerfile. The Dockerfile in this folder deviates from the official docker files by installing tensorflow without specifying the version, to make it platform-independent.

Instructions for BigCodeBench come in two versions, complete_prompt and instruct_prompt. The complete prompt contains package imports and a Python function header with some examples.

Example complete prompt, taken from BigCodeBench/1116:

import random
import statistics

# Constants
AGE_RANGE = (22, 60)

def task_func(dict1):
 """
 Calculate the mean, the median, and the mode(s) of the age of the employees in the department "EMP$$." 
 Generate random ages for each employee within the range [22, 60].

 Parameters:
 dict1 (dict): A dictionary with department codes as keys and number of employees 
 as values.

 Returns:
 tuple: A tuple of mean, median, and a list of mode(s) of employee ages.

 Requirements:
 - random
 - statistics

 Example:
 >>> random.seed(0)
 >>> d = {'EMP$$': 10, 'MAN$$': 5, 'DEV$$': 8, 'HR$$': 7}
 >>> stats = task_func(d)
 >>> print(stats)
 (44.7, 46.5, [46, 48, 24, 38, 54, 53, 47, 41, 52, 44])

And the instruct version:

Calculate the mean, the median, and the mode(s) of the age of the employees in the department “EMP$$.” Generate random ages for each employee within the range [22, 60].

The function should output with: tuple: A tuple of mean, median, and a list of mode(s) of employee ages. You should write self-contained code starting with:

import random
import statistics
# Constants
AGE_RANGE = (22, 60)
def task_func(dict1):

Since the complete prompt contains examples not seen in the instruct prompt, this implementation always uses the complete prompt, prefaced with the following (customizable) instruction prompt:

Read the following function signature and docstring, and fully implement the function described. Make sure to include the import statements and use the same variable names as stated in the header.

By default, this evaluation checks the pass@1 rate (i.e., the model has one chance of getting the question correct).

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Then, evaluate against one or more models with:

inspect eval inspect_evals/bigcodebench --model openai/gpt-4o

This would automatically setup docker containers, which takes around 12 minutes. We recommend first running docker to setup the environment (which would then be cached) before running the evaluation:

cd src/inspect_evals/bigcodebench
docker compose build

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/bigcodebench --limit 10
inspect eval inspect_evals/bigcodebench --max-connections 10
inspect eval inspect_evals/bigcodebench --temperature 0.5

See inspect eval --help for all available options.

Scoring

We run the model submitted answer in a docker sandbox and check whether it passes unit tests in the BigCodeBench dataset.

Errata

As with all evaluations datasets, there are mistakes in the data (see e.g., the issues with the HumanEval dataset here). As of Nov 26, 2024 (v. 0.1.2.), we’ve found the following potential bugs:

Task 111: The provided answer does not pass all test cases when run locally.
Task 227: The libsndfile c library seems to be required for this task to work? It’s currently running into an import issue.
Task 917, it looks plausible that the code would have floating point errors that get flagged as actual errors. Running gpt-4o’s code results in:

FAIL: test_case_1 (main.TestCases)
Traceback (most recent call last):
File "<string>", line 85, in test_case_1
AssertionError: 110.99401141660307 != 110.99999292499156 within 2 places (0.0059815083884871 difference)

======================================================================
FAIL: test_case_2 (main.TestCases)
Traceback (most recent call last):
File "<string>", line 106, in test_case_2
AssertionError: 211.99463480210375 != 211.99999982088116 within 2 places (0.005365018777411024 difference)

======================================================================
FAIL: test_case_4 (main.TestCases)
Traceback (most recent call last):
File "<string>", line 147, in test_case_4
AssertionError: 412.9939192324403 != 412.9999964967954 within 2 places (0.006077264355099032 difference)

Footnotes

A slightly different instruction prompt was used for this evaluation.↩︎