IFEval: Instruction-Following Evaluation for Large Language Models
Evaluates the ability to follow a set of “verifiable instructions” such as “write in more than 400 words” and “mention the keyword of AI at least 3 times. Demonstrates custom scoring.
Overview
IFEval is a benchmark to evaluate a model’s performance on synthesizing programs from docstrings. This implementation uses an installable package from a fork of the official implementation.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Then, evaluate against one or more models with:
inspect eval inspect_evals/ifeval --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/ifeval --limit 10
inspect eval inspect_evals/ifeval --max-connections 10
inspect eval inspect_evals/ifeval --temperature 0.5
See inspect eval --help
for all available options.
Dataset
Here is an example prompt from the dataset:
Write a 300+ word summary of the wikipedia page “https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli”. Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example highlighted section part 1, highlighted section part 2, highlighted section part 3.
The model then completes a simple generation task to complete the sequence.
Scoring
Once a generation is completed, the sequence is tested to evaluate the model’s instruction following performance. There are a total of 25 verifiable instructions which can be further explored in the paper.
The example prompt above comes accompanied with three verifiable instructions, though this number varies depending on the data point. These are ["punctuation:no_comma", "detectable_format:number_highlighted_sections", "length_constraints:number_words"]
.
The evaluation then measures accuracy across prompt-level and instruction-level hierarchies as well as a strict and loose accuracy criteria. We define these terms more concretely below:
Strict Accuracy
For a given response and a single instruction, this is simply
\text{is\_followed}(\text{resp}, \text{inst}) = \begin{cases}
\text{True,} & \text{if instruction is followed.} \\
\text{False,} & \text{otherwise.}
\end{cases}
Loose Accuracy
In this case, we define it using
\text{is\_followed}_{\text{loose}}(\text{resp}, \text{inst}) = \text{Any}\left(\text{is\_followed}(\text{transform}_t(\text{resp}), \text{inst}) \text{ for } t = 1, 2, \ldots\right)
Here, there are a few transformations defined for the input response:
- Remove commonly seen font modifiers in the markdown syntax, especially
*
and**
. - Remove the first line of the response, so that we skip intros like “Sure, here it is:”.
- Remove the last line of the response, so that we skip outros like “Hope it helps.”.
We take the powerset of these transformations (including the identity transformation) to have a total of eight transformations that we apply to the input response.
Prompt-level
Per input response, we will only have a single boolean output evaluating the accuracy (both in the case of strict and loose accuracy).
Instruction-level
This one enumerates over the instructions as opposed the input responses. Continuing from our running example, the output for the ["punctuation:no_comma", "detectable_format:number_highlighted_sections", "length_constraints:number_words"]
instruction set would be [True, True, True]
if all the instructions pass. The accuracy is then calculated over all instructions and the entire dataset.
Final Accuracy
This is simply an average of prompt-level-strict, prompt-level-loose, instruction-level-strict, instruction-level-loose.