IFEval: Instruction-Following Evaluation

Reasoning

Evaluates how well language models can strictly follow detailed instructions, such as writing responses with specific word counts or including required keywords.

Contributed By

@adil-a

Code

src/inspect_evals/ifeval

Paper

https://arxiv.org/abs/2311.07911

Overview

IFEval is a benchmark to evaluate a model’s ability to follow automatically verifiable natural language instructions. This implementation uses an installable package from a fork of the official implementation.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --group ifeval

Note the --group flag. The vast majority of our evals use uv sync with --extra. This eval has a dependency on a git URL, so it’s managed differently.

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/ifeval --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.ifeval import ifeval
eval(ifeval)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/ifeval --limit 10
uv run inspect eval inspect_evals/ifeval --max-connections 10
uv run inspect eval inspect_evals/ifeval --temperature 0.5

See uv run inspect eval --help for all available options.

Dataset

Here is an example prompt from the dataset:

Write a 300+ word summary of the wikipedia page “https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli”. Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example highlighted section part 1, highlighted section part 2, highlighted section part 3.

The model then completes a simple generation task to complete the sequence.

Scoring

Once a generation is completed, the sequence is tested to evaluate the model’s instruction following performance. There are a total of 25 verifiable instructions which can be further explored in the paper.

The example prompt above comes accompanied with three verifiable instructions, though this number varies depending on the data point. These are ["punctuation:no_comma", "detectable_format:number_highlighted_sections", "length_constraints:number_words"].

The evaluation then measures accuracy across prompt-level and instruction-level hierarchies as well as a strict and loose accuracy criteria. We define these terms more concretely below:

Strict Accuracy

For a given response and a single instruction, this is simply

\text{is\_followed}(\text{resp}, \text{inst}) = \begin{cases}
\text{True,} & \text{if instruction is followed.} \\
\text{False,} & \text{otherwise.}
\end{cases}

Loose Accuracy

In this case, we define it using

\text{is\_followed}_{\text{loose}}(\text{resp}, \text{inst}) = \text{Any}\left(\text{is\_followed}(\text{transform}_t(\text{resp}), \text{inst}) \text{ for } t = 1, 2, \ldots\right)

Here, there are a few transformations defined for the input response:

Remove commonly seen font modifiers in the markdown syntax, especially * and **.
Remove the first line of the response, so that we skip intros like “Sure, here it is:”.
Remove the last line of the response, so that we skip outros like “Hope it helps.”.

We take the powerset of these transformations (including the identity transformation) to have a total of eight transformations that we apply to the input response.

Prompt-level

Per input response, we will only have a single boolean output evaluating the accuracy (both in the case of strict and loose accuracy).

Instruction-level

This one enumerates over the instructions as opposed the input responses. Continuing from our running example, the output for the ["punctuation:no_comma", "detectable_format:number_highlighted_sections", "length_constraints:number_words"] instruction set would be [True, True, True] if all the instructions pass. The accuracy is then calculated over all instructions and the entire dataset.

Final Accuracy

This is simply an average of prompt-level-strict, prompt-level-loose, instruction-level-strict, instruction-level-loose.

Standard Errors

Standard errors are computed naively for the the prompt-level scores, and clustered on prompts for the instruction-level scores.

The Final Accuracy score presents an interesting analytic challenge for computing standard errors, as it sums across both prompt-level and instruction-level scores. Here we develop the analytic machinery for it.

Let \(s_{p}\) represent the prompt-level score (average of loose and strict) on prompt \(p\), and let \(s_{p, i}\) represent the instruction-level score (average of loose and strict) of instruction \(i\) within prompt \(p\). Let \(n_p\) represent the number of instructions associated with prompt \(p\), let \(P\) represent the number of prompts, and let

\bar{n}=\frac{1}{P}\sum_{p=1}^P n_p

that is, the average value of \(n_p\). Then we can construct a final accuracy as an average of the prompt-level scores and instruction-level scores as

FA=\frac{1}{2}\left(\frac{1}{P} \sum_{p=1}^P s_p\right) + \frac{1}{2}\left(\frac{1}{P\bar{n}} \sum_{p=1}^P \sum_{i=1}^{n_p} s_{p, i}\right)

This expression can be rearranged as

FA=\frac{1}{P\bar{n}} \sum_{p=1}^P \sum_{i=1}^{n_p} \frac{1}{2}\left(s_{p, i} + \frac{\bar{n}}{n_p} s_p\right)

In other words, we can add a weighted version of the prompt-level score to each instruction-level score and treat our unit of observation as the prompt-adjusted instruction score

\frac{1}{2}\left(s_{p, i} + \frac{\bar{n}}{n_p} s_p\right)

Since the final accuracy is the instruction-level mean of this score, a correct standard error can be computed by clustering this term by the prompt.