NoveltyBench: Evaluating Language Models for Humanlike Diversity

Reasoning

Evaluates how well language models generate diverse, humanlike responses across multiple reasoning and generation tasks. This evaluation assesses whether LLMs can produce varied outputs rather than repetitive or uniform answers.

Contributed By

@iphan

Code

src/inspect_evals/novelty_bench

Paper

https://arxiv.org/abs/2504.05228

Overview

Language models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. NoveltyBench is a benchmark for measuring how well language models can generate multiple differing but still correct answers to user requests that involve subjectivity, randomness, and creativity. This benchmark is intended to serve as a complement to existing quality-based evaluation benchmarks, encouraging model developers to strive toward more pluralistic models, while also taking generation quality into account.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals[novelty_bench]

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync --extra novelty_bench

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/novelty_bench --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.novelty_bench import novelty_bench
eval(novelty_bench)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/novelty_bench --limit 10
uv run inspect eval inspect_evals/novelty_bench --max-connections 10
uv run inspect eval inspect_evals/novelty_bench --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`novelty_bench`

dataset_split (Literal[‘curated’, ‘wildchat’]): Which dataset split to use. Options: “curated” (100 samples) or “wildchat” (1000 samples). (default: 'curated')
num_generations (int): Number of independent generations per prompt for diversity evaluation. (default: 10)
equivalence_threshold (float): classifier threshold for determining functional equivalence between responses. The higher the classifier’s threshold, the larger the difference between two responses must be for them to be judged as non-equivalent. Defaults to 0.102 (from original implementation). (default: 0.102)
quality_model (Literal[‘small’, ‘large’]): Which reward model to use for quality scoring., “small”: Skywork-Reward-V2-Qwen3-1.7B (default, faster), “large”: Skywork-Reward-Gemma-2-27B-v0.2 (original) (default: 'small')
shuffle (bool): Whether to randomly shuffle the dataset order. (default: True)
quality_model_device (str | None): Device for quality/reward model. None for automatic selection (avoids MPS in threads, else auto). Examples: “cpu”, “cuda”, “cuda:0”, “mps”. (default: None)
partition_model_device (str | None): Device for partition/classifier model. None for automatic selection. Examples: “cpu”, “cuda”, “cuda:0”, “mps”. (default: None)

Dataset

1,100 prompts designed to elicit diverse human responses

NBCURATED (100 prompts)

Manually curated prompts across 4 categories, each annotated with responses from 8 human annotators:

Randomness: Prompts involving random selection (e.g., “Roll a make-believe 20-sided die”)
Factual Knowledge: Underspecified factual questions with multiple valid answers (e.g., “List a capital city in Africa”)
Creative Writing: Poetry, riddles, and story prompts (e.g., “Tell me a riddle”)
Subjectivity: Opinion-based questions (e.g., “What’s the best car to get in 2023?”)

Note: Annotator responses represent a lower bound on human diversity due to homogeneous annotator pool.

NBWILDCHAT (1,000 prompts)

Automatically curated from WildChat (1M real ChatGPT conversations) using:

Llama Guard 3 filtering for inappropriate content
IP-based deduplication for topic diversity
GPT-4o classification to select prompts likely to elicit diverse responses

Scoring

Metrics

The evaluation measures both diversity and quality of model responses using two metrics:

distinct_k: Number of functionally unique responses among k generations. We generate k outputs for every prompt/ sample, group them by functional equivalence (see definition below) then count how many classes of distinct responses there are. A higher distinct_k score indicates that the model produces more diverse answers rather than slightly reworded duplicates.
utility_k: Cumulative utility combining novelty and quality, weighted by a user patience model (p=0.8). It’s calculated as a weighted sum where each novel generation (one that’s functionally different from all previous ones) contributes its quality score, weighted by p^(i-1) to account for decreasing user patience. Only truly novel generations (in different equivalence classes) contribute to the utility score. A higher utility_k value means that a model produces useful and diverse responses.

Functional equivalence is defined as: two generations are considered equivalent only if a user who saw one would gain no benefit from seeing the other.

Scoring Method

A fine-tuned DeBERTa-v3-large classifier partitions outputs into equivalence classes by comparing each new generation against a random sample from existing classes. The model was fine-tuned by training it on 1,000 human-annotated pairs of language model outputs (responding to the same prompts from NBCURATED and NBWILDCHAT) to predict whether the two outputs were functionally equivalent or meaningfully different.
Quality is scored using Skywork-Reward-Gemma-2-27B-v0.2, with rewards mapped to utility values {1, 2, …, 10} via MT-Bench calibration.

Models are compared on average distinct generations (out of 10 samples) and cumulative utility scores (maximum 10), where higher scores indicate better performance at producing meaningful alternatives.

Evaluation Results

The eval was run for the gpt-4o-mini model on both splits of the dataset, using k=10 generations and the small quality scoring model.

Metric	curated dataset	wildchat dataset
distinct_k	2.86	2.5936
utility_k	3.234	3.5822

(Results from GPT-4o Mini, Dec 2025)

The paper only provides overall results for both datasets. The combined results are computed using a weighted average based on each split size: (curated_result * 100 + wildchat_result * 1000) / 1100.

The distinct_k results are much closer than the utility_k results. This is expected as our eval run and the paper are using different quality scoring models.

Metric	inspect_eval results	Paper results
distinct_k	2.62	2.65
utility_k	3.55	3.11

(Comparing results on GPT-4o Mini)

Ref: https://arxiv.org/pdf/2504.05228 Table 1