NoveltyBench: Evaluating Language Models for Humanlike Diversity
Evaluates how well language models generate diverse, humanlike responses across multiple reasoning and generation tasks. This evaluation assesses whether LLMs can produce varied outputs rather than repetitive or uniform answers.
Overview
Language models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. NoveltyBench is a benchmark for measuring how well language models can generate multiple differing but still correct answers to user requests that involve subjectivity, randomness, and creativity. This benchmark is intended to serve as a complement to existing quality-based evaluation benchmarks, encouraging model developers to strive toward more pluralistic models, while also taking generation quality into account.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evals[novelty_bench]If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv sync --extra novelty_benchRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/novelty_bench --model openai/gpt-5-nanoYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval
from inspect_evals.novelty_bench import novelty_bench
eval(novelty_bench)After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can control a variety of options from the command line. For example:
uv run inspect eval inspect_evals/novelty_bench --limit 10
uv run inspect eval inspect_evals/novelty_bench --max-connections 10
uv run inspect eval inspect_evals/novelty_bench --temperature 0.5See uv run inspect eval --help for all available options.
Parameters
novelty_bench
dataset_split(Literal[‘curated’, ‘wildchat’]): Which dataset split to use. Options: “curated” (100 samples) or “wildchat” (1000 samples). (default:'curated')num_generations(int): Number of independent generations per prompt for diversity evaluation. (default:10)equivalence_threshold(float): classifier threshold for determining functional equivalence between responses. The higher the classifier’s threshold, the larger the difference between two responses must be for them to be judged as non-equivalent. Defaults to 0.102 (from original implementation). (default:0.102)quality_model(Literal[‘small’, ‘large’]): Which reward model to use for quality scoring., “small”: Skywork-Reward-V2-Qwen3-1.7B (default, faster), “large”: Skywork-Reward-Gemma-2-27B-v0.2 (original) (default:'small')shuffle(bool): Whether to randomly shuffle the dataset order. (default:True)quality_model_device(str | None): Device for quality/reward model. None for automatic selection (avoids MPS in threads, else auto). Examples: “cpu”, “cuda”, “cuda:0”, “mps”. (default:None)partition_model_device(str | None): Device for partition/classifier model. None for automatic selection. Examples: “cpu”, “cuda”, “cuda:0”, “mps”. (default:None)
Dataset
1,100 prompts designed to elicit diverse human responses
NBCURATED (100 prompts)
Manually curated prompts across 4 categories, each annotated with responses from 8 human annotators:
- Randomness: Prompts involving random selection (e.g., “Roll a make-believe 20-sided die”)
- Factual Knowledge: Underspecified factual questions with multiple valid answers (e.g., “List a capital city in Africa”)
- Creative Writing: Poetry, riddles, and story prompts (e.g., “Tell me a riddle”)
- Subjectivity: Opinion-based questions (e.g., “What’s the best car to get in 2023?”)
Note: Annotator responses represent a lower bound on human diversity due to homogeneous annotator pool.
NBWILDCHAT (1,000 prompts)
Automatically curated from WildChat (1M real ChatGPT conversations) using:
- Llama Guard 3 filtering for inappropriate content
- IP-based deduplication for topic diversity
- GPT-4o classification to select prompts likely to elicit diverse responses
Scoring
Metrics
The evaluation measures both diversity and quality of model responses using two metrics:
distinct_k: Number of functionally unique responses among k generations. We generate k outputs for every prompt/ sample, group them by functional equivalence (see definition below) then count how many classes of distinct responses there are. A higherdistinct_kscore indicates that the model produces more diverse answers rather than slightly reworded duplicates.utility_k: Cumulative utility combining novelty and quality, weighted by a user patience model (p=0.8). It’s calculated as a weighted sum where each novel generation (one that’s functionally different from all previous ones) contributes its quality score, weighted by p^(i-1) to account for decreasing user patience. Only truly novel generations (in different equivalence classes) contribute to the utility score. A higherutility_kvalue means that a model produces useful and diverse responses.
Functional equivalence is defined as: two generations are considered equivalent only if a user who saw one would gain no benefit from seeing the other.
Scoring Method
- A fine-tuned DeBERTa-v3-large classifier partitions outputs into equivalence classes by comparing each new generation against a random sample from existing classes. The model was fine-tuned by training it on 1,000 human-annotated pairs of language model outputs (responding to the same prompts from NBCURATED and NBWILDCHAT) to predict whether the two outputs were functionally equivalent or meaningfully different.
- Quality is scored using Skywork-Reward-Gemma-2-27B-v0.2, with rewards mapped to utility values {1, 2, …, 10} via MT-Bench calibration.
Models are compared on average distinct generations (out of 10 samples) and cumulative utility scores (maximum 10), where higher scores indicate better performance at producing meaningful alternatives.
Evaluation Results
The eval was run for the gpt-4o-mini model on both splits of the dataset, using k=10 generations and the small quality scoring model.
| Metric | curated dataset | wildchat dataset |
|---|---|---|
| distinct_k | 2.86 | 2.5936 |
| utility_k | 3.234 | 3.5822 |
(Results from GPT-4o Mini, Dec 2025)
The paper only provides overall results for both datasets. The combined results are computed using a weighted average based on each split size: (curated_result * 100 + wildchat_result * 1000) / 1100.
The distinct_k results are much closer than the utility_k results. This is expected as our eval run and the paper are using different quality scoring models.
| Metric | inspect_eval results | Paper results |
|---|---|---|
| distinct_k | 2.62 | 2.65 |
| utility_k | 3.55 | 3.11 |
(Comparing results on GPT-4o Mini)
Ref: https://arxiv.org/pdf/2504.05228 Table 1