Needle in a Haystack (NIAH): In-Context Retrieval Benchmark for Long Context LLMs

Reasoning

NIAH evaluates in-context retrieval ability of long context LLMs by testing a model’s ability to extract factual information from long-context inputs.

Overview

Overview

The Inspect implementation for NIAH (Needle in a Haystack) is designed to evaluate the in-context retrieval ability of long context language models (LLMs). The main goal of this evaluation is to place a random fact or statement (the “needle”) in the middle of a long context window (the “haystack”) and ask the model to retrieve this statement. The evaluation iterates over various document depths and context lengths to measure performance.

This evaluation was contributed by Owen Parsons.

Installation

To get started, install the required Python packages:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Running the Evaluation

You can evaluate against models using the following command:

inspect eval inspect_evals/niah --model openai/gpt-3.5-turbo-16k

If you prefer not to specify the --model argument each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key.

Example of the .env file:

INSPECT_EVAL_MODEL=openai/gpt-3.5-turbo-16k

You can also use the -T tag to define input values for the evaluation. For example:

inspect eval niah.py --model openai/gpt-3.5-turbo -T n_needles=20

Configuration Variables

Here are the configuration variables used in the NIAH evaluation, along with their default values and types:

Variable Type Default Value Description
min_context int 10000 Minimum context length to evaluate.
max_context int 120000 Maximum context length to evaluate.
n_contexts int 15 The number of contexts to evaluate.
n_positions int 15 The number of positions to evaluate for a given context length.
start_buffer int 0 Buffer at the top of the context to avoid placing needles. (Example: If start_buffer is 100, then the first needle position would aim to be at the 100th token in the context.)
end_buffer int 0 Buffer at the bottom of the context to avoid placing needles. (Example: For a context length of 1000, if end_buffer is 100, then the final needle position would aim to be at the 900th token in the context.)
n_needles int 1 The number of needles to sample.
sample_method str "fixed" Method for sampling the needles.
fixed_index int 0 The index of the needle to use when sample_method is set to "fixed".
n_runs int 1 The number of runs for the evaluation.

Scoring Metric

This benchmark uses a modified version of model_graded_qa() that allows for the scorer call to only include the question related to the needle, rather than passing the original prompt during the benchmark task. This is to avoid the model having to handle long context inputs during the scoring.

The scoring criteria is taken from Greg Kramradt’s implementation and is shown below:

Score 1: The answer is completely unrelated to the reference.
Score 3: The answer has minor relevance but does not align with the reference.
Score 5: The answer has moderate relevance but contains inaccuracies.
Score 7: The answer aligns with the reference but has minor omissions.
Score 10: The answer is completely accurate and aligns perfectly with the reference.

Dataset

The dataset used for the evaluation is read from OpenCompass, and the specific dataset is generated based on the values defined in the configuration.

Dataset Construction

The final dataset used by the NIAH evaluation is generated using the OpenCompass dataset and the configuration variables by following the series of steps summarised below.

  1. Model Encoding: An encoder for the specified model is created using the tiktoken library, which facilitates tokenisation for text processing.

  2. Context Length Generation: Context lengths are generated for the specified range of context values given by min_context, max_context and n_contexts.

  3. Needle Position Generation: Needle positions are determined across the generated contexts based on the number of needles specified by n_needles accounting for a buffer at the start/end of the document if specified by start_buffer or end_buffer.

  4. Data Retrieval: The relevant haystack and needle datasets are extracted from the Hugging Face dataset repository.

  5. Needle Filtering: The needles are filtered to include only English entries.

  6. Needle Sampling: Needles are sampled based on the chosen method (fixed, sequential, or random).

  7. Needle Adjustment: The needles are repeated and shifted to prepare for multiple runs.

  8. Token Length Calculations: The maximum combined token lengths for the needles, questions, and answers are computed, along with the token counts for the main and question prompts.

  9. Context Reading and Trimming: The relevant context texts are read, combined and trimmed to match the required context token lengths.

  10. Period Indices Calculation: Period indices are determined for each context length to facilitate the insertion of needles at appropriate positions between sentences.

  11. Model Context Length Verification: A check is performed to ensure that the model context length is larger than the specified context lengths.

  12. Dataset Creation: The final dataset is constructed by combining the contexts, needle insertions, and associated metadata.

  13. DataFrame Conversion: The complete dataset is converted into a Pandas DataFrame.