Needle in a Haystack (NIAH): In-Context Retrieval Benchmark for Long Context LLMs
NIAH evaluates in-context retrieval ability of long context LLMs by testing a model’s ability to extract factual information from long-context inputs.
Overview
Overview
The Inspect implementation for NIAH (Needle in a Haystack) is designed to evaluate the in-context retrieval ability of long context language models (LLMs). The main goal of this evaluation is to place a random fact or statement (the “needle”) in the middle of a long context window (the “haystack”) and ask the model to retrieve this statement. The evaluation iterates over various document depths and context lengths to measure performance.
This evaluation was contributed by Owen Parsons.
Installation
To get started, install the required Python packages:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Running the Evaluation
You can evaluate against models using the following command:
inspect eval inspect_evals/niah --model openai/gpt-3.5-turbo-16k
If you prefer not to specify the --model
argument each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key.
Example of the .env file:
INSPECT_EVAL_MODEL=openai/gpt-3.5-turbo-16k
You can also use the -T
tag to define input values for the evaluation. For example:
inspect eval niah.py --model openai/gpt-3.5-turbo -T n_needles=20
Configuration Variables
Here are the configuration variables used in the NIAH evaluation, along with their default values and types:
Variable | Type | Default Value | Description |
---|---|---|---|
min_context |
int |
10000 |
Minimum context length to evaluate. |
max_context |
int |
120000 |
Maximum context length to evaluate. |
n_contexts |
int |
15 |
The number of contexts to evaluate. |
n_positions |
int |
15 |
The number of positions to evaluate for a given context length. |
start_buffer |
int |
0 |
Buffer at the top of the context to avoid placing needles. (Example: If start_buffer is 100 , then the first needle position would aim to be at the 100th token in the context.) |
end_buffer |
int |
0 |
Buffer at the bottom of the context to avoid placing needles. (Example: For a context length of 1000 , if end_buffer is 100 , then the final needle position would aim to be at the 900th token in the context.) |
n_needles |
int |
1 |
The number of needles to sample. |
sample_method |
str |
"fixed" |
Method for sampling the needles. |
fixed_index |
int |
0 |
The index of the needle to use when sample_method is set to "fixed" . |
n_runs |
int |
1 |
The number of runs for the evaluation. |
Scoring Metric
This benchmark uses a modified version of model_graded_qa()
that allows for the scorer call to only include the question related to the needle, rather than passing the original prompt during the benchmark task. This is to avoid the model having to handle long context inputs during the scoring.
The scoring criteria is taken from Greg Kramradt’s implementation and is shown below:
Score 1: The answer is completely unrelated to the reference.
Score 3: The answer has minor relevance but does not align with the reference.
Score 5: The answer has moderate relevance but contains inaccuracies.
Score 7: The answer aligns with the reference but has minor omissions.
Score 10: The answer is completely accurate and aligns perfectly with the reference.
Dataset
The dataset used for the evaluation is read from OpenCompass, and the specific dataset is generated based on the values defined in the configuration.
Dataset Construction
The final dataset used by the NIAH evaluation is generated using the OpenCompass dataset and the configuration variables by following the series of steps summarised below.
Model Encoding: An encoder for the specified model is created using the tiktoken library, which facilitates tokenisation for text processing.
Context Length Generation: Context lengths are generated for the specified range of context values given by
min_context
,max_context
andn_contexts
.Needle Position Generation: Needle positions are determined across the generated contexts based on the number of needles specified by
n_needles
accounting for a buffer at the start/end of the document if specified bystart_buffer
orend_buffer
.Data Retrieval: The relevant haystack and needle datasets are extracted from the Hugging Face dataset repository.
Needle Filtering: The needles are filtered to include only English entries.
Needle Sampling: Needles are sampled based on the chosen method (fixed, sequential, or random).
Needle Adjustment: The needles are repeated and shifted to prepare for multiple runs.
Token Length Calculations: The maximum combined token lengths for the needles, questions, and answers are computed, along with the token counts for the main and question prompts.
Context Reading and Trimming: The relevant context texts are read, combined and trimmed to match the required context token lengths.
Period Indices Calculation: Period indices are determined for each context length to facilitate the insertion of needles at appropriate positions between sentences.
Model Context Length Verification: A check is performed to ensure that the model context length is larger than the specified context lengths.
Dataset Creation: The final dataset is constructed by combining the contexts, needle insertions, and associated metadata.
DataFrame Conversion: The complete dataset is converted into a Pandas DataFrame.