FRAMES: Factual Retrieval and Multi-hop Evaluation Set
FRAMES evaluates multi-hop reasoning by requiring models to answer questions that can only be resolved by chaining information across multiple Wikipedia documents. Two task variants are provided:
- frames_baseline: all source documents are provided upfront; tests whether a model can reason across a full document set.
- frames_socrates: documents are withheld and the model must iteratively request them via a
request_documenttool. Retrieval is constrained to a per-sample allowlist of ground-truth source documents, preventing eval-awareness contamination. Scored with a decaying rewardaccuracy × min(optimal_hops / actual_hops, 1)that penalises unnecessary hops and hallucinated document requests.
Overview
⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.
Source: sahil350/frames-eval@2fbb1a1 · Listed by @sahil350
FRAMES evaluates multi-hop reasoning by requiring models to answer questions that can only be resolved by chaining information across multiple Wikipedia documents. Two task variants are provided:
- frames_baseline: all source documents are provided upfront; tests whether a model can reason across a full document set.
- frames_socrates: documents are withheld and the model must iteratively request them via a
request_documenttool. Retrieval is constrained to a per-sample allowlist of ground-truth source documents, preventing eval-awareness contamination. Scored with a decaying rewardaccuracy × min(optimal_hops / actual_hops, 1)that penalises unnecessary hops and hallucinated document requests.
Usage
Installation
This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:
git clone https://github.com/sahil350/frames-eval
cd frames-eval
git checkout 2fbb1a123431d559180547e325f0765d97ab2a39
uv syncRunning evaluations
CLI
uv run inspect eval src/frames_eval/frames.py@frames_baseline --model openai/gpt-5-nanoPython
from inspect_ai import eval
from frames_eval.frames import frames_baseline
eval(frames_baseline(), model="openai/gpt-5-nano")View logs
uv run inspect viewMore information
For the dataset, scorer, task parameters, and validation, see the upstream repo: sahil350/frames-eval.
Options
You can control a variety of options from the command line. For example:
uv run inspect eval src/frames_eval/frames.py@frames_baseline --limit 10
uv run inspect eval src/frames_eval/frames.py@frames_socrates --max-connections 10
uv run inspect eval src/frames_eval/frames.py@frames_baseline --temperature 0.5See uv run inspect eval --help for all available options.
More command-line options: Inspect docs ↗
Evaluation Report
Timestamp: May 2026
Commit: 2fbb1a1
frames_baseline
| Model | Accuracy |
|---|---|
| ollama/llama3.2 | 0.060 |
| ollama/llama3.1:8b | 0.080 |
frames_socrates
| Model | Mean Score | Accuracy | Avg Hops | Avg Hallucinations | Stderr |
|---|---|---|---|---|---|
| ollama/llama3.2 | 0.108 | 0.120 | 2.000 | 0.580 | |
| ollama/llama3.1:8b | 0.270 | 0.275 | 2.000 | 0.470 | 0.070 |
Notes:
- frames_socrates score = accuracy × min(5 / actual_hops, 1) − 0.2 × hallucinations
- Retrieval is constrained to a per-sample allowlist of ground-truth Wikipedia source documents.
- Wikipedia content fetched live via the MediaWiki API at eval time.