FRAMES: Factual Retrieval and Multi-hop Evaluation Set

Reasoning

frames-eval

reasoning

retrieval

multi-hop

Agent

FRAMES evaluates multi-hop reasoning by requiring models to answer questions that can only be resolved by chaining information across multiple Wikipedia documents. Two task variants are provided:

frames_baseline: all source documents are provided upfront; tests whether a model can reason across a full document set.
frames_socrates: documents are withheld and the model must iteratively request them via a request_document tool. Retrieval is constrained to a per-sample allowlist of ground-truth source documents, preventing eval-awareness contamination. Scored with a decaying reward accuracy × min(optimal_hops / actual_hops, 1) that penalises unnecessary hops and hallucinated document requests.

Maintained By

@sahil350

Code

frames

Paper

https://arxiv.org/abs/2409.12941

Overview

⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.

Source: sahil350/frames-eval@2fbb1a1 · Listed by @sahil350

FRAMES evaluates multi-hop reasoning by requiring models to answer questions that can only be resolved by chaining information across multiple Wikipedia documents. Two task variants are provided:

frames_baseline: all source documents are provided upfront; tests whether a model can reason across a full document set.
frames_socrates: documents are withheld and the model must iteratively request them via a request_document tool. Retrieval is constrained to a per-sample allowlist of ground-truth source documents, preventing eval-awareness contamination. Scored with a decaying reward accuracy × min(optimal_hops / actual_hops, 1) that penalises unnecessary hops and hallucinated document requests.

Usage

Installation

This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:

git clone https://github.com/sahil350/frames-eval
cd frames-eval
git checkout 2fbb1a123431d559180547e325f0765d97ab2a39
uv sync

Running evaluations

CLI

uv run inspect eval src/frames_eval/frames.py@frames_baseline --model openai/gpt-5-nano

Python

from inspect_ai import eval
from frames_eval.frames import frames_baseline

eval(frames_baseline(), model="openai/gpt-5-nano")

View logs

uv run inspect view

More information

For the dataset, scorer, task parameters, and validation, see the upstream repo: sahil350/frames-eval.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval src/frames_eval/frames.py@frames_baseline --limit 10
uv run inspect eval src/frames_eval/frames.py@frames_socrates --max-connections 10
uv run inspect eval src/frames_eval/frames.py@frames_baseline --temperature 0.5

See uv run inspect eval --help for all available options.

More command-line options: Inspect docs ↗

Evaluation Report

Timestamp: May 2026

Commit: 2fbb1a1

frames_baseline

Model	Accuracy
ollama/llama3.2	0.060
ollama/llama3.1:8b	0.080

frames_socrates

Model	Mean Score	Accuracy	Avg Hops	Avg Hallucinations	Stderr
ollama/llama3.2	0.108	0.120	2.000	0.580
ollama/llama3.1:8b	0.270	0.275	2.000	0.470	0.070

Notes:

frames_socrates score = accuracy × min(5 / actual_hops, 1) − 0.2 × hallucinations
Retrieval is constrained to a per-sample allowlist of ground-truth Wikipedia source documents.
Wikipedia content fetched live via the MediaWiki API at eval time.