FRAMES: Factual Retrieval and Multi-hop Evaluation Set

Reasoning
frames-eval
reasoning
retrieval
multi-hop
Agent

FRAMES evaluates multi-hop reasoning by requiring models to answer questions that can only be resolved by chaining information across multiple Wikipedia documents. Two task variants are provided:

  • frames_baseline: all source documents are provided upfront; tests whether a model can reason across a full document set.
  • frames_socrates: documents are withheld and the model must iteratively request them via a request_document tool. Retrieval is constrained to a per-sample allowlist of ground-truth source documents, preventing eval-awareness contamination. Scored with a decaying reward accuracy × min(optimal_hops / actual_hops, 1) that penalises unnecessary hops and hallucinated document requests.

Overview

⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.

Source: sahil350/frames-eval@2fbb1a1 · Listed by @sahil350

FRAMES evaluates multi-hop reasoning by requiring models to answer questions that can only be resolved by chaining information across multiple Wikipedia documents. Two task variants are provided:

  • frames_baseline: all source documents are provided upfront; tests whether a model can reason across a full document set.
  • frames_socrates: documents are withheld and the model must iteratively request them via a request_document tool. Retrieval is constrained to a per-sample allowlist of ground-truth source documents, preventing eval-awareness contamination. Scored with a decaying reward accuracy × min(optimal_hops / actual_hops, 1) that penalises unnecessary hops and hallucinated document requests.

Usage

Installation

This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:

git clone https://github.com/sahil350/frames-eval
cd frames-eval
git checkout 2fbb1a123431d559180547e325f0765d97ab2a39
uv sync

Running evaluations

CLI

uv run inspect eval src/frames_eval/frames.py@frames_baseline --model openai/gpt-5-nano

Python

from inspect_ai import eval
from frames_eval.frames import frames_baseline

eval(frames_baseline(), model="openai/gpt-5-nano")

View logs

uv run inspect view

More information

For the dataset, scorer, task parameters, and validation, see the upstream repo: sahil350/frames-eval.

Options

You can control a variety of options from the command line. For example:

uv run inspect eval src/frames_eval/frames.py@frames_baseline --limit 10
uv run inspect eval src/frames_eval/frames.py@frames_socrates --max-connections 10
uv run inspect eval src/frames_eval/frames.py@frames_baseline --temperature 0.5

See uv run inspect eval --help for all available options.

More command-line options: Inspect docs ↗

Evaluation Report

Timestamp: May 2026

Commit: 2fbb1a1

frames_baseline

Model Accuracy
ollama/llama3.2 0.060
ollama/llama3.1:8b 0.080

frames_socrates

Model Mean Score Accuracy Avg Hops Avg Hallucinations Stderr
ollama/llama3.2 0.108 0.120 2.000 0.580
ollama/llama3.1:8b 0.270 0.275 2.000 0.470 0.070

Notes:

  • frames_socrates score = accuracy × min(5 / actual_hops, 1) − 0.2 × hallucinations
  • Retrieval is constrained to a per-sample allowlist of ground-truth Wikipedia source documents.
  • Wikipedia content fetched live via the MediaWiki API at eval time.