CTI-REALM: Cyber Threat Intelligence Detection Rule Development Benchmark

Cybersecurity
Agent

Evaluates AI systems’ ability to analyze cyber threat intelligence and develop comprehensive detection capabilities through a realistic 5-subtask workflow: MITRE technique mapping, data source discovery, Sigma rule generation, KQL development and testing against real telemetry data, and results analysis.

Overview

CTI-REALM (Cyber Threat Real World Evaluation and LLM Benchmarking) evaluates AI agents’ ability to interpret threat intelligence and develop detection rules. The benchmark provides a realistic environment replicating the security analyst workflow, enabling agents to examine CTI reports, execute queries, understand schema structures, and construct detection rules.

Overview

CTI-REALM tests an AI agent’s capability across a 5-stage detection engineering workflow:

  1. CTI Report Analysis — Retrieve and analyze relevant threat intelligence reports
  2. MITRE Technique Mapping — Identify relevant ATT&CK techniques from threat intelligence
  3. Data Source Discovery — Explore available Kusto data sources dynamically
  4. KQL Development — Write and test queries against real telemetry data
  5. Detection Rule Generation — Produce Sigma rules and validated KQL queries

Evaluation involves emulated attacks of varying complexity across Linux systems, cloud platforms, and Azure Kubernetes Service (AKS). Agent performance is measured through both final detection results and trajectory-based rewards that capture decision-making effectiveness.

Setup

Prerequisites

  • Docker (required for the containerized evaluation environment)
  • At least 4GB available RAM for Docker containers

Docker Image Sizes

The evaluation requires approximately 6.5 GB of Docker images:

Image Size Purpose
kustainer-linux ~4.5 GB Azure Data Explorer emulator (Kusto)
inspect-cti_realm-*-default ~1.7 GB Main evaluation environment
python:3.11-slim ~190 MB Init services (kusto-init, mitre-service)

First-run startup takes several minutes as images are pulled and data is loaded into Kusto.

Download Data

All data is hosted on Hugging Face in the dataset arjun180-new/cti_realm. Before running the evaluation, download the required data files:

cd src/inspect_evals/cti_realm
python download_data.py

This downloads:

  1. Dataset files (to data/) — Detection objectives, sample metadata, and ground truth answers
  2. Kusto telemetry data (to docker/kusto_init/data/) — 12 log sources including endpoint, AKS, cloud, identity, and application telemetry
  3. CTI reports (to cti_reports/) — Threat intelligence reports for analysis

Troubleshooting

If you see this error when running the evaluation:

fork/exec /usr/local/lib/docker/cli-plugins/docker-buildx: no such file or directory
Failed to build docker containers

This means Docker Desktop is not running. Start Docker Desktop before running the evaluation.

If you see a TimeoutError or RuntimeError: No services started with kusto-init failing, the Kusto data initialisation is taking longer than the Docker Compose wait timeout. Try increasing the retries value in the kusto-emulator healthcheck in docker/compose.yaml — Inspect uses retries * (interval + timeout) to calculate the overall wait timeout.

Architecture

The evaluation environment is a containerized Docker system integrated with Inspect AI that provides:

  • CTI Repository — 37 source reports from Microsoft Security, Datadog Security Labs, Palo Alto Networks, and Splunk Security Content
  • Kusto Cluster — Query engine for executing KQL against telemetry data
  • Telemetry Logs — Multi-source security logs from attack simulations across Linux endpoints, AKS clusters, and Azure cloud
  • MITRE ATT&CK Database — Techniques and tactic mappings for threat contextualization
  • Sigma Rules Database — Reference collection of existing detection rules
  • Tool API — 8 specialized functions for CTI retrieval, data exploration, query execution, and threat context mapping

Agents interact via a ReAct framework, alternating between reasoning traces and tool invocations.

Usage

Task Variants

Task Samples Description
cti_realm_25 25 Balanced subset (12 Linux, 9 AKS, 4 Cloud)
cti_realm_50 50 Full evaluation (25 Linux, 17 AKS, 8 Cloud)
cti_realm_25_minimal 25 Ablation — CTI-specific tools removed
cti_realm_25_seeded 25 Memory augmentation — pre-seeded workflow knowledge

CTI-REALM-25 is a subset of CTI-REALM-50. All variants use hard difficulty (minimal prompting, no workflow guidance).

Seeded Memory Variant

The cti_realm_25_seeded task pre-populates the agent’s /memories/ directory with three guidance files before execution begins:

  • workflow.md — Step-by-step detection engineering workflow (CTI analysis → MITRE mapping → data exploration → query development → rule generation)
  • tool_tips.md — Tips for effective use of available tools (Kusto queries, Sigma rule formatting, CTI report retrieval)
  • patterns.md — Common detection patterns and KQL query templates for different attack categories

The agent is instructed to read these files before starting its investigation. This variant tests whether providing structured knowledge can close the performance gap between smaller and larger models. In our evaluation, seeded memory closed 33% of the gap between GPT-5-Mini and GPT-5 (Med).

Running Evaluations

# Run 25-sample benchmark
uv run inspect eval inspect_evals/cti_realm_25 --model anthropic/claude-opus-4-6

# Run 50-sample benchmark
uv run inspect eval inspect_evals/cti_realm_50 --model openai/gpt-5

# Run with limited samples
uv run inspect eval inspect_evals/cti_realm_25 --model anthropic/claude-opus-4-5 --limit 5

Notes

  • Use --max-samples 2 for stable performance when running multiple samples
  • Agents are permitted a maximum of 70 messages per task (configurable via -T message_limit=N)

Scoring

CTI-REALM uses a trajectory-based evaluation framework combining deterministic checkpoints with LLM-as-judge assessment. The total reward is:

\[R_{\text{total}} = \sum_{i \in \{C0, C1, C2, C3, C4\}} w_i \cdot r_i \in [0, 1]\]

Checkpoints

Checkpoint Weight Method Description
C0 — CTI Analysis 0.125 LLM-as-judge Correct identification of relevant threat intelligence reports
C1 — MITRE Mapping 0.075 Jaccard similarity Accuracy of ATT&CK technique extraction
C2 — Data Exploration 0.100 Jaccard similarity Identification of relevant telemetry sources
C3 — Query Execution 0.050 Binary Iterative query refinement (≥2 successful queries)
C4 — Detection Quality 0.650 F1-score + LLM-as-judge KQL correctness via F1-score, Sigma rule quality via judge

Checkpoints C0–C3 comprise 35% of the total weight (trajectory reward), while C4 accounts for 65% (ground truth reward).

Grader Model

Checkpoints C0 and C4 use an LLM-as-judge. By default, the grader uses openai/azure/gpt-5-mini. To specify a different grader, use the grader model role:

uv run inspect eval inspect_evals/cti_realm_25 --model anthropic/claude-opus-4-6 --model grader=openai/azure/gpt-5-mini

The results reported below were generated with openai/azure/gpt-5-mini as the grader model.

Judge Calibration

The LLM-as-judge components (C0 and C4 Sigma) were calibrated against human oracle labels on a stratified sample of 20 samples (10 Linux, 7 AKS, 3 Cloud) from an O3 evaluation run. Results using openai/azure/gpt-5-mini as the grader:

Metric C0 (CTI Alignment) C4 (Sigma Quality)
Judge mean 0.593 0.672
Oracle mean 0.698 0.673
Bias -0.105 -0.001
MAE 0.145 0.073
Pearson r 0.483 0.866

C4 Sigma quality judge is well-calibrated: near-zero bias, low MAE, and strong correlation (r=0.87) with human ratings.

C0 CTI alignment judge slightly underestimates quality (bias -0.10) with moderate correlation (r=0.48). Since C0 contributes 12.5% of the total score, the bias translates to ~0.013 on the final score — within standard error bars and unlikely to affect model rankings.

When substituting a different grader model, consider re-validating calibration using tools/judge_calibration_diagnostics.py — see BEST_PRACTICES.md, section “Validate judge-based comparisons” for guidance.

Evaluation Report

Implementation Details

  • This is the reference implementation of CTI-REALM
  • Docker-based sandbox with a local Kusto-compatible server for realistic KQL query execution against actual telemetry data
  • Scoring combines trajectory-based checkpoints (C0–C3) with ground truth F1-score evaluation (C4)
  • GPT-5-Mini is the default LLM-as-judge for C0 and C4 Sigma evaluation, configurable via the grader model role (see Grader Model)

Results

Results from CTI-REALM-50 (50 samples, all models run with message_limit=70, task version 1-A):

Model Provider Rewards StdErr 95% CI
claude-opus-4-6 Anthropic 0.6373 0.0374 [0.562, 0.712]
claude-opus-4-5 Anthropic 0.6244 0.0338 [0.556, 0.692]
claude-sonnet-4-5 Anthropic 0.5872 0.0328 [0.521, 0.653]
gpt-5-2025-03-01-preview (medium) OpenAI 0.5720 0.0337 [0.504, 0.640]
gpt-5.2-2025-03-01-preview (medium) OpenAI 0.5716 0.0325 [0.506, 0.637]
gpt-5-2025-03-01-preview (high) OpenAI 0.5633 0.0329 [0.497, 0.629]
gpt-5.2-2025-03-01-preview (high) OpenAI 0.5513 0.0367 [0.478, 0.625]
gpt-5-2025-03-01-preview (low) OpenAI 0.5413 0.0315 [0.478, 0.605]
gpt-5.1-2025-03-01-preview (medium) OpenAI 0.5133 0.0300 [0.453, 0.574]
gpt-5.1-2025-03-01-preview (high) OpenAI 0.4946 0.0373 [0.420, 0.570]
gpt-5.2-2025-03-01-preview (low) OpenAI 0.4898 0.0248 [0.440, 0.540]
o3-2025-03-01-preview OpenAI 0.4707 0.0281 [0.414, 0.527]
gpt-5-mini-2025-03-01-preview OpenAI 0.4506 0.0292 [0.392, 0.509]
gpt-4.1-2025-03-01-preview OpenAI 0.4186 0.0235 [0.371, 0.466]
gpt-5.1-2025-03-01-preview (low) OpenAI 0.3731 0.0211 [0.331, 0.415]
o4-mini-2025-03-01-preview OpenAI 0.3602 0.0238 [0.312, 0.408]

Key findings:

  • Anthropic models consistently outperform OpenAI models, with Claude Opus 4.6 leading at 0.637
  • Medium reasoning effort achieves the best results across all GPT-5 generations — diminishing returns from additional reasoning computation
  • Dedicated reasoning models (O3, O4-Mini) underperform general-purpose models
  • Performance varies substantially by platform: Linux (0.585 avg) > AKS (0.517) > Cloud (0.282)
  • C4 (detection quality, weight 0.65) is the most discriminating checkpoint
  • Ablation study confirms CTI tools provide 0.077–0.150 point improvements across all models
  • Memory augmentation closes 33% of the gap between GPT-5-Mini and GPT-5 (Med)
  • The benchmark is not saturated — the best model achieves ~64% of the maximum possible reward

Reproducibility Information

  • Total samples: 50 out of 50 (full CTI-REALM-50 dataset)

  • Task version: 1-A

  • Date: February–March 2026

  • Command:

    uv run inspect eval inspect_evals/cti_realm_50 --model <model_name>
  • Models: OpenAI models run via Azure OpenAI (API version 2025-03-01-preview). Anthropic models run via Anthropic API.

  • Epochs: 1 for initial evaluation. The ideal number of epochs is 3, based on our variance analysis (Section 5.3 of the paper) which showed that 3 epochs on CTI-REALM-25 produce stable rankings with standard deviations of 0.196–0.263 across top-5 models.

Changelog

[1-A] — 2026-03-27 — Initial Implementation

  • Initial release of CTI-REALM benchmark
  • 4 task variants: cti_realm_25, cti_realm_50, cti_realm_25_seeded, cti_realm_25_minimal
  • Docker-based sandbox with Kusto emulator for realistic KQL query execution
  • 5-checkpoint trajectory-based scoring (C0–C4) with LLM-as-judge for C0 and C4
  • Evaluation report with 16 models across 50 samples