CTI-REALM: Cyber Threat Intelligence Detection Rule Development Benchmark
Evaluates AI systems’ ability to analyze cyber threat intelligence and develop comprehensive detection capabilities through a realistic 5-subtask workflow: MITRE technique mapping, data source discovery, Sigma rule generation, KQL development and testing against real telemetry data, and results analysis.
Overview
CTI-REALM (Cyber Threat Real World Evaluation and LLM Benchmarking) evaluates AI agents’ ability to interpret threat intelligence and develop detection rules. The benchmark provides a realistic environment replicating the security analyst workflow, enabling agents to examine CTI reports, execute queries, understand schema structures, and construct detection rules.
Overview
CTI-REALM tests an AI agent’s capability across a 5-stage detection engineering workflow:
- CTI Report Analysis — Retrieve and analyze relevant threat intelligence reports
- MITRE Technique Mapping — Identify relevant ATT&CK techniques from threat intelligence
- Data Source Discovery — Explore available Kusto data sources dynamically
- KQL Development — Write and test queries against real telemetry data
- Detection Rule Generation — Produce Sigma rules and validated KQL queries
Evaluation involves emulated attacks of varying complexity across Linux systems, cloud platforms, and Azure Kubernetes Service (AKS). Agent performance is measured through both final detection results and trajectory-based rewards that capture decision-making effectiveness.
Setup
Prerequisites
- Docker (required for the containerized evaluation environment)
- At least 4GB available RAM for Docker containers
Docker Image Sizes
The evaluation requires approximately 6.5 GB of Docker images:
| Image | Size | Purpose |
|---|---|---|
kustainer-linux |
~4.5 GB | Azure Data Explorer emulator (Kusto) |
inspect-cti_realm-*-default |
~1.7 GB | Main evaluation environment |
python:3.11-slim |
~190 MB | Init services (kusto-init, mitre-service) |
First-run startup takes several minutes as images are pulled and data is loaded into Kusto.
Download Data
All data is hosted on Hugging Face in the dataset arjun180-new/cti_realm. Before running the evaluation, download the required data files:
cd src/inspect_evals/cti_realm
python download_data.pyThis downloads:
- Dataset files (to
data/) — Detection objectives, sample metadata, and ground truth answers - Kusto telemetry data (to
docker/kusto_init/data/) — 12 log sources including endpoint, AKS, cloud, identity, and application telemetry - CTI reports (to
cti_reports/) — Threat intelligence reports for analysis
Troubleshooting
If you see this error when running the evaluation:
fork/exec /usr/local/lib/docker/cli-plugins/docker-buildx: no such file or directory
Failed to build docker containers
This means Docker Desktop is not running. Start Docker Desktop before running the evaluation.
If you see a TimeoutError or RuntimeError: No services started with kusto-init failing, the Kusto data initialisation is taking longer than the Docker Compose wait timeout. Try increasing the retries value in the kusto-emulator healthcheck in docker/compose.yaml — Inspect uses retries * (interval + timeout) to calculate the overall wait timeout.
Architecture
The evaluation environment is a containerized Docker system integrated with Inspect AI that provides:
- CTI Repository — 37 source reports from Microsoft Security, Datadog Security Labs, Palo Alto Networks, and Splunk Security Content
- Kusto Cluster — Query engine for executing KQL against telemetry data
- Telemetry Logs — Multi-source security logs from attack simulations across Linux endpoints, AKS clusters, and Azure cloud
- MITRE ATT&CK Database — Techniques and tactic mappings for threat contextualization
- Sigma Rules Database — Reference collection of existing detection rules
- Tool API — 8 specialized functions for CTI retrieval, data exploration, query execution, and threat context mapping
Agents interact via a ReAct framework, alternating between reasoning traces and tool invocations.
Usage
Task Variants
| Task | Samples | Description |
|---|---|---|
cti_realm_25 |
25 | Balanced subset (12 Linux, 9 AKS, 4 Cloud) |
cti_realm_50 |
50 | Full evaluation (25 Linux, 17 AKS, 8 Cloud) |
cti_realm_25_minimal |
25 | Ablation — CTI-specific tools removed |
cti_realm_25_seeded |
25 | Memory augmentation — pre-seeded workflow knowledge |
CTI-REALM-25 is a subset of CTI-REALM-50. All variants use hard difficulty (minimal prompting, no workflow guidance).
Seeded Memory Variant
The cti_realm_25_seeded task pre-populates the agent’s /memories/ directory with three guidance files before execution begins:
- workflow.md — Step-by-step detection engineering workflow (CTI analysis → MITRE mapping → data exploration → query development → rule generation)
- tool_tips.md — Tips for effective use of available tools (Kusto queries, Sigma rule formatting, CTI report retrieval)
- patterns.md — Common detection patterns and KQL query templates for different attack categories
The agent is instructed to read these files before starting its investigation. This variant tests whether providing structured knowledge can close the performance gap between smaller and larger models. In our evaluation, seeded memory closed 33% of the gap between GPT-5-Mini and GPT-5 (Med).
Running Evaluations
# Run 25-sample benchmark
uv run inspect eval inspect_evals/cti_realm_25 --model anthropic/claude-opus-4-6
# Run 50-sample benchmark
uv run inspect eval inspect_evals/cti_realm_50 --model openai/gpt-5
# Run with limited samples
uv run inspect eval inspect_evals/cti_realm_25 --model anthropic/claude-opus-4-5 --limit 5Notes
- Use
--max-samples 2for stable performance when running multiple samples - Agents are permitted a maximum of 70 messages per task (configurable via
-T message_limit=N)
Scoring
CTI-REALM uses a trajectory-based evaluation framework combining deterministic checkpoints with LLM-as-judge assessment. The total reward is:
\[R_{\text{total}} = \sum_{i \in \{C0, C1, C2, C3, C4\}} w_i \cdot r_i \in [0, 1]\]
Checkpoints
| Checkpoint | Weight | Method | Description |
|---|---|---|---|
| C0 — CTI Analysis | 0.125 | LLM-as-judge | Correct identification of relevant threat intelligence reports |
| C1 — MITRE Mapping | 0.075 | Jaccard similarity | Accuracy of ATT&CK technique extraction |
| C2 — Data Exploration | 0.100 | Jaccard similarity | Identification of relevant telemetry sources |
| C3 — Query Execution | 0.050 | Binary | Iterative query refinement (≥2 successful queries) |
| C4 — Detection Quality | 0.650 | F1-score + LLM-as-judge | KQL correctness via F1-score, Sigma rule quality via judge |
Checkpoints C0–C3 comprise 35% of the total weight (trajectory reward), while C4 accounts for 65% (ground truth reward).
Grader Model
Checkpoints C0 and C4 use an LLM-as-judge. By default, the grader uses openai/azure/gpt-5-mini. To specify a different grader, use the grader model role:
uv run inspect eval inspect_evals/cti_realm_25 --model anthropic/claude-opus-4-6 --model grader=openai/azure/gpt-5-miniThe results reported below were generated with openai/azure/gpt-5-mini as the grader model.
Judge Calibration
The LLM-as-judge components (C0 and C4 Sigma) were calibrated against human oracle labels on a stratified sample of 20 samples (10 Linux, 7 AKS, 3 Cloud) from an O3 evaluation run. Results using openai/azure/gpt-5-mini as the grader:
| Metric | C0 (CTI Alignment) | C4 (Sigma Quality) |
|---|---|---|
| Judge mean | 0.593 | 0.672 |
| Oracle mean | 0.698 | 0.673 |
| Bias | -0.105 | -0.001 |
| MAE | 0.145 | 0.073 |
| Pearson r | 0.483 | 0.866 |
C4 Sigma quality judge is well-calibrated: near-zero bias, low MAE, and strong correlation (r=0.87) with human ratings.
C0 CTI alignment judge slightly underestimates quality (bias -0.10) with moderate correlation (r=0.48). Since C0 contributes 12.5% of the total score, the bias translates to ~0.013 on the final score — within standard error bars and unlikely to affect model rankings.
When substituting a different grader model, consider re-validating calibration using tools/judge_calibration_diagnostics.py — see BEST_PRACTICES.md, section “Validate judge-based comparisons” for guidance.
Evaluation Report
Implementation Details
- This is the reference implementation of CTI-REALM
- Docker-based sandbox with a local Kusto-compatible server for realistic KQL query execution against actual telemetry data
- Scoring combines trajectory-based checkpoints (C0–C3) with ground truth F1-score evaluation (C4)
- GPT-5-Mini is the default LLM-as-judge for C0 and C4 Sigma evaluation, configurable via the
gradermodel role (see Grader Model)
Results
Results from CTI-REALM-50 (50 samples, all models run with message_limit=70, task version 1-A):
| Model | Provider | Rewards | StdErr | 95% CI |
|---|---|---|---|---|
| claude-opus-4-6 | Anthropic | 0.6373 | 0.0374 | [0.562, 0.712] |
| claude-opus-4-5 | Anthropic | 0.6244 | 0.0338 | [0.556, 0.692] |
| claude-sonnet-4-5 | Anthropic | 0.5872 | 0.0328 | [0.521, 0.653] |
| gpt-5-2025-03-01-preview (medium) | OpenAI | 0.5720 | 0.0337 | [0.504, 0.640] |
| gpt-5.2-2025-03-01-preview (medium) | OpenAI | 0.5716 | 0.0325 | [0.506, 0.637] |
| gpt-5-2025-03-01-preview (high) | OpenAI | 0.5633 | 0.0329 | [0.497, 0.629] |
| gpt-5.2-2025-03-01-preview (high) | OpenAI | 0.5513 | 0.0367 | [0.478, 0.625] |
| gpt-5-2025-03-01-preview (low) | OpenAI | 0.5413 | 0.0315 | [0.478, 0.605] |
| gpt-5.1-2025-03-01-preview (medium) | OpenAI | 0.5133 | 0.0300 | [0.453, 0.574] |
| gpt-5.1-2025-03-01-preview (high) | OpenAI | 0.4946 | 0.0373 | [0.420, 0.570] |
| gpt-5.2-2025-03-01-preview (low) | OpenAI | 0.4898 | 0.0248 | [0.440, 0.540] |
| o3-2025-03-01-preview | OpenAI | 0.4707 | 0.0281 | [0.414, 0.527] |
| gpt-5-mini-2025-03-01-preview | OpenAI | 0.4506 | 0.0292 | [0.392, 0.509] |
| gpt-4.1-2025-03-01-preview | OpenAI | 0.4186 | 0.0235 | [0.371, 0.466] |
| gpt-5.1-2025-03-01-preview (low) | OpenAI | 0.3731 | 0.0211 | [0.331, 0.415] |
| o4-mini-2025-03-01-preview | OpenAI | 0.3602 | 0.0238 | [0.312, 0.408] |
Key findings:
- Anthropic models consistently outperform OpenAI models, with Claude Opus 4.6 leading at 0.637
- Medium reasoning effort achieves the best results across all GPT-5 generations — diminishing returns from additional reasoning computation
- Dedicated reasoning models (O3, O4-Mini) underperform general-purpose models
- Performance varies substantially by platform: Linux (0.585 avg) > AKS (0.517) > Cloud (0.282)
- C4 (detection quality, weight 0.65) is the most discriminating checkpoint
- Ablation study confirms CTI tools provide 0.077–0.150 point improvements across all models
- Memory augmentation closes 33% of the gap between GPT-5-Mini and GPT-5 (Med)
- The benchmark is not saturated — the best model achieves ~64% of the maximum possible reward
Reproducibility Information
Total samples: 50 out of 50 (full CTI-REALM-50 dataset)
Task version: 1-A
Date: February–March 2026
Command:
uv run inspect eval inspect_evals/cti_realm_50 --model <model_name>Models: OpenAI models run via Azure OpenAI (API version
2025-03-01-preview). Anthropic models run via Anthropic API.Epochs: 1 for initial evaluation. The ideal number of epochs is 3, based on our variance analysis (Section 5.3 of the paper) which showed that 3 epochs on CTI-REALM-25 produce stable rankings with standard deviations of 0.196–0.263 across top-5 models.
Changelog
[1-A] — 2026-03-27 — Initial Implementation
- Initial release of CTI-REALM benchmark
- 4 task variants:
cti_realm_25,cti_realm_50,cti_realm_25_seeded,cti_realm_25_minimal - Docker-based sandbox with Kusto emulator for realistic KQL query execution
- 5-checkpoint trajectory-based scoring (C0–C4) with LLM-as-judge for C0 and C4
- Evaluation report with 16 models across 50 samples