CTI-REALM Transparency Document

OVERVIEW

Detection rule writing is the process of creating logic-based rules to identify malicious activity in security logs and network traffic. This is a critical blue teaming task. The CTI-REALM (Cyber Threat Intelligence Real-world Evaluation of Agent LLM capabilities for threat Mitigation) benchmark measures how effectively AI models interpret threat intelligence to develop detection rule writing capabilities.

WHAT CAN CTI-REALM DO

CTI-REALM is designed to assess the performance of LLM agents in generating detection rules from Cyber Threat Intelligence (CTI) reports. Agents are provided with real-world threat intelligence documents and must produce valid detection rules (such as Sigma rules) that accurately capture the described malicious behaviors. The benchmark evaluates both the syntactic validity and semantic accuracy of generated rules against ground-truth detection logic.

INTENDED USES

This work advances AI safety and security by providing a comprehensive benchmark specifically designed for evaluating detection rule generation from threat intelligence. The benchmark enables assessment not only of whether agents produce syntactically valid detection rules, but also of how accurately they capture the threat behaviors described in CTI reports.

OUT-OF-SCOPE USES

CTI-REALM evaluation datasets are not a universal tool for evaluating all cyber-security detection capabilities across all environments. Instead, use CTI-REALM to benchmark AI systems on their ability to translate threat intelligence into actionable detection rules.

We do not recommend using CTI-REALM in commercial or real-world production security environments without additional validation. These datasets are being released for research purposes only. Developers should consider the inherent focus on specific threat types and detection rule formats as they select their own use cases, and evaluate and mitigate for accuracy, safety, and fairness concerns specific to each intended downstream use.

CTI-REALM is for research purposes only, and should not be used in highly regulated domains where inaccurate outputs could lead to missed detections or false positives that negatively impact security operations.

CTI-REALM should not be used as the sole basis for production security decisions without human expert review and validation.

DATASET DETAILS

DATASET CONTENTS

CTI-REALM features curated cyber threat intelligence reports paired with corresponding ground-truth detection rules. Each entry includes:

  • A CTI report describing threat actor tactics, techniques, and procedures (TTPs)
  • Ground-truth detection rules (e.g., Sigma format) that accurately detect the described behaviors
  • Metadata about the threat type and detection coverage

DATA CREATION & PROCESSING

CTI-REALM was constructed using publicly available threat intelligence reports and corresponding detection rules. The dataset focuses on real-world threat scenarios and detection patterns used by security professionals.

Due to the potential presence of sensitive information within threat intelligence reports, appropriate anonymization and sanitization procedures were employed to ensure no active operational intelligence is exposed.

PEOPLE & IDENTIFIERS

CTI-REALM does not contain personally identifiable information (PII). Any references to individuals, organizations, or infrastructure in the threat intelligence reports have been anonymized or are already publicly disclosed information from published threat research.

SENSITIVE OR HARMFUL CONTENT

The data used to create CTI-REALM consists of threat intelligence descriptions and detection logic. While the content describes malicious activities, it is presented in a defensive security context aimed at improving detection capabilities rather than enabling attacks.

CTI-REALM should be used responsibly and only for defensive security research purposes.

HOW TO GET STARTED

To begin using CTI-REALM, please see the instructions in the Inspect Evals repository for running the CTI-REALM evaluation tasks.

uv run inspect eval inspect_evals/cti_realm --model <your-model>

VALIDATION

To evaluate the effectiveness of CTI-REALM for its intended use, security researchers and detection engineers reviewed the threat intelligence reports and corresponding detection rules to confirm their validity and alignment with real-world detection practices.

EVALUATION

CTI-REALM was evaluated on its ability to assess and compare model capabilities in translating threat intelligence into detection rules.

EVALUATION METHODS

CTI-REALM frames detection rule development as a sequential decision-making task modeled as a Markov Decision Process (MDP). The reward function provides feedback at five checkpoints:

Trajectory Rewards (35% weight) - Measures reasoning across checkpoints:

  • C0 (12.5%): CTI report retrieval
  • C1 (7.5%): MITRE ATT&CK technique extraction
  • C2 (10%): Telemetry source identification
  • C3 (5%): Iterative query refinement

Ground Truth Rewards (65% weight) - Evaluates final detection quality:

  • C4 (65%): Detection rule effectiveness (F1-score and Sigma rule quality)

Evaluation Techniques: We combine deterministic evaluation (tool usage verification for C0/C3, Jaccard similarity for C1/C2, regex/F1-score for C4) with LLM-as-a-judge assessment (GPT-5-mini) for report relevance and rule quality.

EVALUATION RESULTS

We evaluated 15 models using a ReAct agent implemented in Inspect AI, with a maximum of 70 messages per task. In broad comparison testing on CTI-REALM50, Claude Opus 4.5 and Claude Sonnet 4.5 achieved the highest scores (0.590 and 0.589 respectively), significantly outperforming most GPT variants. GPT-5.2 Medium and GPT-5 High showed competitive performance at 0.576 and 0.574, while older models struggled at 36-38%. Variance analysis on CTI-REALM25 with 3 epochs (75 total runs per model) revealed that Claude Sonnet 4.5 achieved the highest mean score (0.595) while Claude Opus 4.5 demonstrated the lowest per-sample variance (StdDev=0.063), indicating more consistent reasoning patterns. Key findings include: Claude models consistently achieved highest scores at ~59%; agentic evaluation exhibits inherent variance with no samples achieving identical scores across epochs; and higher token usage does not ensure better performance—GPT-5 High outperformed GPT-5.1 High despite using fewer tokens.

LIMITATIONS

  • CTI-REALM was developed for research and experimental purposes only. Further testing and validation are needed before considering its application in commercial or real-world scenarios.

  • CTI-REALM was designed and tested using the English language. Performance in other languages may vary.

  • Outputs generated by AI may include syntactically valid but semantically incorrect detection rules. Users are responsible for validating generated rules before deployment.

  • All detection rules generated using this benchmark should be reviewed by human security experts before operational use.

  • CTI-REALM focuses on specific detection rule formats (e.g., Sigma) and may not cover all detection platforms or query languages.

  • There has not been a systematic effort to ensure that systems using CTI-REALM are protected from adversarial inputs. Any systems using it should take proactive measures to validate inputs and outputs.

BEST PRACTICES

  • Always validate generated detection rules against test data before deployment
  • Use models with robust responsible AI mitigations
  • Combine AI-generated rules with human expert review
  • Test detection rules in a controlled environment before production deployment
  • Consider the specific threat landscape and detection infrastructure of your environment

LICENSE

MIT License

TRADEMARKS

This project may contain trademarks or logos for projects, products, or services. Use of any trademarks or logos must follow their respective guidelines and policies.

CONTACT

We welcome feedback and collaboration. If you have suggestions, questions, or observe unexpected behavior in this benchmark, please open an issue in the repository or contact the maintainers.

If reports of issues are received or problems are identified independently, the repository will be updated with appropriate mitigations.