Every security operations center is being told to adopt AI. Vendors promise autonomous threat detection, instant incident response, and the end of alert fatigue. The reality is messier. Most SOC teams are still figuring out where AI fits into their existing workflows, and jumping straight to autonomous agents without building foundational trust is a recipe for expensive failure. The Crawl, Walk, Run framework offers a more honest path. It's not a new concept. Cloud migration teams, DevOps organizations, and Zero Trust programs have used it for years. But it maps remarkably well to how security teams should adopt AI. Each phase builds organizational trust, governance maturity, and technical capability that the next phase depends on. Skip a phase and the risk compounds. This guide is written for SOC leaders and practitioners who want a practical, phased approach to AI adoption, not a vendor pitch.
Before You Crawl: Know Your Starting Point
Before introducing AI, it helps to have honest answers to a few foundational questions. What are the current mean-time-to-detect and mean-time-to-respond benchmarks? What percentage of alerts are false positives? How many analyst hours does a typical investigation consume?
To be clear, having perfect metrics is not a prerequisite for getting started. Waiting for a pristine environment before touching AI means many organizations will never start at all. The SANS 2025 SOC Survey found that the majority of SOCs lack formal metrics programs, and CardinalOps research revealed that 18% of SIEM rules in production are incapable of firing due to configuration errors. These are exactly the kinds of problems AI can help identify and fix. But without at least a rough baseline, you'll have no way to tell whether AI is genuinely improving outcomes or just adding complexity. Capture what you can, acknowledge what you can't, and start. You can sharpen the measurements as you mature.
Crawl: The Approved Chatbot as Trusted Advisor
The Crawl phase is deceptively simple: give analysts access to an authorized, organization-approved AI chatbot and let them use it as a research assistant, sounding board, and accelerator for tasks they already perform.
This is not about a specialized security AI product. It's about tools like ChatGPT Enterprise, Microsoft 365 Copilot, Google Gemini for Workspace, or even a self-hosted open-source model behind a corporate gateway. The specific product matters far less than three properties: it is sanctioned by the organization (not shadow AI), it operates within data governance boundaries (prompts and responses are not leaking sensitive telemetry to consumer-grade services), and it provides a safe on-ramp for analysts to develop AI interaction skills.
Practical use cases at this stage include asking the chatbot to explain an unfamiliar log source or event ID, drafting an initial incident summary from raw alert data, getting a second opinion on whether observed behavior is suspicious, querying organizational policies or playbook steps in natural language, and translating between query languages or understanding unfamiliar script syntax.
The value here is not automation. It's augmentation. Tier 1 and Tier 2 analysts gain a research tool that operates at the speed of their curiosity rather than the speed of documentation searches. Microsoft's randomized controlled trials demonstrated that analysts using AI assistants completed triage tasks 22–26% faster with meaningful accuracy improvements, and the effect was most pronounced among less experienced analysts. A separate RCT published on arXiv showed a 34.5% accuracy improvement and 29.8% time reduction for IT administration security tasks. In practical terms, a chatbot-assisted junior analyst can approach the investigative depth previously reserved for more senior team members.
For teams just getting started, the learning curve on effective prompting is real. The difference between a vague prompt and a well-structured one is often the difference between a useless response and a genuinely helpful investigation assist. See the Appendix at the end of this post for a set of example prompts tailored to common SOC analyst tasks.
Governance at this stage is lightweight but essential. An acceptable use policy should define what data can and cannot be shared with the AI tool. If the organization is using a cloud-hosted model, teams should verify that the provider's data handling terms meet organizational requirements, particularly that prompts are not used for model training. Audit logging of AI interactions should be enabled where available. And perhaps most importantly, every SOC team member should understand that AI outputs are advisory, not authoritative. The analyst is still the decision-maker.
The Crawl phase is also where organizations confront shadow AI head-on. Cloud Security Alliance research found that 38% of employees share confidential data with AI platforms without approval. Netwrix reported that organizations with high unauthorized AI usage face significantly elevated breach costs. Sanctioning an approved tool and making it easy to access is both a productivity play and a risk reduction measure.
Readiness signal to advance: Analysts are consistently using the approved chatbot in daily workflows, adoption is measured and growing, an acceptable use policy is in place, and baseline operational metrics are established for comparison.
Walk: Bounded LLM Integration in Defined Workflows
The Walk phase shifts from conversational AI assistance to embedding language models in specific, bounded tasks within SOC processes. The keyword is bounded. The LLM performs a defined function within a structured workflow, with constrained inputs, validated outputs, and human oversight at decision points.
The architectural pattern is straightforward: an automation platform (SOAR tools, Logic Apps, n8n, or similar orchestration engines) triggers a workflow based on a security event, passes specific data to an LLM via API, receives structured output, and routes that output into existing processes. The LLM never acts independently. It is a processing step within a deterministic pipeline.
High-value use cases for this phase include:
IOC extraction from unstructured reports. Threat intelligence arrives as PDFs, emails, and blog posts. An LLM can parse these documents, extract indicators of compromise (IP addresses, domains, hashes, TTPs), and output structured data that feeds directly into SIEM watchlists or detection rules. Research on LLMCloudHunter demonstrated 99.18% successful compilation of LLM-generated detection rules into functional SIEM queries. A broader study on LLMs for threat intelligence workflows confirmed 92–99% extraction precision across multiple models.
Alert enrichment and summarization. When a SIEM alert fires, a bounded LLM workflow can pull relevant context (asset ownership, recent activity, threat intelligence matches), generate a plain-language summary of what happened and why it matters, and suggest next investigation steps. This transforms a raw alert from a wall of fields into an analyst-ready briefing.
Risk scoring with organizational context. By grounding the LLM with organization-specific data through retrieval-augmented generation (RAG) patterns, using asset criticality databases, CMDB records, and historical incident data, the model can provide risk assessments that incorporate business context a generic alert severity score cannot capture.
Natural language query generation. Analysts describe what they're looking for in plain language and the LLM translates to the appropriate query syntax (KQL, SPL, SQL) for the organization's SIEM. This is especially powerful when the LLM is provided the actual table schemas from the environment, which Microsoft's Threat Hunting Agent documentation confirms dramatically improves query accuracy.
The critical guardrail for this phase is treating all LLM output as untrusted. Even in bounded workflows, language models hallucinate. Every LLM-generated output that feeds into a downstream process needs validation: structural checks on extracted IOCs, syntax validation on generated queries, confidence thresholds on risk scores. A three-layer guardrail model works well here: constrain the model's behavior (system prompts, temperature settings), constrain the data it can access (scoped API permissions, filtered context windows), and constrain the actions it can influence (human approval before any remediation step).
Data governance tightens considerably at this phase. When LLMs process security telemetry, organizations need architectural guarantees that sensitive data stays where it belongs, not just contractual ones. For regulated environments and government workloads, this means deploying a sandboxed LLM instance within the organization's own cloud boundary. Azure AI Foundry, AWS Bedrock with VPC endpoints, or self-hosted open-weight models all provide this pattern: the model runs inside infrastructure the organization controls, and data physically cannot flow back to a third-party model provider. For less sensitive environments, enterprise LLM APIs with verified data handling terms may be sufficient, but the default recommendation should be architectural isolation over contractual trust.
A landmark empirical study from 2025 analyzed 3,090 queries from 45 analysts over 10 months in a real SOC and found that LLM usage concentrated among a subset of "power user" analysts. The takeaway: adoption programs should identify and empower champions rather than expecting uniform adoption across the team.
Readiness signal to advance: Multiple bounded LLM workflows are in production with measured accuracy and efficiency improvements, guardrail frameworks are validated, analysts trust AI-enriched data for investigation (not just summaries), and governance controls are audited and functional.
Run: Agentic Operations with Earned Autonomy
The Run phase requires a deliberate decision about what belongs inside an LLM reasoning loop and what should stay deterministic. Not every task benefits from AI reasoning, and using an agent where a simple API call would suffice adds latency, cost, and unpredictability for no gain.
A useful mental model: if a task has a known, repeatable answer given the same inputs (looking up an IOC in a threat intel feed, checking whether an IP appears in a blocklist, retrieving asset ownership from a CMDB), it should be a deterministic tool call. If a task requires judgment, synthesis, or interpretation of ambiguous data (assessing whether a sequence of events constitutes lateral movement, prioritizing which of twelve alerts to investigate first, generating a hypothesis about attacker intent), that's where LLM reasoning adds value. Protocols like the Model Context Protocol (MCP) make this distinction architectural: they expose deterministic data retrieval and action tools that agents can call, while keeping the reasoning and decision-making in the LLM layer where it belongs.
Microsoft's Sentinel MCP server, CrowdStrike's Falcon platform, Google's security operations agents, and Cisco's agentic capabilities all implement variations of this pattern. The capability is here. The question is whether organizations are ready for it.
This separation of concerns is also the key to managing non-determinism, the defining challenge of agentic operations. Traditional SOAR playbooks are deterministic: same input, same output, fully auditable. Agentic AI is probabilistic by design. An agent investigating the same alert twice may follow different reasoning paths and reach different conclusions. That's valuable when exploring novel threats, and a liability when auditability and consistency matter. Red Canary's practical guidance frames the solution well: design agents as structured workflows where the overall process is deterministic, but individual reasoning steps within bounded nodes leverage AI flexibility. The workflow is the agent. Deterministic tasks get handled by structured code nodes. The LLM reasons only where reasoning is actually needed.
The security risks are real and documented. The OWASP Top 10 for Agentic Applications (released December 2025) catalogs threats including agent goal hijacking, tool misuse, privilege abuse, memory poisoning, and cascading failures across multi-agent systems. Real-world incidents have already demonstrated prompt injection attacks against security copilots that exfiltrated sensitive data with zero user interaction. Obsidian Security research found prompt injection in over 73% of production AI deployments assessed during security audits.
The AWS Agentic AI Security Scoping Matrix provides a useful framework for deciding autonomy levels. The governance model for agentic operations should follow a tiered approach. High-impact actions (isolating hosts, disabling accounts, modifying firewall rules) must remain human-in-the-loop, requiring explicit analyst approval before execution. Medium-impact actions (enrichment queries, ticket creation, notification routing) can operate human-on-the-loop, where agents act but humans monitor and can intervene. Low-impact actions (log queries, IOC lookups, report generation) can operate autonomously once accuracy is demonstrated through extensive testing against labeled datasets.
Autonomy must be earned, not assumed. Every expansion of agent authority should be backed by measured evidence: accuracy rates on historical data, simulated decision testing, and bounded pilot deployments with kill switches. Gartner projects that over 40% of agentic AI projects will be canceled by 2027 due to unclear value or inadequate controls. The organizations that succeed will be those that built trust incrementally.
The Human Element: Evolving, Not Disappearing
A persistent concern across all three phases is skills atrophy. Research from Anthropic confirmed that developers who fully delegated work to AI scored under 40% on comprehension assessments. A 2024 study published in PNAS found that AI assistance accelerated skill decay without the users being aware it was happening. The aviation industry learned this decades ago: autopilot dependency contributed to pilot proficiency gaps that became dangerous in edge cases.
SOC teams should apply the same lesson. Rotate analysts between AI-augmented and manual investigation work. Maintain training exercises that operate without AI assistance. Continuously assess analytical competencies alongside AI capability expansion.
The analyst role does not disappear. It evolves. At Crawl, analysts gain a research accelerator. At Walk, they become automation designers and output validators. At Run, they become AI supervisors and exception handlers, focusing on novel threats, strategic detection engineering, and the judgment calls that AI cannot make. Over 64% of cybersecurity job listings now require AI, ML, or automation skills, signaling that this evolution is already underway.
Start Where You Are
The Cloud Security Alliance found that organizations with mature AI governance are nearly twice as likely to successfully adopt agentic AI compared to those with partial governance. Governance is a maturity multiplier, not a brake. The IBM Cost of a Data Breach Report found that organizations with extensive AI and automation in security saved $1.88 million per breach and identified threats 98 days faster than those without.
The organizations that will succeed with AI in security operations are not the ones that deploy the most advanced tools fastest. They are the ones that build trust methodically: establishing baselines before automating, proving bounded AI value before expanding scope, and earning autonomy through evidence rather than aspiration.
Crawl before you walk. Walk before you run. And measure everything along the way.
Appendix: Starter Prompts for SOC Analysts
These examples are designed for general-purpose chatbots (ChatGPT, M365 Copilot, Claude, etc.) and assume no special security tool integrations. Adapt the specifics to your environment.
Explaining an unfamiliar event: "I'm seeing Windows Event ID 4768 with result code 0x12 in our domain controller logs. Explain what this means, what normal vs. suspicious context looks like, and what I should investigate next."
Drafting an incident summary from raw data: "Here's a set of raw alert fields from our SIEM: [paste fields]. Write a concise incident summary covering: what happened, what systems and accounts are involved, the potential impact, and recommended next steps. Use plain language appropriate for a shift handoff report."
Gut-checking suspicious behavior: "A user account accessed 14 SharePoint sites in 3 minutes, then downloaded 200+ files from a site they'd never visited before. The account has no prior DLP alerts. Walk me through whether this looks like compromised credentials, insider activity, or potentially normal behavior, and what evidence I should look for to distinguish between them."
Translating between query languages: "Convert this SPL query to KQL for Microsoft Sentinel. Preserve the logic exactly and note any functions that don't have a direct equivalent. You have the following tables available to use: [paste tables] and the original SPL query is: [paste query]"
Understanding a script found during investigation: "Analyze this PowerShell script found in a scheduled task on a compromised host. Explain what each section does, flag anything that looks like attacker tooling or obfuscation, and identify any IOCs (domains, IPs, file paths, registry keys) I should search for across the environment: [paste script]"
Organizational policy Q&A: "Our security policy says remote access requires MFA and conditional access compliance. I have a user connecting via VPN from a personal device that passed health check but isn't Entra-joined. Does this violate policy, and what's the correct remediation path? Ask me clarifying questions if you need more information."