Blog Post

Microsoft Sentinel Blog
6 MIN READ

Turning historical patterns into actionable detection pipelines with Microsoft Sentinel data lake

Ashwin_Patil's avatar
Ashwin_Patil
Icon for Microsoft rankMicrosoft
Mar 18, 2026

Identifying recurring attacker IPs and password spray attempts with Microsoft Sentinel data lake jobs.

This article is part of the Sentinel data lake practitioner series. In part 1, we introduced the Operationalization Framework — a structured way to turn exploratory notebooks into reliable, scheduled Spark jobs within the Microsoft Sentinel data lake.

Now in Part 2, we go from framework to function — showing how defenders can turn historical data into fresh, actionable insights using modular pipelines built around one of the most persistent threats today: Password Spray attacks.

Why Password Spray Still Matters

Unlike brute-force attacks that hammer one account, password spray campaigns try a few passwords across many accounts, often over days or weeks, to avoid lockouts.
Most detections look at short-term bursts — missing these low-and-slow campaigns that quietly persist. As organizations scale billions of sign-in events per day, detection teams face an operational dilemma: how to retain long-term behavioral visibility without re-querying terabytes of raw telemetry.

Attackers rotate IPs, leverage shared ASNs (Autonomous System Number), and reuse proxy networks. To detect such behavior, analysts need historical memory — visibility into repeated patterns and attacker infrastructure.

Sentinel data lake notebooks for password spray

The new Password Spray pipeline is a suite of Spark notebooks that implements a modular, cost-efficient pipeline for detecting password spray attacks.
It transforms noisy authentication logs into structured behavioral features through three modular notebooks:

Quick summary of what each notebook does and what it produces:

Notebook

Description

Output Table

data_backfill_setup (optional)

Sets parameters and (optionally) backfills historical days for long-term context.

signin_summary_daily_SPRK_CL

signin_stats_daily_SPRK

 

signinlogs_summaryandstats_daily

Aggregates raw sign-in logs into daily rollups and statistics.

signin_summary_daily_SPRK_CL

signin_stats_daily_SPRK

 

password_spray_features

Computes behavioral features every 4 hours by comparing recent activity to historical days of history with lookback.

password_spray_features_SPRK_CL

 

Together, these notebooks separate daily summaries (data lake layer) from feature computation (analytics layer) minimizing cost while keeping analytics fresh.

High Level Architecture

Below is a high-level architecture for the Password Spray Detection Pipeline through a modular, scalable approach.

Key Components:

Raw Data Ingestion

The pipeline starts with the ingestion of raw sign-in logs from the Sentinel data lake. Historical data is seeded using the data_backfill_setup module, ensuring long-term behavioral visibility for detection. This is an optional step if you want to backfill historical days on the first run to compare with the fresh feature calculation.

Daily Summarization

The signinlogs_summaryandstats_daily notebook processes daily authentication events, creating summary tables and rollups. This separates the “data lake” layer (historical summaries) from the “analytics” layer (real-time analytics), optimizing cost and performance.

Feature Engineering

The password_spray_features notebook runs every 4 hours, merging recent raw data (last 4 hours) with 30–90 days of pre-aggregated historical context generated from either daily summaries or ad-hoc backfill of historical day. It computes behavioral metrics such as total attempts, distinct users, success rate, entropy normalization, and a weighted spray score, labeling each run as LOW, MEDIUM, or HIGH risk.  This table can also be written to analytics tier if you want to create alerting workflow on top of this. If you continue to hunt based on notebooks, then you can keep it in the data lake tier.

Feature Outputs

Lastly results are written to feature tables that power downstream security operations:

  • Alerts: High-confidence incidents for risky IPs. This will require elevating results to analytics tier, so they are accessible via Advanced Hunting.
  • Threat Hunts: Investigations into recurring ASNs or geographic patterns.
  • Dashboards: Identity-attack KPIs, heatmaps, and trends run against summary table, so you are not querying raw log table. For dashboarding, relevant summary tables need to be in analytics tier.

Analyst Views

The architecture supports advanced analyst views, including investigations by ASN, IP, country, and city. Entropy metrics help correlate related IPs under shared cloud or proxy providers, enabling defenders to identify persistent attacker infrastructure and feed high-risk ASNs into blocklists or enrichment systems.

Inside the Feature Notebook – Detailed Breakdown

The password_spray_features notebook transforms aggregated sign-in data into behavioral indicators that quantify the likelihood of password spray activity.
Rather than relying on simple thresholds (e.g., “X failed logons per minute”), it computes multi-dimensional features capturing attacker behavior over time.

Key Glossary Terms

Below is key glossary terms used throughout the section to describe the process.

  • ASN: Network Autonomous System Number; helps identify ISP or network owner.
  • Username entropy: The Shannon entropy quantifies how evenly usernames are distributed within an IP’s attempt. E.g. High entropy – attacker spreading attempts broadly, low entropy – focused attempts (potential internal automations)
  • Normalization: To ensure feature compatibility across scales, the values were normalized against max values derived globally. E.g. distinct_users_norm = distinct_users/max(distinct_users)
  • Spray score: Composite weighted score combining distinct user count, entropy, and success rate.

Notes for Practitioners

  • Entropy helps reduce false positives — differentiating focused logon failures (legitimate user typos) from broad credential sprays.
  • Normalization keeps scores stable — enabling comparisons across time ranges and environments.
  • Spray score risk label tiers support alerting and downstream triage automation, allowing SOC teams to prioritize IPs with high potential impact.
  • The daily summary and feature tables are reusable artifacts — can feed dashboards, threat hunts, UEBA models, or scheduled detections in Sentinel Analytics tier.

Data Input and Time Window

Each execution processes:

  • Recent activity (e.g., last 4 hours)
  • Combined with a historical lookback window (e.g., 30-90 days)

This creates a hybrid snapshot balancing recency with historical persistence, allowing slow-moving campaigns to stand out.

Feature Computation

Each IP address grouped by ASN, City, and Country is analyzed to extract behavioral metrics.

Below is high level schema transformation diagram showing from raw logs to summary calculation and finally computing the features from recent time window- scheduled to run at defined frequency.

 

Spray Score Formula

The spray_score combines normalized metrics and inverse success ratio to produce a single behavioral likelihood score.

Explanation:

  • distinct_users_norm (50%) emphasizes spread of attack.
  • (1 - success_rate) (20%) penalizes benign IPs with legitimate logons.
  • entropy_norm (30%) captures randomness typical of distributed attacks.

Each component is rounded to two decimals for readability and consistent scoring.

Risk Labelling

Since the score is normalized to 0-1, it can be bucketed into qualitative risk tiers.

Range

Label

Meaning

< 0.3

LOW

Likely benign / low spread

0.3–0.6

MEDIUM

Possible automated scanning or early spray

≥ 0.6

HIGH

High-confidence password spray behavior

Output Schema

Each row in the resulting password_spray_features_SPRK table represents a unique IP and its behavioral fingerprint for a given analysis

Column

Description

IPAddress

Source IP address

ASN, City, Country

Enrichment context

attempts_total, success_count, distinct_users, days_active

Base metrics

username_entropy, distinct_users_norm, entropy_norm, success_rate

Derived features

spray_score, spray_score_label

Final behavioral score and label

detection_window_start, detection_window_end, run_date

Window boundaries for reproducibility

 

Here is a visual depicting the flow of feature calculations for the password spray detection notebook — from raw metrics → normalization → scoring → labeling — in a clean horizontal layout.

 

Tracking Adversary Infrastructure

Including ASN, City, Country, and entropy metrics allows defenders to:

  • Correlate related IPs under shared cloud or proxy providers
  • Identify persistent attacker infrastructure across days or weeks
  • Feed high-risk ASNs into TI blocklists or Defender XDR enrichment

Call to Action: Operationalize in 30 minutes

  1. Deploy the three notebooks (backfill, daily summary, features) into your Sentinel data lake workspace using VSCode extension once you clone the repo locally.
  2. Run data_backfill_setup once (optional) to seed 30–90 days of history.
  3. Schedule signinlogs_summaryandstats_daily to run daily and write summary tables.
  4. Schedule password_spray_features every 4 hours to produce password_spray_features_SPRK_CL.
  5. Operationalize outputs: hunt in the feature table, dashboard on summaries, and (optionally) write high-risk results to Analytics tier for alert rules. Notebook has Recommended monitoring queries for hunting and analytics.

Daily Spray Activity Summary

password_spray_features_SPRK_CL| where run_date >= ago(7d)| summarize     TotalSprayIPs = dcount(IPAddress),    HighRiskIPs = dcountif(IPAddress, spray_score_label == "HIGH"),    TopCountries = make_set(Country, 5)by bin(run_date, 1d)

Persistent Threat Actors

password_spray_features_SPRK_CL| where spray_score_label in ("HIGH", "MEDIUM")| where days_active >=// Active for multiple days| top 20 by spray_score desc

Conclusion

Rather than focusing solely on short-lived spikes in telemetry, detections should incorporate historical context to identify persistent adversary behavior. For example, an attacker may attempt a small set of common passwords across many accounts each night while rotating IP addresses, thereby avoiding suspicion within any single 15‑minute window. By correlating low-volume activity over a 30–90 day lookback period, detections can attribute recurring infrastructure to a slow, sustained password-spray campaign and surface persistence—not just isolated activity.

By turning Spark notebooks into modular, operational pipelines within Microsoft Sentinel data lake, we create a repeatable detection architecture — one that scales analytics, reduces costs, and integrates seamlessly into your broader SIEM ecosystem.

As organizations shift from reactive to proactive detection engineering, the Sentinel data lake emerges as the foundation for next-generation behavioral analytics — where every authentication log record has a second life as a feature, a score, or an insight.

Resources

For more resources, see:

Updated Mar 17, 2026
Version 1.0
No CommentsBe the first to comment