analytics
839 TopicsStreaming and Batch Data Architectures with Microsoft Fabric to Azure Databricks
Author's: Oscar Alvarado oscaralvarado and Rafia Aqil Rafia_Aqil Note: This article describes a solution idea. Your cloud architect can use this guidance to help visualize the major components for a typical implementation. Use this article as a starting point to design a well-architected solution that aligns with your workload’s specific requirements. As organizations adopt Microsoft Fabric as their unified analytics platform, it has become a leading path for ingesting both streaming and batch data into Azure Databricks. This article covers integration approaches -via Microsoft Fabric- and details the five Fabric-specific paths that connect OneLake/ADLS and Databricks for end-to-end data processing. Medallion Architecture The following data flow corresponds to the architecture diagram: Data is ingested through Microsoft Fabric (via Mirroring, RTI, or Data Factory) lands data into OneLake/ADLS. With the medallion pattern, consisting of Bronze, Silver, and Gold storage layers, organizations have flexible access and extendable data processing: Bronze – Raw data entry point. Data arrives in its source format and is converted to the open, transactional Delta Lake format. Silver – Optimized for BI and data science. ETL and stream processing tasks filter, clean, transform, join, and aggregate Bronze data into curated datasets using SQL, Python, R, or Scala. Gold – Enriched data ready for analytics and reporting. Analysts use Power BI, PySpark, SQL, or Excel for insights and queries. Fabric Integration Paths Note: This architecture establishes a complete loop-back between Microsoft Fabric and Azure Databricks, enabling Gold layer tables to be seamlessly mirrored back to Microsoft Fabric for dashboarding through Azure Databricks Mirroring. The following five paths connect Microsoft Fabric to Azure Databricks: Fabric Mirroring to OneLake – A low-cost, low-latency turnkey solution that creates a replica of data from operational sources (SQL Server, Azure Cosmos DB, Oracle) in OneLake. Handles the initial load and ongoing CDC changes automatically, keeping data continuously up to date. Fabric RTI to OneLake – Fabric Real-Time Intelligence ingests streaming event data into OneLake with sub-second latency, enabling real-time analytics on live event streams. Fabric Data Factory to OneLake – Orchestrates ingestion from diverse sources not covered by Mirroring (such as Sybase or REST APIs) and lands data in OneLake, ensuring complete source coverage. OneLake to Azure Databricks – Unity Catalog connections to OneLake, secured via Managed Identities from Microsoft Entra ID, allow Databricks to query OneLake data items as a native catalog without data duplication. Fabric Data Factory to Azure Databricks (direct) – Orchestrates ingestion from diverse sources directly into Azure Data Lake Storage (ADLS), where Azure Databricks picks up the data for medallion architecture processing. Design Considerations Area Updated guidance Direct RTI-to-Databricks integration There is still no broad GA direct integration where Fabric RTI and Databricks operate as one native real-time runtime. Integration should be positioned through open protocols, Event Hubs/Kafka-style patterns, OneLake, Delta, and federation. OneLake federation in Azure Databricks OneLake federation in Azure Databricks is now the key integration story. It allows Databricks Unity Catalog to query Fabric Lakehouse and Warehouse data in OneLake without copying it. Access is read-only and depends on Fabric tenant settings, workspace permissions, and Databricks Unity Catalog setup. RTI data availability to Databricks Data ingested through Fabric RTI can be made available to Databricks by landing or exposing the data into OneLake-backed items, especially Lakehouse/Warehouse patterns. Eventhouse data can be made available in OneLake in Delta format through OneLake availability, but Databricks OneLake federation should be validated against the specific Fabric item type and access path. Existing Databricks customers Existing Databricks customers do not need to abandon Databricks. They can use Fabric RTI as the event ingestion, real-time detection, operational alerting, and business action layer, while continuing to use Databricks for engineering, ML, advanced analytics, and Unity Catalog-governed access. Activator and business action Fabric Activator is the cleanest business-user action layer. It can monitor streaming events and trigger Teams messages, email, Power Automate flows, Fabric pipelines, notebooks, Spark jobs, Dataflows, UDFs, and other downstream actions. This is a strong differentiator because it lets business users act on events without waiting for batch analytics. Operations Agents Operations Agents are in preview and should be positioned carefully. They monitor real-time data from Eventhouse or ontology sources, surface insights, recommend actions, and can connect to Activator/Power Automate action paths. They are not simply a pre-ingestion decision engine before data lands anywhere; they work from configured Fabric knowledge/data sources. Before landing in Lakehouse For decisioning before Lakehouse persistence, use Eventstream processing and Activator rules on streams. For AI-assisted operational recommendations, use Operations Agents once the relevant data is available in Eventhouse or ontology. Requirement-Specific Notes Data Ingestion Microsoft Fabric Mirroring currently supports SQL Server, Azure Cosmos DB, and Oracle as source systems. For sources not yet supported by Mirroring—such as Sybase or REST APIs—use Fabric Data Factory pipelines to ensure full coverage across all data systems. Once data is in the landing zone with the correct format, Mirroring’s CDC replication starts automatically and manages the complexity of merging changes (updates, inserts, and deletes) into Delta tables, keeping data in Fabric continuously up to date. Learn more about open mirroring Storage Format and Time Travel OneLake supports Delta tables, enabling schema evolution and time travel across all data stored in the lakehouse. Learn more about OneLake and Delta tables Security Encryption at rest: OneLake automatically encrypts all data at rest using Microsoft-managed keys, compliant with FIPS 140-2 standards. Learn more Encryption in transit: All data in transit is encrypted using TLS 1.2 or higher, securing data movement between Fabric, OneLake, and Azure Databricks. Learn more Data Governance OneLake can be registered and scanned by Microsoft Purview, enabling cataloging of stored metadata and data quality profiling. This protects sensitive information, including PHI and PII, across ingestion and analytics workflows. Learn more about Purview with Fabric Lakehouse Operations and Monitoring Use the Fabric monitor hub to track pipeline health, Spark application performance, and ingestion job status across all Fabric workloads. Learn more about the Fabric monitor hub Scenario Details This architecture applies to any organization that needs to unify streaming and batch data at scale. Common characteristics include: Multiple operational data sources (databases, SaaS applications, event streams) A requirement to process both real-time and historical data in the same platform Governance and compliance requirements for sensitive data (PHI, PII, financial records) Analytics consumers spanning BI (Power BI), data science (Databricks notebooks), and ML workloads Potential Use Cases Healthcare and life sciences – PHI/PII protection via Purview; real-time patient telemetry + batch EHR analytics Financial services – Real-time fraud detection streams + batch regulatory reporting Retail and e-commerce – Streaming clickstream analytics + batch inventory and supply chain processing Energy and utilities – IoT sensor telemetry streaming + batch consumption analytics Next Steps Get started with Microsoft Fabric Mirroring Build an ETL pipeline with Lakeflow Declarative Pipelines Configure Unity Catalog with OneLake shortcuts Monitor Fabric pipelines with the Fabric monitor hub316Views0likes0CommentsDeep Dive: Implementing Retrieval-Augmented Generation (RAG) with Azure AI Search
Artificial Intelligence has changed the way businesses handle information, automate tasks, and interact with users. Large Language Models (LLMs) such as GPT-based systems can generate impressive responses, but they have one major limitation: they do not automatically know your private business data, internal documents, or the latest information. https://dellenny.com/deep-dive-implementing-retrieval-augmented-generation-rag-with-azure-ai-search/47Views0likes0CommentsAzure Databricks at Databricks Data + AI Summit 2026: updates and new announcements
Databricks Data + AI Summit brings together the global data and AI community in San Francisco to share product news, technical breakthroughs, and customer stories. This year, as usual, we have a lot of Azure Databricks announcements, a strong presence across the event, and a continued focus on helping customers put their data to work across analytics, AI, and enable business productivity. Find us at Data and AI Summit As a Legend Sponsor and Databricks’ long-standing strategic partner, Microsoft is joining Databricks Data + AI Summit during the keynote, multiple breakout sessions, and at the Expo booth. We're also engaging with customers 1:1 to hear from you. Satya Nadella will join Ali Ghodsi, CEO Databricks, in a pre-recorded keynote conversation on the importance of data in AI implementation and the deep integrations we co-engineer. We encourage you to visit us at the Microsoft Booth (Booth # 103) on the Expo floor to chat with the Azure Databricks team, see demos, and learn more about the recent announcements. Azure Databricks Breakout Sessions Unlocking the Microsoft Data & AI Ecosystem with Azure Databricks: From Insight to Impact Wednesday, June 17 | 1:50 PM – 2:30 PM PDT | Speaker: Anavi Nahar, Head of Product, Azure Data Lake Storage & Azure Databricks, Microsoft In today’s data-driven landscape, organizations need more than analytics—they need a unified platform that turns raw data into actionable intelligence across the Microsoft ecosystem. This session explores how Azure Databricks serves as the backbone of modern data architecture, integrating with core Microsoft cloud services and platforms to accelerate innovation. Learn how to use Azure Databricks for scalable data engineering, advanced analytics, and AI-driven solutions while enabling real-time collaboration and governance. Through practical examples and architectural patterns, we’ll show how to eliminate data silos, optimize performance, and empower teams to deliver insights faster. Zero-Copy Federated Energy Analytics: ADME + Databricks in Action Wednesday, June 17 | 12:40 PM - 1:20 PM PDT | Speaker: Andy Corran, Principal Product Manager, Azure Databricks, Microsoft Oil and gas companies have standardized on Azure Data Manager for Energy (ADME) as their subsurface system of record, but running analytics and AI on that data has meant copying massive datasets into downstream platforms, breaking governance and slowing every workflow that follows. In this jointly developed Microsoft and Databricks session, we introduce a new zero‑copy, federated path that brings Databricks compute directly to data, with native governance and serverless scale. We walk through the architecture, show the solution in action against live ADME, and share how operators across the industry are accelerating subsurface analytics while keeping ADME as the single source of truth. Unity Catalog External Locations: Extending Governance to OneLake and Beyond Wednesday, Jun 17 | 5:20 PM - 5:40 PM PDT | Speaker: Ljubica Vujovic Boskovic, Senior Product Manager, Databricks In this session, we'll show how External Locations provide a consistent, extensible pattern for connecting Databricks to any storage platform — and walk through what it takes to create External Location for Microsoft OneLake. You'll see the architecture, the setup end-to-end, and a demo reading and writing UC-governed assets directly into OneLake storage without needing to setup any ETL pipelines. Latest announcements We recently announced new ways to build AI apps and agents with Azure Databricks, Copilot Studio, and GitHub Copilot, including authoring Copilot Studio agents that reason over an entire Azure Databricks workspace through one MCP connection. At Microsoft Build, PepsiCo also shared its blueprint for agentic AI, illustrating how Azure Databricks can provide the data foundation for agentic apps. This week’s announcements make it easier to use Azure Databricks with the Microsoft tools your teams rely on every day, including Microsoft Teams, M365 Copilot, Excel, SharePoint, Power BI, and OneLake: Genie for Microsoft Teams and M365 Copilot (Beta): You can tag Genie in a Teams thread and get a context-aware answer from your Azure Databricks lakehouse without leaving the conversation. Responses are governed by Unity Catalog, so each answer is scoped to what the user is permitted to see. It’s part of the broader Genie One experience for report generation, reusable agents, low-code apps, and natural-language pipeline design. See it in action in the Databricks + Microsoft co-authored training in AI Skills Navigator Genie in Copilot Cowork (Beta): Available today, Databricks Genie works seamlessly with M365 Copilot Cowork. This integration will allow teams to anchor Cowork’s tasks with the Genie Ontology, bringing trusted data intelligence straight into their workflows Azure Databricks Excel Add-in (Public Preview): This brings governed lakehouse data into Excel without SQL or per-user ODBC setup. Unity Catalog metric views let business logic be defined once and stay consistent across tools, and the add-in supports write-back, so permitted users can push updates from Excel into Databricks. Learn how to set it up. SharePoint Connector (Beta) via Lakeflow Connect. A fully managed connector for point-and-click ingestion pipelines that bring SharePoint content — structured sheets and unstructured PDFs, Word docs, and PowerPoints — into Delta tables, keeping downstream analytics, Genie spaces, and Excel workbooks supplied with current data. Read the documentation here. Azure Databricks OneLake Catalog Federation (Generally Available): The ability to query OneLake data directly from Azure Databricks without pipelines, duplication, or data movement is generally available. This announcement coupled with the Azure Databricks Mirrored Catalog item enable bidirectional READ from Azure Databricks and OneLake. Learn more here Storing Unity Catalog Managed Tables in OneLake (Beta): You can now customers can use OneLake as a storage location option for Unity Catalog tables in addition to Azure Data Lake Storage (ADLS). Read more on how to do this here. CustomerLake: a customer data platform inside the lakehouse Introducing CustomerLake, a Customer Data Platform (CDP) built directly within the lakehouse rather than as a separate application. CustomerLake is now available in Azure Databricks. Two kinds of agents do much of the work: Profile Agents help assemble business-ready Customer 360 profiles from fragmented sources, reducing the manual effort of stitching customer data together. Campaign Agents give marketing teams a workspace to segment audiences, recommend next-best actions, activate across channels, and continuously optimize personalized experiences. Because CustomerLake runs inside your governed storage boundary, customer data, AI models, and governance stay together — avoiding much of the data movement and duplication that come with connecting separate marketing tools. For Azure customers, that means building customer engagement on the same governed lakehouse foundation they already use for analytics and AI, rather than maintaining a parallel stack. “What excites us most about the CustomerLake and the new CDP capability is the ability to bring customer data together in a way that is actionable, timely, and scalable. By creating a more complete view of each customer, we can better understand behaviors, preferences, and needs across channels, which will help us deliver more personalized experiences and more relevant offers. Ultimately, we see this as a powerful step toward stronger engagement, deeper loyalty, and better outcomes for both our business and our customers.” Jay Malepati Global Director of Data Science, Circle K All of these announcements benefit from built in Governance with Azure Databricks Unity Catalog. By connecting governed lakehouse data to the Microsoft tools your teams already use — Teams, M365 Copilot, Excel, SharePoint, OneLake, and Power BI — these updates make it easier to put trusted AI to work on Azure. To learn more, explore the Azure Databricks documentation and try these capabilities in your own workspace.815Views1like0CommentsMicrosoft Teams Presence Report
[New Blog Post] In this article I describe my #PowerShell script, which I have made available for you on #GitHub. This script is used to retroactively display the #MSTeams presence status as an HTML report per user. https://www.msb365.blog/?p=5816 #M365 #MVPbuzz821Views0likes4CommentsWhat to Do When You Hit Capacity in Azure Databricks: Engage, Mitigate, Plan!
Microsoft's Cloud Architects: Paul Singh PaulSingh, Eduardo Dos Santos eduardomdossantos, Chris Walk cwalk, Peter Lo PeterLo, Tim Orentlikher tim_orentlikher, Ajmal Hossain ajmalhossain, Chris Haynes Chris_Haynes, and Rafia Aqil Rafia_Aqil Start Here: Engage Microsoft Capacity constraints in Azure Databricks are not an Azure Databricks product issue. Azure Databricks does not own or reserve compute, it dynamically provisions VMs from Azure when clusters are created or scaled. This means cluster creation, autoscaling, or job execution can stall when the underlying VM SKUs are constrained at the regional level. The fastest path to resolution is a structured conversation with your Microsoft account team, who can engage the Azure capacity intake process on your behalf. Create a Quota Support Ticket via Microsoft Support and bring the following to your account team with your Support Ticket Number. Each field maps directly to what capacity intake teams will ask for: missing fields slow the request. What to Prepare Before You Reach Out Your Account Team Field What Capacity Intake Needs Example Subscription IDs The exact Azure subscriptions that will host the workspaces and clusters 7ebee83d-7923-426c-8449-59fd4dff25ab Region(s) Primary region, plus any acceptable alternates East US 2 VM family / SKU Specific series and version requested Eadsv5, ESv4, DSv4, DSv2 Core count / new limit Total vCPU or core count per SKU 10,000 cores for Eadsv5 Workload characteristic CPU-bound vs. memory/shuffle-heavy vs. IO-heavy; batch vs. streaming vs. SQL “Memory-intensive ETL with large joins and shuffles” Scale and timing When you need it, ramp profile, peak vs. steady state “Need by month-end; ramp from 2,000 to 9,650 cores over Q3” Business context Business use case “Migration off AWS” What “Capacity” Really Means: A Layered Mental Model Before diving into fixes, it is important to understand what is actually happening behind the scenes. Capacity constraints can occur at three distinct layers, and solving them requires addressing each one. Layer 1: Azure Infrastructure This is the layer most teams underestimate. Capacity here is governed by: VM SKU availability in the region. D-series and E-series: the two most common Databricks worker families: have repeatedly hit capacity constraints across multiple Azure regions, causing cluster creation failures, autoscale stalls, and provisioning delays. Regional supply constraints, which are dynamic and shared across all Azure tenants. vCPU quotas and limits per subscription, which are separate from regional supply. Quota is your subscription’s limit to deploy resources (like a credit card limit); regional capacity is the underlying infrastructure available. Both must be sufficient. Layer 2: Azure Databricks Platform The Azure Databricks control plane has its own published ceilings that your architecture must proactively respect. Key limits from the official Azure Databricks resource limits documentation: Resource Limit Scope Jobs created per hour 10,000 Workspace Tasks running simultaneously 2,000 Workspace (Run Job and For Each parent tasks excluded) Parent tasks running simultaneously (Run Job / For Each) 750 Workspace SQL warehouses 1,000 Workspace Attached notebooks or execution contexts 145 Cluster Virtual machines 25,000 Per subscription per region Note: For limits marked as non-fixed in the official documentation, you can request an increase through your Azure Databricks account team. Reference: https://learn.microsoft.com/en-us/azure/databricks/resources/limits Layer 3: Workload (Spark Execution) Even when both lower layers cooperate, Spark’s own execution model can produce capacity-like symptoms: Parallelism and task distribution, which dictate how many cores a job can usefully consume. Memory pressure from joins, shuffles, and skewed keys. IO demand and caching behavior, including Delta cache effectiveness and Spark cache misuse. Understanding these layers is critical. Retries sometimes succeed because capacity is dynamic: as other workloads complete, nodes are released back to Azure and briefly become available. Recognizing When You’ve Hit Capacity Capacity issues rarely present as a single clean error. Instead, they appear as inconsistent behaviors: Clusters stuck in Pending state Autoscaling fails or never reaches the desired size Jobs intermittently fail to start Retry attempts sometimes succeed These inconsistencies occur because capacity is shared across Azure tenants and fluctuates throughout the day. Running workloads outside peak business hours in the impacted region’s time zone is one of the most effective short-term mitigations. Immediate Actions: How to Unblock Your Workloads When you are actively hitting capacity constraints, speed matters. Please reach out to your Microsoft Account team and try these mitigations that are ordered from quickest to most involved. Retry and Run During Off-Peak Hours Capacity availability changes throughout the day as workloads complete and release VMs. Running outside peak business hours for the impacted region significantly improves success rates. Switch VM SKU or Family If a specific VM SKU is constrained, switching to another can immediately unblock provisioning. Move within the same family (for example, DSv4 → DSv5) Or switch families entirely (for example, D-series → F-series or L-series) This is one of the most effective but often underused approaches. Also, Choosing the Right VM Family Most Databricks environments default to D-series (general purpose) and E-series (memory optimized). These are also the most heavily used and most capacity-constrained VM families. Consider alternatives based on your workload: VM Family Best For When to Use Trade-off D-series General workloads Default choice Often constrained in high-demand regions E-series Memory-heavy Spark jobs Joins, shuffles, analytics High demand; higher cost F-series CPU-intensive jobs Parsing, transformations Lower memory per core L-series IO-heavy workloads Delta caching, large datasets Higher cost; large local NVMe Practical decision framework: Memory-bound workloads (joins, shuffles): Move from E-series to L-series. Similar memory per core, plus large local NVMe for Delta caching. CPU-bound workloads: Move from D-series to F-series. Higher CPU performance at lower cost. IO-heavy or cache-sensitive workloads: L-series can significantly improve performance and reduce shuffle pressure. Designing a single VM family is one of the biggest production risks in Azure Databricks environments. Implement Regional Diversity in your Databricks workload As Azure capacity constraints are region- and SKU-specific, it is important to build architectural flexibility into your Databricks deployments. For critical or large-scale workloads, consider deploying multiple Databricks workspaces across different Azure regions to reduce dependency on any single region’s capacity. This approach enables: improved resilience to regional capacity constraints greater flexibility in workload placement Important: Multi-region deployment requires deliberate architecture, including deploying separate workspaces and replicating data and configurations across regions—it is not automatic. Why Adding More Nodes Is Not Always the Answer When jobs slow down, the instinct is to scale compute. With Spark, more nodes do not always solve the problem. Common workload issues that masquerade as capacity problems: Data skew Excessive shuffle operations Inefficient partitioning Overuse of UDFs In real workloads, shuffle operations can grow significantly larger than input data, placing heavy pressure on both compute and memory that more nodes cannot relieve. Smarter optimization strategies: Reduce shuffle through repartitioning and query optimization Enable Photon for faster execution Optimize Delta tables using Z-ordering and compaction Leverage caching strategically (not just Spark cache: use the Delta/disk cache) These optimizations can reduce your dependency on scarce VM capacity altogether. What to Do When Your Capacity Is Approved Once Azure approves your capacity request, retaining it requires active steps. Because Azure capacity is dynamic and shared, approved capacity is held only while compute remains actively deployed and running. This is especially important in highly constrained regions. Microsoft recommends the following: Configure an Instance Pool For workloads that cannot yet use serverless compute, configure an Azure Databricks Instance Pool with a minimum number of idle nodes aligned to your production requirements. An instance pool pre-allocates and maintains a set of idle, ready-to-use VM instances. When a cluster is created from the pool, it draws from these warm nodes: eliminating the need to request new VMs from the regional Azure capacity pool between job runs. Key behaviors: The pool holds a minimum number of nodes continuously, keeping them warm and immediately available. Clusters attached to the pool pull from warm nodes, avoiding re-acquisition from Azure between runs. No DBU charges apply while nodes are idle in the pool. Azure VM infrastructure costs do apply for all minimum idle instances. Size the pool conservatively: aligned to production need only: to balance capacity retention against ongoing cost. Important: Instance pools hold idle nodes on a best-effort basis. Periodic platform events can recycle pool nodes, briefly causing the pool to fall below its configured minimum idle count while Azure re-acquires replacement nodes. Pools significantly improve availability and startup latency, but they do not change the fact that the underlying VMs are still requested from Azure on demand. They are not a hard reservation. Reference: https://learn.microsoft.com/en-us/azure/databricks/compute/pools Designing for Resilience: Long-Term Best Practices To avoid repeated capacity issues, your architecture needs to evolve beyond reactive mitigations. Plan for Capacity Early Understand VM quotas and limits before you need them: not after a constraint occurs. Avoid designing a single SKU. Build flexibility into cluster configurations so you can switch families without re-engineering jobs. Standardize Compute Configurations Consistent, policy-driven environments make it easier to adapt when capacity constraints occur. Use Databricks Cluster Policies to constrain cluster creation to approved, available VM families: this prevents teams from inadvertently requesting constrained SKUs. Move Toward Serverless Where Possible Serverless compute abstracts capacity management away from the customer. As the Databricks platform expands serverless support, migrating eligible workloads is the most durable long-term strategy. Azure continues to expand infrastructure capacity, but there are no guaranteed timelines for relief in constrained regions. Note: If your workload supports serverless compute, Databricks recommends using serverless compute instead of pools or classic VM-backed clusters. Serverless removes dependency on specific VM SKUs and regional capacity: scaling is managed by the platform with significantly improved availability. Reference: https://learn.microsoft.com/en-us/azure/databricks/serverless-compute. For eligible workloads: including Databricks Jobs (automated workflows), Databricks SQL Warehouses, and Delta Live Tables: serverless compute eliminates VM SKU dependency entirely. Configuration guidance is available in the Azure Databricks deployment guide, Development Section, Step 9. Multi-Region Strategy for Critical Workloads For the most critical workloads, evaluate a multi-region deployment as part of your business's continuity planning. This is a significant architectural investment: see the FAQ for the full scope: but it is the only approach that provides true regional redundancy. Coordinate this with your Microsoft account team. Reference: Azure Databricks & Microsoft Fabric Disaster Recovery: The Complete Better‑Together Strategy for Cloud Architects Final Takeaways Capacity issues are infrastructure-level constraints, not Databricks product failures VM family selection is critical: do not rely solely on D-series and E-series Workload optimization can reduce dependency on scarce resources before requesting more capacity Serverless compute is Microsoft’s preferred long-term recommendation for eligible workloads Azure On-Demand Capacity Reservations provide guaranteed capacity for mission-critical scenarios: distinct from instance pools (best-effort) and Reserved Instances (billing discount only) Architectural flexibility: multi-SKU, multi-region awareness is your best defense against future constraints FAQ Why do retries work? Capacity in Azure regions is shared across all tenants and fluctuates throughout the day as workloads complete and release VMs. A retry succeeds when capacity temporarily frees up. Retrying during off-peak hours improves success rates significantly. Why does capacity fluctuate during the day? Capacity is a function of regional supply and concurrent demand. As workloads complete, nodes are released back to Azure. Peak business hours in the impacted region’s time zone tend to be the tightest windows. Why are instance pools not a hard reservation? Pools hold a minimum number of nodes on a best-effort basis. Periodic platform events recycle pool nodes, so a pool can briefly fall below its configured minimum idle count while Azure re-acquires replacement nodes. Setting minimum idle to 0 avoids paying for idle VMs at the cost of slower acquisition time. Pools significantly improve availability and startup latency but do not guarantee capacity at the Azure infrastructure level. Why does serverless behave differently from classic clusters? Serverless compute removes customer control over individual VM SKUs. Databricks manages the underlying capacity across a shared pool. SKU-swap and pool-based mitigations do not apply. Customer-side levers reduce to retry and off-peak scheduling. The trade-off is that serverless is the simplest and most reliable option when the workload supports it. Why is changing regions a last resort? Region changes require redeployment of the Azure Databricks workspace and migration of all dependent artifacts: jobs, clusters, libraries, networking (private endpoints, VNet injection), Unity Catalog assignments, identities, and source data. The destination region must be validated for the same SKU and zonal configuration. For these reasons, region change should always be coordinated with the Microsoft account team and attempted only after preferred mitigations have been exhausted. Why does VM family selection matter so much for capacity? Different VM families have different supply curves. D-series and E-series are the most requested Databricks worker families and the ones most frequently constrained. Choosing a SKU based on whether the workload is memory/shuffle-heavy, CPU-bound, or IO-heavy improves both performance and the probability that capacity is available. The capacity team often steers customers toward newer-generation alternatives when supply differs by generation version. What does the Microsoft account team actually do? They route the request into the Azure capacity intake process, advise alternate SKUs and regions, surface zonal vs. regional considerations, and provide forward visibility into known constraints. The customer’s job is to bring a complete, accurate workload profile so the account team can advocate effectively. It is also recommended to open an Azure Support ticket. This will save time later, as the capacity planning teams would like to track issues and requests via a support ticket. Once an Azure Support ticket is opened, the ticket number should be shared to the Microsoft Account Team, at a minimum to the Customer Success Account Manager (CSAM), if one is assigned to your organization.195Views0likes0CommentsIssue with notifications in external network
I am having problems with seeing notifications from external network (Microsoft's network to be exact). I can see the notification count all right: However upon clicking the bell, I get this error 90% of the time: What is really weird, is that every now & then it works just fine - it might take 10 times trying the bell icon, or 50, but those notifications will at some point be shown to me all right. I seem to be in tiny minority of people having this issue. Does anyone have any guidance or troubleshooting approach? Is this a setting in my home tenant, or Microsoft tenant? Account permissions? Does a log or telemetry exist anywhere where I could start asking the right questions how to resolve this. If anyone has any insight, that would be greatly appreciated Thanks, Maciej34Views0likes1CommentSentinel Foundry - MCP Server (Preview) (Github Community Release)
I’ve been cooking something that a lot of people in SOC have been struggling with — especially on the engineering side of Microsoft Sentinel. Thanks to the Microsoft Security team for shaping the capabilities of Sentinel even better with Sentinel Data Lake & Modern SecOps. Today’s the day I can finally share it. Note: This is not an official Microsoft product, but it is designed to make the Sentinel Build even better (complement) with much more intelligence. 🚀 Sentinel Foundry is now in public preview with 43 tools. (Sentinel Foundry - MCP Server) It’s an MCP server built to act like the brain of a strong Sentinel engineer — helping make building, improving, and operating Sentinel far more practical, faster, and honestly more enjoyable. For a lot of teams, the challenge is not understanding what Sentinel can do. The hard part is the engineering work around it: -> Deciding what data should actually be ingested -> Building a clean, scalable Sentinel foundation -> Writing useful detections instead of noisy ones -> Balancing security value with cost -> Turning ideas into deployable engineering outputs That is exactly why I built Sentinel Foundry to help communities grow stronger. It helps with the real engineering tasks behind Sentinel — from architecture thinking to detection design, deployment planning, ingestion strategy, automation ideas, and many of the workflows outlined in the GitHub project. How does it work? Here’s one of the flagship prompts I ran with it: “Give me a complete security posture report for our workspace. Score each pillar and tell me what to prioritise.” And within seconds, it produced a structured engineering blueprint that would normally take a lot longer to pull together manually. You can see the example prompts here in what it can do: https://github.com/prabhukiranveesam/Sentinel-Foundry#what-can-it-do I want building Sentinel to feel less like repetitive engineering overhead — and more like real security engineering that is fast, creative, and enjoyable. If you work with Sentinel as a SOC L2 analyst, engineer, detection engineer, consultant, or architect, I’d genuinely love for you to try it and tell me what you think. 🔗 Public Preview: https://github.com/prabhukiranveesam/Sentinel-Foundry This is just the start of an AI era — and I’m excited to keep shaping it with more powerful features over the coming days. This is very easy to set up and will be available to all of you at no cost during this month as part of the public preview, and your feedback is extremely valuable to shape this as a powerful solution.514Views0likes1CommentDetecting AI agents and non-human identities in Microsoft Sentinel: the classic-agent blind spot
Build 2026 made the direction official. The industry is moving from the app era into the agent era, and Microsoft spent a real share of the keynote on securing agents across their lifecycle, from discovering what is exploitable to governing what is running in production. On the identity side the centerpiece is Microsoft Entra Agent ID, now generally available, which gives AI agents first-class identities and extends Conditional Access, Identity Protection, and full audit logging to them. That is good news for agents you build the new way. It is not the whole picture, and the gap is where most SOCs will get hurt first. Modern agents are covered. Classic agents are not. Entra Agent ID draws a hard line between two kinds of agent. Modern agents are created through the Agent ID platform, each backed by an agent identity blueprint. They carry a proper Agent ID, a full audit trail, and the complete set of governance capabilities, including Identity Protection for Agents, which establishes a baseline for an agent's normal activity and flags anomalies automatically. Classic agents are everything that came before, or that gets built outside the platform: AI agents implemented as ordinary service principals or app registrations, for example Copilot Studio agents created before Agent ID was enabled, or any home-grown automation calling Graph with client credentials. In the Entra agent registry they appear with "Has Agent ID: No," and that flag matters, because the Agent ID protections apply to identities that actually hold an Agent ID. Classic agents sit outside Identity Protection for Agents and Conditional Access for Agents. Here is the uncomfortable part. The non-human identities you already run, the service principals behind your pipelines, your integrations, your scripts, your pre-platform Copilot Studio bots, are almost all classic agents. They tend to outnumber your human accounts, they have no MFA in any meaningful sense, and a credential added to one does not show up in the Azure portal. The new platform protections do not reach them. Until you migrate them, the only place you get detection coverage on that population is your SIEM. So this is the job Sentinel does that Agent ID does not: detect risky behavior on the classic, service-principal-backed agents that the platform cannot yet protect. The telemetry you have, and the one switch people forget Three tables carry most of the signal. AADServicePrincipalSignInLogs records service principal authentications, the client-credentials sign-ins your agents and automation use. No user, no MFA, just an app proving it holds a secret or certificate. AADManagedIdentitySignInLogs does the same for managed identities. AuditLogs records directory changes, including the one that matters most for persistence: a new credential added to an application or service principal. One practical warning before any of this works. Service principal and managed identity sign-in logs are not streamed by default. You have to enable those categories explicitly in the Entra diagnostic settings feeding your workspace. Plenty of teams write the detection, never check, and never notice the table is empty. Verify that first. Detection 1: a new credential on a service principal or app Adding a secret or certificate to an existing service principal is one of the cleanest persistence techniques in a Microsoft cloud. The attacker compromises a privileged user or app, drops a fresh credential on a service principal that already holds useful Graph permissions, and now has access that survives password resets and session revocation. It maps to MITRE T1098.001, Account Manipulation: Additional Cloud Credentials. For a classic agent it is especially nasty, because there is no Identity Protection baseline watching it. // Detection 1: new secret or certificate added to an application or service principal // MITRE T1098.001 - Account Manipulation: Additional Cloud Credentials AuditLogs | where OperationName has_any ("Add service principal", "Certificates and secrets management") | where Result =~ "success" | extend Initiator = coalesce( tostring(InitiatedBy.user.userPrincipalName), tostring(InitiatedBy.app.displayName)) | extend InitiatorIp = tostring(InitiatedBy.user.ipAddress) | mv-apply Target = TargetResources on ( where Target.type =~ "Application" | extend TargetName = tostring(Target.displayName), TargetId = tostring(Target.id), KeyChanges = Target.modifiedProperties ) | mv-apply Prop = KeyChanges on ( where tostring(Prop.displayName) =~ "KeyDescription" | extend NewKeys = parse_json(tostring(Prop.newValue)), OldKeys = parse_json(tostring(Prop.oldValue)) ) | extend AddedKeys = set_difference(NewKeys, OldKeys) | where array_length(AddedKeys) > 0 | project TimeGenerated, Initiator, InitiatorIp, TargetName, TargetId, AddedKeys | order by TimeGenerated desc The operation filter catches the three shapes this event takes in the log: "Add service principal," "Add service principal credentials," and "Update application - Certificates and secrets management." The modifiedProperties parsing isolates the KeyDescription change, and set_difference confirms a key was actually added rather than removed, so rotating out an old credential does not, on its own, fire the rule. False positives come from legitimate rotation and from automation that provisions app credentials (CI/CD, infrastructure as code). The initiator is the discriminant. A credential added by your deployment pipeline's service account at the usual time is routine. The same change initiated by an interactive admin out of hours, or by an account that never normally touches app credentials, is what you want to surface. Allow-list the expected initiators, not the targets. Detection 2: a classic agent signing in from a first-seen IP A service principal that has only ever authenticated from your Azure regions and suddenly signs in from somewhere new is a strong signal that its credential has been lifted and is being used elsewhere. Service principals have stable, boring network behavior, which makes a first-seen IP a far cleaner indicator for them than it is for roaming human users. This is the behavioral baseline Identity Protection gives you for free on modern agents, rebuilt in KQL for the classic ones it ignores. MITRE T1078.004, Valid Accounts: Cloud Accounts. // Detection 2: classic-agent service principal signing in from a previously unseen IP // MITRE T1078.004 - Valid Accounts: Cloud Accounts let baseline = 14d; let detection = 1d; let KnownIPs = AADServicePrincipalSignInLogs | where TimeGenerated between (ago(baseline + detection) .. ago(detection)) | where tostring(ResultType) == "0" | summarize KnownIPSet = make_set(IPAddress) by AppId; AADServicePrincipalSignInLogs | where TimeGenerated > ago(detection) | where tostring(ResultType) == "0" | lookup kind=leftouter KnownIPs on AppId | where set_has_element(KnownIPSet, IPAddress) == false | summarize FirstSeen = min(TimeGenerated), Resources = make_set(ResourceDisplayName, 10) by ServicePrincipalName, AppId, IPAddress | order by FirstSeen desc The query builds a per-application baseline of source IPs over the previous two weeks, then flags any successful sign-in today from an address outside that set. Two tuning notes. Brand-new service principals have no baseline, so they surface on first use. That is usually worth seeing once, but you can exclude AppIds younger than the baseline window if it gets noisy. And if your agents egress through shifting cloud IP ranges, widen the comparison from an exact IP to the autonomous system number or a known-range allow-list, otherwise you will chase your own infrastructure. This complements Agent ID, it does not replace it! The endgame is not to run these rules forever. It is to shrink the population they apply to. Inventory your tenant for agents marked "Has Agent ID: No," prioritize the ones holding sensitive Graph permissions, and migrate them onto the Agent ID platform, where Identity Protection and Conditional Access take over the baselining you are doing here by hand. Microsoft has signaled a migration path from classic to modern agents. Treat these two detections as the coverage you need in the meantime, and as a permanent safety net for anything that never makes the move. If you do one thing this week: enable the service principal sign-in log category, deploy detection 1, and pull a list of every service principal that had a credential added in the last 90 days. That list alone tends to be more interesting than people expect. Cheers, Marcel265Views0likes0CommentsDesigning Reliable Data Platforms: Centralized Failure Logging Framework with Azure Monitor
Introduction Modern data platforms are no longer just about moving and transforming data. In production, what really matters is reliability and how quickly you can understand and react when something breaks. If you’re using Azure Synapse/ADF/Microsoft Fabric, you already have built-in monitoring. You can see pipeline runs, error messages. But it doesnt show you activity level errors, Pipeline errors works well when you’re debugging a single failure. But it doesn’t scale. Once you have dozens of pipelines running across multiple environments, failures become harder to track. You find yourself jumping between pipeline runs, scanning activity outputs, and trying to piece together what actually happened. And suddenly, simple questions become difficult to answer: Which datasets are failing most often? Are failures concentrated in Bronze, Silver, or Gold? Is this a one-off issue or a recurring pattern? What changed between yesterday and today? At that point, pipeline-level monitoring is no longer enough. You need something more structured. P.S the framework can be implemented across both Synapse and Microsoft Fabric environments with minimal changes. Why we need a custom logging framework The core issue is that pipeline failures are treated as runtime events, not as data. They live inside pipeline output and are tied to a specific run. This makes them hard to query across time, aggregate across pipelines, correlate across environments, or understand which activities failed inside the pipeline or integrate into alerting and dashboards in a consistent way. Pipeline Failures are visible but activity failures are not , and they’re not operationalized, what’s missing is a central place where all failures are captured in a consistent, structured format, regardless of which pipeline or dataset produced them including Activity level logs. That’s where a custom logging framework comes in, instead of relying only on built-in monitoring, we introduce a layer that captures failures as structured events, standardizes the payload across pipelines, and sends it to Log Analytics where it can be queried using KQL. This shifts the model from checking a pipeline when it fails to treating failures as a dataset that can be analyzed, monitored, and improved over time. Once you make that shift, you can build alerts based on patterns instead of reacting to single failures, track reliability across datasets or domains, and identify recurring issues instead of dealing with incidents one by one. It also changes who can use the data, visibility is no longer limited to engineers digging into pipeline runs it becomes accessible at the platform level for leads and stakeholders. This framework doesn’t replace Synapse monitoring. It complements it by adding a proper observability layer on top. Architecture When a pipeline fails in Synapse, the failure is intercepted through a dedicated failure path. At this stage, we don’t just log the error as-is we pass it through a custom logging framework that transforms the failure into a structured payload. This payload includes key context such as pipeline name, activity, environment, dataset, layer (Bronze/Silver/Gold), error details, and correlation identifiers. The important part here is consistency every pipeline emits the same schema, regardless of its logic. Once the payload is constructed, it is sent to Azure Monitor using the Logs Ingestion API, this API acts as the entry point into the monitoring system and decouples the pipelines from the underlying storage implementation. A Data Collection Rule (DCR) sits behind the ingestion layer and defines how incoming data is handled. It acts as a contract for the payload schema and optionally applies transformations before the data is persisted. Finally, the logs are stored in a custom Log Analytics table, where they become fully query-able using KQL, at this point, failures are no longer tied to a single pipeline run they are part of a centralized dataset that can be analyzed across time, environments, and domains. Setting up Log Analytics Before integrating the logging framework with Synapse, we first need to set up the destination for our logs, this includes creating a Log Analytics workspace, defining a custom table, and configuring the ingestion path using a Data Collection Rule (DCR). The goal is to create a pipeline where structured failure events can be received, validated, and stored in a consistent format. P.S all steps mentioned in this blog can be automated with ARM templates. 1. Create a Log Analytics workspace Start by creating a Log Analytics workspace. This will act as the central store for all failure logs across your data platform. In the Azure Portal: Navigate to Azure Monitor → Log Analytics workspaces Create a new workspace in your target subscription and region Choose a meaningful name (for example: log-analytics-data-domain) This workspace becomes the single place where all pipeline failures will be collected and queried. 2. Create a custom table for pipeline failures Instead of relying on generic tables, we define a dedicated custom table to store pipeline failure events. From the Log Analytics workspace: Go to Tables → Create → Custom table (DCR-based) Define a table name such as: DataDomain_SynapsePipelineErrors_CL [it has to end with CL suffix] At this stage, you’ll define the schema that represents your logging payload. Typical fields include: TimeGenerated PipelineName PipelineRunId ActivityName ActivityType Status ErrorCode ErrorMessage Severity Environment Layer DatasetName PartitionDate WorkspaceName CorrelationId The key here is consistency, this schema will be reused across all pipelines, so take the time to define it properly. 3. Create a Data Collection Rule (DCR) The Data Collection Rule defines how incoming data is ingested into Log Analytics. It acts as both a schema contract and a routing mechanism. In Azure Portal: Go to Azure Monitor → Data Collection Rules Create a new DCR and associate it with your Log Analytics workspace Within the DCR: Define a custom stream (for example: DataDomain_SynapsePipelineErrors_CL) Map this stream to your custom table Optionally define transformations using KQL (for example, renaming fields or enforcing types) This step is critical because it decouples your pipelines from the storage layer. If the schema evolves later, you can adjust it here without changing pipeline logic. 4. Configure the Logs Ingestion endpoint Once the DCR is created, Azure generates an ingestion endpoint that will be used by your pipelines. The endpoint follows this pattern: https://<dce>.<region>.ingest.monitor.azure.com/dataCollectionRules/<dcrId>/streams/<streamName>?api-version=2023-01-01 This endpoint is what your Synapse pipeline will call using a Web Activity. At this point, you should also: Enable Managed Identity authentication Grant the Synapse workspace permission to send data to the DCR This ensures secure ingestion without using secrets. 5.RBAC for Managed Identity The Managed Identity used by Synapse or Microsoft Fabric must have the following Azure RBAC role: Monitoring Metrics Publisher This role allows the identity to send data through the Azure Monitor Logs Ingestion API. The role should be assigned on the: Data Collection Rule (DCR) resource In Azure Portal: Data Collection Rule (DCR) → Access Control (IAM) → Add Role Assignment → Monitoring Metrics Publisher → Select Synapse/Fabric Managed Identity Without this role assignment, requests to the Logs Ingestion API will fail with authorization errors such as HTTP 403. 6. Validate the setup Before integrating with pipelines, it’s a good idea to validate that ingestion works. You can send a test payload (via Postman or a simple script) and then query your table in Log Analytics: DataDomain_SynapsePipelineErrors_CL | take 10 If everything is configured correctly, you should see your test records appear. Integrating with Synapse pipelines Now that the ingestion layer is ready, the next step is connecting Synapse pipelines, so failures are logged automatically instead of sending manual test payloads. The idea is simple: whenever a pipeline activity fails, we capture the failure details, transform them into a structured payload, and send them directly to the Logs Ingestion API, this turns pipeline failures into centralized operational events. 1. Add a failure handling path Inside your Synapse pipeline, add an On Failure dependency from the activities you want to monitor. Typically, this includes critical activities such as: Copy Activities Notebook executions Stored Procedures Data Flows Web Activities Instead of allowing the pipeline to fail silently, the failure path redirects execution into a dedicated logging step , in most production environments, this is implemented as a reusable child pipeline such as: pipeline name : Customized Logs API This keeps logging logic centralized and avoids duplicating the same implementation across dozens of pipelines. 2. Pass failure metadata as parameters The logging pipeline should receive operational context from the parent pipeline. Typical parameters include: Pipeline name Pipeline run ID Activity name Activity type Error code Error message Environment Layer (Bronze/Silver/Gold) Dataset name Severity Correlation ID This metadata becomes the foundation of the structured logging payload. The more operational context you capture here, the easier troubleshooting becomes later. 3. Construct the logging payload Inside the logging pipeline [Customized Logs API], use a dynamic content expression to construct a JSON payload matching the Log Analytics schema. Example payload: concat( '[{"TimeGenerated":"', utcNow(), '","PipelineName":"POC_Test"', ',"PipelineRunId":"', pipeline().RunId, '","PipelineStatus":"Failed"', ',"ActivityName":"TestActivity"', ',"ActivityType":"Web"', ',"ActivityStatus":"Failed"', ',"ErrorCode":"TEST"', ',"ErrorMessage":"POC test"', ',"Severity":"Warning"', ',"Environment":"Test"', ',"Layer":"Bronze"', ',"ExecutionStage":"POC"', ',"DatasetName":"TestDataset"', ',"PartitionDate":"', utcNow(), '","WorkspaceName":"', pipeline().DataFactory, '","TriggerName":"Manual"', ',"TriggerTimeUtc":"', utcNow(), '","DurationMs":1000', ',"RetryCount":0', ',"Compute":"Synapse"', ',"CorrelationId":"', pipeline().RunId, '","Payload":{"source":"test","target":"loganalytics"}}]' ) The important part is schema consistency. Every pipeline should emit the same payload structure regardless of which activity failed. This makes downstream querying and dashboarding significantly easier. 4. Send logs using a Web Activity After constructing the payload, use a Web Activity to send the data to the Logs Ingestion API endpoint configured earlier. Typical configuration: URL: https://<data-collection-endpoint>.<region>.ingest.monitor.azure.com/dataCollectionRules/<dcr-id>/streams/<stream-name>?api-version=2023-01-01 Method POST Authentication Managed Identity Resource https://monitor.azure.com Headers { "Content-Type": "application/json" } Body Dynamic JSON payload generated in the previous step. I highly recommend using Managed Identity avoids storing secrets or credentials inside Synapse pipelines and keeps authentication fully managed by Azure. 5. Validate end-to-end ingestion Once the pipeline is connected, trigger a controlled failure and verify that the event appears in Log Analytics. Run: DataDomain_SynapsePipelineErrors_CL | sort by TimeGenerated desc | take 20 You should now see real pipeline failures arriving automatically from Synapse. At this point, the framework becomes fully operational. Failures are no longer isolated runtime events buried inside activity outputs they are centralized, queryable operational records that can be analyzed across the entire platform. Future Steps Now that we have a centralized logging framework in place, we can take observability one step further by building operational dashboards in Power BI or Microsoft Fabric to analyze reliability trends across the entire data platform. Instead of reacting to isolated pipeline failures, we can aggregate logs across pipelines, datasets, environments, and medallion layers to identify what is actually causing instability over time. This allows engineering teams to detect recurring error patterns, identify unstable datasets, measure platform reliability, analyze failure spikes after deployments, and understand where operational bottlenecks are concentrated. By transforming pipeline failures into structured operational telemetry, the framework evolves beyond simple logging into a true observability platform that supports proactive reliability engineering, helping teams move from reactive firefighting to data-driven operational improvements based on measurable reliability KPIs such as failure trends, MTTR, SLA compliance, severity distribution, and pipeline health scoring. Links Tutorial: Send data to Azure Monitor Logs with Logs ingestion API (Azure portal) - Azure Monitor | Microsoft Learn Medallion Architecture Understanding with Azure Synapse Analytics Example | by Satyam Gawade | Medium Feedback: Sally Dabbah | LinkedIn239Views0likes0Comments