Analytics on Azure Blog

8 MIN READ

End-to-End Observability for Azure Databricks: From Infrastructure to Internal Application Logging

Rafia_Aqil

Microsoft

Dec 07, 2025

Author's: Peter Lo PeterLo, Amudha Palani amudhapalani, Geoffrey Rathinapandi geofegeo and Rafia Aqil Rafia_Aqil

Observability in Azure Databricks is the ability to continuously monitor and troubleshoot the health, performance, and usage of data workloads by capturing metrics, logs, and traces. In a structured observability approach, we consider two broad categories of logging: Internal Databricks Logging (within the Databricks environment) and External Databricks Logging (leveraging Azure services). Each plays a distinct role in providing insights.

By combining internal and external observability mechanisms, organizations can achieve a comprehensive view: internal logs enable detailed analysis of Spark jobs and data quality, while external logs ensure global visibility, auditing, and integration with broader monitoring dashboards and alerting systems. The article is organized into two main sections:

Infrastructure Logging for Azure Databricks (external observability)
Internal Databricks Logging (in-platform observability)

Considerations

Addressing key questions upfront ensures your observability strategy is tailored to your organization’s unique workloads, risk profile, and operational needs. By proactively evaluating what to monitor, where to store logs, and who needs access, you can avoid blind spots, streamline incident response, and align monitoring investments with business priorities.

What types of workloads are running in Databricks?

- Why it matters: Different workloads (e.g., batch ETL, streaming pipelines, ML training, interactive notebooks) have distinct performance profiles and failure modes.

- Business impact: Understanding workload types helps prioritize monitoring for mission-critical processes like real-time fraud detection or daily financial reporting.

What failure scenarios need to be monitored?

- Examples: Job failures, cluster provisioning errors, quota limits, authentication issues.

- Business impact: Early detection of failures reduces downtime, improves SLA adherence, and prevents data pipeline disruptions that could affect reporting or customer-facing analytics.

Where should logs be stored and analyzed?

- Options: Centralized Log Analytics workspace, Azure Storage for archival, Event Hub for streaming analysis.

- Business impact: Centralized logging enables unified dashboards, cross-team visibility, and faster incident response across data engineering, operations, and compliance teams.

Who needs access to logs and alerts?

- Stakeholders: Data engineers, platform administrators, security analysts, compliance officers.

- Business impact: Role-based access ensures that the right teams can act on insights while maintaining data governance and privacy controls.

Infrastructure Logging for Azure Databricks

Approach 1: Diagnostic Settings for Azure Databricks

Diagnostic settings in Azure Monitor allow you to capture detailed logs and metrics from your Azure Databricks workspace, supporting operational monitoring, troubleshooting, and compliance. By configuring diagnostic settings at the workspace level, administrators can route Databricks logs—including cluster events, job statuses, and audit logs—to destinations such as Log Analytics, Azure Storage, or Event Hub. This enables unified analysis, alerting, and long-term retention of critical operational data.

Configuration Overview

- Enable Diagnostic Settings on the Databricks workspace to route logs to Log Analytics Workspace.
- Logs can also be combined with other logs mentioned below for full Azure Databricks observability.
- Here is a Guide to Azure Databricks Diagnostic Settings Log Reference: Configure diagnostic log delivery
- Implement tagging strategy-organizations can gain granular visibility into resource consumption and align spending with business priorities.
  - Default tags: Automatically applied by Databricks to cloud-deployed resources.
  - Custom tags: User-defined tags that you can add to compute resources and serverless workloads.

Use Cases

- Operational Monitoring: Detect job or resource bottlenecks.
- Security & Compliance: Audit user actions and enforce governance policies.
- Incident Response: Correlate Databricks logs with infrastructure events for faster troubleshooting.

Best Practices

- Enable only relevant log categories to optimize cost and performance.
- Use role-based access control (RBAC) to secure access to logs.

Approach 2: Azure Databricks Compute Log Delivery

Compute log delivery in Azure Databricks enables you to automatically collect and archive logs from Spark driver nodes, worker nodes, and cluster events for both all-purpose and job compute resources. When you create a cluster, you can specify a log delivery location—such as DBFS, Azure Storage, or a Unity Catalog volume—where logs are delivered every five minutes and archived hourly. All logs generated up until the compute resource is terminated are guaranteed to be delivered, supporting troubleshooting, auditing, and compliance.

Configure: To configure the log delivery location:

1. On the compute page, click the Advanced toggle.
2. Click the Logging tab.
3. Select a destination type: DBFS or Volumes (Preview).
4. Enter the Log path.
5. To store the logs, Databricks creates a subfolder in your chosen log path named after the compute's cluster_id.

Approach 3: Azure Activity Logs

Whenever you create, update, or delete Databricks resources (such as provisioning a new workspace, scaling a cluster, or modifying network settings), these actions are captured in the Activity Log. This enables teams to track who made changes, when, and what impact those changes had on the environment. Each event in the Activity Log has a particular category that is described in the following document: Azure Activity Log event schema. For Databricks, this is especially valuable for:

- Auditing resource deployments and configuration changes
- Investigating failed provisioning or quota errors
- Monitoring compliance with organizational policies
- Responding to incidents or unauthorized actions

Use Cases

- Auditing infrastructure-level changes outside the Databricks workspace.
- Monitoring provisioning delays or resource availability.

Best Practices

- Use Activity Logs in conjunction with other logs for full-stack visibility.
- Set up alerts for critical infrastructure events.
- Review logs regularly to ensure compliance and operational health.

Approach 4: Azure Monitor VM Insights

Azure Databricks cluster nodes run on Azure virtual machines (VMs), and their infrastructure-level performance can be monitored using Azure Monitor VM Insights (formerly OMS). This approach provides visibility into resource utilization across individual cluster VMs, helping identify bottlenecks that may affect Spark job performance or overall workload efficiency.

Configuration Overview: To enable VM performance monitoring:

- Enable VM Insights on the Databricks cluster for VMs.

Monitored Metrics: Once enabled, VM Insights collects: CPU usage, Memory consumption, Disk I/O, Network throughput, Process-level statistics. These metrics help assess whether Spark workloads are constrained by infrastructure limits, such as insufficient memory or high disk latency.

Considerations

- This is a standard Azure VM monitoring technique and is not specific to Databricks.
- Use role-based access control (RBAC) to secure access to performance data.

Approach 5: Virtual Network Flow Logs

For Azure Databricks workspaces deployed in a custom Azure Virtual Network (VNet-injected mode), enabling Virtual Network Flow Logs provides deep visibility into IP traffic flowing through the virtual network. These logs help monitor and optimize resources or support large enterprises that are trying to detect intrusion, flow logs can help. Review common use cases here: Vnet Flow Logs Common Usecases and how logging works here: Key properties of virtual network flow logs. Follow these steps to setup Vnet Flow Logs: Create a flow log

Configuration Overview

- Virtual Network Flow Logs are a feature of Azure Network Watcher.
- Optionally, logs can be analyzed using Traffic Analytics for deeper insights.

These logs help identify:

- Unexpected or unauthorized traffic
- Bandwidth usage patterns
- Effectiveness of NSG rules and network segmentation

Considerations

- NSG flow logging is only available for VNet-injected deployment modes.
- Ensure Network Watcher is enabled in the region where the Databricks workspace is deployed.
- Use Traffic Analytics to visualize trends and detect anomalies in network flows.

Approach 6: Spark Monitoring Logging & Metrics

The spark-monitoring library is a Python toolkit designed to interact with the Spark History Server REST API. Its main purpose is to help users programmatically access, analyze, and visualize Spark application metrics and job details after execution. Here’s what it offers:

- Application Listing: Retrieve a list of all Spark applications available on the History Server, including metadata such as application ID, name, start/end time, and status.
- Job and Stage Details: Access detailed information about jobs and stages within each application, including execution times, status, and resource usage.
- Task Metrics: Extract metrics for individual tasks, such as duration, input/output size, and shuffle statistics, supporting performance analysis and bottleneck identification.

Considerations

- The Spark Monitoring Library must be installed, see Git Repository here.
- Metrics can be exported to external observability platforms for long-term retention and alerting.

Use cases

- Automated reporting of Spark job performance and resource usage
- Batch analysis of completed Spark applications
- Integration of Spark metrics into external dashboards or monitoring systems
- Post-execution troubleshooting and optimization

Internal Databricks Logging

Approach 7: Databricks System Tables (Unity Catalog)

Databricks System Tables are a recent addition to Azure Databricks observability, offering structured, SQL-accessible insights into workspace usage, performance, and cost. These tables reside in the Unity Catalog and are organized into schemas such as system.billing, system.lakeflow, and system.compute. You can enable System Tables through these steps: _enable_system_tables - Databricks

Overview and Capabilities

When enabled by an administrator, system tables allow users to query operational metadata directly using SQL. Examples include:

- system.billing.usage: Tracks compute usage (CPU core-hours, memory) per job.
- system.compute.clusters: Captures cluster lifecycle events.
- system.lakeflow.job_run: Provides job execution details.

Use Cases

- Cost Monitoring: Aggregate usage records to identify high-cost jobs or users.
- Import pre-built usage dashboards to your workspaces to monitor account- and workspace-level usage: Usage dashboards and Create and monitor budgets
- Operational Efficiency: Track job durations, cluster concurrency, and resource utilization.
- In-Platform BI: Build dashboards in Databricks SQL to visualize usage trends without relying on external billing tools.

Best Practices

- Schedule regular queries to track cost trends, job performance, and resource usage.
- Apply role-based access control to restrict sensitive usage data.
- Integrate system table insights into Databricks SQL dashboards for real-time visibility.

Approach 8: Data Quality Monitoring

Data Quality Monitoring is a native Azure Databricks feature designed to track data quality and machine learning model performance over time. It enables automated monitoring of Delta tables and ML inference outputs, helping teams detect anomalies, data drift, and reliability issues directly within the Databricks environment. Follow these steps to enable Data Quality Monitoring. Data Quality Monitoring supports three profile types:

- Time Series: Monitors time-partitioned data, computing metrics per time window.
- Inference: Tracks prediction drift and anomalies in model request/response logs.
- Snapshot: Performs full-table scans to compute metrics across the entire dataset.
- From the enabling Lakehouse Monitoring, on step 5 you can also enable data profiling to view Data Profiling Dashboards.

Use Cases

- Data Quality Monitoring: Track null values, column distributions, and schema changes.
- Model Performance Monitoring: Detect concept drift, prediction anomalies, and accuracy degradation.
- Operational Reliability: Ensure consistent data pipelines and ML inference behavior.

Approach 9: Databricks SQL Dashboards and Alerts

Databricks SQL Dashboards and Alerts provide in-platform observability for operational monitoring, enabling teams to visualize metrics and receive notifications based on SQL query results. This approach complements infrastructure-level monitoring by focusing on application-level conditions, data correctness, and workflow health. Users can build dashboards using Databricks SQL or SQL Warehouses by querying: System tables (e.g., job runs, billing usage), Data Quality Monitoring metric tables, Custom operational datasets. You can create alerts through these steps: Databricks SQL alerts.

Alerting Features: Databricks SQL supports alerting on query results, allowing users to define conditions that trigger notifications via: Email, Slack (via webhook integration). Alerts can be configured for scenarios such as:

- Job failure counts exceeding thresholds
- Row count drops in critical tables
- Cost/Workload spikes or resource usage anomalies

Considerations

- Alerts are query-driven and run on a schedule; ensure queries are optimized for performance.
- Dashboards and alerts are workspace-specific and require appropriate permissions.

Best Practices

- Use system tables and Data Quality Monitoring metrics as data sources for dashboards.
- Schedule alerts to run at appropriate intervals (e.g., hourly for job failures).
- Combine internal alerts with external monitoring for full-stack coverage.

Approach 10: Custom Tags for Workspace-Level Assets

Custom tags allow organizations to classify and organize Databricks resources (clusters, jobs, pools, notebooks) for better governance, cost tracking, and observability. Tags are key-value pairs applied at the resource level and can be propagated to Azure for billing and monitoring.

Why Use Custom Tags?

- Cost Attribution: Assign tags like Environment=Prod, Project=HealthcareAnalytics to track costs in Azure Cost Management.
- Governance: Enforce policies based on tags (e.g., restrict high-cost clusters to Environment=Dev).
- Observability: Filter logs and metrics by tags for dashboards and alerts.

Taggable Assets

- Clusters: Apply tags during cluster creation via the Databricks UI or REST API.
- Jobs: Include tags in job configurations for workload-level tracking.
- Instance Pools: Tag pools to manage shared compute resources.
- Notebooks & Workflows: Use tags in metadata for classification and reporting.

Best Practices

- Define a standard tag taxonomy (e.g., Environment, Owner, CostCenter, Compliance).
- Validate tags regularly to ensure consistency across workspaces.
- Use tags in Log Analytics queries for cost and performance dashboards.