data & ai

11 Topics

Azure AI Foundry vs. Azure Databricks – A Unified Approach to Enterprise Intelligence
Key Insights into Azure AI Foundry and Azure Databricks Complementary Powerhouses: Azure AI Foundry is purpose-built for generative AI application and agent development, focusing on model orchestration and rapid prototyping, while Azure Databricks excels in large-scale data engineering, analytics, and traditional machine learning, forming the data intelligence backbone. Seamless Integration for End-to-End AI: A critical native connector allows AI agents developed in Foundry to access real-time, governed data from Databricks, enabling contextual and data-grounded AI solutions. This integration facilitates a comprehensive AI lifecycle from data preparation to intelligent application deployment. Specialized Roles for Optimal Performance: Enterprises leverage Databricks for its robust data processing, lakehouse architecture, and ML model training capabilities, and then utilize AI Foundry for deploying sophisticated generative AI applications, agents, and managing their lifecycle, ensuring responsible AI practices and scalability. In the rapidly evolving landscape of artificial intelligence, organizations seek robust platforms that can not only handle vast amounts of data but also enable the creation and deployment of intelligent applications. Microsoft Azure offers two powerful, yet distinct, services in this domain: Azure AI Foundry and Azure Databricks. While both contribute to an organization's AI capabilities, they serve different primary functions and are designed to complement each other in building comprehensive, enterprise-grade AI solutions. Decoding the Core Purpose: Foundry for Generative AI, Databricks for Data Intelligence At its heart, the distinction between Azure AI Foundry and Azure Databricks lies in their core objectives and the types of workloads they are optimized for. Understanding these fundamental differences is crucial for strategic deployment and maximizing their combined potential. Azure AI Foundry: The Epicenter for Generative AI and Agents Azure AI Foundry emerges as Microsoft's unified platform specifically engineered for the development, deployment, and management of generative AI applications and AI agents. It represents a consolidation of capabilities from what were formerly Azure AI Studio and Azure OpenAI Studio. Its primary focus is on accelerating the entire lifecycle of generative AI, from initial prototyping to large-scale production deployments. Key Characteristics of Azure AI Foundry: Generative AI Focus: Foundry streamlines the development of large language models (LLMs) and customized generative AI applications, including chatbots and conversational AI. It emphasizes prompt engineering, Retrieval-Augmented Generation (RAG), and agent orchestration. Extensive Model Catalog: It provides access to a vast catalog of over 11,000 foundation models from various publishers, including OpenAI, Meta (Llama 4), Mistral, and others. These models can be deployed via managed compute or serverless API deployments, offering flexibility and choice. Agentic Development: A significant strength of Foundry is its support for building sophisticated AI agents. This includes tools for grounding agents with knowledge, tool calling, comprehensive evaluations, tracing, monitoring, and guardrails to ensure responsible AI practices. Foundry Local further extends this by allowing offline and on-device development. Unified Development Environment: It offers a single management grouping for agents, models, and tools, promoting efficient development and consistent governance across AI projects. Enterprise Readiness: Built-in capabilities such as Role-Based Access Control (RBAC), observability, content safety, and project isolation ensure that AI applications are secure, compliant, and scalable for enterprise use. Figure 1: Conceptual Architecture of Azure AI Foundry illustrating its various components for AI development and deployment. Azure Databricks: The Powerhouse for Data Engineering, Analytics, and Machine Learning Azure Databricks, on the other hand, is an Apache Spark-based data intelligence platform optimized for large-scale data engineering, analytics, and traditional machine learning workloads. It acts as a collaborative workspace for data scientists, data engineers, and ML engineers to process, analyze, and transform massive datasets, and to build and deploy diverse ML models. Key Characteristics of Azure Databricks: Unified Data Analytics Platform: Central to Databricks is its lakehouse architecture, built on Delta Lake, which unifies data warehousing and data lakes. This provides a single platform for data engineering, SQL analytics, and machine learning. Big Data Processing: Excelling in distributed computing, Databricks is ideal for processing large datasets, performing ETL (Extract, Transform, Load) operations, and real-time analytics at scale. Comprehensive ML and AI Workflows: It offers a specialized environment for the full ML lifecycle, including data preparation, feature engineering, model training (both classic and deep learning), and model serving. Tools like MLflow are integrated for tracking, evaluating, and monitoring ML models. Data Intelligence Features: Databricks includes AI-assistive features such as Databricks Assistant and Databricks AI/BI Genie, which enable users to interact with their data using natural language queries to derive insights. Unified Governance with Unity Catalog: Unity Catalog provides a centralized governance solution for all data and AI assets within the lakehouse, ensuring data security, lineage tracking, and access control. Figure 2: The Databricks Data Intelligence Platform with its unified approach to data, analytics, and AI. The Symbiotic Relationship: Integration and Complementary Use Cases While distinct in their primary functions, Azure AI Foundry and Azure Databricks are explicitly designed to work together, forming a powerful, integrated ecosystem for end-to-end AI development and deployment. This synergy is key to building advanced, data-driven AI solutions in the enterprise. Seamless Integration for Enhanced AI Capabilities The integration between the two platforms is a cornerstone of Microsoft's AI strategy, enabling AI agents and generative applications to be grounded in high-quality, governed enterprise data. Key Integration Points: Native Databricks Connector in AI Foundry: A significant development in 2025 is the public preview of a native connector that allows AI agents built in Azure AI Foundry to directly query real-time, governed data from Azure Databricks. This means Foundry agents can leverage Databricks AI/BI Genie to surface data insights and even trigger Databricks Jobs, providing highly contextual and domain-aware responses. Data Grounding for AI Agents: This integration enables AI agents to access structured and unstructured data processed and stored in Databricks, providing the necessary context and knowledge base for more accurate and relevant generative AI outputs. All interactions are auditable within Databricks, maintaining governance and security. Model Crossover and Availability: Foundation models, such as the Llama 4 family, are made available across both platforms. Databricks DBRX models can also appear in the Foundry model catalog, allowing flexibility in where models are trained, deployed, and consumed. Unified Identity and Governance: Both platforms leverage Azure Entra ID for authentication and access control, and Unity Catalog provides unified governance for data and AI assets managed by Databricks, which can then be respected by Foundry agents. Here's a breakdown of how a typical flow might look: Mindmap 1: Illustrates the complementary roles and integration points between Azure Databricks and Azure AI Foundry within an end-to-end AI solution. When to Use Which (and When to Use Both) Choosing between Azure AI Foundry and Azure Databricks, or deciding when to combine them, depends on the specific requirements of your AI project: Choose Azure AI Foundry When You Need To: Build and deploy production-grade generative AI applications and multi-agent systems. Access, evaluate, and benchmark a wide array of foundation models from various providers. Develop AI agents with sophisticated capabilities like tool calling, RAG, and contextual understanding. Implement enterprise-grade guardrails, tracing, monitoring, and content safety for AI applications. Rapidly prototype and iterate on generative AI solutions, including chatbots and copilots. Integrate AI agents deeply with Microsoft 365 and Copilot Studio. Choose Azure Databricks When You Need To: Perform large-scale data engineering, ETL, and data warehousing on a unified lakehouse. Build and train traditional machine learning models (supervised, unsupervised learning, deep learning) at scale. Manage and govern all data and AI assets centrally with Unity Catalog, ensuring data quality and lineage. Conduct complex data analytics, business intelligence (BI), and real-time data processing. Leverage AI-assistive tools like Databricks AI/BI Genie for natural language interaction with data. Require high-performance compute and auto-scaling for data-intensive workloads. Use Both for Comprehensive AI Solutions: The most powerful approach for many enterprises is to leverage both platforms. Azure Databricks can serve as the robust data backbone, handling data ingestion, processing, governance, and traditional ML model training. Azure AI Foundry then sits atop this foundation, consuming the prepared and governed data to build, deploy, and manage intelligent generative AI agents and applications. This allows for: Domain-Aware AI: Foundry agents are grounded in enterprise-specific data from Databricks, leading to more accurate, relevant, and trustworthy AI responses. End-to-End AI Lifecycle: Databricks manages the "data intelligence" part, and Foundry handles the "generative AI application" part, covering the entire spectrum from raw data to intelligent user experience. Optimized Resource Utilization: Each platform focuses on what it does best, leading to more efficient resource allocation and specialized toolsets for different stages of the AI journey. Comparative Analysis: Features and Capabilities To further illustrate their distinct yet complementary nature, let's examine a detailed comparison of their features, capabilities, and typical user bases. Radar Chart 1: This chart visually compares Azure AI Foundry and Azure Databricks across several key dimensions, illustrating their specialized strengths. Azure AI Foundry excels in generative AI and agent orchestration, while Azure Databricks dominates in data engineering, unified data governance, and traditional ML workflows. A Detailed Feature Comparison Feature Category Azure AI Foundry Azure Databricks Primary Focus Generative AI application & agent development, model orchestration Large-scale data engineering, analytics, traditional ML, and AI workflows Data Handling Connects to diverse data sources (e.g., Databricks, Azure AI Search) for grounding AI agents. Not a primary data storage/processing platform. Native data lakehouse architecture (Delta Lake), optimized for big data processing, storage, and real-time analytics. AI/ML Capabilities Foundation models (LLMs), prompt engineering, RAG, agent orchestration, model evaluation, content safety, responsible AI tooling. Traditional ML (supervised/unsupervised), deep learning, feature engineering, MLflow for lifecycle management, Databricks AI/BI Genie. Development Style Low-code agent building, prompt flows, unified SDK/API, templates. Code-first (Python, SQL, Scala, R), notebooks, IDE integrations. Model Access & Deployment Extensive model catalog (11,000+ models), serverless API, managed compute deployments, model benchmarking. Training and serving custom ML models, including deep learning. Models available for deployment through MLflow. Governance & Security Azure-based security & compliance, RBAC, project isolation, content safety guardrails, tracing, evaluations. Unity Catalog for unified data & AI governance, lineage tracking, access control, Entra ID integration. Key Users AI developers, business analysts, citizen developers, AI app builders. Data scientists, data engineers, ML engineers, data analysts. Integration Points Native connector to Databricks AI/BI Genie, Azure AI Search, Microsoft 365, Copilot Studio, Power Platform. Microsoft Fabric, Power BI, Azure AI Foundry, Azure Purview, Azure Monitor, Azure Key Vault. Table 1: A comparative overview of the distinct features and functionalities of Azure AI Foundry and Azure Databricks Concluding Thoughts In essence, Azure AI Foundry and Azure Databricks are not competing platforms but rather essential components of a unified, comprehensive AI strategy within the Azure ecosystem. Azure Databricks provides the robust, scalable foundation for all data engineering, analytics, and traditional machine learning workloads, acting as the "data intelligence platform." Azure AI Foundry then leverages this foundation to specialize in the rapid development, deployment, and operationalization of generative AI applications and intelligent agents. Together, they enable enterprises to unlock the full potential of AI, transforming raw data into powerful, domain-aware, and governed intelligent solutions. Frequently Asked Questions (FAQ) What is the main difference between Azure AI Foundry and Azure Databricks? Azure AI Foundry is specialized for building, deploying, and managing generative AI applications and AI agents, focusing on model orchestration and prompt engineering. Azure Databricks is a data intelligence platform for large-scale data engineering, analytics, and traditional machine learning, built on a Lakehouse architecture. Can Azure AI Foundry and Azure Databricks be used together? Yes, they are designed to work synergistically. Azure AI Foundry can leverage a native connector to access real-time, governed data from Azure Databricks, allowing AI agents to be grounded in enterprise data for more accurate and contextual responses. Which platform should I choose for training large machine learning models? For training large-scale, traditional machine learning, and deep learning models, Azure Databricks is generally the preferred choice due to its robust capabilities for data processing, feature engineering, and ML lifecycle management (MLflow). Azure AI Foundry focuses more on the deployment and orchestration of pre-trained foundation models and generative AI applications. Does Azure AI Foundry replace Azure Machine Learning or Databricks? No, Azure AI Foundry complements these services. It provides a specialized environment for generative AI and agent development, often integrating with data and models managed by Azure Databricks or Azure Machine Learning for comprehensive AI solutions. How do these platforms handle data governance? Azure Databricks utilizes Unity Catalog for unified data and AI governance, providing centralized control over data access and lineage. Azure AI Foundry integrates with Azure-based security and compliance features, ensuring responsible AI practices and data privacy within its generative AI applications.
anishekkamal
Jan 29, 2026 Place Microsoft Mission Critical Blog
2.3KViews
1like
2Comments
Comparing Open-Source vs Closed LLMs for Enterprise Apps
LLMs are advanced AI models trained in vast data. They enable tasks such as summarization, translation, content creation, and data analysis. When companies build applications that use AI, one of the most important decisions they face is choosing the right type of Large Language Model (LLM). There are two main choices: open‑source LLMs and closed or proprietary LLMs. Understanding the differences between them helps businesses decide which option fits their needs, goals, and security requirements.
ashishmahajan
Jan 14, 2026 Place Microsoft Mission Critical Blog
697Views
1like
0Comments
Preparing for Azure PostgreSQL Certificate Authority Rotation: A Comprehensive Operational Guide
The Challenge It started with a standard notification in the Azure Portal: Tracking-ID YK3N-7RZ. A routine Certificate Authority (CA) rotation for Azure Database for PostgreSQL. As Cloud Solution Architects, we’ve seen this scenario play out many times. The moment “certificate rotation” is mentioned, a wave of unease ripples through engineering teams. Let’s be honest: for many of us—ourselves included—certificates represent the edge of our technical “comfort zone.” We know they are critical for security, but the complexity of PKI chains, trust stores, and SSL handshakes can be intimidating. There is a silent fear: “If we touch this, will we break production?” We realized we had a choice. We could treat this as an opportunity, and we could leave that comfort zone. We approached our customer with a proactive proposal: Let’s use this event to stop fearing certificates and start mastering them. Instead of just patching the immediate issue, we used this rotation as a catalyst to review and upgrade the security posture of their database connections. We wanted to move from “hoping it works” to “knowing it’s secure.” The response was overwhelmingly positive. The teams didn’t just want a quick fix; they wanted “help for self-help.” They wanted to understand the mechanics behind sslmode and build the confidence to manage trust stores proactively. This guide is the result of that journey. It is designed to help you navigate the upcoming rotation not with anxiety, but with competence—turning a mandatory maintenance window into a permanent security improvement. Two Levels of Analysis A certificate rotation affects your environment on two distinct levels, requiring different expertise and actions: Level Responsibility Key Questions Actions Platform Level Cloud/Platform Teams Which clusters, services, and namespaces are affected? How do we detect at scale? Azure Service Health monitoring, AKS scanning, infrastructure-wide assessment Application Level Application/Dev Teams What SSL mode? Which trust store? How to update connection strings? Code changes, dependency updates, trust store management This article addresses both levels - providing platform-wide detection strategies (Section 5) and application-specific remediation guidance (Platform-Specific Remediation). Business Impact: In production environments, certificate validation failures cause complete database connection outages. A single missed certificate rotation has caused hours of downtime for enterprise customers, impacting revenue and customer trust. Who’s Affected: DevOps engineers, SREs, database administrators, and platform engineers managing Azure PostgreSQL instances - especially those using: - Java applications with custom JRE cacerts - Containerized workloads with baked-in trust stores - Strict SSL modes (sslmode=verify-full, verify-ca) The Solution What we’ll cover: 🛡️ Reliability: How to prevent database connection outages through proactive certificate management 🔄 Resiliency: Automation strategies that ensure your trust stores stay current 🔒 Security: Maintaining TLS security posture while rotating certificates safely Key Takeaway: This rotation is a client trust topic, not a server change. Applications trusting root CAs (DigiCert Global Root G2, Microsoft RSA Root CA 2017) without intermediate pinning are unaffected. Risk concentrates where strict validation meets custom trust stores. 📦 Platform-Specific Implementation: Detailed remediation guides for Java, .NET, Python, Node.js, and Kubernetes are available in our GitHub Repository. Note: The GitHub Repository. contains community-contributed content provided as-is. Test all scripts in non-production environments before use. 1. Understanding Certificate Authority Rotation What Changes During CA Rotation? Azure Database for PostgreSQL uses TLS/SSL to encrypt client-server connections. The database server presents a certificate chain during the TLS handshake: Certificate Chain Structure: Figure: Certificate chain structure showing the rotation from old intermediate (red, deprecated) to new intermediate (blue, active after rotation). Client applications must trust the root certificates (green) to validate the chain. 📝 Diagram Source: The Mermaid source code for this diagram is available in certificate-chain-diagram.mmd. Why Root Trust Matters Key Principle: If your application trusts the root certificate and allows the chain to be validated dynamically, you are not affected. The risk occurs when: Custom trust stores contain only the old intermediate certificate (not the root) Certificate pinning is implemented at the intermediate level Strict validation is enabled (sslmode=verify-full in PostgreSQL connection strings) 2. Who Is Affected and Why Risk Assessment Matrix Application Type Trust Store SSL Mode Risk Level Action Required Cloud-native app (Azure SDK) OS Trust Store require 🟢 Low None - Azure SDK handles automatically Java app (default JRE) System cacerts verify-ca 🟡 Medium Verify JRE version (11.0.16+, 17.0.4+, 8u381+) Java app (custom cacerts) Custom JKS file verify-full 🔴 High Update custom trust store with new intermediate .NET app (Windows) Windows Cert Store require 🟢 Low None - automatic via Windows Update Python app (certifi) certifi bundle verify-ca 🟡 Medium Update certifi package (pip install --upgrade certifi) Node.js app (default) Built-in CAs verify-ca 🟢 Low None - Node.js 16+, 18+, 20+ auto-updated Container (Alpine) /etc/ssl/certs verify-full 🔴 High Update base image or install ca-certificates-bundle Container (custom) Baked-in certs verify-full 🔴 High Rebuild image with updated trust store How to Read This Matrix Use the above matrix to quickly assess whether your applications are affected by CA rotation. Here is an overview, how you read the matrix: Column Meaning Application Type What kind of application do you have? (e.g., Java, .NET, Container) Trust Store Where does the application store its trusted certificates? SSL Mode How strictly does the application validate the server certificate? Risk Level 🟢 Low / 🟡 Medium / 🔴 High - How likely is a connection failure? Action Required What specific action do you need to take? Risk Level Logic: Risk Level Why? 🟢 Low Automatic updates (OS/Azure SDK) or no certificate validation 🟡 Medium Manual update required but straightforward (e.g., pip install --upgrade certifi) 🔴 High Custom trust store must be manually updated - highest outage risk SSL Mode Security Posture Understanding SSL modes is critical because they determine both security posture AND rotation impact. This creates a dual consideration: SSL Mode Certificate Validation Rotation Impact Security Level Recommendation disable ❌ None ✅ No impact 🔴 INSECURE Never use in production allow ❌ None ✅ No impact 🟠 WEAK Not recommended prefer ❌ Optional ✅ Minimal 🟡 WEAK Not recommended require ❌ No (Npgsql 6.0+) ✅ No impact 🟡 WEAK Upgrade to verify-full verify-ca ✅ Chain only 🔴 Critical 🔵 MODERATE Update trust stores verify-full ✅ Chain + hostname 🔴 Critical 🟢 SECURE Recommended - Update trust stores Key Insight: Applications using weak SSL modes (everything below verify-ca) are technically unaffected by CA rotation but represent security vulnerabilities. The safest path is verify-full with current trust stores. ⚖️ The Security vs. Resilience Trade-off The Paradox: Secure applications (verify-full) have the highest rotation risk 🔴, while insecure applications (require) are unaffected but have security gaps. Teams discovering weak SSL modes during rotation preparation face a critical decision: Option Approach Rotation Impact Security Impact Recommended For 🚀 Quick Fix Keep weak SSL mode (require) ✅ No action needed ⚠️ Security debt remains Emergency situations only 🛡️ Proper Fix Upgrade to verify-full 🔴 Requires trust store updates ✅ Improved security posture All production systems Our Recommendation: Use CA rotation events as an opportunity to improve your security posture. The effort to update trust stores is a one-time investment that pays off in long-term security. Common Scenarios Scenario 1: Enterprise Java Application Problem: Custom trust store created 2+ years ago for PCI compliance Risk: High - contains only old intermediate certificates Solution: Export new intermediate from Azure, import to custom cacerts Scenario 2: Kubernetes Microservices Problem: Init container copies trust store from ConfigMap at startup Risk: High - ConfigMap never updated since initial deployment Solution: Update ConfigMap, redeploy pods with new trust store Scenario 3: Legacy .NET Application Problem: .NET Framework 4.6 on Windows Server 2016 (no Windows Update) Risk: Medium - depends on manual certificate store updates Solution: Import new intermediate to Windows Certificate Store manually 3. Trust Store Overview A trust store is the collection of root and intermediate CA certificates that your application uses to validate server certificates during TLS handshakes. Understanding where your application’s trust store is located determines how you’ll update it for CA rotations. Trust Store Locations by Platform Category Platform Trust Store Location Update Method Auto-Updated? OS Level Windows Cert:\LocalMachine\Root Windows Update ✅ Yes Debian/Ubuntu /etc/ssl/certs/ca-certificates.crt apt upgrade ca-certificates ✅ Yes (with updates) Red Hat/CentOS /etc/pki/tls/certs/ca-bundle.crt yum update ca-certificates ✅ Yes (with updates) Runtime Level Java JRE $JAVA_HOME/lib/security/cacerts Java security updates ✅ With JRE updates Python (certifi) site-packages/certifi/cacert.pem pip install --upgrade certifi ❌ Manual Node.js Bundled with runtime Node.js version upgrade ✅ With Node.js updates Custom Custom JKS Application-specific path keytool -importcert ❌ Manual Container image /etc/ssl/certs (baked-in) Rebuild container image ❌ Manual ConfigMap mount Kubernetes ConfigMap Update ConfigMap, redeploy ❌ Manual Why This Matters for CA Rotation Applications using auto-updated trust stores (OS-managed, current runtime versions) generally handle CA rotations automatically. The risk concentrates in: Custom trust stores created for compliance requirements (PCI-DSS, SOC 2) that are rarely updated Baked-in container certificates from images built months or years ago Outdated runtimes (old JRE versions, frozen Python environments) that haven’t received security updates Air-gapped environments where automatic updates are disabled When planning for CA rotation, focus your assessment efforts on applications in the “Manual” update category. 4. Platform-Specific Remediation 📦 Detailed implementation guides are available in our GitHub repository: azure-certificate-rotation-guide Quick Reference: Remediation by Platform Platform Trust Store Location Update Method Guide Java $JAVA_HOME/lib/security/cacerts Update JRE or manual keytool import java-cacerts.md .NET (Windows) Windows Certificate Store Windows Update (automatic) dotnet-windows.md Python certifi package pip install --upgrade certifi python-certifi.md Node.js Built-in CA bundle Update Node.js version nodejs.md Containers Base image /etc/ssl/certs Rebuild image or ConfigMap containers-kubernetes.md Scripts & Automation Script Purpose Download State Scan-AKS-TrustStores.ps1 Scan all pods in AKS for trust store configurations PowerShell tested validate-connection.sh Test PostgreSQL connection with SSL validation Bash not tested update-cacerts.sh Update Java cacerts with new intermediate Bash not tested 5. Proactive Detection Strategies Database-Level Discovery: Identifying Connected Clients One starting point for impact assessment is querying the PostgreSQL database itself to identify which applications are connecting. We developed a SQL query that joins pg_stat_ssl with pg_stat_activity to reveal active TLS connections, their SSL version, and cipher suites. 🔍 Get the SQL Query: Download the complete detection script from our GitHub repository: detect-clients.sql Important Limitations This query has significant constraints that you must understand before relying on it for CA rotation planning: Limitation Impact Mitigation Point-in-time snapshot Only shows currently connected clients Run query repeatedly over days/weeks to capture periodic jobs and batch processes No certificate details Cannot identify which CA certificate the client is using Requires client-side investigation (trust store analysis) Connection pooling May show pooler instead of actual application Use application_name in connection strings to identify true source Idle connections Long-running connections may be dormant Cross-reference with application activity logs Recommended approach: Use this query to create an initial inventory, then investigate each unique application_name and client_addr combination to determine their trust store configuration and SSL mode. Proactive Monitoring with Azure Monitor To detect certificate-related issues before and after CA rotation, configure Azure Monitor alerts. This enables early warning when SSL handshakes start failing. Why this matters: After CA rotation, applications with outdated trust stores will fail to connect. An alert allows you to detect affected applications quickly rather than waiting for user reports. Official Documentation: For complete guidance on creating and managing alerts, see Azure Monitor Alerts Overview and Create a Log Search Alert. Here is a short example of an Azure Monitor Alert definition as a starting point. { "alertRule": { "name": "PostgreSQL SSL Connection Failures", "severity": 2, "condition": { "query": "AzureDiagnostics | where ResourceType == 'SERVERS' and Category == 'PostgreSQLLogs' and Message contains 'SSL error' | summarize count() by bin(TimeGenerated, 5m)", "threshold": 5, "timeAggregation": "Total", "windowSize": "PT5M" } } } Alert Configuration Notes: Setting Recommended Value Rationale Severity 2 (Warning) Allows investigation without triggering critical incident response Threshold 5 failures/5min Filters noise while catching genuine issues Evaluation Period 5 minutes Balances responsiveness with alert fatigue Action Group Platform Team Ensures quick triage and coordination 6. Production Validation Pre-Rotation Validation Checklist Inventory all applications connecting to Azure PostgreSQL Identify trust store locations for each application Verify root certificate presence in trust stores Test connection with new intermediate in non-production environment Update monitoring alerts for SSL connection failures Prepare rollback plan if issues occur Schedule maintenance window (if required) Notify stakeholders of potential impact Testing Procedure We established a systematic 3-step validation process to ensure zero downtime. This approach moves from isolated testing to gradual production rollout. 🧪 Technical Validation Guide: For the complete list of psql commands, connection string examples for Windows/Linux, and automated testing scripts, please refer to our Validation Guide in the GitHub repository. Connection Testing Strategy The core of our validation strategy was testing connections with explicit sslmode settings. We used the psql command-line tool to simulate different client behaviors. Test Scenario Purpose Expected Result Encryption only (sslmode=require) Verify basic connectivity Connection succeeds even with unknown CA CA validation (sslmode=verify-ca) Verify trust store integrity Connection succeeds only if CA chain is valid Full validation (sslmode=verify-full) Verify strict security compliance Connection succeeds only if CA chain AND hostname match Pro Tip: Test with verify-full and an explicit root CA file containing the new Microsoft/DigiCert root certificates before the rotation date. This validates that your trust stores will work after the intermediate certificate changes. Step 1: Test in Non-Production Validate connections against a test server using the new intermediate certificate (Azure provides test endpoints during the rotation window). Step 2: Canary Deployment Deploy the updated trust store to a single “canary” instance or pod. Monitor: - Connection success rate - Error logs - Response times Step 3: Gradual Rollout Once the canary is stable, proceed with a phased rollout: 1. Update 10% of pods 2. Monitor for 1 hour 3. Update 50% of pods 4. Monitor for 1 hour 5. Complete rollout 7. Best Practices and Lessons Learned Certificate Management Best Practices Practice Guidance Example Trust Root CAs, Not Intermediates Configure trust stores with root CA certificates only. This provides resilience against intermediate certificate rotations. Trust Microsoft TLS RSA Root G2 and DigiCert Global Root G2 instead of specific intermediates Automate Trust Store Updates Use OS-provided trust stores when possible (automatically updated). For custom trust stores, implement CI/CD pipelines. Schedule bi-annual trust store audits Use SSL Mode Appropriately Choose SSL mode based on security requirements. verify-ca is recommended for most scenarios. See Security Posture Matrix in Section 2 Maintain Container Images Rebuild container images monthly to include latest CA certificates. Use init containers for runtime updates. Multi-stage builds with CA certificate update step Avoid Certificate Pinning Never pin intermediate certificates. If pinning is required for compliance, implement automated update processes. Pin only root CA certificates if absolutely necessary SSL Mode Decision Guide SSL Mode Security Level Resilience When to Use require Medium High Encrypted traffic without certificate validation. Use when CA rotation resilience is more important than MITM protection. verify-ca High Medium Validates certificate chain. Recommended for most production scenarios. verify-full Highest Low Strictest validation with hostname matching. Use only when compliance requires it. Organizational Communication Model Effective certificate rotation requires structured communication across multiple layers: Layer Responsibility Key Action Azure Service Health Microsoft publishes announcements to affected subscriptions Monitor Azure Service Health proactively Platform/Cloud Team Receives Azure announcements, triages criticality Follow ITSM processes, assess impact Application Teams Execute application-level changes Update trust stores, validate connections Security Teams Define certificate validation policies Set compliance requirements Ownership and Responsibility Matrix Team Responsibility Deliverable Platform/Cloud Team Monitor Azure Service Health, coordinate response Impact assessment, team notifications Application Teams Application-level changes (connection strings, trust stores) Updated configurations, validation results Security Teams Define certificate policies, compliance requirements Policy documentation, audit reports All Teams (Shared) Certificate lifecycle collaboration Playbooks, escalation paths, training Certificate Rotation Playbook Components Organizations should establish documented playbooks including: Component Recommended Frequency Purpose Trust Store Audits Bi-annual (every 6 months) Ensure certificates are current Certificate Inventory Quarterly review Know what certificates exist where Playbook Updates Annual or after incidents Keep procedures current Team Training Annual Build knowledge and confidence Field Observations: Common Configuration Patterns Pattern Observation Risk Implicit SSL Mode Teams don’t explicitly set sslmode, relying on framework defaults Unexpected behavior during CA rotation Copy-Paste Configurations Connection strings copied without understanding options Works until certificate changes expose gaps Framework-Specific Defaults Java uses JRE trust store, .NET uses Windows Certificate Store, Python depends on certifi package Some require manual updates, some are automatic Framework Trust Store Defaults Framework Default Trust Store Update Method Risk Level Java/Quarkus JRE cacerts Manual or JRE update Medium - requires awareness .NET Windows Certificate Store Windows Update Low - automatic Node.js Bundled certificates Node.js version update Low - automatic Python certifi package pip install --upgrade certifi High - manual intervention required Knowledge and Confidence Challenges Challenge Impact Mitigation Limited certificate knowledge Creates uncertainty and risk-averse behavior Proactive education, hands-on workshops Topic intimidation “Certificates” can seem complex, leading to avoidance Reality: Implementation is straightforward once understood Previous negative experiences Leadership concerns based on past incidents Document successes, share lessons learned Visibility gaps Lack of visibility into application dependencies Maintain certificate inventory, use discovery tools Monitoring Strategy (Recommended for Post-Rotation): While pre-rotation monitoring focuses on inventory, post-rotation monitoring should track: Key Metrics: - Connection failure rates (group by application, SSL error types) - SSL handshake duration (detect performance degradation) - Certificate validation errors (track which certificates fail) - Application error logs (filter for “SSL”, “certificate”, “trust”) Recommended Alerts: - Threshold: >5 SSL connection failures in 5 minutes - Anomaly detection: Connection failure rate increases >50% - Certificate expiry warnings: 30, 14, 7 days before expiration Dashboard Components: - Connection success rate by application - SSL error distribution (validation failures, expired certificates, etc.) - Certificate inventory with expiry dates - Trust store update status across infrastructure These metrics, alerts and thresholds are only starting points and need to be adjusted based on your environment and needs. Post-Rotation Validation and Telemetry Note: This article focuses on preparation for upcoming certificate rotations. Post-rotation metrics and incident data will be collected after the rotation completes and can inform future iterations of this guidance. Recommended Post-Rotation Activities: Here are some thoughts on post-rotation activities that could create more insights on the effectiveness of the preparation. Incident Tracking: After rotation completes, organizations should track: - Production incidents related to SSL/TLS connection failures - Services affected and their business criticality - Mean Time to Detection (MTTD) for certificate-related issues - Mean Time to Resolution (MTTR) from detection to fix Success Metrics to Measure Pre-Rotation Validation: - Number of services inventoried and assessed - Percentage of services requiring trust store updates - Testing coverage (dev, staging, production) Post-Rotation Outcomes: - Zero-downtime success rate (percentage of services with no impact) - Applications requiring emergency patching - Time from rotation to full validation Impact Assessment Telemetry to Collect: - Total connection attempts vs. failures (before and after rotation) - Duration of any service degradation or outages - ustomer-facing impact (user-reported issues, support tickets) - Geographic or subscription-specific patterns Continuous Improvement Post-Rotation Review: - What worked well in the preparation phase? - Which teams or applications were unprepared? - What gaps exist in monitoring or alerting? - How can communication be improved for future rotations? Documentation Updates: - Update playbooks with lessons learned - Refine monitoring queries based on observed patterns - Enhance team training materials - Share anonymized case studies across the organization 8. Engagement & Next Steps Discussion Questions We’d love to hear from the community: What’s your experience with certificate rotations? Have you encountered unexpected connection failures during CA rotation events? Which trust store update method works best for your environment? OS-managed, runtime-bundled, or custom trust stores? How do you handle certificate management in air-gapped environments? What strategies have worked for your organization? Share Your Experience If you’ve implemented proactive certificate management strategies or have lessons learned from CA rotation incidents, we encourage you to: Comment below with your experiences and tips Contribute to the GitHub repository with additional platform guides or scripts Connect with us on LinkedIn to continue the conversation Call to Action Take these steps now to prepare for the CA rotation: Assess your applications - Use the Risk Assessment Matrix (Section 2) to identify which applications use sslmode=verify-ca or verify-full with custom trust stores Import root CA certificates - Add DigiCert Global Root G2 and Microsoft RSA Root CA 2017 to your trust stores Upgrade SSL mode - Change your connection strings to at least sslmode=verify-ca (recommended: verify-full) for improved security Document your changes - Record which applications were updated, what trust stores were modified, and the validation results Automate for the future - Implement proactive certificate management so future CA rotations are handled automatically (OS-managed trust stores, CI/CD pipelines for container images, scheduled trust store audits) 9. Resources Official Documentation Azure PostgreSQL: Azure PostgreSQL SSL/TLS Concepts Azure PostgreSQL - Connect with TLS/SSL PostgreSQL & libpq: PostgreSQL libpq SSL Support - SSL mode options and environment variables PostgreSQL psql Reference - Command-line tool documentation PostgreSQL Server SSL/TLS Configuration Certificate Authorities: DigiCert Root Certificates Microsoft PKI Repository Microsoft Trusted Root Program Community Resources Let’s Encrypt Root Expiration (2021 Incident) NIST SP 800-57: Key Management Guidelines OWASP Certificate Pinning Cheat Sheet Neon Blog: PostgreSQL Connection Security Defaults Tools and Scripts PowerShell AKS Trust Store Scanner (see Platform-Specific Remediation) PostgreSQL Interactive Terminal (psql) PostgreSQL JDBC SSL Documentation Industry Context Certificate rotation challenges are not unique to Azure PostgreSQL. Similar incidents have occurred across the industry: Historical Incidents: - Let’s Encrypt Root Expiration (2021): Widespread impact when DST Root CA X3 expired, affecting older Android devices and legacy systems - DigiCert Root Transitions: Multiple cloud providers experienced customer impact during CA changes - Internal PKI Rotations: Enterprises face similar challenges when rotating internally-issued certificates Relevant Standards: - NIST SP 800-57: Key Management Guidelines (certificate lifecycle best practices) - OWASP Certificate Pinning: Guidance on balancing security and operational resilience - CIS Benchmarks: Recommendations for TLS/SSL configuration in cloud environments Authors Author Role Contact Andreas Semmelmann Cloud Solution Architect, Microsoft LinkedIn Mpho Muthige Cloud Solution Architect, Microsoft LinkedIn Disclaimers Disclaimer: The information in this blog post is provided for general informational purposes only and does not constitute legal, financial, or professional advice. While every effort has been made to ensure the accuracy of the information at the time of publication, Microsoft makes no warranties or representations as to its completeness or accuracy. Product features, availability, and timelines are subject to change without notice. For specific guidance, please consult your legal or compliance advisor. Microsoft Support Statement: This article represents field experiences and community best practices. For official Microsoft support and SLA-backed guidance: Azure Support: https://azure.microsoft.com/support/ Official Documentation: https://learn.microsoft.com/azure/ Microsoft Q&A: https://learn.microsoft.com/answers/ Production Issues: Always open official support tickets for production-impacting problems. Customer Privacy Notice: This article describes real-world scenarios from customer engagements. All customer-specific information has been anonymized. No NDAs or customer confidentiality agreements were violated in creating this content. AI-generated content disclaimer: This content was generated in whole or in part with the assistance of AI tools. AI-generated content may be incorrect or incomplete. Please review and verify before relying on it for critical decisions. See terms Community Contribution: The GitHub repository referenced in this article contains community-contributed scripts and guides. These are provided as-is for educational purposes and should be tested in non-production environments before use. Tags: #AzurePostgreSQL #CertificateRotation #TLS #SSL #TrustStores #Operations #DevOps #SRE #CloudSecurity #AzureDatabase
AndreasSemmelmann
Dec 15, 2025 Place Microsoft Mission Critical Blog
340Views
0likes
0Comments
Diagnose performance issues in Spark jobs through Spark UI.
Agenda Introduction Overview of Spark UI Navigating to Spark UI Jobs Timeline Opening Jobs Timeline Reading Event Timeline Failing Jobs or Executors Diagnosing Failing Jobs Diagnosing Failing Executors Scenario - Memory Issues Scenario - Long Running Jobs Scenario - Identifying Longest Stage Introduction Diagnosing performance issues of job using Spark UI This guide walks you through how to use the Spark UI to diagnose performance issues Overview of Spark UI Job Composition Composed of multiple stages Stages may contain more than one task Task Breakdown Tasks are broken into executors Navigating to Spark UI: Navigating to Cluster's Page Navigate to your cluster’s page: Navigating to Spark UI: Clicking Spark UI Click Spark UI: Jobs Timeline Jobs timeline The jobs timeline is a great starting point for understanding your pipeline or query. It gives you an overview of what was running, how long each step took, and if there were any failures along the way Opening Jobs Timeline Accessing the Jobs Timeline Navigate to the Spark UI Click on the Jobs tab Viewing the Event Timeline Click on Event Timeline Highlighted in red in the screenshot Example Timeline Shows driver and executor 0 being added Failing Jobs or Executors: Example of Failed Job Failed Job Example Indicated by a red status Shown in the event timeline Removed Executors Also indicated by a red status Shown in the event timeline Failing Jobs or Executors: Common Reasons for Executors Being Removed Autoscaling Expected behavior, not an error See Enable autoscaling for more details Compute configuration reference - Azure Databricks | Microsoft Learn Spot instance losses Cloud provider reclaiming your VMs Learn more about Spot instances here Executors running out of memory Diagnosing Failing Jobs: Steps to Diagnose Failing Jobs Identifying Failing Jobs Click on the failing job to access its page Reviewing Failure Details Scroll down to see the failed stage Check the failure reason Diagnosing Failing Jobs: Generic Errors You may get a generic error. Click on the link in the description to see if you can get more info: Diagnosing Failing Jobs: Memory Issues Task Failure Explanation Scroll down the page to see why each task failed Memory issue identified as the cause Scenario – Spot instance , Auto-scaling Diagnosing Failing Executors: Checking Event Log Check Event Log Identify any explanations for executor failures Spot Instances Cloud provider may reclaim spot instances Diagnosing Failing Executors: Navigating to Executors Tab Check Event Log for Executor Loss Look for messages indicating cluster resizing or spot instance loss Navigate to Spark UI Click on the Executors tab Diagnosing Failing Executors: Getting Logs from Failed Executors Here you can get the logs from the failed executors: Scenario - Memory Issues Memory Issues Common cause of problems Requires thorough investigation Quality of Code Potential source of memory issues Needs to be checked for efficiency Data Quality Can affect memory usage Must be organized correctly Spark memory issues - Azure Databricks | Microsoft Learn Identifying Longest Stage Identify the longest stage of the job Scroll to the bottom of the job’s page Locate the list of stages Order the stages by duration Identifying Longest Stage Identify the longest stage of the job Scroll to the bottom of the job’s page Locate the list of stages Order the stages by duration Stage I/O Details High-Level Data Overview Input Output Shuffle Read Shuffle Write Number of Tasks in Long Stage Identifying the number of tasks Helps in pinpointing the issue Look at the specified location to determine the number of tasks Investigating Stage Details Investigate Further if Multiple Tasks Check if the stage has more than one task Click on the link in the stage’s description Get More Info About Longest Stage Click on the link provided Gather detailed information Conclusion Potential Data Skew Issues Data skew can impact performance May cause uneven distribution of data Spelling Errors in Data Incorrect spelling can affect data processing Ensure data accuracy for optimal performance Learn More Navigate to Skew and Spill - Skew and spill - Azure Databricks | Microsoft Learn
PraveenPentareddy
Oct 13, 2025 Place Microsoft Mission Critical Blog
237Views
2likes
0Comments
Reducing SQL Connection Latency for Apps Using Azure AAD Authentication
Challenge: connection latency and token overhead Consider a cloud-native application deployed in Azure App Service or Kubernetes (AKS) that needs to query an Azure SQL Database for real-time data. The application uses Azure Active Directory (AAD) for secure authentication, but every time the application establishes a new connection to the database, it requests a new AAD token. In high-traffic environments where thousands of requests are processed per second, this repetitive token issuance introduces latency and performance degradation. This delay becomes particularly problematic for time-sensitive applications where every millisecond counts. Each token request impacts response times and creates unnecessary resource consumption. Solution: token caching and expiration management To mitigate these delays, we can optimize the authentication process by caching the AAD token and reusing it for the duration of its validity (typically 1 hour to 24 hours). Instead of requesting a new token for every database connection, the token is fetched only when the existing one is near expiration. This approach eliminates the repeated authentication overhead and ensures that the application can maintain seamless connectivity to the database without the performance hit of generating a new token for each request. In addition to reducing latency, this approach reduces the number of HTTP calls made to the Azure Active Directory service, resulting in better resource utilization and lower operational costs. Concrete performance gains: optimized SQL client connection As part of the mitigation, we provide a custom code implementation that uses SqlClient, a supported library, to optimize the connection time. The test was conducted with the S0 database, where using a single process and using connection pooling, we opened a connection, executed the SELECT 1, and closed the connection. During the testing phase with a connection pooler script running for 96 hours (without the AAD token cache), the following results were observed: 10 connections took 1 second, representing 0.866% of total connections. 1 connection took 4 seconds, representing 0.0866%. 1.144 connections took less than 1 second, representing 99.05% of total connections. All executions of SELECT 1 were completed in 0 seconds. These results demonstrate how caching AAD tokens and reusing them effectively reduced connection overhead and improved performance. None of the connections exceeded 5 seconds in duration, while with the default behavior, connections were reaching 30 seconds and more, depending on the environment complexity. Step-by-step implementation Here’s a step-by-step guide on how to implement this solution using C# and the Microsoft.Data.SqlClient package to optimize SQL database connections: Obtain and cache a token: Instead of requesting a new AAD token with every connection, we obtain a token once and cache it. This is done by leveraging Azure Managed Identity to authenticate the application, which eliminates the need to repeatedly authenticate with Azure Active Directory for every database connection. In this step, we fetch the token once and store it securely for reuse. Renew the token only when it’s near expiry We will refresh the token only when it is nearing expiration or has already expired. The application checks the token’s expiration time before attempting to use it. If the token is still valid, it continues to be reused. If it's close to expiration, a new token is fetched. Reuse a single token across multiple connections: The cached token can be used for multiple database connections during its lifetime. Rather than requesting a new token for each new connection, the application will use the same token across all connections until the token is about to expire. Code example: optimized SQL connection management Here’s an example of how you can implement token caching in a C# application using Microsoft.Data.SqlClient. using System; using System.Data.SqlClient; using System.Diagnostics; using System.Threading; using Azure.Identity; namespace SqlConnectionOptimization { public class SqlConnectionManager { private string _accessToken; private DateTimeOffset _tokenExpiration; private readonly string _connectionString = "Server=tcp:servername.database.windows.net,1433;Initial Catalog=DBName;..."; private readonly Stopwatch _stopwatch = new Stopwatch(); public SqlConnectionManager() { _accessToken = string.Empty; _tokenExpiration = DateTimeOffset.UtcNow; } public void Run() { while (true) { // Refresh token if necessary if (IsTokenExpired()) { RefreshToken(); } // Establish connection and perform operations using (var connection = CreateConnection()) { LogExecutionTime("Connected"); ExecuteQuery(connection); LogExecutionTime("Query Executed"); } // Simulate some idle time between operations Log("Waiting before next operation..."); Thread.Sleep(1000); } } private bool IsTokenExpired() { return string.IsNullOrEmpty(_accessToken) || DateTimeOffset.UtcNow.AddMinutes(5) >= _tokenExpiration; } private void RefreshToken() { _stopwatch.Start(); try { var result = FetchAccessToken(); _accessToken = result.Token; _tokenExpiration = result.Expiration; LogExecutionTime("Token Refreshed"); Log($"Token expires at: {_tokenExpiration}"); } catch (Exception ex) { Log($"Error fetching token: {ex.Message}"); } } private (string Token, DateTimeOffset Expiration) FetchAccessToken() { var managedIdentityCredential = new ManagedIdentityCredential(); var tokenRequestContext = new Azure.Core.TokenRequestContext(new[] { "https://database.windows.net/" }); var accessToken = managedIdentityCredential.GetTokenAsync(tokenRequestContext).Result; return (accessToken.Token, accessToken.ExpiresOn.UtcDateTime); } private SqlConnection CreateConnection() { var connection = new SqlConnection(_connectionString) { AccessToken = _accessToken }; int retries = 0; while (true) { try { connection.Open(); return connection; } catch (Exception ex) { retries++; if (retries > 5) { Log($"Error connecting after multiple retries: {ex.Message}"); throw; } Log($"Connection attempt failed. Retrying in {retries} seconds..."); Thread.Sleep(retries * 1000); } } } private void ExecuteQuery(SqlConnection connection) { var query = "SELECT 1"; // Simple query, replace with real logic as needed int retries = 0; while (true) { try { using (var command = new SqlCommand(query, connection)) { command.CommandTimeout = 5; // Adjust timeout for more complex queries command.ExecuteScalar(); } return; } catch (Exception ex) { retries++; if (retries > 5) { Log($"Max retries reached for query execution: {ex.Message}"); throw; } Log($"Query execution failed. Retrying in {retries} seconds..."); Thread.Sleep(retries * 1000); } } } private void Log(string message) { Console.WriteLine($"{DateTime.Now:yyyy-MM-dd HH:mm:ss.fff}: {message}"); } private void LogExecutionTime(string action) { _stopwatch.Stop(); var elapsed = _stopwatch.Elapsed; Log($"{action} - Elapsed time: {elapsed:hh\\:mm\\:ss\\.fff}"); _stopwatch.Reset(); } public static void Main(string[] args) { var manager = new SqlConnectionManager(); manager.Run(); } } } Key points in the code Token Expiration Check: The IsTokenExpired() method checks whether the token has expired by comparing it to the current time. We’ve added a 5-minute buffer for token expiration. This can be adjusted based on your needs. Managed Identity Authentication: The application uses Azure Managed Identity to authenticate and fetch the token, ensuring secure and scalable access to Azure SQL Database without requiring client secrets. Retry Logic: In the event of a connection failure or query execution failure, the system retries a set number of times with exponential backoff, making it resilient to transient network or authentication issues. Conclusion By implementing a token caching and expiration management strategy, applications can dramatically improve the performance and scalability of their database interactions, especially in environments with high request volumes. By leveraging Azure Managed Identity for secure, reusable tokens, you can reduce authentication latency and improve the overall efficiency of your SQL database connections. This approach can also be adapted to any service using Azure SQL Database and Azure Active Directory for authentication. Next steps Benchmarking: Test the implementation in your environment to quantify the performance gains. Error Handling: Extend the retry logic and error handling to better handle transient failures, especially in production environments. Resources: Introducing Configurable Retry Logic in Microsoft.Data.SqlClient v3.0.0-Preview1 Configurable retry logic in SqlClient Troubleshoot transient connection errors Scaling: Consider how this strategy can be applied across multiple services in larger architectures. Consider reading and applying managed identity best practices. Resources: Managed identity best practice recommendations
MelaniaNitu
Oct 03, 2025 Place Microsoft Mission Critical Blog
628Views
2likes
1Comment
A Deep Dive into Spark UI for Job Optimization
Key Insights for Spark Job Optimization The Spark UI is your X-ray into application execution: It provides real-time and post-mortem insights into every job, stage, task, and resource usage, moving you from guesswork to evidence-driven tuning. Systematic analysis is crucial: Start from high-level overviews in the Jobs tab, drill down into Stages for bottlenecks and shuffle operations, examine Tasks for skew and spills, and review Executors for resource allocation issues. Targeted optimizations yield significant gains: Address issues like data skew, excessive shuffles, memory pressure, and inefficient SQL plans with specific techniques such as repartitioning, broadcast joins, Kryo serialization, and proper resource allocation. Apache Spark is a powerful distributed computing framework, but extracting its full potential often requires meticulous optimization. The Spark UI (User Interface) stands as an indispensable tool, offering a detailed, web-based dashboard that provides real-time and historical insights into your Spark applications. It's the diagnostic center that helps you pinpoint performance bottlenecks, understand resource consumption, and identify inefficiencies that may be hindering your jobs. This comprehensive guide will walk you through the process of accessing, navigating, and interpreting the Spark UI, empowering you to translate its rich data into concrete strategies for optimizing your Spark jobs. As of July 1, 2025, modern Spark versions like 4.0.0 place significant emphasis on UI-driven performance tuning, making this a critical skill for any data professional. Accessing and Navigating the Spark UI: Your Diagnostic Gateway Before diving into optimization, you need to know how to access the Spark UI. Its accessibility varies depending on your Spark deployment mode: Local Mode: When running Spark locally, the UI is typically available at http://localhost:4040. Cluster Mode: In cluster environments like YARN, Mesos, or Kubernetes, the UI is usually accessed via the Spark History Server (often at port 18080) for post-mortem analysis, or through the application master's URL while the job is running. Cloud Platforms: On cloud services such as AWS Glue, Databricks, or EMR, the Spark UI is typically integrated into their respective consoles or accessible by enabling Spark event logging. Ensure event logs are configured to roll over to prevent metrics truncation for long-running jobs. Once accessed, the Spark UI is structured into several key tabs, each providing a different lens into your application's behavior: Jobs Tab: High-level overview of all jobs. Stages Tab: Detailed breakdown of stages within a job. Tasks Tab: Granular information about individual task execution. Storage Tab: Insights into cached RDDs and DataFrames. Environment Tab: Spark configuration and system properties. Executors Tab: Resource usage of individual executors. SQL Tab: Specific details for SQL queries and DataFrame operations (if applicable). Deciphering the Spark UI: A Tab-by-Tab Analysis An overview of the Jobs tab in the Apache Spark UI, showing job progress and details. 1. The Jobs Tab: Your Application's Pulse Check The Jobs tab is your initial point of contact for understanding the overall health and progress of your Spark application. It summarizes all submitted jobs, their status (running, completed, failed), duration, and general progress. This tab helps you quickly identify jobs that are stalling, taking excessively long, or have failed outright. What to look for: Overall Duration: Identify jobs that exhibit long durations. These are prime candidates for deeper optimization. Status and Progress: Observe jobs that are stuck or show a high number of failed tasks, indicating potential underlying issues that need immediate attention. Event Timeline: This visual representation of the application's lifecycle, including job execution and executor activity, can reveal patterns of resource contention or uneven parallel execution. 2. The Stages Tab: Unveiling Bottlenecks Stages are the backbone of a Spark job's execution, representing a sequence of tasks that can run together without data shuffling. The Stages tab provides granular details about each stage, making it crucial for pinpointing specific bottlenecks. The Stages tab in Spark UI, displaying detailed information for each stage of a job. Key Metrics and Analysis: Duration: Sort stages by duration to identify the longest-running ones. These are where your optimization efforts will likely yield the greatest impact. Input/Output (I/O) Sizes: High input/output metrics suggest that the stage might be I/O-bound. This points to opportunities for optimizing data formats or storage. Shuffle Read/Write: These are critical metrics. High "Shuffle Read" or "Shuffle Write" values indicate significant data movement between nodes, which is a very expensive operation. This often signals inefficient joins, aggregations, or partitioning. Task Progress and Event Timeline: Within the detail view of a stage, the event timeline visually represents individual task execution. Look for "straggler" tasks – tasks that take significantly longer than others – as this is a strong indicator of data skew where certain partitions hold disproportionately more data or require more computation. DAG Visualization: The Directed Acyclic Graph (DAG) visualization within a stage illustrates the flow of RDDs/DataFrames and the operations applied to them. This visual can simplify understanding complex data transformations and dependencies. For example, if a stage shows 3.2 TB of shuffle read and one task processes 400 GB compared to a median of 25 GB, this immediately highlights a severe data skew issue. 3. The Tasks Tab: Drilling Down to Individual Performance The Tasks tab offers the most granular view, showing execution details for individual tasks within a stage. This is where you can confirm observations from the Stages tab and identify specific issues like out-of-memory errors or high garbage collection times. Critical data points: Executor Run Time: Helps identify slow-running tasks. GC Time (Garbage Collection Time): High GC times indicate memory pressure and inefficient memory management, suggesting a need to optimize memory configurations or data serialization. Shuffle Spill (Memory Bytes Spilled / Disk Bytes Spilled): If tasks are spilling data to disk, it means they ran out of memory. This is a severe performance bottleneck, pointing to insufficient executor memory or inefficient data processing. Host: Sorting the task table by host can reveal skewed executors, where one executor is burdened with significantly more work due to data imbalance. 4. The SQL Tab: Optimizing Your DataFrames and SQL Queries For Spark DataFrame and SQL workloads, the SQL tab is invaluable. It provides detailed information about executed SQL queries, including their duration, associated jobs, and, most importantly, their physical and logical execution plans. Analyzing SQL queries: Physical Plan: This is a textual and graphical representation of how the Spark optimizer decided to execute your query. Look for inefficient join strategies (e.g., unintended Cartesian joins, inefficient Sort-Merge Joins where Broadcast Join would be better), missed filter pushdowns, or unnecessary data shuffles (indicated by "Exchange" operations). Graphical Visualization: This visual simplifies the analysis by showing aggregated information about rows and data processed at each stage of the SQL query. By analyzing the physical plan, you can validate whether your DataFrame transformations or SQL queries are being optimized as expected. For instance, if you've hinted for a broadcast join but the plan shows a Sort-Merge Join with a huge shuffle, you know there's a problem. 5. The Executors Tab: Resource Utilization Deep Dive This tab provides a detailed view of the resources consumed by each executor in your cluster, including CPU cores, allocated memory, used memory, disk usage, and the number of active tasks. It's essential for understanding resource allocation and identifying bottlenecks related to cluster configuration. Key checks: Memory Used vs. Total Memory: Identify if executors are underutilized or overloaded. High memory usage combined with disk spills indicates memory pressure. CPU Cores: Verify if your allocated CPU cores are being efficiently utilized. Low utilization might suggest insufficient parallelism or tasks waiting for resources. Disk Usage: Indicates if tasks are performing large I/O operations or spilling excessive data to disk. Thread Dump: Allows you to inspect the JVM thread dump on each executor for advanced debugging of performance issues. 6. The Storage Tab: Managing Cached Data If your Spark application uses caching or persistence (e.g., via cache() or persist()), the Storage tab provides details about persisted RDDs and DataFrames, including their storage levels (memory, disk, or both), sizes, and partition distribution. Insights from the Storage tab: Memory Management: Ensure cached data is not consuming excessive memory or being spilled to disk unnecessarily. Appropriate Caching Strategy: Verify that frequently accessed datasets are cached with suitable storage levels to minimize recomputation without causing memory overflows. 7. The Environment Tab: Configuration Validation This tab displays all Spark configuration properties, JVM settings, and system environment variables. It's a crucial place to confirm that your Spark application is running with the intended configurations. Key usage: Configuration Validation: Double-check if critical Spark configurations like spark.executor.memory, spark.executor.cores, spark.sql.shuffle.partitions, and spark.serializer are set correctly. Misconfigurations can severely impact performance. Translating UI Insights into Optimization Strategies Once you've analyzed the Spark UI and identified specific bottlenecks, you can apply targeted optimization techniques. This shift from "guess-and-check" to "evidence-driven" tuning can significantly improve job runtimes and reduce costs. 1. Addressing Data Skew Detection: Long "straggler" tasks in the Stage Event Timeline, uneven partition sizes, or highly skewed "Shuffle Read/Write" metrics in the Stages tab. Optimization: Repartitioning: Use repartition(N) or repartitionByRange(N, column) to distribute data more evenly across partitions. For instance, df = df.repartitionByRange(800, "customer_id") for a skewed customer_id key. Salting: For highly skewed join keys, add a random prefix (salt) to the key before joining, then remove it afterward. Adaptive Query Execution (AQE): In Spark 3.2+, enable AQE (spark.sql.adaptive.enabled=true and spark.sql.adaptive.skewJoin.enabled=true). AQE can dynamically detect and mitigate data skew during shuffle operations. 2. Optimizing Shuffles Detection: High "Shuffle Read" and "Shuffle Write" metrics in the Stages tab, indicating excessive data movement. Optimization: Filter Early: Push down filters and projections as early as possible to reduce the amount of data processed and shuffled. Broadcast Joins: For small tables (typically under spark.sql.autoBroadcastJoinThreshold, default 10MB), use broadcast(df) hint or set spark.sql.autoBroadcastJoinThreshold to enable broadcast joins. This avoids a shuffle for the smaller table. Adjust Shuffle Partitions: Configure spark.sql.shuffle.partitions appropriately. A common rule of thumb is 2-4 times the number of total executor cores, ensuring each partition is between 100-200 MB to avoid OOM errors and small file overhead. Coalesce: Use coalesce() for reducing the number of partitions without triggering a full shuffle if data size allows. 3. Memory Management and Garbage Collection Detection: High "Shuffle Spill" (Memory/Disk Bytes Spilled) in the Tasks tab, out-of-memory errors, or significant "GC Time" in the Executors tab or Task details. Optimization: Executor Memory: Increase spark.executor.memory if tasks are spilling to disk. Memory Fractions: Adjust spark.memory.fraction and spark.memory.storageFraction to allocate more memory for execution or caching. Serialization: Use Kryo serialization (spark.serializer=org.apache.spark.serializer.KryoSerializer) for faster and more compact data serialization, reducing memory footprint and network I/O. Caching: Cache only necessary DataFrames that are reused multiple times, and use appropriate storage levels (e.g., MEMORY_AND_DISK). Unpersist data promptly when no longer needed. GC Tuning: For large heaps, consider tuning JVM garbage collector settings, often involving the G1GC algorithm, to minimize GC pauses. High GC time (e.g., >15% of task time) indicates too many small objects. 4. Resource Allocation and Parallelism Detection: Underutilized executors (low CPU usage, many idle cores), tasks waiting for resources in the Jobs/Executors tabs, or dynamic allocation adding/removing executors frequently. Optimization: Executor Cores/Memory: Adjust spark.executor.cores and spark.executor.memory to match your cluster's capacity and workload. Ensure you have enough executors to handle the desired parallelism. Default Parallelism: Set spark.default.parallelism to a value that provides sufficient concurrent tasks, ideally 2-4 times the total number of CPU cores in your cluster. 5. SQL Query and DataFrame Optimization Detection: Inefficient physical plans in the SQL tab, long-running SQL queries, or unnecessary "Exchange" operations. Optimization: Predicate Pushdown: Ensure filters are applied as early as possible (e.g., directly in the data source read) to reduce the amount of data processed. Join Order and Strategy: Reorder joins to place selective filters and smaller tables first. Leverage specific join hints (BROADCAST, SHUFFLE_HASH) where appropriate. Column Pruning: Select only the columns you need, avoiding full table scans. Bucketing and Partitioning: For frequently joined or filtered columns, consider bucketing and partitioning your data to improve performance of joins and aggregations. This bar chart quantifies the common performance bottlenecks in Spark, indicating their typical impact on job execution on a scale of 0 to 10. Higher scores suggest more significant performance degradation. Understanding these high-impact areas helps prioritize optimization efforts. A Practical Example: Tackling Data Skew with the UI Imagine a PySpark ETL job that takes 48 minutes to complete. A quick glance at the Jobs tab shows that "Job 3" accounts for 42 of those minutes. Drilling into Job 3, the Stages tab reveals that "Stage 19" is the culprit, consuming 38 minutes and involving 3.2 TB of shuffle read. Further inspection of Stage 19's Event Timeline within the Stage Detail view immediately highlights a "straggler" task on a specific host (e.g., ip-10-0-4-11). This task processed an anomalous 400 GB of data, compared to the median 25 GB for other tasks in the same stage. This disparity is a classic symptom of data skew, likely caused by a highly skewed key like "customer_id". The Fix: Based on this evidence, an optimization is implemented: df = df.repartitionByRange(800, "customer_id") potentially combined with salting if the skew is severe. After redeploying, the Spark UI confirms the success: Stage 19's runtime drops to 6 minutes, the total job to 12 minutes, and crucially, there's no disk spill and GC time is less than 3%. This example underscores how the Spark UI provides the exact evidence needed to diagnose issues and validate the effectiveness of applied optimizations. Optimizing for the Future: Best Practices and Continuous Improvement Effective use of the Spark UI isn't a one-time activity; it's an ongoing process for continuous optimization. Table of Common Symptoms and Proven Fixes Symptom in UI Root Cause What to Change / Fix Few very long tasks; wide idle band at end of stage (stragglers) Too few partitions or severe data skew repartition(N) or repartitionByRange; for skew: salting, skew join hint, enable AQE skew mitigation Shuffle spill: "Disk Bytes Spilled" > 0 Executor memory insufficient Raise spark.executor.memory / spark.memory.fraction, use Kryo serialization, filter earlier Stage uses SortMergeJoin with huge shuffle where BroadcastJoin was expected Broadcast join not chosen or threshold too low broadcast(df) hint or configure spark.sql.autoBroadcastJoinThreshold GC Time > 15% of Task Time Too many small objects, inefficient memory usage cache() only necessary data, use Dataset encoders or vectorized Parquet reader, increase executor heap but watch GC algorithm Executors idle in timeline; dynamic allocation frequently adds/removes Slots > parallelism; poor partitioning for workload Lower spark.sql.shuffle.partitions, coalesce downstream if appropriate, adjust spark.default.parallelism SQL plan shows multiple "Exchanges" stacking Unnecessary repartitions (e.g., narrow-wide-narrow pattern) Use colocated sort-merge join hints, reuse partitioning columns, analyze query logic for redundant shuffles High I/O metrics in Stages tab (e.g., large input size without sufficient processing) Inefficient data format, full table scans, or lack of predicate pushdown Optimize data formats (e.g., Parquet with snappy compression), apply filters/projections early, leverage partitioning/bucketing in source data Application fails with OutOfMemoryError (OOM) on driver or executor Insufficient driver/executor memory for data or operations Increase spark.driver.memory or spark.executor.memory; reduce partition size or number of partitions; enable off-heap memory if applicable This table summarizes common symptoms observed in the Spark UI, their root causes, and corresponding solutions. It serves as a quick reference guide for targeted optimization efforts. Visualization of Spark UI Concepts This Mermaid mindmap visually organizes the key concepts related to analyzing the Spark UI and optimizing Spark jobs, covering accessing the UI, understanding its various tabs, specific optimization strategies, and overarching best practices for continuous improvement. Conclusion Analyzing the Spark UI is an art and a science, offering an unparalleled view into the inner workings of your Spark applications. By systematically navigating its various tabs—Jobs, Stages, Tasks, SQL, Executors, Storage, and Environment—you can gather crucial evidence to diagnose performance issues such as data skew, excessive shuffles, memory pressure, and inefficient resource allocation. This evidence-driven approach allows you to implement targeted optimizations, whether it's through repartitioning data, adjusting memory configurations, fine-tuning SQL queries, or optimizing resource allocation. Mastering the Spark UI not only transforms you into a more effective Spark developer but also ensures that your big data pipelines run with optimal efficiency, leading to significant reductions in execution time and operational costs. Continuous monitoring and iterative optimization based on UI insights are the keys to maintaining robust and performant Spark applications in production environments. Frequently Asked Questions (FAQ) What is the primary purpose of the Spark UI? The Spark UI serves as a web-based interface for monitoring, debugging, and optimizing Spark applications by providing real-time and historical insights into job execution, resource utilization, and performance bottlenecks. How can I access the Spark UI in a cluster environment? In a cluster environment, the Spark UI can typically be accessed via the Spark History Server (often running on port 18080) for completed jobs, or through the application master's URL while the job is still active. Cloud platforms like AWS Glue or Databricks usually provide direct links in their respective consoles. What does "Shuffle Read/Write" indicate in the Spark UI? "Shuffle Read/Write" metrics in the Stages tab indicate the amount of data transferred between executors across the network during shuffle operations. High values often point to expensive data redistribution, which can be a significant performance bottleneck, typically caused by wide transformations like joins or aggregations. How do "straggler" tasks relate to data skew? "Straggler" tasks are individual tasks within a stage that take significantly longer to complete than others. They are a primary indicator of data skew, where certain data partitions have disproportionately more data or require more computation, leading to uneven work distribution across executors. What are some immediate actions to take if the Spark UI shows high "Shuffle Spill"? High "Shuffle Spill" (data written to disk due to memory limitations) suggests that executors are running out of memory. Immediate actions include increasing spark.executor.memory, optimizing data serialization (e.g., using Kryo), or filtering data earlier to reduce memory footprint. Referenced Sources Performance Tuning - Spark 4.0.0 Documentation - spark.apache.org Diagnose cost and performance issues using the Spark UI - Databricks Documentation Web UI - Spark 4.0.0 Documentation - spark.apache.org Diagnose cost and performance issues using the Spark UI | Databricks Documentation How to interpret Spark UI - Databricks Community - 109593 Apache Spark Performance Tuning: 7 Optimization Tips (2025) Diagnose cost and performance issues using the Spark UI - Azure Databricks | Microsoft Learn Mastering Spark UI Monitoring in PySpark: Optimizing Performance ... Diagnose cost and performance issues using the Spark UI r/dataengineering on Reddit: Beginner’s Guide to Spark UI: How to Monitor and Analyze Spark Jobs Diagnose cost and performance issues using the Spark UI How to Optimize Spark Jobs for Maximum Performance Monitoring and Instrumentation - Spark 4.0.0 Documentation Spark Web UI - Understanding Spark Execution - Spark By {Examples} How to read Spark UI - Stack Overflow
anishekkamal
Aug 11, 2025 Place Microsoft Mission Critical Blog
1.7KViews
3likes
0Comments
Scaling PostgreSQL Connections in Azure: A Deep Dive into Multi-PgBouncer Architectures
This article dives into building a robust and scalable connection pooling layer for Azure Database for PostgreSQL - Flexible Server. You'll learn how to deploy multiple PgBouncer instances, managed by an Azure Load Balancer, to eliminate single points of failure and dramatically boost throughput for your applications.
AlexRosa
Jun 04, 2025 Place Microsoft Mission Critical Blog
536Views
2likes
1Comment
A Practical Guide to Implement End-to-End Data Warehousing PoC Using Microsoft Fabric-Part1
Sometime back we had the opportunity to help a customer implement a PoC using Microsoft Fabric. In a series of blogs, we will demonstrate how we helped our customer from the SfMC team to implement a Data Warehouse PoC using the ELT (Extract, Load, Transform) approach. For this PoC, we used sample data from SQL Server’s well-known sample databases, giving us a straightforward and accessible way to illustrate the capabilities of Microsoft Fabric in a real-world data warehousing scenario. The following were the customer requirements: Build the enterprise grade Data Warehouse solution in Microsoft Fabric Ingest data from 59 diverse sources across 130 different countries Source data from ADLS (JSON) Source data from Azure and on-prem SQL Server databases Other data sources. PoC scope: ELT approach ingest (OLTP DB & ADLS Gen2) data directly into the Warehouse Transformation using stored procedures to load the production schema Solution Summary: Ingest OLTP data from source SQL Server (full load) Meta data driven copy activity to load data into staging schema of the Data Warehouse. Stored procedures to process the staging data along with data from ADLS shortcuts. Load the production Dimension tables. Application’s customer data arrives into ADLS storage as JSON files. Stored Procedures to process the data and load Customer dimension and Fact data using incremental load. Built 6 Dimensions and 1 Fact. Build reports. Sample overview of items that were implemented: Sample report from the PoC: Prerequisites: To implement this PoC, the following are the prerequisites: you may need to download the following files: You can download them from: Download all the required files to local drive (eg: C:\temp). We assume that you already have a Fabric tenant setup with a Fabric Capacity (you will need a F64 capacity to test the co-pilot feature else a F2 capacity will be sufficient). If Fabric is not enabled, use the link below to enable Fabric for your organization. Step by Step Guide to Enable Microsoft Fabric for Microsoft 365 Developer Account You will also need: A logical SQL Server (how to deploy one, how to add local IP address to the firewall exception) ADLS Gen2 Account (how to deploy one, how to create a container) Once created, Once the account is created, navigate to the “Data Storage” section and create a container name it levelup or a name of choice Open the container “levelup”and create a folder called “JSON_FILES” Install Storage Explorer, configure to connect to your Azure subscription, and navigate to storage account container. Upload five JSON files from the downloaded folder “JSON_FILES” to ADLS Gen2 in the "levelup" container under the "JSON_FILES" folder. Upload the following folders by choosing “Upload Folder” from Azure Storage Explorer. Sales_SalesOrderDetail Sales_SalesOrderHeader The above two folders containing Delta files, will be used for creating shortcuts in the Lakehouse which will be used for building the tables within the Warehouse in the subsequent tasks. After the upload, you should have the folders below inside your Levelup container. Next, create a database on the Azure SQL Server by using the bacpac files downloaded earlier: Connect to Azure SQL Logical Server using SSMS (if you don’t have, you can download here) Right click on Databases and select the option “Import Data-Tier Application”. Follow the screen captures below to complete the database bacpac import. Note: Depending upon DTU/V-core chosen, the Import activities might take up 30 mins. Continue to the next blog
Dharmendra_Keshari
Apr 13, 2025 Place Microsoft Mission Critical Blog
1.9KViews
5likes
2Comments
A Practical Guide to Implement End-to-End Data Warehousing PoC Using Microsoft Fabric-Part 3
In this continuation (part 3) of the previous blog, you will create a simple metadata driven pipeline to ingest data into the Data Warehouse and populate the dimension tables. Task4: Create a metadata driven pipeline to ingest data into the Data Warehouse and populate the dimension tables. We will use the table mtd.TablesToCopy created in Task3 of above while building a metadata driven pipeline, to ingest data from the Azure SQL Database. Click on the Workspace icon on the left side navigation and click “+ New Item” > search “Data pipeline” to create a new pipeline and provide a name (eg., PL_Dim_FullLoad_Levelup) Once the pipeline editor opens, click on the Activities tab and add a Stored Procedure activity and name it DeleteAllRowsfromStgTbls in the General tab On the settings tab, use the below snip to populate the details. This stored procedure deletes all the records of the staging dimension tables to facilitate a full load (except stg.DimCustomer) Note: We'll utilize stg.DimCustomer for incremental changes, won't be deleting all rows from the table. Click on the whitespace in the pipeline editor/canvas to add a parameter called “LookupTable” like the below. Add a lookup activity into the canvas. Connect the “On success” port of the “Stored Procedure” activity to the Lookup activity. From the general tab of the lookup activity, rename it to TablesToCopyLookup. In the settings tab, use the below snip to populate the details. Important: Remember to uncheck the First row only Note: In the above snip the Query field is set dynamically. To do so, click on the Add dynamic content as shown below and paste the following code: SELECT * FROM @{pipeline().parameters.LookupTable} The lookup activity will get the table name passed by the pipeline parameter at runtime and run the select statement against the table – mtd.TablesToCopy. Add a ForEach activity to the canvas and connect the “On success” port of the “Lookup” activity to the “ForEach” activity. The ForEach activity will iterate once for each value returned by the Lookup activity. Rename the activity name to ForEach_Row_In_TablesToCopy Set the settings for the ForEach activity as below. The “items” field should be set similar to note section of step 5. Paste the following into the items field - @activity('TablesToCopyLookup').output.value. Double click or edit the ForEach activity and within the ForEach activity, add a Copy Activity, set its name as “Copy_Data_From_SQL_To_Fabric_DW” in the General tab to: Set the source tab settings based on the snip below. In the connection field, setup a new connection to the Azure SQL DB that you created in the pre-requisites section. In the Query field click Add dynamic content and type - SELECT * FROM @{item().tableName}. This will pass the name of the table based on the iteration we are in. Set the destination tab settings based on the snip below. Notice that we will copy the source tables from the Azure SQL database to the staging schema and dynamically pass table names based on the source tables. The schema field is statically set to stg. To create the table names in the data warehouse dynamically, in the table name field, click Add dynamic content and paste - @concat('Dim',split(item().tableName,'.')[1]). This expression, concatenates the string “Dim” to the second part of the two part source table. In the canvas showing the Copy activity, click on Main canvas on the top left corner to exit out of the ForEach activity. Add a Stored procedure activity to the canvas and connect the Success port of the ForEach activity to the stored procedure. Rename the stored procedure activity from the General tab to - CTAS_For_PrdDimTbls. In the settings tab, fill out the details using the snip below Click on the Home tab > Save to save the pipeline and the Play button to execute the pipeline. The pipeline should look like the below. After a successful run, it would have copied all the tables from the Azure SQL Database to the stg schema of the Data Warehouse and would have populated all the Dimensions tables in the production schema.
ChandraMohanSV
Apr 10, 2025 Place Microsoft Mission Critical Blog
396Views
2likes
0Comments
Azure Databricks - Best Practices for Using Spot Instances in Cluster Scaling
Leveraging Spot instances in Azure Databricks can greatly reduce costs; however, we strongly advise against their use for critical production workloads requiring high SLAs. Since Spot instances are subject to availability and can be interrupted at any time, they pose a risk to workload stability. If you still choose to use Spot instances for such workloads, it is essential to follow best practices to mitigate potential risks. Spot instances provide a cost-efficient way to scale compute clusters, but improper configurations may lead to instability and job failures. This blog outlines key best practices for using Spot instances, whether auto-scaling is enabled or disabled. When Auto Scaling is Disabled with Spot Instances Without auto-scaling, Spot instance availability is crucial for successful cluster startup. Here’s what you need to consider: Cluster Availability Ensure that 80% of the total requested nodes are available for startup. For instance, if you request four Spot compute worker nodes, the eviction of even a single node can delay the cluster's launch. Cluster Launch Attributes Use attributes like FALL_BACK_TO_AZURE during cluster launch. This ensures that if Spot instances are unavailable, an on-demand compute node will be provisioned instead, preventing cluster failures. Avoid Using Pools with Spot Instances Creating clusters from pools with Spot instances can introduce instability, especially if the driver node is assigned to a Spot instance. To prevent this, we strongly discourage using pools with Spot instances when launching clusters. When Auto Scaling is Enabled with Spot Instances Auto-scaling allows clusters to dynamically adjust resources, but careful setup is necessary for smooth scaling. On-Demand Nodes First Set the Azure attribute first_on_demand=2 in the job cluster definition. This ensures that the first two nodes (one driver and one worker) are on-demand, stabilizing cluster creation. Autoscaling Settings Enable auto-scaling on the cluster. Set min_workers=1, ensuring that at least one worker is always on-demand. Define the maximum cluster size to prevent over-scaling issues. This setup ensures reliable cluster startup and reduces the risk of job failures. Upscaling Considerations The cluster should always start with on-demand nodes before scaling up with Spot instances. While this approach improves stability, it may slightly increase overall job duration due to the upscaling process. Avoid Using Pools with Spot Instances Just like in the non-auto-scaling setup, avoid creating clusters from pools with Spot instances. Doing so can lead to delayed startups and instability. Final Thoughts By following these best practices, you can maximize the benefits of Spot instances while ensuring cluster stability and efficiency. Whether auto-scaling is enabled or not, prioritizing on-demand instances during startup and carefully managing scaling policies will help mitigate potential risks.
PraveenPentareddy
Apr 08, 2025 Place Microsoft Mission Critical Blog
2.3KViews
2likes
0Comments