data & ai
210 TopicsAzure AI Foundry vs. Azure Databricks – A Unified Approach to Enterprise Intelligence
Key Insights into Azure AI Foundry and Azure Databricks Complementary Powerhouses: Azure AI Foundry is purpose-built for generative AI application and agent development, focusing on model orchestration and rapid prototyping, while Azure Databricks excels in large-scale data engineering, analytics, and traditional machine learning, forming the data intelligence backbone. Seamless Integration for End-to-End AI: A critical native connector allows AI agents developed in Foundry to access real-time, governed data from Databricks, enabling contextual and data-grounded AI solutions. This integration facilitates a comprehensive AI lifecycle from data preparation to intelligent application deployment. Specialized Roles for Optimal Performance: Enterprises leverage Databricks for its robust data processing, lakehouse architecture, and ML model training capabilities, and then utilize AI Foundry for deploying sophisticated generative AI applications, agents, and managing their lifecycle, ensuring responsible AI practices and scalability. In the rapidly evolving landscape of artificial intelligence, organizations seek robust platforms that can not only handle vast amounts of data but also enable the creation and deployment of intelligent applications. Microsoft Azure offers two powerful, yet distinct, services in this domain: Azure AI Foundry and Azure Databricks. While both contribute to an organization's AI capabilities, they serve different primary functions and are designed to complement each other in building comprehensive, enterprise-grade AI solutions. Decoding the Core Purpose: Foundry for Generative AI, Databricks for Data Intelligence At its heart, the distinction between Azure AI Foundry and Azure Databricks lies in their core objectives and the types of workloads they are optimized for. Understanding these fundamental differences is crucial for strategic deployment and maximizing their combined potential. Azure AI Foundry: The Epicenter for Generative AI and Agents Azure AI Foundry emerges as Microsoft's unified platform specifically engineered for the development, deployment, and management of generative AI applications and AI agents. It represents a consolidation of capabilities from what were formerly Azure AI Studio and Azure OpenAI Studio. Its primary focus is on accelerating the entire lifecycle of generative AI, from initial prototyping to large-scale production deployments. Key Characteristics of Azure AI Foundry: Generative AI Focus: Foundry streamlines the development of large language models (LLMs) and customized generative AI applications, including chatbots and conversational AI. It emphasizes prompt engineering, Retrieval-Augmented Generation (RAG), and agent orchestration. Extensive Model Catalog: It provides access to a vast catalog of over 11,000 foundation models from various publishers, including OpenAI, Meta (Llama 4), Mistral, and others. These models can be deployed via managed compute or serverless API deployments, offering flexibility and choice. Agentic Development: A significant strength of Foundry is its support for building sophisticated AI agents. This includes tools for grounding agents with knowledge, tool calling, comprehensive evaluations, tracing, monitoring, and guardrails to ensure responsible AI practices. Foundry Local further extends this by allowing offline and on-device development. Unified Development Environment: It offers a single management grouping for agents, models, and tools, promoting efficient development and consistent governance across AI projects. Enterprise Readiness: Built-in capabilities such as Role-Based Access Control (RBAC), observability, content safety, and project isolation ensure that AI applications are secure, compliant, and scalable for enterprise use. Figure 1: Conceptual Architecture of Azure AI Foundry illustrating its various components for AI development and deployment. Azure Databricks: The Powerhouse for Data Engineering, Analytics, and Machine Learning Azure Databricks, on the other hand, is an Apache Spark-based data intelligence platform optimized for large-scale data engineering, analytics, and traditional machine learning workloads. It acts as a collaborative workspace for data scientists, data engineers, and ML engineers to process, analyze, and transform massive datasets, and to build and deploy diverse ML models. Key Characteristics of Azure Databricks: Unified Data Analytics Platform: Central to Databricks is its lakehouse architecture, built on Delta Lake, which unifies data warehousing and data lakes. This provides a single platform for data engineering, SQL analytics, and machine learning. Big Data Processing: Excelling in distributed computing, Databricks is ideal for processing large datasets, performing ETL (Extract, Transform, Load) operations, and real-time analytics at scale. Comprehensive ML and AI Workflows: It offers a specialized environment for the full ML lifecycle, including data preparation, feature engineering, model training (both classic and deep learning), and model serving. Tools like MLflow are integrated for tracking, evaluating, and monitoring ML models. Data Intelligence Features: Databricks includes AI-assistive features such as Databricks Assistant and Databricks AI/BI Genie, which enable users to interact with their data using natural language queries to derive insights. Unified Governance with Unity Catalog: Unity Catalog provides a centralized governance solution for all data and AI assets within the lakehouse, ensuring data security, lineage tracking, and access control. Figure 2: The Databricks Data Intelligence Platform with its unified approach to data, analytics, and AI. The Symbiotic Relationship: Integration and Complementary Use Cases While distinct in their primary functions, Azure AI Foundry and Azure Databricks are explicitly designed to work together, forming a powerful, integrated ecosystem for end-to-end AI development and deployment. This synergy is key to building advanced, data-driven AI solutions in the enterprise. Seamless Integration for Enhanced AI Capabilities The integration between the two platforms is a cornerstone of Microsoft's AI strategy, enabling AI agents and generative applications to be grounded in high-quality, governed enterprise data. Key Integration Points: Native Databricks Connector in AI Foundry: A significant development in 2025 is the public preview of a native connector that allows AI agents built in Azure AI Foundry to directly query real-time, governed data from Azure Databricks. This means Foundry agents can leverage Databricks AI/BI Genie to surface data insights and even trigger Databricks Jobs, providing highly contextual and domain-aware responses. Data Grounding for AI Agents: This integration enables AI agents to access structured and unstructured data processed and stored in Databricks, providing the necessary context and knowledge base for more accurate and relevant generative AI outputs. All interactions are auditable within Databricks, maintaining governance and security. Model Crossover and Availability: Foundation models, such as the Llama 4 family, are made available across both platforms. Databricks DBRX models can also appear in the Foundry model catalog, allowing flexibility in where models are trained, deployed, and consumed. Unified Identity and Governance: Both platforms leverage Azure Entra ID for authentication and access control, and Unity Catalog provides unified governance for data and AI assets managed by Databricks, which can then be respected by Foundry agents. Here's a breakdown of how a typical flow might look: Mindmap 1: Illustrates the complementary roles and integration points between Azure Databricks and Azure AI Foundry within an end-to-end AI solution. When to Use Which (and When to Use Both) Choosing between Azure AI Foundry and Azure Databricks, or deciding when to combine them, depends on the specific requirements of your AI project: Choose Azure AI Foundry When You Need To: Build and deploy production-grade generative AI applications and multi-agent systems. Access, evaluate, and benchmark a wide array of foundation models from various providers. Develop AI agents with sophisticated capabilities like tool calling, RAG, and contextual understanding. Implement enterprise-grade guardrails, tracing, monitoring, and content safety for AI applications. Rapidly prototype and iterate on generative AI solutions, including chatbots and copilots. Integrate AI agents deeply with Microsoft 365 and Copilot Studio. Choose Azure Databricks When You Need To: Perform large-scale data engineering, ETL, and data warehousing on a unified lakehouse. Build and train traditional machine learning models (supervised, unsupervised learning, deep learning) at scale. Manage and govern all data and AI assets centrally with Unity Catalog, ensuring data quality and lineage. Conduct complex data analytics, business intelligence (BI), and real-time data processing. Leverage AI-assistive tools like Databricks AI/BI Genie for natural language interaction with data. Require high-performance compute and auto-scaling for data-intensive workloads. Use Both for Comprehensive AI Solutions: The most powerful approach for many enterprises is to leverage both platforms. Azure Databricks can serve as the robust data backbone, handling data ingestion, processing, governance, and traditional ML model training. Azure AI Foundry then sits atop this foundation, consuming the prepared and governed data to build, deploy, and manage intelligent generative AI agents and applications. This allows for: Domain-Aware AI: Foundry agents are grounded in enterprise-specific data from Databricks, leading to more accurate, relevant, and trustworthy AI responses. End-to-End AI Lifecycle: Databricks manages the "data intelligence" part, and Foundry handles the "generative AI application" part, covering the entire spectrum from raw data to intelligent user experience. Optimized Resource Utilization: Each platform focuses on what it does best, leading to more efficient resource allocation and specialized toolsets for different stages of the AI journey. Comparative Analysis: Features and Capabilities To further illustrate their distinct yet complementary nature, let's examine a detailed comparison of their features, capabilities, and typical user bases. Radar Chart 1: This chart visually compares Azure AI Foundry and Azure Databricks across several key dimensions, illustrating their specialized strengths. Azure AI Foundry excels in generative AI and agent orchestration, while Azure Databricks dominates in data engineering, unified data governance, and traditional ML workflows. A Detailed Feature Comparison Feature Category Azure AI Foundry Azure Databricks Primary Focus Generative AI application & agent development, model orchestration Large-scale data engineering, analytics, traditional ML, and AI workflows Data Handling Connects to diverse data sources (e.g., Databricks, Azure AI Search) for grounding AI agents. Not a primary data storage/processing platform. Native data lakehouse architecture (Delta Lake), optimized for big data processing, storage, and real-time analytics. AI/ML Capabilities Foundation models (LLMs), prompt engineering, RAG, agent orchestration, model evaluation, content safety, responsible AI tooling. Traditional ML (supervised/unsupervised), deep learning, feature engineering, MLflow for lifecycle management, Databricks AI/BI Genie. Development Style Low-code agent building, prompt flows, unified SDK/API, templates. Code-first (Python, SQL, Scala, R), notebooks, IDE integrations. Model Access & Deployment Extensive model catalog (11,000+ models), serverless API, managed compute deployments, model benchmarking. Training and serving custom ML models, including deep learning. Models available for deployment through MLflow. Governance & Security Azure-based security & compliance, RBAC, project isolation, content safety guardrails, tracing, evaluations. Unity Catalog for unified data & AI governance, lineage tracking, access control, Entra ID integration. Key Users AI developers, business analysts, citizen developers, AI app builders. Data scientists, data engineers, ML engineers, data analysts. Integration Points Native connector to Databricks AI/BI Genie, Azure AI Search, Microsoft 365, Copilot Studio, Power Platform. Microsoft Fabric, Power BI, Azure AI Foundry, Azure Purview, Azure Monitor, Azure Key Vault. Table 1: A comparative overview of the distinct features and functionalities of Azure AI Foundry and Azure Databricks Concluding Thoughts In essence, Azure AI Foundry and Azure Databricks are not competing platforms but rather essential components of a unified, comprehensive AI strategy within the Azure ecosystem. Azure Databricks provides the robust, scalable foundation for all data engineering, analytics, and traditional machine learning workloads, acting as the "data intelligence platform." Azure AI Foundry then leverages this foundation to specialize in the rapid development, deployment, and operationalization of generative AI applications and intelligent agents. Together, they enable enterprises to unlock the full potential of AI, transforming raw data into powerful, domain-aware, and governed intelligent solutions. Frequently Asked Questions (FAQ) What is the main difference between Azure AI Foundry and Azure Databricks? Azure AI Foundry is specialized for building, deploying, and managing generative AI applications and AI agents, focusing on model orchestration and prompt engineering. Azure Databricks is a data intelligence platform for large-scale data engineering, analytics, and traditional machine learning, built on a Lakehouse architecture. Can Azure AI Foundry and Azure Databricks be used together? Yes, they are designed to work synergistically. Azure AI Foundry can leverage a native connector to access real-time, governed data from Azure Databricks, allowing AI agents to be grounded in enterprise data for more accurate and contextual responses. Which platform should I choose for training large machine learning models? For training large-scale, traditional machine learning, and deep learning models, Azure Databricks is generally the preferred choice due to its robust capabilities for data processing, feature engineering, and ML lifecycle management (MLflow). Azure AI Foundry focuses more on the deployment and orchestration of pre-trained foundation models and generative AI applications. Does Azure AI Foundry replace Azure Machine Learning or Databricks? No, Azure AI Foundry complements these services. It provides a specialized environment for generative AI and agent development, often integrating with data and models managed by Azure Databricks or Azure Machine Learning for comprehensive AI solutions. How do these platforms handle data governance? Azure Databricks utilizes Unity Catalog for unified data and AI governance, providing centralized control over data access and lineage. Azure AI Foundry integrates with Azure-based security and compliance features, ensuring responsible AI practices and data privacy within its generative AI applications.Preparing for Azure PostgreSQL Certificate Authority Rotation: A Comprehensive Operational Guide
The Challenge It started with a standard notification in the Azure Portal: Tracking-ID YK3N-7RZ. A routine Certificate Authority (CA) rotation for Azure Database for PostgreSQL. As Cloud Solution Architects, we’ve seen this scenario play out many times. The moment “certificate rotation” is mentioned, a wave of unease ripples through engineering teams. Let’s be honest: for many of us—ourselves included—certificates represent the edge of our technical “comfort zone.” We know they are critical for security, but the complexity of PKI chains, trust stores, and SSL handshakes can be intimidating. There is a silent fear: “If we touch this, will we break production?” We realized we had a choice. We could treat this as an opportunity, and we could leave that comfort zone. We approached our customer with a proactive proposal: Let’s use this event to stop fearing certificates and start mastering them. Instead of just patching the immediate issue, we used this rotation as a catalyst to review and upgrade the security posture of their database connections. We wanted to move from “hoping it works” to “knowing it’s secure.” The response was overwhelmingly positive. The teams didn’t just want a quick fix; they wanted “help for self-help.” They wanted to understand the mechanics behind sslmode and build the confidence to manage trust stores proactively. This guide is the result of that journey. It is designed to help you navigate the upcoming rotation not with anxiety, but with competence—turning a mandatory maintenance window into a permanent security improvement. Two Levels of Analysis A certificate rotation affects your environment on two distinct levels, requiring different expertise and actions: Level Responsibility Key Questions Actions Platform Level Cloud/Platform Teams Which clusters, services, and namespaces are affected? How do we detect at scale? Azure Service Health monitoring, AKS scanning, infrastructure-wide assessment Application Level Application/Dev Teams What SSL mode? Which trust store? How to update connection strings? Code changes, dependency updates, trust store management This article addresses both levels - providing platform-wide detection strategies (Section 5) and application-specific remediation guidance (Platform-Specific Remediation). Business Impact: In production environments, certificate validation failures cause complete database connection outages. A single missed certificate rotation has caused hours of downtime for enterprise customers, impacting revenue and customer trust. Who’s Affected: DevOps engineers, SREs, database administrators, and platform engineers managing Azure PostgreSQL instances - especially those using: - Java applications with custom JRE cacerts - Containerized workloads with baked-in trust stores - Strict SSL modes (sslmode=verify-full, verify-ca) The Solution What we’ll cover: 🛡️ Reliability: How to prevent database connection outages through proactive certificate management 🔄 Resiliency: Automation strategies that ensure your trust stores stay current 🔒 Security: Maintaining TLS security posture while rotating certificates safely Key Takeaway: This rotation is a client trust topic, not a server change. Applications trusting root CAs (DigiCert Global Root G2, Microsoft RSA Root CA 2017) without intermediate pinning are unaffected. Risk concentrates where strict validation meets custom trust stores. 📦 Platform-Specific Implementation: Detailed remediation guides for Java, .NET, Python, Node.js, and Kubernetes are available in our GitHub Repository. Note: The GitHub Repository. contains community-contributed content provided as-is. Test all scripts in non-production environments before use. 1. Understanding Certificate Authority Rotation What Changes During CA Rotation? Azure Database for PostgreSQL uses TLS/SSL to encrypt client-server connections. The database server presents a certificate chain during the TLS handshake: Certificate Chain Structure: Figure: Certificate chain structure showing the rotation from old intermediate (red, deprecated) to new intermediate (blue, active after rotation). Client applications must trust the root certificates (green) to validate the chain. 📝 Diagram Source: The Mermaid source code for this diagram is available in certificate-chain-diagram.mmd. Why Root Trust Matters Key Principle: If your application trusts the root certificate and allows the chain to be validated dynamically, you are not affected. The risk occurs when: Custom trust stores contain only the old intermediate certificate (not the root) Certificate pinning is implemented at the intermediate level Strict validation is enabled (sslmode=verify-full in PostgreSQL connection strings) 2. Who Is Affected and Why Risk Assessment Matrix Application Type Trust Store SSL Mode Risk Level Action Required Cloud-native app (Azure SDK) OS Trust Store require 🟢 Low None - Azure SDK handles automatically Java app (default JRE) System cacerts verify-ca 🟡 Medium Verify JRE version (11.0.16+, 17.0.4+, 8u381+) Java app (custom cacerts) Custom JKS file verify-full 🔴 High Update custom trust store with new intermediate .NET app (Windows) Windows Cert Store require 🟢 Low None - automatic via Windows Update Python app (certifi) certifi bundle verify-ca 🟡 Medium Update certifi package (pip install --upgrade certifi) Node.js app (default) Built-in CAs verify-ca 🟢 Low None - Node.js 16+, 18+, 20+ auto-updated Container (Alpine) /etc/ssl/certs verify-full 🔴 High Update base image or install ca-certificates-bundle Container (custom) Baked-in certs verify-full 🔴 High Rebuild image with updated trust store How to Read This Matrix Use the above matrix to quickly assess whether your applications are affected by CA rotation. Here is an overview, how you read the matrix: Column Meaning Application Type What kind of application do you have? (e.g., Java, .NET, Container) Trust Store Where does the application store its trusted certificates? SSL Mode How strictly does the application validate the server certificate? Risk Level 🟢 Low / 🟡 Medium / 🔴 High - How likely is a connection failure? Action Required What specific action do you need to take? Risk Level Logic: Risk Level Why? 🟢 Low Automatic updates (OS/Azure SDK) or no certificate validation 🟡 Medium Manual update required but straightforward (e.g., pip install --upgrade certifi) 🔴 High Custom trust store must be manually updated - highest outage risk SSL Mode Security Posture Understanding SSL modes is critical because they determine both security posture AND rotation impact. This creates a dual consideration: SSL Mode Certificate Validation Rotation Impact Security Level Recommendation disable ❌ None ✅ No impact 🔴 INSECURE Never use in production allow ❌ None ✅ No impact 🟠 WEAK Not recommended prefer ❌ Optional ✅ Minimal 🟡 WEAK Not recommended require ❌ No (Npgsql 6.0+) ✅ No impact 🟡 WEAK Upgrade to verify-full verify-ca ✅ Chain only 🔴 Critical 🔵 MODERATE Update trust stores verify-full ✅ Chain + hostname 🔴 Critical 🟢 SECURE Recommended - Update trust stores Key Insight: Applications using weak SSL modes (everything below verify-ca) are technically unaffected by CA rotation but represent security vulnerabilities. The safest path is verify-full with current trust stores. ⚖️ The Security vs. Resilience Trade-off The Paradox: Secure applications (verify-full) have the highest rotation risk 🔴, while insecure applications (require) are unaffected but have security gaps. Teams discovering weak SSL modes during rotation preparation face a critical decision: Option Approach Rotation Impact Security Impact Recommended For 🚀 Quick Fix Keep weak SSL mode (require) ✅ No action needed ⚠️ Security debt remains Emergency situations only 🛡️ Proper Fix Upgrade to verify-full 🔴 Requires trust store updates ✅ Improved security posture All production systems Our Recommendation: Use CA rotation events as an opportunity to improve your security posture. The effort to update trust stores is a one-time investment that pays off in long-term security. Common Scenarios Scenario 1: Enterprise Java Application Problem: Custom trust store created 2+ years ago for PCI compliance Risk: High - contains only old intermediate certificates Solution: Export new intermediate from Azure, import to custom cacerts Scenario 2: Kubernetes Microservices Problem: Init container copies trust store from ConfigMap at startup Risk: High - ConfigMap never updated since initial deployment Solution: Update ConfigMap, redeploy pods with new trust store Scenario 3: Legacy .NET Application Problem: .NET Framework 4.6 on Windows Server 2016 (no Windows Update) Risk: Medium - depends on manual certificate store updates Solution: Import new intermediate to Windows Certificate Store manually 3. Trust Store Overview A trust store is the collection of root and intermediate CA certificates that your application uses to validate server certificates during TLS handshakes. Understanding where your application’s trust store is located determines how you’ll update it for CA rotations. Trust Store Locations by Platform Category Platform Trust Store Location Update Method Auto-Updated? OS Level Windows Cert:\LocalMachine\Root Windows Update ✅ Yes Debian/Ubuntu /etc/ssl/certs/ca-certificates.crt apt upgrade ca-certificates ✅ Yes (with updates) Red Hat/CentOS /etc/pki/tls/certs/ca-bundle.crt yum update ca-certificates ✅ Yes (with updates) Runtime Level Java JRE $JAVA_HOME/lib/security/cacerts Java security updates ✅ With JRE updates Python (certifi) site-packages/certifi/cacert.pem pip install --upgrade certifi ❌ Manual Node.js Bundled with runtime Node.js version upgrade ✅ With Node.js updates Custom Custom JKS Application-specific path keytool -importcert ❌ Manual Container image /etc/ssl/certs (baked-in) Rebuild container image ❌ Manual ConfigMap mount Kubernetes ConfigMap Update ConfigMap, redeploy ❌ Manual Why This Matters for CA Rotation Applications using auto-updated trust stores (OS-managed, current runtime versions) generally handle CA rotations automatically. The risk concentrates in: Custom trust stores created for compliance requirements (PCI-DSS, SOC 2) that are rarely updated Baked-in container certificates from images built months or years ago Outdated runtimes (old JRE versions, frozen Python environments) that haven’t received security updates Air-gapped environments where automatic updates are disabled When planning for CA rotation, focus your assessment efforts on applications in the “Manual” update category. 4. Platform-Specific Remediation 📦 Detailed implementation guides are available in our GitHub repository: azure-certificate-rotation-guide Quick Reference: Remediation by Platform Platform Trust Store Location Update Method Guide Java $JAVA_HOME/lib/security/cacerts Update JRE or manual keytool import java-cacerts.md .NET (Windows) Windows Certificate Store Windows Update (automatic) dotnet-windows.md Python certifi package pip install --upgrade certifi python-certifi.md Node.js Built-in CA bundle Update Node.js version nodejs.md Containers Base image /etc/ssl/certs Rebuild image or ConfigMap containers-kubernetes.md Scripts & Automation Script Purpose Download State Scan-AKS-TrustStores.ps1 Scan all pods in AKS for trust store configurations PowerShell tested validate-connection.sh Test PostgreSQL connection with SSL validation Bash not tested update-cacerts.sh Update Java cacerts with new intermediate Bash not tested 5. Proactive Detection Strategies Database-Level Discovery: Identifying Connected Clients One starting point for impact assessment is querying the PostgreSQL database itself to identify which applications are connecting. We developed a SQL query that joins pg_stat_ssl with pg_stat_activity to reveal active TLS connections, their SSL version, and cipher suites. 🔍 Get the SQL Query: Download the complete detection script from our GitHub repository: detect-clients.sql Important Limitations This query has significant constraints that you must understand before relying on it for CA rotation planning: Limitation Impact Mitigation Point-in-time snapshot Only shows currently connected clients Run query repeatedly over days/weeks to capture periodic jobs and batch processes No certificate details Cannot identify which CA certificate the client is using Requires client-side investigation (trust store analysis) Connection pooling May show pooler instead of actual application Use application_name in connection strings to identify true source Idle connections Long-running connections may be dormant Cross-reference with application activity logs Recommended approach: Use this query to create an initial inventory, then investigate each unique application_name and client_addr combination to determine their trust store configuration and SSL mode. Proactive Monitoring with Azure Monitor To detect certificate-related issues before and after CA rotation, configure Azure Monitor alerts. This enables early warning when SSL handshakes start failing. Why this matters: After CA rotation, applications with outdated trust stores will fail to connect. An alert allows you to detect affected applications quickly rather than waiting for user reports. Official Documentation: For complete guidance on creating and managing alerts, see Azure Monitor Alerts Overview and Create a Log Search Alert. Here is a short example of an Azure Monitor Alert definition as a starting point. { "alertRule": { "name": "PostgreSQL SSL Connection Failures", "severity": 2, "condition": { "query": "AzureDiagnostics | where ResourceType == 'SERVERS' and Category == 'PostgreSQLLogs' and Message contains 'SSL error' | summarize count() by bin(TimeGenerated, 5m)", "threshold": 5, "timeAggregation": "Total", "windowSize": "PT5M" } } } Alert Configuration Notes: Setting Recommended Value Rationale Severity 2 (Warning) Allows investigation without triggering critical incident response Threshold 5 failures/5min Filters noise while catching genuine issues Evaluation Period 5 minutes Balances responsiveness with alert fatigue Action Group Platform Team Ensures quick triage and coordination 6. Production Validation Pre-Rotation Validation Checklist Inventory all applications connecting to Azure PostgreSQL Identify trust store locations for each application Verify root certificate presence in trust stores Test connection with new intermediate in non-production environment Update monitoring alerts for SSL connection failures Prepare rollback plan if issues occur Schedule maintenance window (if required) Notify stakeholders of potential impact Testing Procedure We established a systematic 3-step validation process to ensure zero downtime. This approach moves from isolated testing to gradual production rollout. 🧪 Technical Validation Guide: For the complete list of psql commands, connection string examples for Windows/Linux, and automated testing scripts, please refer to our Validation Guide in the GitHub repository. Connection Testing Strategy The core of our validation strategy was testing connections with explicit sslmode settings. We used the psql command-line tool to simulate different client behaviors. Test Scenario Purpose Expected Result Encryption only (sslmode=require) Verify basic connectivity Connection succeeds even with unknown CA CA validation (sslmode=verify-ca) Verify trust store integrity Connection succeeds only if CA chain is valid Full validation (sslmode=verify-full) Verify strict security compliance Connection succeeds only if CA chain AND hostname match Pro Tip: Test with verify-full and an explicit root CA file containing the new Microsoft/DigiCert root certificates before the rotation date. This validates that your trust stores will work after the intermediate certificate changes. Step 1: Test in Non-Production Validate connections against a test server using the new intermediate certificate (Azure provides test endpoints during the rotation window). Step 2: Canary Deployment Deploy the updated trust store to a single “canary” instance or pod. Monitor: - Connection success rate - Error logs - Response times Step 3: Gradual Rollout Once the canary is stable, proceed with a phased rollout: 1. Update 10% of pods 2. Monitor for 1 hour 3. Update 50% of pods 4. Monitor for 1 hour 5. Complete rollout 7. Best Practices and Lessons Learned Certificate Management Best Practices Practice Guidance Example Trust Root CAs, Not Intermediates Configure trust stores with root CA certificates only. This provides resilience against intermediate certificate rotations. Trust Microsoft TLS RSA Root G2 and DigiCert Global Root G2 instead of specific intermediates Automate Trust Store Updates Use OS-provided trust stores when possible (automatically updated). For custom trust stores, implement CI/CD pipelines. Schedule bi-annual trust store audits Use SSL Mode Appropriately Choose SSL mode based on security requirements. verify-ca is recommended for most scenarios. See Security Posture Matrix in Section 2 Maintain Container Images Rebuild container images monthly to include latest CA certificates. Use init containers for runtime updates. Multi-stage builds with CA certificate update step Avoid Certificate Pinning Never pin intermediate certificates. If pinning is required for compliance, implement automated update processes. Pin only root CA certificates if absolutely necessary SSL Mode Decision Guide SSL Mode Security Level Resilience When to Use require Medium High Encrypted traffic without certificate validation. Use when CA rotation resilience is more important than MITM protection. verify-ca High Medium Validates certificate chain. Recommended for most production scenarios. verify-full Highest Low Strictest validation with hostname matching. Use only when compliance requires it. Organizational Communication Model Effective certificate rotation requires structured communication across multiple layers: Layer Responsibility Key Action Azure Service Health Microsoft publishes announcements to affected subscriptions Monitor Azure Service Health proactively Platform/Cloud Team Receives Azure announcements, triages criticality Follow ITSM processes, assess impact Application Teams Execute application-level changes Update trust stores, validate connections Security Teams Define certificate validation policies Set compliance requirements Ownership and Responsibility Matrix Team Responsibility Deliverable Platform/Cloud Team Monitor Azure Service Health, coordinate response Impact assessment, team notifications Application Teams Application-level changes (connection strings, trust stores) Updated configurations, validation results Security Teams Define certificate policies, compliance requirements Policy documentation, audit reports All Teams (Shared) Certificate lifecycle collaboration Playbooks, escalation paths, training Certificate Rotation Playbook Components Organizations should establish documented playbooks including: Component Recommended Frequency Purpose Trust Store Audits Bi-annual (every 6 months) Ensure certificates are current Certificate Inventory Quarterly review Know what certificates exist where Playbook Updates Annual or after incidents Keep procedures current Team Training Annual Build knowledge and confidence Field Observations: Common Configuration Patterns Pattern Observation Risk Implicit SSL Mode Teams don’t explicitly set sslmode, relying on framework defaults Unexpected behavior during CA rotation Copy-Paste Configurations Connection strings copied without understanding options Works until certificate changes expose gaps Framework-Specific Defaults Java uses JRE trust store, .NET uses Windows Certificate Store, Python depends on certifi package Some require manual updates, some are automatic Framework Trust Store Defaults Framework Default Trust Store Update Method Risk Level Java/Quarkus JRE cacerts Manual or JRE update Medium - requires awareness .NET Windows Certificate Store Windows Update Low - automatic Node.js Bundled certificates Node.js version update Low - automatic Python certifi package pip install --upgrade certifi High - manual intervention required Knowledge and Confidence Challenges Challenge Impact Mitigation Limited certificate knowledge Creates uncertainty and risk-averse behavior Proactive education, hands-on workshops Topic intimidation “Certificates” can seem complex, leading to avoidance Reality: Implementation is straightforward once understood Previous negative experiences Leadership concerns based on past incidents Document successes, share lessons learned Visibility gaps Lack of visibility into application dependencies Maintain certificate inventory, use discovery tools Monitoring Strategy (Recommended for Post-Rotation): While pre-rotation monitoring focuses on inventory, post-rotation monitoring should track: Key Metrics: - Connection failure rates (group by application, SSL error types) - SSL handshake duration (detect performance degradation) - Certificate validation errors (track which certificates fail) - Application error logs (filter for “SSL”, “certificate”, “trust”) Recommended Alerts: - Threshold: >5 SSL connection failures in 5 minutes - Anomaly detection: Connection failure rate increases >50% - Certificate expiry warnings: 30, 14, 7 days before expiration Dashboard Components: - Connection success rate by application - SSL error distribution (validation failures, expired certificates, etc.) - Certificate inventory with expiry dates - Trust store update status across infrastructure These metrics, alerts and thresholds are only starting points and need to be adjusted based on your environment and needs. Post-Rotation Validation and Telemetry Note: This article focuses on preparation for upcoming certificate rotations. Post-rotation metrics and incident data will be collected after the rotation completes and can inform future iterations of this guidance. Recommended Post-Rotation Activities: Here are some thoughts on post-rotation activities that could create more insights on the effectiveness of the preparation. Incident Tracking: After rotation completes, organizations should track: - Production incidents related to SSL/TLS connection failures - Services affected and their business criticality - Mean Time to Detection (MTTD) for certificate-related issues - Mean Time to Resolution (MTTR) from detection to fix Success Metrics to Measure Pre-Rotation Validation: - Number of services inventoried and assessed - Percentage of services requiring trust store updates - Testing coverage (dev, staging, production) Post-Rotation Outcomes: - Zero-downtime success rate (percentage of services with no impact) - Applications requiring emergency patching - Time from rotation to full validation Impact Assessment Telemetry to Collect: - Total connection attempts vs. failures (before and after rotation) - Duration of any service degradation or outages - ustomer-facing impact (user-reported issues, support tickets) - Geographic or subscription-specific patterns Continuous Improvement Post-Rotation Review: - What worked well in the preparation phase? - Which teams or applications were unprepared? - What gaps exist in monitoring or alerting? - How can communication be improved for future rotations? Documentation Updates: - Update playbooks with lessons learned - Refine monitoring queries based on observed patterns - Enhance team training materials - Share anonymized case studies across the organization 8. Engagement & Next Steps Discussion Questions We’d love to hear from the community: What’s your experience with certificate rotations? Have you encountered unexpected connection failures during CA rotation events? Which trust store update method works best for your environment? OS-managed, runtime-bundled, or custom trust stores? How do you handle certificate management in air-gapped environments? What strategies have worked for your organization? Share Your Experience If you’ve implemented proactive certificate management strategies or have lessons learned from CA rotation incidents, we encourage you to: Comment below with your experiences and tips Contribute to the GitHub repository with additional platform guides or scripts Connect with us on LinkedIn to continue the conversation Call to Action Take these steps now to prepare for the CA rotation: Assess your applications - Use the Risk Assessment Matrix (Section 2) to identify which applications use sslmode=verify-ca or verify-full with custom trust stores Import root CA certificates - Add DigiCert Global Root G2 and Microsoft RSA Root CA 2017 to your trust stores Upgrade SSL mode - Change your connection strings to at least sslmode=verify-ca (recommended: verify-full) for improved security Document your changes - Record which applications were updated, what trust stores were modified, and the validation results Automate for the future - Implement proactive certificate management so future CA rotations are handled automatically (OS-managed trust stores, CI/CD pipelines for container images, scheduled trust store audits) 9. Resources Official Documentation Azure PostgreSQL: Azure PostgreSQL SSL/TLS Concepts Azure PostgreSQL - Connect with TLS/SSL PostgreSQL & libpq: PostgreSQL libpq SSL Support - SSL mode options and environment variables PostgreSQL psql Reference - Command-line tool documentation PostgreSQL Server SSL/TLS Configuration Certificate Authorities: DigiCert Root Certificates Microsoft PKI Repository Microsoft Trusted Root Program Community Resources Let’s Encrypt Root Expiration (2021 Incident) NIST SP 800-57: Key Management Guidelines OWASP Certificate Pinning Cheat Sheet Neon Blog: PostgreSQL Connection Security Defaults Tools and Scripts PowerShell AKS Trust Store Scanner (see Platform-Specific Remediation) PostgreSQL Interactive Terminal (psql) PostgreSQL JDBC SSL Documentation Industry Context Certificate rotation challenges are not unique to Azure PostgreSQL. Similar incidents have occurred across the industry: Historical Incidents: - Let’s Encrypt Root Expiration (2021): Widespread impact when DST Root CA X3 expired, affecting older Android devices and legacy systems - DigiCert Root Transitions: Multiple cloud providers experienced customer impact during CA changes - Internal PKI Rotations: Enterprises face similar challenges when rotating internally-issued certificates Relevant Standards: - NIST SP 800-57: Key Management Guidelines (certificate lifecycle best practices) - OWASP Certificate Pinning: Guidance on balancing security and operational resilience - CIS Benchmarks: Recommendations for TLS/SSL configuration in cloud environments Authors Author Role Contact Andreas Semmelmann Cloud Solution Architect, Microsoft LinkedIn Mpho Muthige Cloud Solution Architect, Microsoft LinkedIn Disclaimers Disclaimer: The information in this blog post is provided for general informational purposes only and does not constitute legal, financial, or professional advice. While every effort has been made to ensure the accuracy of the information at the time of publication, Microsoft makes no warranties or representations as to its completeness or accuracy. Product features, availability, and timelines are subject to change without notice. For specific guidance, please consult your legal or compliance advisor. Microsoft Support Statement: This article represents field experiences and community best practices. For official Microsoft support and SLA-backed guidance: Azure Support: https://azure.microsoft.com/support/ Official Documentation: https://learn.microsoft.com/azure/ Microsoft Q&A: https://learn.microsoft.com/answers/ Production Issues: Always open official support tickets for production-impacting problems. Customer Privacy Notice: This article describes real-world scenarios from customer engagements. All customer-specific information has been anonymized. No NDAs or customer confidentiality agreements were violated in creating this content. AI-generated content disclaimer: This content was generated in whole or in part with the assistance of AI tools. AI-generated content may be incorrect or incomplete. Please review and verify before relying on it for critical decisions. See terms Community Contribution: The GitHub repository referenced in this article contains community-contributed scripts and guides. These are provided as-is for educational purposes and should be tested in non-production environments before use. Tags: #AzurePostgreSQL #CertificateRotation #TLS #SSL #TrustStores #Operations #DevOps #SRE #CloudSecurity #AzureDatabaseThis week's Fabric Engineering Connection calls - Holiday Cheer Edition
🎉 Holiday Cheer Alert! 🎶 Ready to jingle all the way with the Fabric Partner Community? Join us for our Fabric Engineering Connection - Holiday Cheer Edition! The festivities kick off with a “Name That Tune: Holiday Edition” game—where your competitive spirit could win you fabulous prizes! Bring your brightest “Ho Ho Ho,” your silliest sparkle, and get ready to sleigh the season with us. 🎅✨ Stick around for inspiring presentations from our guest speakers: 🦌Nellie Gustafsson, Principal PM Manager, with updates on Data Science, AI, and Data Agents (Americas & EMEA call only) 🦌Shireen Bahadur, Senior Program Manager, and Ajay Jagannathan, Principal Group PM Manager, sharing “What’s New in Database Mirroring” 📅 Americas & EMEA: Wednesday, December 17, 8–9 am PT 📅 APAC: Thursday, December 18, 1–2 am UTC Show starts on the hour—enthusiasm mandatory, jingle optional! 🔔 To join, become a member of the Fabric Partner Community Teams Channel (if you are not already): https://lnkd.in/g_PRdfjt Let’s deck the halls, spread some cheer, and make this celebration one to remember!47Views1like0CommentsJoin the Fabric Partner Community for this Week's Fabric Engineering Connection calls!
Are you a Microsoft partner that is interested in data and analytics? After a two-week break for Ignite and Thanksgiving, be sure to join us for this week's Fabric Engineering Connection calls! Yitzhak Kesselman will be presenting on the brand-new Microsoft Fabric IQ, just announced during Ignite, followed by Shuaijun Ye with Environment Best Practices. The Americas & EMEA call will take place Wednesday, December 3, from 8-9 am PT and the APAC call is Thursday, December 4, from 1-2 am UTC/Wednesday, December 3, from 5-6 pm PT. This is your opportunity to learn more, ask questions, and provide feedback. To participate in the call, you must be a member of the Fabric Partner Community Teams channel. To join, complete the participation form at https://aka.ms/JoinFabricPartnerCommunity. We look forward to seeing you at the calls!37Views1like0CommentsJoin the Fabric Partner Community for an AMA with Kim Manis
🌟 Exciting News for the Fabric Partner Community! 🌟 Join us for a special AMA (Ask Me Anything) session with the Fabric Leadership Team, featuring Kim Manis, CVP of Product Management for Microsoft Fabric Platform! This is your chance to connect directly with the leaders shaping the future of Microsoft Fabric, ask your platform-related questions, provide feedback, and gain insights into upcoming innovations. 🗓️ Date: Thursday, December 4 ⏰ Time: 8–9 am PT 🔗 Submit or upvote your questions now: https://lnkd.in/gKNBvMKq To participate, make sure you’re a member of the Fabric Partner Community Teams Channel. If you haven’t joined yet, sign up here: https://lnkd.in/g_PRdfjt Don’t miss this opportunity to engage with the Fabric Leadership Team and help shape the future of data platforms!42Views1like0CommentsJoin the Fabric Partner Community for this Week's Fabric Engineering Connection calls!
Are you a Microsoft partners that is interested in data and analytics? Be sure to join us for the next Fabric Engineering Connection calls! 🎉 The Americas & EMEA call will take place Wednesday, October 22, from 8-9 am PT and will feature presentations from Teddy Bercovitz and Gerd Saurer on Fabric Extend Workload Developer Kit, followed by a presentation on Data Protection Capabilities from Yael Biss. The APAC call is Thursday, October 23, from 1-2 am UTC/Wednesday, October 2, from 5-6 pm PT. Tamer Farag, Trilok Rajesh and Shreya Ghosh will be presenting on Modernizing Legacy Analytics & BI Platforms. This is your opportunity to learn more, ask questions, and provide feedback. To join the call, you must be a member of the Fabric Partner Community Teams channel. To join, complete the participation form at https://aka.ms/JoinFabricPartnerCommunity. We look forward to seeing you later this week!83Views2likes0CommentsJoin the Fabric Partner Community for this Week's Fabric Engineering Connection calls!
Are you a Microsoft partner that is interested in data and analytics? Be sure to join us for the next Fabric Engineering Connection calls! 🎉 Sujata Narayana will be sharing a recap of Power BI announcements from FabCon Europe, followed by the latest updates on AI Functions from Virginia Roman. The Americas & EMEA call will take place Wednesday, October 15, from 8-9 am PT and the APAC call is Thursday, October 16, from 1-2 am UTC/Wednesday, October 15, from 5-6 pm PT. This is your opportunity to learn more, ask questions, and provide feedback. To join the call, you must be a member of the Fabric Partner Community Teams channel. To join, complete the participation form at https://aka.ms/JoinFabricPartnerCommunity. We look forward to seeing you later this week!44Views1like0CommentsDiagnose performance issues in Spark jobs through Spark UI.
Agenda Introduction Overview of Spark UI Navigating to Spark UI Jobs Timeline Opening Jobs Timeline Reading Event Timeline Failing Jobs or Executors Diagnosing Failing Jobs Diagnosing Failing Executors Scenario - Memory Issues Scenario - Long Running Jobs Scenario - Identifying Longest Stage Introduction Diagnosing performance issues of job using Spark UI This guide walks you through how to use the Spark UI to diagnose performance issues Overview of Spark UI Job Composition Composed of multiple stages Stages may contain more than one task Task Breakdown Tasks are broken into executors Navigating to Spark UI: Navigating to Cluster's Page Navigate to your cluster’s page: Navigating to Spark UI: Clicking Spark UI Click Spark UI: Jobs Timeline Jobs timeline The jobs timeline is a great starting point for understanding your pipeline or query. It gives you an overview of what was running, how long each step took, and if there were any failures along the way Opening Jobs Timeline Accessing the Jobs Timeline Navigate to the Spark UI Click on the Jobs tab Viewing the Event Timeline Click on Event Timeline Highlighted in red in the screenshot Example Timeline Shows driver and executor 0 being added Failing Jobs or Executors: Example of Failed Job Failed Job Example Indicated by a red status Shown in the event timeline Removed Executors Also indicated by a red status Shown in the event timeline Failing Jobs or Executors: Common Reasons for Executors Being Removed Autoscaling Expected behavior, not an error See Enable autoscaling for more details Compute configuration reference - Azure Databricks | Microsoft Learn Spot instance losses Cloud provider reclaiming your VMs Learn more about Spot instances here Executors running out of memory Diagnosing Failing Jobs: Steps to Diagnose Failing Jobs Identifying Failing Jobs Click on the failing job to access its page Reviewing Failure Details Scroll down to see the failed stage Check the failure reason Diagnosing Failing Jobs: Generic Errors You may get a generic error. Click on the link in the description to see if you can get more info: Diagnosing Failing Jobs: Memory Issues Task Failure Explanation Scroll down the page to see why each task failed Memory issue identified as the cause Scenario – Spot instance , Auto-scaling Diagnosing Failing Executors: Checking Event Log Check Event Log Identify any explanations for executor failures Spot Instances Cloud provider may reclaim spot instances Diagnosing Failing Executors: Navigating to Executors Tab Check Event Log for Executor Loss Look for messages indicating cluster resizing or spot instance loss Navigate to Spark UI Click on the Executors tab Diagnosing Failing Executors: Getting Logs from Failed Executors Here you can get the logs from the failed executors: Scenario - Memory Issues Memory Issues Common cause of problems Requires thorough investigation Quality of Code Potential source of memory issues Needs to be checked for efficiency Data Quality Can affect memory usage Must be organized correctly Spark memory issues - Azure Databricks | Microsoft Learn Identifying Longest Stage Identify the longest stage of the job Scroll to the bottom of the job’s page Locate the list of stages Order the stages by duration Identifying Longest Stage Identify the longest stage of the job Scroll to the bottom of the job’s page Locate the list of stages Order the stages by duration Stage I/O Details High-Level Data Overview Input Output Shuffle Read Shuffle Write Number of Tasks in Long Stage Identifying the number of tasks Helps in pinpointing the issue Look at the specified location to determine the number of tasks Investigating Stage Details Investigate Further if Multiple Tasks Check if the stage has more than one task Click on the link in the stage’s description Get More Info About Longest Stage Click on the link provided Gather detailed information Conclusion Potential Data Skew Issues Data skew can impact performance May cause uneven distribution of data Spelling Errors in Data Incorrect spelling can affect data processing Ensure data accuracy for optimal performance Learn More Navigate to Skew and Spill - Skew and spill - Azure Databricks | Microsoft LearnJoin the Fabric Partner Community for this Week's Fabric Engineering Connection calls!
Are you a Microsoft partner that is interested in data and analytics? Be sure to join us for the next Fabric Engineering Connection calls! 🎉 Tom Peplow will be discussing OneLake Diagnostics and Sarab Dua will be joining to cover recent releases and roadmap for network security. Both promise to be presentations you won't want to miss! The Americas & EMEA call will take place Wednesday, October 8, from 8-9 am PT and the APAC call is Thursday, October 9, from 1-2 am UTC/Wednesday, October 8, from 5-6 pm PT. This is your opportunity to learn more, ask questions, and provide feedback. To join the call, you must be a member of the Fabric Partner Community Teams channel. To join, complete the participation form at https://aka.ms/JoinFabricPartnerCommunity. We look forward to seeing you later this week!47Views1like0CommentsReducing SQL Connection Latency for Apps Using Azure AAD Authentication
Challenge: connection latency and token overhead Consider a cloud-native application deployed in Azure App Service or Kubernetes (AKS) that needs to query an Azure SQL Database for real-time data. The application uses Azure Active Directory (AAD) for secure authentication, but every time the application establishes a new connection to the database, it requests a new AAD token. In high-traffic environments where thousands of requests are processed per second, this repetitive token issuance introduces latency and performance degradation. This delay becomes particularly problematic for time-sensitive applications where every millisecond counts. Each token request impacts response times and creates unnecessary resource consumption. Solution: token caching and expiration management To mitigate these delays, we can optimize the authentication process by caching the AAD token and reusing it for the duration of its validity (typically 1 hour to 24 hours). Instead of requesting a new token for every database connection, the token is fetched only when the existing one is near expiration. This approach eliminates the repeated authentication overhead and ensures that the application can maintain seamless connectivity to the database without the performance hit of generating a new token for each request. In addition to reducing latency, this approach reduces the number of HTTP calls made to the Azure Active Directory service, resulting in better resource utilization and lower operational costs. Concrete performance gains: optimized SQL client connection As part of the mitigation, we provide a custom code implementation that uses SqlClient, a supported library, to optimize the connection time. The test was conducted with the S0 database, where using a single process and using connection pooling, we opened a connection, executed the SELECT 1, and closed the connection. During the testing phase with a connection pooler script running for 96 hours (without the AAD token cache), the following results were observed: 10 connections took 1 second, representing 0.866% of total connections. 1 connection took 4 seconds, representing 0.0866%. 1.144 connections took less than 1 second, representing 99.05% of total connections. All executions of SELECT 1 were completed in 0 seconds. These results demonstrate how caching AAD tokens and reusing them effectively reduced connection overhead and improved performance. None of the connections exceeded 5 seconds in duration, while with the default behavior, connections were reaching 30 seconds and more, depending on the environment complexity. Step-by-step implementation Here’s a step-by-step guide on how to implement this solution using C# and the Microsoft.Data.SqlClient package to optimize SQL database connections: Obtain and cache a token: Instead of requesting a new AAD token with every connection, we obtain a token once and cache it. This is done by leveraging Azure Managed Identity to authenticate the application, which eliminates the need to repeatedly authenticate with Azure Active Directory for every database connection. In this step, we fetch the token once and store it securely for reuse. Renew the token only when it’s near expiry We will refresh the token only when it is nearing expiration or has already expired. The application checks the token’s expiration time before attempting to use it. If the token is still valid, it continues to be reused. If it's close to expiration, a new token is fetched. Reuse a single token across multiple connections: The cached token can be used for multiple database connections during its lifetime. Rather than requesting a new token for each new connection, the application will use the same token across all connections until the token is about to expire. Code example: optimized SQL connection management Here’s an example of how you can implement token caching in a C# application using Microsoft.Data.SqlClient. using System; using System.Data.SqlClient; using System.Diagnostics; using System.Threading; using Azure.Identity; namespace SqlConnectionOptimization { public class SqlConnectionManager { private string _accessToken; private DateTimeOffset _tokenExpiration; private readonly string _connectionString = "Server=tcp:servername.database.windows.net,1433;Initial Catalog=DBName;..."; private readonly Stopwatch _stopwatch = new Stopwatch(); public SqlConnectionManager() { _accessToken = string.Empty; _tokenExpiration = DateTimeOffset.UtcNow; } public void Run() { while (true) { // Refresh token if necessary if (IsTokenExpired()) { RefreshToken(); } // Establish connection and perform operations using (var connection = CreateConnection()) { LogExecutionTime("Connected"); ExecuteQuery(connection); LogExecutionTime("Query Executed"); } // Simulate some idle time between operations Log("Waiting before next operation..."); Thread.Sleep(1000); } } private bool IsTokenExpired() { return string.IsNullOrEmpty(_accessToken) || DateTimeOffset.UtcNow.AddMinutes(5) >= _tokenExpiration; } private void RefreshToken() { _stopwatch.Start(); try { var result = FetchAccessToken(); _accessToken = result.Token; _tokenExpiration = result.Expiration; LogExecutionTime("Token Refreshed"); Log($"Token expires at: {_tokenExpiration}"); } catch (Exception ex) { Log($"Error fetching token: {ex.Message}"); } } private (string Token, DateTimeOffset Expiration) FetchAccessToken() { var managedIdentityCredential = new ManagedIdentityCredential(); var tokenRequestContext = new Azure.Core.TokenRequestContext(new[] { "https://database.windows.net/" }); var accessToken = managedIdentityCredential.GetTokenAsync(tokenRequestContext).Result; return (accessToken.Token, accessToken.ExpiresOn.UtcDateTime); } private SqlConnection CreateConnection() { var connection = new SqlConnection(_connectionString) { AccessToken = _accessToken }; int retries = 0; while (true) { try { connection.Open(); return connection; } catch (Exception ex) { retries++; if (retries > 5) { Log($"Error connecting after multiple retries: {ex.Message}"); throw; } Log($"Connection attempt failed. Retrying in {retries} seconds..."); Thread.Sleep(retries * 1000); } } } private void ExecuteQuery(SqlConnection connection) { var query = "SELECT 1"; // Simple query, replace with real logic as needed int retries = 0; while (true) { try { using (var command = new SqlCommand(query, connection)) { command.CommandTimeout = 5; // Adjust timeout for more complex queries command.ExecuteScalar(); } return; } catch (Exception ex) { retries++; if (retries > 5) { Log($"Max retries reached for query execution: {ex.Message}"); throw; } Log($"Query execution failed. Retrying in {retries} seconds..."); Thread.Sleep(retries * 1000); } } } private void Log(string message) { Console.WriteLine($"{DateTime.Now:yyyy-MM-dd HH:mm:ss.fff}: {message}"); } private void LogExecutionTime(string action) { _stopwatch.Stop(); var elapsed = _stopwatch.Elapsed; Log($"{action} - Elapsed time: {elapsed:hh\\:mm\\:ss\\.fff}"); _stopwatch.Reset(); } public static void Main(string[] args) { var manager = new SqlConnectionManager(); manager.Run(); } } } Key points in the code Token Expiration Check: The IsTokenExpired() method checks whether the token has expired by comparing it to the current time. We’ve added a 5-minute buffer for token expiration. This can be adjusted based on your needs. Managed Identity Authentication: The application uses Azure Managed Identity to authenticate and fetch the token, ensuring secure and scalable access to Azure SQL Database without requiring client secrets. Retry Logic: In the event of a connection failure or query execution failure, the system retries a set number of times with exponential backoff, making it resilient to transient network or authentication issues. Conclusion By implementing a token caching and expiration management strategy, applications can dramatically improve the performance and scalability of their database interactions, especially in environments with high request volumes. By leveraging Azure Managed Identity for secure, reusable tokens, you can reduce authentication latency and improve the overall efficiency of your SQL database connections. This approach can also be adapted to any service using Azure SQL Database and Azure Active Directory for authentication. Next steps Benchmarking: Test the implementation in your environment to quantify the performance gains. Error Handling: Extend the retry logic and error handling to better handle transient failures, especially in production environments. Resources: Introducing Configurable Retry Logic in Microsoft.Data.SqlClient v3.0.0-Preview1 Configurable retry logic in SqlClient Troubleshoot transient connection errors Scaling: Consider how this strategy can be applied across multiple services in larger architectures. Consider reading and applying managed identity best practices. Resources: Managed identity best practice recommendations