apps & infra
5 TopicsAzure AI Foundry vs. Azure Databricks – A Unified Approach to Enterprise Intelligence
Key Insights into Azure AI Foundry and Azure Databricks Complementary Powerhouses: Azure AI Foundry is purpose-built for generative AI application and agent development, focusing on model orchestration and rapid prototyping, while Azure Databricks excels in large-scale data engineering, analytics, and traditional machine learning, forming the data intelligence backbone. Seamless Integration for End-to-End AI: A critical native connector allows AI agents developed in Foundry to access real-time, governed data from Databricks, enabling contextual and data-grounded AI solutions. This integration facilitates a comprehensive AI lifecycle from data preparation to intelligent application deployment. Specialized Roles for Optimal Performance: Enterprises leverage Databricks for its robust data processing, lakehouse architecture, and ML model training capabilities, and then utilize AI Foundry for deploying sophisticated generative AI applications, agents, and managing their lifecycle, ensuring responsible AI practices and scalability. In the rapidly evolving landscape of artificial intelligence, organizations seek robust platforms that can not only handle vast amounts of data but also enable the creation and deployment of intelligent applications. Microsoft Azure offers two powerful, yet distinct, services in this domain: Azure AI Foundry and Azure Databricks. While both contribute to an organization's AI capabilities, they serve different primary functions and are designed to complement each other in building comprehensive, enterprise-grade AI solutions. Decoding the Core Purpose: Foundry for Generative AI, Databricks for Data Intelligence At its heart, the distinction between Azure AI Foundry and Azure Databricks lies in their core objectives and the types of workloads they are optimized for. Understanding these fundamental differences is crucial for strategic deployment and maximizing their combined potential. Azure AI Foundry: The Epicenter for Generative AI and Agents Azure AI Foundry emerges as Microsoft's unified platform specifically engineered for the development, deployment, and management of generative AI applications and AI agents. It represents a consolidation of capabilities from what were formerly Azure AI Studio and Azure OpenAI Studio. Its primary focus is on accelerating the entire lifecycle of generative AI, from initial prototyping to large-scale production deployments. Key Characteristics of Azure AI Foundry: Generative AI Focus: Foundry streamlines the development of large language models (LLMs) and customized generative AI applications, including chatbots and conversational AI. It emphasizes prompt engineering, Retrieval-Augmented Generation (RAG), and agent orchestration. Extensive Model Catalog: It provides access to a vast catalog of over 11,000 foundation models from various publishers, including OpenAI, Meta (Llama 4), Mistral, and others. These models can be deployed via managed compute or serverless API deployments, offering flexibility and choice. Agentic Development: A significant strength of Foundry is its support for building sophisticated AI agents. This includes tools for grounding agents with knowledge, tool calling, comprehensive evaluations, tracing, monitoring, and guardrails to ensure responsible AI practices. Foundry Local further extends this by allowing offline and on-device development. Unified Development Environment: It offers a single management grouping for agents, models, and tools, promoting efficient development and consistent governance across AI projects. Enterprise Readiness: Built-in capabilities such as Role-Based Access Control (RBAC), observability, content safety, and project isolation ensure that AI applications are secure, compliant, and scalable for enterprise use. Figure 1: Conceptual Architecture of Azure AI Foundry illustrating its various components for AI development and deployment. Azure Databricks: The Powerhouse for Data Engineering, Analytics, and Machine Learning Azure Databricks, on the other hand, is an Apache Spark-based data intelligence platform optimized for large-scale data engineering, analytics, and traditional machine learning workloads. It acts as a collaborative workspace for data scientists, data engineers, and ML engineers to process, analyze, and transform massive datasets, and to build and deploy diverse ML models. Key Characteristics of Azure Databricks: Unified Data Analytics Platform: Central to Databricks is its lakehouse architecture, built on Delta Lake, which unifies data warehousing and data lakes. This provides a single platform for data engineering, SQL analytics, and machine learning. Big Data Processing: Excelling in distributed computing, Databricks is ideal for processing large datasets, performing ETL (Extract, Transform, Load) operations, and real-time analytics at scale. Comprehensive ML and AI Workflows: It offers a specialized environment for the full ML lifecycle, including data preparation, feature engineering, model training (both classic and deep learning), and model serving. Tools like MLflow are integrated for tracking, evaluating, and monitoring ML models. Data Intelligence Features: Databricks includes AI-assistive features such as Databricks Assistant and Databricks AI/BI Genie, which enable users to interact with their data using natural language queries to derive insights. Unified Governance with Unity Catalog: Unity Catalog provides a centralized governance solution for all data and AI assets within the lakehouse, ensuring data security, lineage tracking, and access control. Figure 2: The Databricks Data Intelligence Platform with its unified approach to data, analytics, and AI. The Symbiotic Relationship: Integration and Complementary Use Cases While distinct in their primary functions, Azure AI Foundry and Azure Databricks are explicitly designed to work together, forming a powerful, integrated ecosystem for end-to-end AI development and deployment. This synergy is key to building advanced, data-driven AI solutions in the enterprise. Seamless Integration for Enhanced AI Capabilities The integration between the two platforms is a cornerstone of Microsoft's AI strategy, enabling AI agents and generative applications to be grounded in high-quality, governed enterprise data. Key Integration Points: Native Databricks Connector in AI Foundry: A significant development in 2025 is the public preview of a native connector that allows AI agents built in Azure AI Foundry to directly query real-time, governed data from Azure Databricks. This means Foundry agents can leverage Databricks AI/BI Genie to surface data insights and even trigger Databricks Jobs, providing highly contextual and domain-aware responses. Data Grounding for AI Agents: This integration enables AI agents to access structured and unstructured data processed and stored in Databricks, providing the necessary context and knowledge base for more accurate and relevant generative AI outputs. All interactions are auditable within Databricks, maintaining governance and security. Model Crossover and Availability: Foundation models, such as the Llama 4 family, are made available across both platforms. Databricks DBRX models can also appear in the Foundry model catalog, allowing flexibility in where models are trained, deployed, and consumed. Unified Identity and Governance: Both platforms leverage Azure Entra ID for authentication and access control, and Unity Catalog provides unified governance for data and AI assets managed by Databricks, which can then be respected by Foundry agents. Here's a breakdown of how a typical flow might look: Mindmap 1: Illustrates the complementary roles and integration points between Azure Databricks and Azure AI Foundry within an end-to-end AI solution. When to Use Which (and When to Use Both) Choosing between Azure AI Foundry and Azure Databricks, or deciding when to combine them, depends on the specific requirements of your AI project: Choose Azure AI Foundry When You Need To: Build and deploy production-grade generative AI applications and multi-agent systems. Access, evaluate, and benchmark a wide array of foundation models from various providers. Develop AI agents with sophisticated capabilities like tool calling, RAG, and contextual understanding. Implement enterprise-grade guardrails, tracing, monitoring, and content safety for AI applications. Rapidly prototype and iterate on generative AI solutions, including chatbots and copilots. Integrate AI agents deeply with Microsoft 365 and Copilot Studio. Choose Azure Databricks When You Need To: Perform large-scale data engineering, ETL, and data warehousing on a unified lakehouse. Build and train traditional machine learning models (supervised, unsupervised learning, deep learning) at scale. Manage and govern all data and AI assets centrally with Unity Catalog, ensuring data quality and lineage. Conduct complex data analytics, business intelligence (BI), and real-time data processing. Leverage AI-assistive tools like Databricks AI/BI Genie for natural language interaction with data. Require high-performance compute and auto-scaling for data-intensive workloads. Use Both for Comprehensive AI Solutions: The most powerful approach for many enterprises is to leverage both platforms. Azure Databricks can serve as the robust data backbone, handling data ingestion, processing, governance, and traditional ML model training. Azure AI Foundry then sits atop this foundation, consuming the prepared and governed data to build, deploy, and manage intelligent generative AI agents and applications. This allows for: Domain-Aware AI: Foundry agents are grounded in enterprise-specific data from Databricks, leading to more accurate, relevant, and trustworthy AI responses. End-to-End AI Lifecycle: Databricks manages the "data intelligence" part, and Foundry handles the "generative AI application" part, covering the entire spectrum from raw data to intelligent user experience. Optimized Resource Utilization: Each platform focuses on what it does best, leading to more efficient resource allocation and specialized toolsets for different stages of the AI journey. Comparative Analysis: Features and Capabilities To further illustrate their distinct yet complementary nature, let's examine a detailed comparison of their features, capabilities, and typical user bases. Radar Chart 1: This chart visually compares Azure AI Foundry and Azure Databricks across several key dimensions, illustrating their specialized strengths. Azure AI Foundry excels in generative AI and agent orchestration, while Azure Databricks dominates in data engineering, unified data governance, and traditional ML workflows. A Detailed Feature Comparison Feature Category Azure AI Foundry Azure Databricks Primary Focus Generative AI application & agent development, model orchestration Large-scale data engineering, analytics, traditional ML, and AI workflows Data Handling Connects to diverse data sources (e.g., Databricks, Azure AI Search) for grounding AI agents. Not a primary data storage/processing platform. Native data lakehouse architecture (Delta Lake), optimized for big data processing, storage, and real-time analytics. AI/ML Capabilities Foundation models (LLMs), prompt engineering, RAG, agent orchestration, model evaluation, content safety, responsible AI tooling. Traditional ML (supervised/unsupervised), deep learning, feature engineering, MLflow for lifecycle management, Databricks AI/BI Genie. Development Style Low-code agent building, prompt flows, unified SDK/API, templates. Code-first (Python, SQL, Scala, R), notebooks, IDE integrations. Model Access & Deployment Extensive model catalog (11,000+ models), serverless API, managed compute deployments, model benchmarking. Training and serving custom ML models, including deep learning. Models available for deployment through MLflow. Governance & Security Azure-based security & compliance, RBAC, project isolation, content safety guardrails, tracing, evaluations. Unity Catalog for unified data & AI governance, lineage tracking, access control, Entra ID integration. Key Users AI developers, business analysts, citizen developers, AI app builders. Data scientists, data engineers, ML engineers, data analysts. Integration Points Native connector to Databricks AI/BI Genie, Azure AI Search, Microsoft 365, Copilot Studio, Power Platform. Microsoft Fabric, Power BI, Azure AI Foundry, Azure Purview, Azure Monitor, Azure Key Vault. Table 1: A comparative overview of the distinct features and functionalities of Azure AI Foundry and Azure Databricks Concluding Thoughts In essence, Azure AI Foundry and Azure Databricks are not competing platforms but rather essential components of a unified, comprehensive AI strategy within the Azure ecosystem. Azure Databricks provides the robust, scalable foundation for all data engineering, analytics, and traditional machine learning workloads, acting as the "data intelligence platform." Azure AI Foundry then leverages this foundation to specialize in the rapid development, deployment, and operationalization of generative AI applications and intelligent agents. Together, they enable enterprises to unlock the full potential of AI, transforming raw data into powerful, domain-aware, and governed intelligent solutions. Frequently Asked Questions (FAQ) What is the main difference between Azure AI Foundry and Azure Databricks? Azure AI Foundry is specialized for building, deploying, and managing generative AI applications and AI agents, focusing on model orchestration and prompt engineering. Azure Databricks is a data intelligence platform for large-scale data engineering, analytics, and traditional machine learning, built on a Lakehouse architecture. Can Azure AI Foundry and Azure Databricks be used together? Yes, they are designed to work synergistically. Azure AI Foundry can leverage a native connector to access real-time, governed data from Azure Databricks, allowing AI agents to be grounded in enterprise data for more accurate and contextual responses. Which platform should I choose for training large machine learning models? For training large-scale, traditional machine learning, and deep learning models, Azure Databricks is generally the preferred choice due to its robust capabilities for data processing, feature engineering, and ML lifecycle management (MLflow). Azure AI Foundry focuses more on the deployment and orchestration of pre-trained foundation models and generative AI applications. Does Azure AI Foundry replace Azure Machine Learning or Databricks? No, Azure AI Foundry complements these services. It provides a specialized environment for generative AI and agent development, often integrating with data and models managed by Azure Databricks or Azure Machine Learning for comprehensive AI solutions. How do these platforms handle data governance? Azure Databricks utilizes Unity Catalog for unified data and AI governance, providing centralized control over data access and lineage. Azure AI Foundry integrates with Azure-based security and compliance features, ensuring responsible AI practices and data privacy within its generative AI applications.Preparing for Azure PostgreSQL Certificate Authority Rotation: A Comprehensive Operational Guide
The Challenge It started with a standard notification in the Azure Portal: Tracking-ID YK3N-7RZ. A routine Certificate Authority (CA) rotation for Azure Database for PostgreSQL. As Cloud Solution Architects, we’ve seen this scenario play out many times. The moment “certificate rotation” is mentioned, a wave of unease ripples through engineering teams. Let’s be honest: for many of us—ourselves included—certificates represent the edge of our technical “comfort zone.” We know they are critical for security, but the complexity of PKI chains, trust stores, and SSL handshakes can be intimidating. There is a silent fear: “If we touch this, will we break production?” We realized we had a choice. We could treat this as an opportunity, and we could leave that comfort zone. We approached our customer with a proactive proposal: Let’s use this event to stop fearing certificates and start mastering them. Instead of just patching the immediate issue, we used this rotation as a catalyst to review and upgrade the security posture of their database connections. We wanted to move from “hoping it works” to “knowing it’s secure.” The response was overwhelmingly positive. The teams didn’t just want a quick fix; they wanted “help for self-help.” They wanted to understand the mechanics behind sslmode and build the confidence to manage trust stores proactively. This guide is the result of that journey. It is designed to help you navigate the upcoming rotation not with anxiety, but with competence—turning a mandatory maintenance window into a permanent security improvement. Two Levels of Analysis A certificate rotation affects your environment on two distinct levels, requiring different expertise and actions: Level Responsibility Key Questions Actions Platform Level Cloud/Platform Teams Which clusters, services, and namespaces are affected? How do we detect at scale? Azure Service Health monitoring, AKS scanning, infrastructure-wide assessment Application Level Application/Dev Teams What SSL mode? Which trust store? How to update connection strings? Code changes, dependency updates, trust store management This article addresses both levels - providing platform-wide detection strategies (Section 5) and application-specific remediation guidance (Platform-Specific Remediation). Business Impact: In production environments, certificate validation failures cause complete database connection outages. A single missed certificate rotation has caused hours of downtime for enterprise customers, impacting revenue and customer trust. Who’s Affected: DevOps engineers, SREs, database administrators, and platform engineers managing Azure PostgreSQL instances - especially those using: - Java applications with custom JRE cacerts - Containerized workloads with baked-in trust stores - Strict SSL modes (sslmode=verify-full, verify-ca) The Solution What we’ll cover: 🛡️ Reliability: How to prevent database connection outages through proactive certificate management 🔄 Resiliency: Automation strategies that ensure your trust stores stay current 🔒 Security: Maintaining TLS security posture while rotating certificates safely Key Takeaway: This rotation is a client trust topic, not a server change. Applications trusting root CAs (DigiCert Global Root G2, Microsoft RSA Root CA 2017) without intermediate pinning are unaffected. Risk concentrates where strict validation meets custom trust stores. 📦 Platform-Specific Implementation: Detailed remediation guides for Java, .NET, Python, Node.js, and Kubernetes are available in our GitHub Repository. Note: The GitHub Repository. contains community-contributed content provided as-is. Test all scripts in non-production environments before use. 1. Understanding Certificate Authority Rotation What Changes During CA Rotation? Azure Database for PostgreSQL uses TLS/SSL to encrypt client-server connections. The database server presents a certificate chain during the TLS handshake: Certificate Chain Structure: Figure: Certificate chain structure showing the rotation from old intermediate (red, deprecated) to new intermediate (blue, active after rotation). Client applications must trust the root certificates (green) to validate the chain. 📝 Diagram Source: The Mermaid source code for this diagram is available in certificate-chain-diagram.mmd. Why Root Trust Matters Key Principle: If your application trusts the root certificate and allows the chain to be validated dynamically, you are not affected. The risk occurs when: Custom trust stores contain only the old intermediate certificate (not the root) Certificate pinning is implemented at the intermediate level Strict validation is enabled (sslmode=verify-full in PostgreSQL connection strings) 2. Who Is Affected and Why Risk Assessment Matrix Application Type Trust Store SSL Mode Risk Level Action Required Cloud-native app (Azure SDK) OS Trust Store require 🟢 Low None - Azure SDK handles automatically Java app (default JRE) System cacerts verify-ca 🟡 Medium Verify JRE version (11.0.16+, 17.0.4+, 8u381+) Java app (custom cacerts) Custom JKS file verify-full 🔴 High Update custom trust store with new intermediate .NET app (Windows) Windows Cert Store require 🟢 Low None - automatic via Windows Update Python app (certifi) certifi bundle verify-ca 🟡 Medium Update certifi package (pip install --upgrade certifi) Node.js app (default) Built-in CAs verify-ca 🟢 Low None - Node.js 16+, 18+, 20+ auto-updated Container (Alpine) /etc/ssl/certs verify-full 🔴 High Update base image or install ca-certificates-bundle Container (custom) Baked-in certs verify-full 🔴 High Rebuild image with updated trust store How to Read This Matrix Use the above matrix to quickly assess whether your applications are affected by CA rotation. Here is an overview, how you read the matrix: Column Meaning Application Type What kind of application do you have? (e.g., Java, .NET, Container) Trust Store Where does the application store its trusted certificates? SSL Mode How strictly does the application validate the server certificate? Risk Level 🟢 Low / 🟡 Medium / 🔴 High - How likely is a connection failure? Action Required What specific action do you need to take? Risk Level Logic: Risk Level Why? 🟢 Low Automatic updates (OS/Azure SDK) or no certificate validation 🟡 Medium Manual update required but straightforward (e.g., pip install --upgrade certifi) 🔴 High Custom trust store must be manually updated - highest outage risk SSL Mode Security Posture Understanding SSL modes is critical because they determine both security posture AND rotation impact. This creates a dual consideration: SSL Mode Certificate Validation Rotation Impact Security Level Recommendation disable ❌ None ✅ No impact 🔴 INSECURE Never use in production allow ❌ None ✅ No impact 🟠 WEAK Not recommended prefer ❌ Optional ✅ Minimal 🟡 WEAK Not recommended require ❌ No (Npgsql 6.0+) ✅ No impact 🟡 WEAK Upgrade to verify-full verify-ca ✅ Chain only 🔴 Critical 🔵 MODERATE Update trust stores verify-full ✅ Chain + hostname 🔴 Critical 🟢 SECURE Recommended - Update trust stores Key Insight: Applications using weak SSL modes (everything below verify-ca) are technically unaffected by CA rotation but represent security vulnerabilities. The safest path is verify-full with current trust stores. ⚖️ The Security vs. Resilience Trade-off The Paradox: Secure applications (verify-full) have the highest rotation risk 🔴, while insecure applications (require) are unaffected but have security gaps. Teams discovering weak SSL modes during rotation preparation face a critical decision: Option Approach Rotation Impact Security Impact Recommended For 🚀 Quick Fix Keep weak SSL mode (require) ✅ No action needed ⚠️ Security debt remains Emergency situations only 🛡️ Proper Fix Upgrade to verify-full 🔴 Requires trust store updates ✅ Improved security posture All production systems Our Recommendation: Use CA rotation events as an opportunity to improve your security posture. The effort to update trust stores is a one-time investment that pays off in long-term security. Common Scenarios Scenario 1: Enterprise Java Application Problem: Custom trust store created 2+ years ago for PCI compliance Risk: High - contains only old intermediate certificates Solution: Export new intermediate from Azure, import to custom cacerts Scenario 2: Kubernetes Microservices Problem: Init container copies trust store from ConfigMap at startup Risk: High - ConfigMap never updated since initial deployment Solution: Update ConfigMap, redeploy pods with new trust store Scenario 3: Legacy .NET Application Problem: .NET Framework 4.6 on Windows Server 2016 (no Windows Update) Risk: Medium - depends on manual certificate store updates Solution: Import new intermediate to Windows Certificate Store manually 3. Trust Store Overview A trust store is the collection of root and intermediate CA certificates that your application uses to validate server certificates during TLS handshakes. Understanding where your application’s trust store is located determines how you’ll update it for CA rotations. Trust Store Locations by Platform Category Platform Trust Store Location Update Method Auto-Updated? OS Level Windows Cert:\LocalMachine\Root Windows Update ✅ Yes Debian/Ubuntu /etc/ssl/certs/ca-certificates.crt apt upgrade ca-certificates ✅ Yes (with updates) Red Hat/CentOS /etc/pki/tls/certs/ca-bundle.crt yum update ca-certificates ✅ Yes (with updates) Runtime Level Java JRE $JAVA_HOME/lib/security/cacerts Java security updates ✅ With JRE updates Python (certifi) site-packages/certifi/cacert.pem pip install --upgrade certifi ❌ Manual Node.js Bundled with runtime Node.js version upgrade ✅ With Node.js updates Custom Custom JKS Application-specific path keytool -importcert ❌ Manual Container image /etc/ssl/certs (baked-in) Rebuild container image ❌ Manual ConfigMap mount Kubernetes ConfigMap Update ConfigMap, redeploy ❌ Manual Why This Matters for CA Rotation Applications using auto-updated trust stores (OS-managed, current runtime versions) generally handle CA rotations automatically. The risk concentrates in: Custom trust stores created for compliance requirements (PCI-DSS, SOC 2) that are rarely updated Baked-in container certificates from images built months or years ago Outdated runtimes (old JRE versions, frozen Python environments) that haven’t received security updates Air-gapped environments where automatic updates are disabled When planning for CA rotation, focus your assessment efforts on applications in the “Manual” update category. 4. Platform-Specific Remediation 📦 Detailed implementation guides are available in our GitHub repository: azure-certificate-rotation-guide Quick Reference: Remediation by Platform Platform Trust Store Location Update Method Guide Java $JAVA_HOME/lib/security/cacerts Update JRE or manual keytool import java-cacerts.md .NET (Windows) Windows Certificate Store Windows Update (automatic) dotnet-windows.md Python certifi package pip install --upgrade certifi python-certifi.md Node.js Built-in CA bundle Update Node.js version nodejs.md Containers Base image /etc/ssl/certs Rebuild image or ConfigMap containers-kubernetes.md Scripts & Automation Script Purpose Download State Scan-AKS-TrustStores.ps1 Scan all pods in AKS for trust store configurations PowerShell tested validate-connection.sh Test PostgreSQL connection with SSL validation Bash not tested update-cacerts.sh Update Java cacerts with new intermediate Bash not tested 5. Proactive Detection Strategies Database-Level Discovery: Identifying Connected Clients One starting point for impact assessment is querying the PostgreSQL database itself to identify which applications are connecting. We developed a SQL query that joins pg_stat_ssl with pg_stat_activity to reveal active TLS connections, their SSL version, and cipher suites. 🔍 Get the SQL Query: Download the complete detection script from our GitHub repository: detect-clients.sql Important Limitations This query has significant constraints that you must understand before relying on it for CA rotation planning: Limitation Impact Mitigation Point-in-time snapshot Only shows currently connected clients Run query repeatedly over days/weeks to capture periodic jobs and batch processes No certificate details Cannot identify which CA certificate the client is using Requires client-side investigation (trust store analysis) Connection pooling May show pooler instead of actual application Use application_name in connection strings to identify true source Idle connections Long-running connections may be dormant Cross-reference with application activity logs Recommended approach: Use this query to create an initial inventory, then investigate each unique application_name and client_addr combination to determine their trust store configuration and SSL mode. Proactive Monitoring with Azure Monitor To detect certificate-related issues before and after CA rotation, configure Azure Monitor alerts. This enables early warning when SSL handshakes start failing. Why this matters: After CA rotation, applications with outdated trust stores will fail to connect. An alert allows you to detect affected applications quickly rather than waiting for user reports. Official Documentation: For complete guidance on creating and managing alerts, see Azure Monitor Alerts Overview and Create a Log Search Alert. Here is a short example of an Azure Monitor Alert definition as a starting point. { "alertRule": { "name": "PostgreSQL SSL Connection Failures", "severity": 2, "condition": { "query": "AzureDiagnostics | where ResourceType == 'SERVERS' and Category == 'PostgreSQLLogs' and Message contains 'SSL error' | summarize count() by bin(TimeGenerated, 5m)", "threshold": 5, "timeAggregation": "Total", "windowSize": "PT5M" } } } Alert Configuration Notes: Setting Recommended Value Rationale Severity 2 (Warning) Allows investigation without triggering critical incident response Threshold 5 failures/5min Filters noise while catching genuine issues Evaluation Period 5 minutes Balances responsiveness with alert fatigue Action Group Platform Team Ensures quick triage and coordination 6. Production Validation Pre-Rotation Validation Checklist Inventory all applications connecting to Azure PostgreSQL Identify trust store locations for each application Verify root certificate presence in trust stores Test connection with new intermediate in non-production environment Update monitoring alerts for SSL connection failures Prepare rollback plan if issues occur Schedule maintenance window (if required) Notify stakeholders of potential impact Testing Procedure We established a systematic 3-step validation process to ensure zero downtime. This approach moves from isolated testing to gradual production rollout. 🧪 Technical Validation Guide: For the complete list of psql commands, connection string examples for Windows/Linux, and automated testing scripts, please refer to our Validation Guide in the GitHub repository. Connection Testing Strategy The core of our validation strategy was testing connections with explicit sslmode settings. We used the psql command-line tool to simulate different client behaviors. Test Scenario Purpose Expected Result Encryption only (sslmode=require) Verify basic connectivity Connection succeeds even with unknown CA CA validation (sslmode=verify-ca) Verify trust store integrity Connection succeeds only if CA chain is valid Full validation (sslmode=verify-full) Verify strict security compliance Connection succeeds only if CA chain AND hostname match Pro Tip: Test with verify-full and an explicit root CA file containing the new Microsoft/DigiCert root certificates before the rotation date. This validates that your trust stores will work after the intermediate certificate changes. Step 1: Test in Non-Production Validate connections against a test server using the new intermediate certificate (Azure provides test endpoints during the rotation window). Step 2: Canary Deployment Deploy the updated trust store to a single “canary” instance or pod. Monitor: - Connection success rate - Error logs - Response times Step 3: Gradual Rollout Once the canary is stable, proceed with a phased rollout: 1. Update 10% of pods 2. Monitor for 1 hour 3. Update 50% of pods 4. Monitor for 1 hour 5. Complete rollout 7. Best Practices and Lessons Learned Certificate Management Best Practices Practice Guidance Example Trust Root CAs, Not Intermediates Configure trust stores with root CA certificates only. This provides resilience against intermediate certificate rotations. Trust Microsoft TLS RSA Root G2 and DigiCert Global Root G2 instead of specific intermediates Automate Trust Store Updates Use OS-provided trust stores when possible (automatically updated). For custom trust stores, implement CI/CD pipelines. Schedule bi-annual trust store audits Use SSL Mode Appropriately Choose SSL mode based on security requirements. verify-ca is recommended for most scenarios. See Security Posture Matrix in Section 2 Maintain Container Images Rebuild container images monthly to include latest CA certificates. Use init containers for runtime updates. Multi-stage builds with CA certificate update step Avoid Certificate Pinning Never pin intermediate certificates. If pinning is required for compliance, implement automated update processes. Pin only root CA certificates if absolutely necessary SSL Mode Decision Guide SSL Mode Security Level Resilience When to Use require Medium High Encrypted traffic without certificate validation. Use when CA rotation resilience is more important than MITM protection. verify-ca High Medium Validates certificate chain. Recommended for most production scenarios. verify-full Highest Low Strictest validation with hostname matching. Use only when compliance requires it. Organizational Communication Model Effective certificate rotation requires structured communication across multiple layers: Layer Responsibility Key Action Azure Service Health Microsoft publishes announcements to affected subscriptions Monitor Azure Service Health proactively Platform/Cloud Team Receives Azure announcements, triages criticality Follow ITSM processes, assess impact Application Teams Execute application-level changes Update trust stores, validate connections Security Teams Define certificate validation policies Set compliance requirements Ownership and Responsibility Matrix Team Responsibility Deliverable Platform/Cloud Team Monitor Azure Service Health, coordinate response Impact assessment, team notifications Application Teams Application-level changes (connection strings, trust stores) Updated configurations, validation results Security Teams Define certificate policies, compliance requirements Policy documentation, audit reports All Teams (Shared) Certificate lifecycle collaboration Playbooks, escalation paths, training Certificate Rotation Playbook Components Organizations should establish documented playbooks including: Component Recommended Frequency Purpose Trust Store Audits Bi-annual (every 6 months) Ensure certificates are current Certificate Inventory Quarterly review Know what certificates exist where Playbook Updates Annual or after incidents Keep procedures current Team Training Annual Build knowledge and confidence Field Observations: Common Configuration Patterns Pattern Observation Risk Implicit SSL Mode Teams don’t explicitly set sslmode, relying on framework defaults Unexpected behavior during CA rotation Copy-Paste Configurations Connection strings copied without understanding options Works until certificate changes expose gaps Framework-Specific Defaults Java uses JRE trust store, .NET uses Windows Certificate Store, Python depends on certifi package Some require manual updates, some are automatic Framework Trust Store Defaults Framework Default Trust Store Update Method Risk Level Java/Quarkus JRE cacerts Manual or JRE update Medium - requires awareness .NET Windows Certificate Store Windows Update Low - automatic Node.js Bundled certificates Node.js version update Low - automatic Python certifi package pip install --upgrade certifi High - manual intervention required Knowledge and Confidence Challenges Challenge Impact Mitigation Limited certificate knowledge Creates uncertainty and risk-averse behavior Proactive education, hands-on workshops Topic intimidation “Certificates” can seem complex, leading to avoidance Reality: Implementation is straightforward once understood Previous negative experiences Leadership concerns based on past incidents Document successes, share lessons learned Visibility gaps Lack of visibility into application dependencies Maintain certificate inventory, use discovery tools Monitoring Strategy (Recommended for Post-Rotation): While pre-rotation monitoring focuses on inventory, post-rotation monitoring should track: Key Metrics: - Connection failure rates (group by application, SSL error types) - SSL handshake duration (detect performance degradation) - Certificate validation errors (track which certificates fail) - Application error logs (filter for “SSL”, “certificate”, “trust”) Recommended Alerts: - Threshold: >5 SSL connection failures in 5 minutes - Anomaly detection: Connection failure rate increases >50% - Certificate expiry warnings: 30, 14, 7 days before expiration Dashboard Components: - Connection success rate by application - SSL error distribution (validation failures, expired certificates, etc.) - Certificate inventory with expiry dates - Trust store update status across infrastructure These metrics, alerts and thresholds are only starting points and need to be adjusted based on your environment and needs. Post-Rotation Validation and Telemetry Note: This article focuses on preparation for upcoming certificate rotations. Post-rotation metrics and incident data will be collected after the rotation completes and can inform future iterations of this guidance. Recommended Post-Rotation Activities: Here are some thoughts on post-rotation activities that could create more insights on the effectiveness of the preparation. Incident Tracking: After rotation completes, organizations should track: - Production incidents related to SSL/TLS connection failures - Services affected and their business criticality - Mean Time to Detection (MTTD) for certificate-related issues - Mean Time to Resolution (MTTR) from detection to fix Success Metrics to Measure Pre-Rotation Validation: - Number of services inventoried and assessed - Percentage of services requiring trust store updates - Testing coverage (dev, staging, production) Post-Rotation Outcomes: - Zero-downtime success rate (percentage of services with no impact) - Applications requiring emergency patching - Time from rotation to full validation Impact Assessment Telemetry to Collect: - Total connection attempts vs. failures (before and after rotation) - Duration of any service degradation or outages - ustomer-facing impact (user-reported issues, support tickets) - Geographic or subscription-specific patterns Continuous Improvement Post-Rotation Review: - What worked well in the preparation phase? - Which teams or applications were unprepared? - What gaps exist in monitoring or alerting? - How can communication be improved for future rotations? Documentation Updates: - Update playbooks with lessons learned - Refine monitoring queries based on observed patterns - Enhance team training materials - Share anonymized case studies across the organization 8. Engagement & Next Steps Discussion Questions We’d love to hear from the community: What’s your experience with certificate rotations? Have you encountered unexpected connection failures during CA rotation events? Which trust store update method works best for your environment? OS-managed, runtime-bundled, or custom trust stores? How do you handle certificate management in air-gapped environments? What strategies have worked for your organization? Share Your Experience If you’ve implemented proactive certificate management strategies or have lessons learned from CA rotation incidents, we encourage you to: Comment below with your experiences and tips Contribute to the GitHub repository with additional platform guides or scripts Connect with us on LinkedIn to continue the conversation Call to Action Take these steps now to prepare for the CA rotation: Assess your applications - Use the Risk Assessment Matrix (Section 2) to identify which applications use sslmode=verify-ca or verify-full with custom trust stores Import root CA certificates - Add DigiCert Global Root G2 and Microsoft RSA Root CA 2017 to your trust stores Upgrade SSL mode - Change your connection strings to at least sslmode=verify-ca (recommended: verify-full) for improved security Document your changes - Record which applications were updated, what trust stores were modified, and the validation results Automate for the future - Implement proactive certificate management so future CA rotations are handled automatically (OS-managed trust stores, CI/CD pipelines for container images, scheduled trust store audits) 9. Resources Official Documentation Azure PostgreSQL: Azure PostgreSQL SSL/TLS Concepts Azure PostgreSQL - Connect with TLS/SSL PostgreSQL & libpq: PostgreSQL libpq SSL Support - SSL mode options and environment variables PostgreSQL psql Reference - Command-line tool documentation PostgreSQL Server SSL/TLS Configuration Certificate Authorities: DigiCert Root Certificates Microsoft PKI Repository Microsoft Trusted Root Program Community Resources Let’s Encrypt Root Expiration (2021 Incident) NIST SP 800-57: Key Management Guidelines OWASP Certificate Pinning Cheat Sheet Neon Blog: PostgreSQL Connection Security Defaults Tools and Scripts PowerShell AKS Trust Store Scanner (see Platform-Specific Remediation) PostgreSQL Interactive Terminal (psql) PostgreSQL JDBC SSL Documentation Industry Context Certificate rotation challenges are not unique to Azure PostgreSQL. Similar incidents have occurred across the industry: Historical Incidents: - Let’s Encrypt Root Expiration (2021): Widespread impact when DST Root CA X3 expired, affecting older Android devices and legacy systems - DigiCert Root Transitions: Multiple cloud providers experienced customer impact during CA changes - Internal PKI Rotations: Enterprises face similar challenges when rotating internally-issued certificates Relevant Standards: - NIST SP 800-57: Key Management Guidelines (certificate lifecycle best practices) - OWASP Certificate Pinning: Guidance on balancing security and operational resilience - CIS Benchmarks: Recommendations for TLS/SSL configuration in cloud environments Authors Author Role Contact Andreas Semmelmann Cloud Solution Architect, Microsoft LinkedIn Mpho Muthige Cloud Solution Architect, Microsoft LinkedIn Disclaimers Disclaimer: The information in this blog post is provided for general informational purposes only and does not constitute legal, financial, or professional advice. While every effort has been made to ensure the accuracy of the information at the time of publication, Microsoft makes no warranties or representations as to its completeness or accuracy. Product features, availability, and timelines are subject to change without notice. For specific guidance, please consult your legal or compliance advisor. Microsoft Support Statement: This article represents field experiences and community best practices. For official Microsoft support and SLA-backed guidance: Azure Support: https://azure.microsoft.com/support/ Official Documentation: https://learn.microsoft.com/azure/ Microsoft Q&A: https://learn.microsoft.com/answers/ Production Issues: Always open official support tickets for production-impacting problems. Customer Privacy Notice: This article describes real-world scenarios from customer engagements. All customer-specific information has been anonymized. No NDAs or customer confidentiality agreements were violated in creating this content. AI-generated content disclaimer: This content was generated in whole or in part with the assistance of AI tools. AI-generated content may be incorrect or incomplete. Please review and verify before relying on it for critical decisions. See terms Community Contribution: The GitHub repository referenced in this article contains community-contributed scripts and guides. These are provided as-is for educational purposes and should be tested in non-production environments before use. Tags: #AzurePostgreSQL #CertificateRotation #TLS #SSL #TrustStores #Operations #DevOps #SRE #CloudSecurity #AzureDatabaseApplication Gateway for Containers – A New Way to Ingress into AKS
Introduction If you’re using Azure Kubernetes Service (AKS), you will need a mechanism for accepting and routing HTTP/S traffic to applications running in your AKS cluster. Until recently, this was typically handled by Azure’s Application Gateway Ingress Controller (AGIC) or another Ingress product such as NGINX. With the introduction of the upstream Kubernetes Gateway API project, there’s now a more evolved solution for ingress traffic management. This article will discuss Application Gateway for Containers (AGC) – which is Azure’s latest load balancing solution that implements Gateway API. This post is not an instructional on how to deploy AGC, but it will address the following: What is Gateway API and why is it needed? How does AGC work? How is high availability and resiliency incorporated into AGC? What AGC is not The goal is that you will come away with an understanding of the inner workings of AGC and how it ties into the AKS environment. Let’s get started! Gateway API Overview Before the introduction of Gateway API, Ingress API was the de facto method for routing Layer 7 traffic to applications running in Kubernetes. It provides a simple routing process for HTTP/S traffic but has limitations. For instance, it requires the use of vendor specific annotations for the following: URL rewriting or header modification Routing for gRPC, TCP or UDP based traffic To address these limitations, The Kubernetes Network Special Interest Group (SIG) introduced Gateway API. It consists of a collection of Custom Resource Definitions (CRDs) which extends the Kubernetes API to allow for the creation of custom resources. Gateway API is a more flexible, portable and extensible solution in comparison to its Ingress predecessor. It consists of three components: Gateway Class – provides a standard on how Gateway objects should be configured and behave Gateway – an instantiation of a Gateway Class that implements its configuration Routes – defines routing and protocol-based rules that are mapped to Kubernetes backend services As seen in Fig.1.1, the relative independence of each component in Gateway API allows for a separation of concerns type resource model. For example, developers can focus on creating routes for their apps and platform teams can manage the gateway resources that are utilized by routes. The other benefit is the portability of routes. For example, ones created in AKS can be used with Gateway API deployments in other environments. This flexibility is not possible with Ingress API, due to a lack of standardization across different Ingress controller implementations. Application Gateway for Containers Overview Not to be confused with Application Gateway, Application Gateway for Containers is a load balancing product designed to manage layer 7 traffic intended for applications running in AKS. It supports advanced routing capabilities by leveraging components that bootstrap Gateway API into AKS. The above figure is an illustration of AGC, AKS and how they work together to manage incoming traffic. Let’s break down the diagram in detail to get a better understanding of AGC. The Application Gateway for Containers Frontend serves as the public entry point for client traffic. It is a child resource of AGC and is given an auto-generated FQDN during setup. To avoid using the FQDN, it can be mapped to a CNAME record for DNS resolution. Also, you can have multiple Frontend resources (up to 5) in a single AGC instance. hild resource The Association child resource is the point of entry into the AKS Cluster and defines where the proxy components live. In the above pic, you will notice a subnet linked to it, which is established via subnet delegation. This is a dedicated subnet that’s also used by the proxy components which send traffic to destination AKS pods. The ALB Controller (which will be described shortly), deploys the proxies into the subnet. Here’s a view of the ALB Controller subnet. It must use a /24 or smaller CIDR and cannot be used for any other resources. In this case, the ALB subnet is deployed within the AKS Virtual Network (VNet), however this is not a requirement. It can be in a separate VNet that is peered with the AKS virtual network. So, we’ve determined how traffic flows from the AGC frontend resource and to the proxy components. But two questions remain: 1) How do the proxy components know which backend services are intended for the incoming request? 2) How is Gateway API leveraged by AGC to utilize advanced routing patterns? This is where the ALB controller comes into play. Before creating the AGC instance, the ALB controller is deployed into AKS. It’s responsible for monitoring HTTP route and Gateway resource configurations. As you can see in the above pic, ALB controller runs as three pods in AKS: two controller pods and one for bootstrapping. The ALB controller pods have a direct connection to AGC and are responsible for replicating resource configurations to it. To accomplish this, a federated Managed Identity is used which has the AppGW for Containers Configuration Manager role on the AGC Resource Group. Also, the ALB Controller uses this Managed Identity to provision AGC. Alternatively, you can create your own AGC resource via Azure portal, CLI, PowerShell or Infrastructure as Code (IAC). The latter deployment method is done through Azure Resource Manager (ARM). By default, the bootstrap pod is how Gateway API is installed. However, you can disable this behavior by setting the albController.installGatewayApiCRDs parameter to false when you install the ALB Controller using Helm. In Fig.1.8, a kubectl describe command is executed against the bootstrap pod to display its specs. You will notice an Init container applies the Gateway API CRDs into AKS. Init Containers are used to perform initialization tasks that must precede the startup of a main application container. Fig.1.9. Gateway Class object definition output Recall from earlier that Gateway API consists of three resources: Gateway class, Gateway resource and Routes. The ALB Controller will create a Gateway Class object with the name azure-alb-external, as shown above. Fig.1.10. Gateway Resource and HTTPRoute configuration files Fig.1.11. Diagram of traffic splitting between backends The final steps to complete the puzzle are to deploy a Gateway resource which listens for traffic over a protocol/port combination and a Route to define how traffic coming via the Gateway maps to backend services. The Gateway definition has a gatewayClassName spec that references the name of the Gateway Class. In the above example, it listens for HTTP traffic on port 80. And there’s a corresponding HTTPRoute config that splits the traffic across two backend services: backend-v1 receiving 50% of the traffic on port 8080 and backend-v2 receiving the remaining traffic using the same port. High Availability & Resiliency in AGC When you create an Application Gateway for Containers resource, it’s automatically deployed across Availability zones within the selected region. An Availability Zone (AZ) is a physically unique group of one or more datacenters. Its purpose is to provide inner-regional resiliency at the datacenter level. There are typically three AZs in a region where they are supported. Therefore, if one datacenter in the region goes down, AGC is not impacted. If Availability zones aren’t supported in the selected region, fault and update domains in the form of Availability sets will be leveraged to mitigate against outages and maintenance events. This link provides a list of Azure regions that support Availability zones. To mitigate against regional outages, you can leverage Azure Front Door or Traffic Manager with AGC. Azure Front Door is a Layer 7 routing service that load-balances incoming traffic across two regions. It provides Content Deliver Networking (CDN), Web-application firewall (WAF), SSL termination and other capabilities for HTTP/HTTPS traffic. Whereas Traffic Manager uses DNS to direct client requests to the appropriate endpoint based on a specified routing method such as priority, performance, weight or others. What AGC is Not Application Gateway for Containers is not a replacement for Application Gateway. Rather, it’s a new service within the family of Azure load balancing services. Although AGC doesn’t currently have Web Application Firewall (WAF) capabilities like Application Gateway, the feature is currently in private preview and will soon be available. Lastly, AGC is designed specifically for routing requests to containerized applications running in AKS. And unlike Application Gateway, it does not service backend targets such as Azure App Services, VMs, and Virtual Machine Scale Sets (VMSS). Conclusion Over time, it became evident that a new way of managing ingress traffic for containerized workloads was needed. The initial implementations for ingress traffic management were sufficient for simple routing requests but lacked native support for advanced routing needs. In this article, we discussed Microsoft Azure’s newest load balancing solution called Application Gateway for Containers, which builds on the Gateway API for Kubernetes. We explored the components of AGC, how it manages traffic and addressed any potential misconceptions regarding it. For some additional resources, check out the following: What is Application Gateway for Containers? | Microsoft Learn Gateway API | Kubernetes Introduction - Kubernetes Gateway API AGC Supported Regions1.9KViews4likes0CommentsNginx Ingress controller integration with Istio Service Mesh
Introduction Nginx (pronounced as "engine x") is an HTTP web server, reverse proxy, content cache, load balancer, TCP/UDP proxy server, and mail proxy server. It is one of the common ingress (used to ingest external traffic into the cluster) used in Kubernetes. I have discussed Istio service mesh in my previous article here: Istio Service Mesh Observability in AKS | Microsoft Community Hub. Setting up nginx ingress controller with Istio Service mesh requires custom configuration and is not as straightforward as using in-house ingress from Istio. One of my customers faced this issue and I was able to resolve it using the configuration we will discuss in this article. Not all customers can migrate to Istio Ingress when enabling service mesh as they might already have lot of dependencies on existing ingress rules as well as enterprise agreements with Ingress providers. The main problem with having both nginx ingress controller and Istio service mesh in the same Kubernetes cluster is when mTLS is enforced strictly by Istio. TLS vs mTLS Usually when we communicate with a server, we use TLS in which only the server’s identity is verified using a certificate. The client is verified using secondary methods like username-password, tokens etc. With the advent of distributed attacks increasing in the age of AI it is critical to implement cryptographically verifiable identities for clients as well. Mutual TLS or mTLS is based on this Zero trust mindset. With mTLS both client and server present a verifiable certificate which makes man in the middle attack extremely difficult. Enabling mTLS is one of the primary use cases of using Istio Service mesh in the Kubernetes cluster. Sidecar in Istio Sidecars are secondary containers which get injected and attach to the pod with main containers in the Pod. Istio sidecar acts like a proxy and intercepts all the incoming and outgoing traffic to the application container unless explicitly specified. Sidecar is how istio is able to implement it functionalities around traffic management in service mesh. In future there would be an option to operate Istio in a Sidecarless fashion using Ambient mode, which is still in development for Istio addon for AKS at the time of writing this article. Root cause In the above diagram you can see that istio sidecar injection is enabled in Application pod namespace but not in Ingress controller. Also, traffic enters the ingress controller through AKS exposed Internal load balancer. This traffic is https / TLS based and get TLS terminated at the ingress controller side. This is usually done as otherwise Nginx would not be able to perform many of it functionalities like path and header-based routing unless it decrypts the traffic. Therefore, traffic going towards application pods is http based. Now since mTLS is strictly enforced in the service mesh it will only accept mTLS traffic therefore, this traffic gets dropped and the user will get a 502 bad gateway error thrown by Nginx. Even if the traffic is re-encrypted and sent to application pods, which Nginx supports, the request will still get dropped as Istio allows only mTLS not TLS. Solution To solve this problem, we follow the following steps: 1. Enable sidecar injection in Ingress controller namespace: First we will enable sidecar injection in Ingress controller namespace, so that traffic egress from the ingress controller pods is mTLS. 2. Exempt external inbound traffic from sidecar: Next, mTLS is only understood within the AKS cluster, so we will have to bypass the external traffic from going through the Istio proxy container and directly to nginx container. If we don’t do this, Istio will expect external traffic to also be mTLS and will drop it. After traffic enters Nginx, it then decrypts the traffic and sends it out, which is intercepted by istio-proxy sidecar and encrypted to mTLS. 3. Send traffic to application service instead of pods directly: By default, nginx sends traffic directly to application pods as you can see in the root cause diagram. If we continue doing that, istio will not consider this traffic to be mesh traffic and drop it. Therefore, for istio to allow this traffic as part of the mesh we have to direct it through the application service. After this is done, istio allows this traffic to go through to the application pods. There are some additional configurations which we will discuss in the detailed steps below. Steps to integrate Nginx Ingress Controller with Istio Service mesh For details on setting up the AKS cluster, enabling istio and installing demo application, check out my prior article: Istio Service Mesh Observability in AKS | Microsoft Community Hub, steps 1 through 4. The steps below assume that you already have an AKS cluster setup with istio service mesh installed. Also, demo application should be installed as discussed in my previous article. 1. Enable mTLS strict mode for the entire service mesh. This would enforce mTLS in all namespaces where istio sidecar injection is enabled. # Enable mTLS for the entire service mesh kubectl apply -n aks-istio-system -f - <<EOF apiVersion: security.istio.io/v1 kind: PeerAuthentication metadata: name: global-mtls namespace: aks-istio-system spec: mtls: mode: STRICT EOF 2. Install Nginx ingress controller if not installed already in your AKS cluster. # Namespace where you want to install the ingress-nginx controller NAMESPACE=ingress-basic # Add nginx helm repo to your repositories helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update # Install Nginx Ingress Controller with annotation for Azure Load Balancer and externalTrafficPolicy set to Local # This is important for the health probe to work correctly with the Azure Load Balancer helm install ingress-nginx ingress-nginx/ingress-nginx \ --create-namespace \ --namespace $NAMESPACE \ --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz \ --set controller.service.externalTrafficPolicy=Local 3. Create Ingress object in Application namespace: You need to create an Ingress object to allow nginx to route traffic to your pods. Kindly refer nginx-ingress-before.yaml # Apply Ingress Resource for the sample application kubectl apply -f ./nginx-ingress-before.yaml -n default 4. Validate if you are able to access sample app using nginx ingress created: We will get the external IP of the ingress controller service that is of type LoadBalancer. # Get external IP for the service kubectl get services -n ingress-basic You will get an output as shown below: Now copy the IP from above and access http://<external-ip>/test in your browser. You will notice that nginx is throwing 502 Bad Gateway error. This is because it was not able to reach the application pods and get a response as istio-proxy dropped the requests as it was not mTLS. Following steps will fix this issue: 5. Enable sidecar injection in ingress controller namespace : For pods to understand traffic from nginx, it has to be sent with mTLS from istio side. To make this possible we have to enable sidecar injection in nginx ingress controller namespace. Post adding this label, restart the ingress controller deployment so that sidecars are injected into the nginx ingress controller pods: # Get the istio version installed on the AKS cluster az aks show --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME --query 'serviceMeshProfile.istio.revisions' # Label namespace with appropriate istio version to enable sidecar injection kubectl label namespace <ingress-controller-namespace> istio.io/rev=asm-1-<version> # Restart nginx ingress controller deployment so that sidecars can be injected into the pods kubectl rollout restart deployment/ingress-nginx-controller -n ingress-basic 6. Exempt external inbound traffic from sidecar: This is required as mTLS is only understood within the AKS cluster and is not meant for external traffic. Now since sidecar is injected in Nginx, we need to exempt external traffic from going to istio proxy otherwise it will get dropped from not being mTLS (It is only TLS). To do this we need to add the following annotations: # Edit nginx controller deployment kubectl edit deployments -n ingress-basic ingress-nginx-controller # Disable all inbound port redirection to proxy (empty quotes to this property archives that) traffic.sidecar.istio.io/includeInboundPorts: "" # Explicitly enable inbound ports on which the cluster is exposed externally to bypass istio-proxy redirection and take traffic directly to ingress controller pods traffic.sidecar.istio.io/excludeInboundPorts: "80,443" Kindly wait before exiting the edit mode as we have one more annotation to add below. 7. Allow connection between Nginx ingress controller and API server: Now since mTLS is enforced for Nginx it will not be able to communicate with Kubernetes API server to monitor and react to changes in Ingress resources, enabling dynamic configuration of NGINX based on these changes. Therefore, we need to exempt Kubernetes API server IP from mTLS traffic. # Query kubernetes API server IP kubectl get svc kubernetes -o jsonpath='{.spec.clusterIP}' # Add annotation to ingress controller traffic.sidecar.istio.io/excludeOutboundIPRanges: "KUBE_API_SERVER_IP/32" The problem with this approach is that AKS doesn't guarantee static IP for API server as it is managed by platform. Usually, API server IP changes during cluster restart or reprovisioning but that is not guaranteed to only happen during those instances. It can take up any IP from the service CIDR which is a /16 CIDR unless configured explicitly. One option is to have dedicated CIDR subnet for API server using VNET integration feature but this feature is currently in preview with tentative GA in Q2 2025: API Server VNet Integration in Azure Kubernetes Service (AKS) - Azure Kubernetes Service. After enabling this feature API server will always take an IP from the allocated subnet which can be mentioned in the annotation above. This is how the final deployment yaml for nginx ingress controller should look, note that annotations are updated under template and not at the deployment level: 8. Route traffic to istio sidecar once it enters the ingress object: By default, nginx sends traffic to upstream PodIP and port combination. If this is done with mTLS enabled, istio will not recognize this as mesh traffic and drop it. Therefore, it is important to change this behavior and send traffic to the exposed service instead of the backend pod directly. This is done with the annotations below, you can check the sample here nginx-ingress-after.yaml: # Setup nginx to send traffic to upstream service instead of PodIP and port nginx.ingress.kubernetes.io/service-upstream: "true" # Specify the service fqdn where to route the traffic (this is the service that exposes the application pods) nginx.ingress.kubernetes.io/upstream-vhost: <service>.<namespace>.svc.cluster.local # Apply Ingress Resource for the sample application kubectl apply -f ./nginx-ingress-after.yaml -n default 9. Configure the ingress’s sidecar to route traffic to services in the mesh: This is only needed if the ingress object is in a separate namespace compared to the services it is routing traffic to, we don’t need this as our ingress and application service are in the same namespace. Sidecars know how to route traffic to services in the same namespace but if you want them to route traffic to a different namespace, you will need to allow it in your sidecar configuration, which can be done using the yaml here Sidecar.yaml. # Apply Sidecar yaml in the namespace where ingress object is deployed kubectl apply -f Sidecar.yaml -n <ingress-object-namespace> Validate if the application is accessible: The application should now load at the endpoint http://<external-ip>/test in your browser. Conclusion That’s it, once the steps above are followed, traffic should flow as expected between mTLS enforced service mesh and nginx ingress controller. You can find all the commands and yaml files from this article here. Let me know if you have any questions or face any issues with integrating nginx ingress controller with Istio service mesh in comments below.1.1KViews4likes0CommentsIstio Service Mesh Observability in AKS
Introduction A service mesh is a dedicated infrastructure layer that manages service-to-service communication in microservices architectures. It is essential for managing communication between microservices in a distributed system, providing built-in security, traffic control, and observability. Istio is a powerful, open-source service mesh that simplifies managing, securing, and observing microservices communication. It joined Cloud Native Computing Foundation (CNCF) in 2022 and has become an industry standard for Service mesh operation. Azure Kubernetes Service (AKS) is a managed Kubernetes service provided by Microsoft Azure. It allows you to deploy, manage, and scale containerized applications using Kubernetes, without needing extensive container orchestration expertise. Observability in Istio Service Mesh is crucial for ensuring reliability, performance and security of microservices-based applications. Istio is a powerhouse when it comes to exposing telemetry and understanding the complex flow of traffic between applications. This article is a step-by-step guide for enabling Istio service mesh in AKS using Istio addon and enabling observability using managed Prometheus and Grafana. At the end we will discuss Advanced Container Networking Services (ACNS) addon in AKS, which enables Hubble to help visualize traffic flow within an AKS cluster / service mesh. I wanted to document the process as there are not enough articles available currently to achieve this in AKS and specifically none that talk about enabling Istio metrics export with mTLS enabled in AKS cluster (at the time of writing this article 😊). Metrics scraping architecture Above is a simplified architecture diagram on how the metrics will get scraped in AKS by Prometheus. Prometheus is embedded into the azure monitor pods (ama-pods), and they will be doing the scraping based on the scraping configuration set. Each application pod will have a Istio-proxy container sidecar to control traffic and collect metrics, this also depends on which namespaces have sidecar auto-injection enabled or which pods are explicitly injected with sidecar. Hubble pods will also be running on the cluster utilizing the eBPF technology to scrape network flows using Layer 3. Prometheus will collect all these metrics and send it out to azure monitor workspace (customized Prometheus database) via private endpoint. Managed Grafana instance will then pull this data from azure monitor workspace over private endpoint again. Steps for configuring managed Prometheus, Grafana and Hubble 1. Start with logging into AZ CLI with your account and selecting the default subscription and define some variables that you will use for creation of resource group and AKS cluster. # Define variables export MY_RESOURCE_GROUP_NAME="<your resource group name>" export REGION="<region where you would like to deploy the cluster>" export MY_AKS_CLUSTER_NAME="<AKS cluster name>" # Create a resource group az group create --name $MY_RESOURCE_GROUP_NAME --location $REGION # Create an AKS cluster az aks create --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME --node-count 3 --generate-ssh-keys Once completed you should be able to see your AKS cluster in the Azure portal. 2. Get credentials for the AKS cluster and verify your connection. # Get the credentials for the AKS cluster az aks get-credentials --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME # Verify the connection to your cluster kubectl get nodes If the connection is successful and the AKS cluster was created successfully you should see the nodes created as part of your aks cluster. 3. Enable Istio addon for AKS (You might need to install aks-preview plugin for AZ CLI if not already installed). Then verify the installation of istio and enable sidecar injection in desired namespace # Enable istio addon on AKS cluster az aks mesh enable --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME # Verify istiod (Istio control plane) pods are running successfully kubectl get pods -n aks-istio-system # Enable sidecar injection az aks show --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME --query 'serviceMeshProfile.istio.revisions' Based on the output of the above command use the appropriate label to enable sidecar injection, below “default” is the namespace where I am enabling sidecar injection kubectl label namespace default istio.io/rev=asm-1-22 Sample output: 4. Deploy sample application and verify its deployment # Deploy sample application kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.18/samples/bookinfo/platform/kube/bookinfo.yaml # Verify services and pods kubectl get services kubectl get pods kubectl port-forward svc/productpage 12002:9080 Sample Output: Above you will notice in output of “kubectl get pods” that each of the pods have 2 containers under READY column. This is because you had enabled sidecar injection in default namespace, the 2nd container in each pod is the istio-proxy container After port forwarding your app to local port 12002, you should be able to access it: http://localhost:12002 5. Enable mTLS in your service mesh. This is one of the most important use cases of istio that it enables you to enforce mTLS so that only mTLS traffic is allowed in your mesh, improving your cluster security significantly. # Enable mTLS enforcement for default namespace in the cluster (copy / paste and run the entire code block till the 2nd EOF in terminal) kubectl apply -n default -f - <<EOF apiVersion: security.istio.io/v1 kind: PeerAuthentication metadata: name: default spec: mtls: mode: STRICT EOF # Verify your policy got deployed kubectl get peerauthentication -n default Sample output: 6. Now we will deploy managed prometheus and grafana. We will then link them with the AKS cluster. This will enable us to visualize prometheus based metrics from kubernetes on Grafana dashboard. # Create azure monitor resource (managed prometheus resource) export AZURE_MONITOR_NAME="<your desired name for managed prometheus resource>" az resource create --resource-group $MY_RESOURCE_GROUP_NAME --namespace microsoft.monitor --resource-type accounts --name $AZURE_MONITOR_NAME --location $REGION --properties '{}' # Create Azure Managed Grafana instance export GRAFANA_NAME="<your desired name for managed grafana resource>" az grafana create --name $GRAFANA_NAME --resource-group $MY_RESOURCE_GROUP_NAME --location $REGION # Link Azure Monitor and Azure Managed Grafana to the AKS cluster grafanaId=$(az grafana show --name $GRAFANA_NAME --resource-group $MY_RESOURCE_GROUP_NAME --query id --output tsv) azuremonitorId=$(az resource show --resource-group $MY_RESOURCE_GROUP_NAME --name $AZURE_MONITOR_NAME --resource-type "Microsoft.Monitor/accounts" --query id --output tsv) az aks update --name $MY_AKS_CLUSTER_NAME --resource-group $MY_RESOURCE_GROUP_NAME --enable-azure-monitor-metrics --azure-monitor-workspace-resource-id $azuremonitorId --grafana-resource-id $grafanaId # Verify Azure monitor pods are running kubectl get pods -o wide -n kube-system | grep ama- Sample output: On Azure portal, you can check that the new resources are created: You should then open the grafana instance and click on the instance URL to open your managed grafana instance. If you are not able to do so, assign yourself Grafana Admin role under Access control pane in Grafana resource on Azure: 7. Now you will need to configure a job and configmap for prometheus to scrape metrics from istio. Download the configmap prometheus-config from here. # Create job and configmap for scraping istio metrics with prometheus kubectl create configmap ama-metrics-prometheus-config --from-file=prometheus-config -n kube-system Wait for about 10-15 mins and then verify whether istio metrics are getting scraped from your cluster. Go to prometheus resource on Azure -> Metrics on the left pane -> Select “istio_requests_total” and run query. You should be able to see data popping up after that. 8. Import Istio Grafana dashboards to your managed Grafana instance. For doing this first find out the version of istio you are running on your cluster # Get Istio version Installed for importing specific dashboards az aks show --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME --query 'serviceMeshProfile.istio.revisions' Sample output: After this go to the following dashboards and download the specific version based on your istio version: Istio Mesh Dashboard | Grafana Labs Istio Control Plane Dashboard | Grafana Labs Istio Service SLO Demo | Grafana Labs (Only 1 version is available here) For each of the dashboards downloaded above, click on dashboards on Grafana and New->Import option on top right corner. After clicking on import upload the downloaded json file of the dashboard and click on import. Remember to select Azure managed prometheus as data source before importing. Post this you should be able to see istio metrics displayed on the Grafana dashboards: 9. Now that you have exported Istio metrics and created dashboards, we will now need to see how to visualize traffic flow graphs in AKS. This is critical as with complex service mesh, you will need to understand how your traffic is flowing. The standard way to do this is either using Kiali or Jaeger, which currently are not supported with Istio addon for AKS. We will use Hubble, which is an eBPF technology developed by Cilium to scrape network flows using Layer 3 (so it would be more efficient). Hubble is ported to non-cilium AKS clusters using retina which is available using the Advanced Container Networking Service (ACNS) addon. You can download hubble-ui.yaml from here. # Enable ACNS for the AKS cluster az aks update --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME --enable-acns # Setup Hubble UI kubectl apply -f hubble-ui.yaml kubectl -n kube-system port-forward svc/hubble-ui 12000:80 Sample output: Navigate to http://localhost:12000 on your browser to open Hubble UI Conclusion We have learned how to configure observability for Istio metrics using managed Prometheus and Grafana on AKS and visualize network flows using Hubble. You can find the commands and yaml files used in this article here. Let me know if you face any issues during this implementation via comments. Thank you for reading this article! Happy learning!1.1KViews6likes0Comments