apps & infra
4 TopicsAzure AI Foundry vs. Azure Databricks – A Unified Approach to Enterprise Intelligence
Key Insights into Azure AI Foundry and Azure Databricks Complementary Powerhouses: Azure AI Foundry is purpose-built for generative AI application and agent development, focusing on model orchestration and rapid prototyping, while Azure Databricks excels in large-scale data engineering, analytics, and traditional machine learning, forming the data intelligence backbone. Seamless Integration for End-to-End AI: A critical native connector allows AI agents developed in Foundry to access real-time, governed data from Databricks, enabling contextual and data-grounded AI solutions. This integration facilitates a comprehensive AI lifecycle from data preparation to intelligent application deployment. Specialized Roles for Optimal Performance: Enterprises leverage Databricks for its robust data processing, lakehouse architecture, and ML model training capabilities, and then utilize AI Foundry for deploying sophisticated generative AI applications, agents, and managing their lifecycle, ensuring responsible AI practices and scalability. In the rapidly evolving landscape of artificial intelligence, organizations seek robust platforms that can not only handle vast amounts of data but also enable the creation and deployment of intelligent applications. Microsoft Azure offers two powerful, yet distinct, services in this domain: Azure AI Foundry and Azure Databricks. While both contribute to an organization's AI capabilities, they serve different primary functions and are designed to complement each other in building comprehensive, enterprise-grade AI solutions. Decoding the Core Purpose: Foundry for Generative AI, Databricks for Data Intelligence At its heart, the distinction between Azure AI Foundry and Azure Databricks lies in their core objectives and the types of workloads they are optimized for. Understanding these fundamental differences is crucial for strategic deployment and maximizing their combined potential. Azure AI Foundry: The Epicenter for Generative AI and Agents Azure AI Foundry emerges as Microsoft's unified platform specifically engineered for the development, deployment, and management of generative AI applications and AI agents. It represents a consolidation of capabilities from what were formerly Azure AI Studio and Azure OpenAI Studio. Its primary focus is on accelerating the entire lifecycle of generative AI, from initial prototyping to large-scale production deployments. Key Characteristics of Azure AI Foundry: Generative AI Focus: Foundry streamlines the development of large language models (LLMs) and customized generative AI applications, including chatbots and conversational AI. It emphasizes prompt engineering, Retrieval-Augmented Generation (RAG), and agent orchestration. Extensive Model Catalog: It provides access to a vast catalog of over 11,000 foundation models from various publishers, including OpenAI, Meta (Llama 4), Mistral, and others. These models can be deployed via managed compute or serverless API deployments, offering flexibility and choice. Agentic Development: A significant strength of Foundry is its support for building sophisticated AI agents. This includes tools for grounding agents with knowledge, tool calling, comprehensive evaluations, tracing, monitoring, and guardrails to ensure responsible AI practices. Foundry Local further extends this by allowing offline and on-device development. Unified Development Environment: It offers a single management grouping for agents, models, and tools, promoting efficient development and consistent governance across AI projects. Enterprise Readiness: Built-in capabilities such as Role-Based Access Control (RBAC), observability, content safety, and project isolation ensure that AI applications are secure, compliant, and scalable for enterprise use. Figure 1: Conceptual Architecture of Azure AI Foundry illustrating its various components for AI development and deployment. Azure Databricks: The Powerhouse for Data Engineering, Analytics, and Machine Learning Azure Databricks, on the other hand, is an Apache Spark-based data intelligence platform optimized for large-scale data engineering, analytics, and traditional machine learning workloads. It acts as a collaborative workspace for data scientists, data engineers, and ML engineers to process, analyze, and transform massive datasets, and to build and deploy diverse ML models. Key Characteristics of Azure Databricks: Unified Data Analytics Platform: Central to Databricks is its lakehouse architecture, built on Delta Lake, which unifies data warehousing and data lakes. This provides a single platform for data engineering, SQL analytics, and machine learning. Big Data Processing: Excelling in distributed computing, Databricks is ideal for processing large datasets, performing ETL (Extract, Transform, Load) operations, and real-time analytics at scale. Comprehensive ML and AI Workflows: It offers a specialized environment for the full ML lifecycle, including data preparation, feature engineering, model training (both classic and deep learning), and model serving. Tools like MLflow are integrated for tracking, evaluating, and monitoring ML models. Data Intelligence Features: Databricks includes AI-assistive features such as Databricks Assistant and Databricks AI/BI Genie, which enable users to interact with their data using natural language queries to derive insights. Unified Governance with Unity Catalog: Unity Catalog provides a centralized governance solution for all data and AI assets within the lakehouse, ensuring data security, lineage tracking, and access control. Figure 2: The Databricks Data Intelligence Platform with its unified approach to data, analytics, and AI. The Symbiotic Relationship: Integration and Complementary Use Cases While distinct in their primary functions, Azure AI Foundry and Azure Databricks are explicitly designed to work together, forming a powerful, integrated ecosystem for end-to-end AI development and deployment. This synergy is key to building advanced, data-driven AI solutions in the enterprise. Seamless Integration for Enhanced AI Capabilities The integration between the two platforms is a cornerstone of Microsoft's AI strategy, enabling AI agents and generative applications to be grounded in high-quality, governed enterprise data. Key Integration Points: Native Databricks Connector in AI Foundry: A significant development in 2025 is the public preview of a native connector that allows AI agents built in Azure AI Foundry to directly query real-time, governed data from Azure Databricks. This means Foundry agents can leverage Databricks AI/BI Genie to surface data insights and even trigger Databricks Jobs, providing highly contextual and domain-aware responses. Data Grounding for AI Agents: This integration enables AI agents to access structured and unstructured data processed and stored in Databricks, providing the necessary context and knowledge base for more accurate and relevant generative AI outputs. All interactions are auditable within Databricks, maintaining governance and security. Model Crossover and Availability: Foundation models, such as the Llama 4 family, are made available across both platforms. Databricks DBRX models can also appear in the Foundry model catalog, allowing flexibility in where models are trained, deployed, and consumed. Unified Identity and Governance: Both platforms leverage Azure Entra ID for authentication and access control, and Unity Catalog provides unified governance for data and AI assets managed by Databricks, which can then be respected by Foundry agents. Here's a breakdown of how a typical flow might look: Mindmap 1: Illustrates the complementary roles and integration points between Azure Databricks and Azure AI Foundry within an end-to-end AI solution. When to Use Which (and When to Use Both) Choosing between Azure AI Foundry and Azure Databricks, or deciding when to combine them, depends on the specific requirements of your AI project: Choose Azure AI Foundry When You Need To: Build and deploy production-grade generative AI applications and multi-agent systems. Access, evaluate, and benchmark a wide array of foundation models from various providers. Develop AI agents with sophisticated capabilities like tool calling, RAG, and contextual understanding. Implement enterprise-grade guardrails, tracing, monitoring, and content safety for AI applications. Rapidly prototype and iterate on generative AI solutions, including chatbots and copilots. Integrate AI agents deeply with Microsoft 365 and Copilot Studio. Choose Azure Databricks When You Need To: Perform large-scale data engineering, ETL, and data warehousing on a unified lakehouse. Build and train traditional machine learning models (supervised, unsupervised learning, deep learning) at scale. Manage and govern all data and AI assets centrally with Unity Catalog, ensuring data quality and lineage. Conduct complex data analytics, business intelligence (BI), and real-time data processing. Leverage AI-assistive tools like Databricks AI/BI Genie for natural language interaction with data. Require high-performance compute and auto-scaling for data-intensive workloads. Use Both for Comprehensive AI Solutions: The most powerful approach for many enterprises is to leverage both platforms. Azure Databricks can serve as the robust data backbone, handling data ingestion, processing, governance, and traditional ML model training. Azure AI Foundry then sits atop this foundation, consuming the prepared and governed data to build, deploy, and manage intelligent generative AI agents and applications. This allows for: Domain-Aware AI: Foundry agents are grounded in enterprise-specific data from Databricks, leading to more accurate, relevant, and trustworthy AI responses. End-to-End AI Lifecycle: Databricks manages the "data intelligence" part, and Foundry handles the "generative AI application" part, covering the entire spectrum from raw data to intelligent user experience. Optimized Resource Utilization: Each platform focuses on what it does best, leading to more efficient resource allocation and specialized toolsets for different stages of the AI journey. Comparative Analysis: Features and Capabilities To further illustrate their distinct yet complementary nature, let's examine a detailed comparison of their features, capabilities, and typical user bases. Radar Chart 1: This chart visually compares Azure AI Foundry and Azure Databricks across several key dimensions, illustrating their specialized strengths. Azure AI Foundry excels in generative AI and agent orchestration, while Azure Databricks dominates in data engineering, unified data governance, and traditional ML workflows. A Detailed Feature Comparison Feature Category Azure AI Foundry Azure Databricks Primary Focus Generative AI application & agent development, model orchestration Large-scale data engineering, analytics, traditional ML, and AI workflows Data Handling Connects to diverse data sources (e.g., Databricks, Azure AI Search) for grounding AI agents. Not a primary data storage/processing platform. Native data lakehouse architecture (Delta Lake), optimized for big data processing, storage, and real-time analytics. AI/ML Capabilities Foundation models (LLMs), prompt engineering, RAG, agent orchestration, model evaluation, content safety, responsible AI tooling. Traditional ML (supervised/unsupervised), deep learning, feature engineering, MLflow for lifecycle management, Databricks AI/BI Genie. Development Style Low-code agent building, prompt flows, unified SDK/API, templates. Code-first (Python, SQL, Scala, R), notebooks, IDE integrations. Model Access & Deployment Extensive model catalog (11,000+ models), serverless API, managed compute deployments, model benchmarking. Training and serving custom ML models, including deep learning. Models available for deployment through MLflow. Governance & Security Azure-based security & compliance, RBAC, project isolation, content safety guardrails, tracing, evaluations. Unity Catalog for unified data & AI governance, lineage tracking, access control, Entra ID integration. Key Users AI developers, business analysts, citizen developers, AI app builders. Data scientists, data engineers, ML engineers, data analysts. Integration Points Native connector to Databricks AI/BI Genie, Azure AI Search, Microsoft 365, Copilot Studio, Power Platform. Microsoft Fabric, Power BI, Azure AI Foundry, Azure Purview, Azure Monitor, Azure Key Vault. Table 1: A comparative overview of the distinct features and functionalities of Azure AI Foundry and Azure Databricks Concluding Thoughts In essence, Azure AI Foundry and Azure Databricks are not competing platforms but rather essential components of a unified, comprehensive AI strategy within the Azure ecosystem. Azure Databricks provides the robust, scalable foundation for all data engineering, analytics, and traditional machine learning workloads, acting as the "data intelligence platform." Azure AI Foundry then leverages this foundation to specialize in the rapid development, deployment, and operationalization of generative AI applications and intelligent agents. Together, they enable enterprises to unlock the full potential of AI, transforming raw data into powerful, domain-aware, and governed intelligent solutions. Frequently Asked Questions (FAQ) What is the main difference between Azure AI Foundry and Azure Databricks? Azure AI Foundry is specialized for building, deploying, and managing generative AI applications and AI agents, focusing on model orchestration and prompt engineering. Azure Databricks is a data intelligence platform for large-scale data engineering, analytics, and traditional machine learning, built on a Lakehouse architecture. Can Azure AI Foundry and Azure Databricks be used together? Yes, they are designed to work synergistically. Azure AI Foundry can leverage a native connector to access real-time, governed data from Azure Databricks, allowing AI agents to be grounded in enterprise data for more accurate and contextual responses. Which platform should I choose for training large machine learning models? For training large-scale, traditional machine learning, and deep learning models, Azure Databricks is generally the preferred choice due to its robust capabilities for data processing, feature engineering, and ML lifecycle management (MLflow). Azure AI Foundry focuses more on the deployment and orchestration of pre-trained foundation models and generative AI applications. Does Azure AI Foundry replace Azure Machine Learning or Databricks? No, Azure AI Foundry complements these services. It provides a specialized environment for generative AI and agent development, often integrating with data and models managed by Azure Databricks or Azure Machine Learning for comprehensive AI solutions. How do these platforms handle data governance? Azure Databricks utilizes Unity Catalog for unified data and AI governance, providing centralized control over data access and lineage. Azure AI Foundry integrates with Azure-based security and compliance features, ensuring responsible AI practices and data privacy within its generative AI applications.738Views0likes0CommentsApplication Gateway for Containers – A New Way to Ingress into AKS
Introduction If you’re using Azure Kubernetes Service (AKS), you will need a mechanism for accepting and routing HTTP/S traffic to applications running in your AKS cluster. Until recently, this was typically handled by Azure’s Application Gateway Ingress Controller (AGIC) or another Ingress product such as NGINX. With the introduction of the upstream Kubernetes Gateway API project, there’s now a more evolved solution for ingress traffic management. This article will discuss Application Gateway for Containers (AGC) – which is Azure’s latest load balancing solution that implements Gateway API. This post is not an instructional on how to deploy AGC, but it will address the following: What is Gateway API and why is it needed? How does AGC work? How is high availability and resiliency incorporated into AGC? What AGC is not The goal is that you will come away with an understanding of the inner workings of AGC and how it ties into the AKS environment. Let’s get started! Gateway API Overview Before the introduction of Gateway API, Ingress API was the de facto method for routing Layer 7 traffic to applications running in Kubernetes. It provides a simple routing process for HTTP/S traffic but has limitations. For instance, it requires the use of vendor specific annotations for the following: URL rewriting or header modification Routing for gRPC, TCP or UDP based traffic To address these limitations, The Kubernetes Network Special Interest Group (SIG) introduced Gateway API. It consists of a collection of Custom Resource Definitions (CRDs) which extends the Kubernetes API to allow for the creation of custom resources. Gateway API is a more flexible, portable and extensible solution in comparison to its Ingress predecessor. It consists of three components: Gateway Class – provides a standard on how Gateway objects should be configured and behave Gateway – an instantiation of a Gateway Class that implements its configuration Routes – defines routing and protocol-based rules that are mapped to Kubernetes backend services As seen in Fig.1.1, the relative independence of each component in Gateway API allows for a separation of concerns type resource model. For example, developers can focus on creating routes for their apps and platform teams can manage the gateway resources that are utilized by routes. The other benefit is the portability of routes. For example, ones created in AKS can be used with Gateway API deployments in other environments. This flexibility is not possible with Ingress API, due to a lack of standardization across different Ingress controller implementations. Application Gateway for Containers Overview Not to be confused with Application Gateway, Application Gateway for Containers is a load balancing product designed to manage layer 7 traffic intended for applications running in AKS. It supports advanced routing capabilities by leveraging components that bootstrap Gateway API into AKS. The above figure is an illustration of AGC, AKS and how they work together to manage incoming traffic. Let’s break down the diagram in detail to get a better understanding of AGC. The Application Gateway for Containers Frontend serves as the public entry point for client traffic. It is a child resource of AGC and is given an auto-generated FQDN during setup. To avoid using the FQDN, it can be mapped to a CNAME record for DNS resolution. Also, you can have multiple Frontend resources (up to 5) in a single AGC instance. hild resource The Association child resource is the point of entry into the AKS Cluster and defines where the proxy components live. In the above pic, you will notice a subnet linked to it, which is established via subnet delegation. This is a dedicated subnet that’s also used by the proxy components which send traffic to destination AKS pods. The ALB Controller (which will be described shortly), deploys the proxies into the subnet. Here’s a view of the ALB Controller subnet. It must use a /24 or smaller CIDR and cannot be used for any other resources. In this case, the ALB subnet is deployed within the AKS Virtual Network (VNet), however this is not a requirement. It can be in a separate VNet that is peered with the AKS virtual network. So, we’ve determined how traffic flows from the AGC frontend resource and to the proxy components. But two questions remain: 1) How do the proxy components know which backend services are intended for the incoming request? 2) How is Gateway API leveraged by AGC to utilize advanced routing patterns? This is where the ALB controller comes into play. Before creating the AGC instance, the ALB controller is deployed into AKS. It’s responsible for monitoring HTTP route and Gateway resource configurations. As you can see in the above pic, ALB controller runs as three pods in AKS: two controller pods and one for bootstrapping. The ALB controller pods have a direct connection to AGC and are responsible for replicating resource configurations to it. To accomplish this, a federated Managed Identity is used which has the AppGW for Containers Configuration Manager role on the AGC Resource Group. Also, the ALB Controller uses this Managed Identity to provision AGC. Alternatively, you can create your own AGC resource via Azure portal, CLI, PowerShell or Infrastructure as Code (IAC). The latter deployment method is done through Azure Resource Manager (ARM). By default, the bootstrap pod is how Gateway API is installed. However, you can disable this behavior by setting the albController.installGatewayApiCRDs parameter to false when you install the ALB Controller using Helm. In Fig.1.8, a kubectl describe command is executed against the bootstrap pod to display its specs. You will notice an Init container applies the Gateway API CRDs into AKS. Init Containers are used to perform initialization tasks that must precede the startup of a main application container. Fig.1.9. Gateway Class object definition output Recall from earlier that Gateway API consists of three resources: Gateway class, Gateway resource and Routes. The ALB Controller will create a Gateway Class object with the name azure-alb-external, as shown above. Fig.1.10. Gateway Resource and HTTPRoute configuration files Fig.1.11. Diagram of traffic splitting between backends The final steps to complete the puzzle are to deploy a Gateway resource which listens for traffic over a protocol/port combination and a Route to define how traffic coming via the Gateway maps to backend services. The Gateway definition has a gatewayClassName spec that references the name of the Gateway Class. In the above example, it listens for HTTP traffic on port 80. And there’s a corresponding HTTPRoute config that splits the traffic across two backend services: backend-v1 receiving 50% of the traffic on port 8080 and backend-v2 receiving the remaining traffic using the same port. High Availability & Resiliency in AGC When you create an Application Gateway for Containers resource, it’s automatically deployed across Availability zones within the selected region. An Availability Zone (AZ) is a physically unique group of one or more datacenters. Its purpose is to provide inner-regional resiliency at the datacenter level. There are typically three AZs in a region where they are supported. Therefore, if one datacenter in the region goes down, AGC is not impacted. If Availability zones aren’t supported in the selected region, fault and update domains in the form of Availability sets will be leveraged to mitigate against outages and maintenance events. This link provides a list of Azure regions that support Availability zones. To mitigate against regional outages, you can leverage Azure Front Door or Traffic Manager with AGC. Azure Front Door is a Layer 7 routing service that load-balances incoming traffic across two regions. It provides Content Deliver Networking (CDN), Web-application firewall (WAF), SSL termination and other capabilities for HTTP/HTTPS traffic. Whereas Traffic Manager uses DNS to direct client requests to the appropriate endpoint based on a specified routing method such as priority, performance, weight or others. What AGC is Not Application Gateway for Containers is not a replacement for Application Gateway. Rather, it’s a new service within the family of Azure load balancing services. Although AGC doesn’t currently have Web Application Firewall (WAF) capabilities like Application Gateway, the feature is currently in private preview and will soon be available. Lastly, AGC is designed specifically for routing requests to containerized applications running in AKS. And unlike Application Gateway, it does not service backend targets such as Azure App Services, VMs, and Virtual Machine Scale Sets (VMSS). Conclusion Over time, it became evident that a new way of managing ingress traffic for containerized workloads was needed. The initial implementations for ingress traffic management were sufficient for simple routing requests but lacked native support for advanced routing needs. In this article, we discussed Microsoft Azure’s newest load balancing solution called Application Gateway for Containers, which builds on the Gateway API for Kubernetes. We explored the components of AGC, how it manages traffic and addressed any potential misconceptions regarding it. For some additional resources, check out the following: What is Application Gateway for Containers? | Microsoft Learn Gateway API | Kubernetes Introduction - Kubernetes Gateway API AGC Supported Regions1.8KViews4likes0CommentsNginx Ingress controller integration with Istio Service Mesh
Introduction Nginx (pronounced as "engine x") is an HTTP web server, reverse proxy, content cache, load balancer, TCP/UDP proxy server, and mail proxy server. It is one of the common ingress (used to ingest external traffic into the cluster) used in Kubernetes. I have discussed Istio service mesh in my previous article here: Istio Service Mesh Observability in AKS | Microsoft Community Hub. Setting up nginx ingress controller with Istio Service mesh requires custom configuration and is not as straightforward as using in-house ingress from Istio. One of my customers faced this issue and I was able to resolve it using the configuration we will discuss in this article. Not all customers can migrate to Istio Ingress when enabling service mesh as they might already have lot of dependencies on existing ingress rules as well as enterprise agreements with Ingress providers. The main problem with having both nginx ingress controller and Istio service mesh in the same Kubernetes cluster is when mTLS is enforced strictly by Istio. TLS vs mTLS Usually when we communicate with a server, we use TLS in which only the server’s identity is verified using a certificate. The client is verified using secondary methods like username-password, tokens etc. With the advent of distributed attacks increasing in the age of AI it is critical to implement cryptographically verifiable identities for clients as well. Mutual TLS or mTLS is based on this Zero trust mindset. With mTLS both client and server present a verifiable certificate which makes man in the middle attack extremely difficult. Enabling mTLS is one of the primary use cases of using Istio Service mesh in the Kubernetes cluster. Sidecar in Istio Sidecars are secondary containers which get injected and attach to the pod with main containers in the Pod. Istio sidecar acts like a proxy and intercepts all the incoming and outgoing traffic to the application container unless explicitly specified. Sidecar is how istio is able to implement it functionalities around traffic management in service mesh. In future there would be an option to operate Istio in a Sidecarless fashion using Ambient mode, which is still in development for Istio addon for AKS at the time of writing this article. Root cause In the above diagram you can see that istio sidecar injection is enabled in Application pod namespace but not in Ingress controller. Also, traffic enters the ingress controller through AKS exposed Internal load balancer. This traffic is https / TLS based and get TLS terminated at the ingress controller side. This is usually done as otherwise Nginx would not be able to perform many of it functionalities like path and header-based routing unless it decrypts the traffic. Therefore, traffic going towards application pods is http based. Now since mTLS is strictly enforced in the service mesh it will only accept mTLS traffic therefore, this traffic gets dropped and the user will get a 502 bad gateway error thrown by Nginx. Even if the traffic is re-encrypted and sent to application pods, which Nginx supports, the request will still get dropped as Istio allows only mTLS not TLS. Solution To solve this problem, we follow the following steps: 1. Enable sidecar injection in Ingress controller namespace: First we will enable sidecar injection in Ingress controller namespace, so that traffic egress from the ingress controller pods is mTLS. 2. Exempt external inbound traffic from sidecar: Next, mTLS is only understood within the AKS cluster, so we will have to bypass the external traffic from going through the Istio proxy container and directly to nginx container. If we don’t do this, Istio will expect external traffic to also be mTLS and will drop it. After traffic enters Nginx, it then decrypts the traffic and sends it out, which is intercepted by istio-proxy sidecar and encrypted to mTLS. 3. Send traffic to application service instead of pods directly: By default, nginx sends traffic directly to application pods as you can see in the root cause diagram. If we continue doing that, istio will not consider this traffic to be mesh traffic and drop it. Therefore, for istio to allow this traffic as part of the mesh we have to direct it through the application service. After this is done, istio allows this traffic to go through to the application pods. There are some additional configurations which we will discuss in the detailed steps below. Steps to integrate Nginx Ingress Controller with Istio Service mesh For details on setting up the AKS cluster, enabling istio and installing demo application, check out my prior article: Istio Service Mesh Observability in AKS | Microsoft Community Hub, steps 1 through 4. The steps below assume that you already have an AKS cluster setup with istio service mesh installed. Also, demo application should be installed as discussed in my previous article. 1. Enable mTLS strict mode for the entire service mesh. This would enforce mTLS in all namespaces where istio sidecar injection is enabled. # Enable mTLS for the entire service mesh kubectl apply -n aks-istio-system -f - <<EOF apiVersion: security.istio.io/v1 kind: PeerAuthentication metadata: name: global-mtls namespace: aks-istio-system spec: mtls: mode: STRICT EOF 2. Install Nginx ingress controller if not installed already in your AKS cluster. # Namespace where you want to install the ingress-nginx controller NAMESPACE=ingress-basic # Add nginx helm repo to your repositories helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update # Install Nginx Ingress Controller with annotation for Azure Load Balancer and externalTrafficPolicy set to Local # This is important for the health probe to work correctly with the Azure Load Balancer helm install ingress-nginx ingress-nginx/ingress-nginx \ --create-namespace \ --namespace $NAMESPACE \ --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz \ --set controller.service.externalTrafficPolicy=Local 3. Create Ingress object in Application namespace: You need to create an Ingress object to allow nginx to route traffic to your pods. Kindly refer nginx-ingress-before.yaml # Apply Ingress Resource for the sample application kubectl apply -f ./nginx-ingress-before.yaml -n default 4. Validate if you are able to access sample app using nginx ingress created: We will get the external IP of the ingress controller service that is of type LoadBalancer. # Get external IP for the service kubectl get services -n ingress-basic You will get an output as shown below: Now copy the IP from above and access http://<external-ip>/test in your browser. You will notice that nginx is throwing 502 Bad Gateway error. This is because it was not able to reach the application pods and get a response as istio-proxy dropped the requests as it was not mTLS. Following steps will fix this issue: 5. Enable sidecar injection in ingress controller namespace : For pods to understand traffic from nginx, it has to be sent with mTLS from istio side. To make this possible we have to enable sidecar injection in nginx ingress controller namespace. Post adding this label, restart the ingress controller deployment so that sidecars are injected into the nginx ingress controller pods: # Get the istio version installed on the AKS cluster az aks show --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME --query 'serviceMeshProfile.istio.revisions' # Label namespace with appropriate istio version to enable sidecar injection kubectl label namespace <ingress-controller-namespace> istio.io/rev=asm-1-<version> # Restart nginx ingress controller deployment so that sidecars can be injected into the pods kubectl rollout restart deployment/ingress-nginx-controller -n ingress-basic 6. Exempt external inbound traffic from sidecar: This is required as mTLS is only understood within the AKS cluster and is not meant for external traffic. Now since sidecar is injected in Nginx, we need to exempt external traffic from going to istio proxy otherwise it will get dropped from not being mTLS (It is only TLS). To do this we need to add the following annotations: # Edit nginx controller deployment kubectl edit deployments -n ingress-basic ingress-nginx-controller # Disable all inbound port redirection to proxy (empty quotes to this property archives that) traffic.sidecar.istio.io/includeInboundPorts: "" # Explicitly enable inbound ports on which the cluster is exposed externally to bypass istio-proxy redirection and take traffic directly to ingress controller pods traffic.sidecar.istio.io/excludeInboundPorts: "80,443" Kindly wait before exiting the edit mode as we have one more annotation to add below. 7. Allow connection between Nginx ingress controller and API server: Now since mTLS is enforced for Nginx it will not be able to communicate with Kubernetes API server to monitor and react to changes in Ingress resources, enabling dynamic configuration of NGINX based on these changes. Therefore, we need to exempt Kubernetes API server IP from mTLS traffic. # Query kubernetes API server IP kubectl get svc kubernetes -o jsonpath='{.spec.clusterIP}' # Add annotation to ingress controller traffic.sidecar.istio.io/excludeOutboundIPRanges: "KUBE_API_SERVER_IP/32" The problem with this approach is that AKS doesn't guarantee static IP for API server as it is managed by platform. Usually, API server IP changes during cluster restart or reprovisioning but that is not guaranteed to only happen during those instances. It can take up any IP from the service CIDR which is a /16 CIDR unless configured explicitly. One option is to have dedicated CIDR subnet for API server using VNET integration feature but this feature is currently in preview with tentative GA in Q2 2025: API Server VNet Integration in Azure Kubernetes Service (AKS) - Azure Kubernetes Service. After enabling this feature API server will always take an IP from the allocated subnet which can be mentioned in the annotation above. This is how the final deployment yaml for nginx ingress controller should look, note that annotations are updated under template and not at the deployment level: 8. Route traffic to istio sidecar once it enters the ingress object: By default, nginx sends traffic to upstream PodIP and port combination. If this is done with mTLS enabled, istio will not recognize this as mesh traffic and drop it. Therefore, it is important to change this behavior and send traffic to the exposed service instead of the backend pod directly. This is done with the annotations below, you can check the sample here nginx-ingress-after.yaml: # Setup nginx to send traffic to upstream service instead of PodIP and port nginx.ingress.kubernetes.io/service-upstream: "true" # Specify the service fqdn where to route the traffic (this is the service that exposes the application pods) nginx.ingress.kubernetes.io/upstream-vhost: <service>.<namespace>.svc.cluster.local # Apply Ingress Resource for the sample application kubectl apply -f ./nginx-ingress-after.yaml -n default 9. Configure the ingress’s sidecar to route traffic to services in the mesh: This is only needed if the ingress object is in a separate namespace compared to the services it is routing traffic to, we don’t need this as our ingress and application service are in the same namespace. Sidecars know how to route traffic to services in the same namespace but if you want them to route traffic to a different namespace, you will need to allow it in your sidecar configuration, which can be done using the yaml here Sidecar.yaml. # Apply Sidecar yaml in the namespace where ingress object is deployed kubectl apply -f Sidecar.yaml -n <ingress-object-namespace> Validate if the application is accessible: The application should now load at the endpoint http://<external-ip>/test in your browser. Conclusion That’s it, once the steps above are followed, traffic should flow as expected between mTLS enforced service mesh and nginx ingress controller. You can find all the commands and yaml files from this article here. Let me know if you have any questions or face any issues with integrating nginx ingress controller with Istio service mesh in comments below.1.1KViews4likes0CommentsIstio Service Mesh Observability in AKS
Introduction A service mesh is a dedicated infrastructure layer that manages service-to-service communication in microservices architectures. It is essential for managing communication between microservices in a distributed system, providing built-in security, traffic control, and observability. Istio is a powerful, open-source service mesh that simplifies managing, securing, and observing microservices communication. It joined Cloud Native Computing Foundation (CNCF) in 2022 and has become an industry standard for Service mesh operation. Azure Kubernetes Service (AKS) is a managed Kubernetes service provided by Microsoft Azure. It allows you to deploy, manage, and scale containerized applications using Kubernetes, without needing extensive container orchestration expertise. Observability in Istio Service Mesh is crucial for ensuring reliability, performance and security of microservices-based applications. Istio is a powerhouse when it comes to exposing telemetry and understanding the complex flow of traffic between applications. This article is a step-by-step guide for enabling Istio service mesh in AKS using Istio addon and enabling observability using managed Prometheus and Grafana. At the end we will discuss Advanced Container Networking Services (ACNS) addon in AKS, which enables Hubble to help visualize traffic flow within an AKS cluster / service mesh. I wanted to document the process as there are not enough articles available currently to achieve this in AKS and specifically none that talk about enabling Istio metrics export with mTLS enabled in AKS cluster (at the time of writing this article 😊). Metrics scraping architecture Above is a simplified architecture diagram on how the metrics will get scraped in AKS by Prometheus. Prometheus is embedded into the azure monitor pods (ama-pods), and they will be doing the scraping based on the scraping configuration set. Each application pod will have a Istio-proxy container sidecar to control traffic and collect metrics, this also depends on which namespaces have sidecar auto-injection enabled or which pods are explicitly injected with sidecar. Hubble pods will also be running on the cluster utilizing the eBPF technology to scrape network flows using Layer 3. Prometheus will collect all these metrics and send it out to azure monitor workspace (customized Prometheus database) via private endpoint. Managed Grafana instance will then pull this data from azure monitor workspace over private endpoint again. Steps for configuring managed Prometheus, Grafana and Hubble 1. Start with logging into AZ CLI with your account and selecting the default subscription and define some variables that you will use for creation of resource group and AKS cluster. # Define variables export MY_RESOURCE_GROUP_NAME="<your resource group name>" export REGION="<region where you would like to deploy the cluster>" export MY_AKS_CLUSTER_NAME="<AKS cluster name>" # Create a resource group az group create --name $MY_RESOURCE_GROUP_NAME --location $REGION # Create an AKS cluster az aks create --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME --node-count 3 --generate-ssh-keys Once completed you should be able to see your AKS cluster in the Azure portal. 2. Get credentials for the AKS cluster and verify your connection. # Get the credentials for the AKS cluster az aks get-credentials --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME # Verify the connection to your cluster kubectl get nodes If the connection is successful and the AKS cluster was created successfully you should see the nodes created as part of your aks cluster. 3. Enable Istio addon for AKS (You might need to install aks-preview plugin for AZ CLI if not already installed). Then verify the installation of istio and enable sidecar injection in desired namespace # Enable istio addon on AKS cluster az aks mesh enable --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME # Verify istiod (Istio control plane) pods are running successfully kubectl get pods -n aks-istio-system # Enable sidecar injection az aks show --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME --query 'serviceMeshProfile.istio.revisions' Based on the output of the above command use the appropriate label to enable sidecar injection, below “default” is the namespace where I am enabling sidecar injection kubectl label namespace default istio.io/rev=asm-1-22 Sample output: 4. Deploy sample application and verify its deployment # Deploy sample application kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.18/samples/bookinfo/platform/kube/bookinfo.yaml # Verify services and pods kubectl get services kubectl get pods kubectl port-forward svc/productpage 12002:9080 Sample Output: Above you will notice in output of “kubectl get pods” that each of the pods have 2 containers under READY column. This is because you had enabled sidecar injection in default namespace, the 2nd container in each pod is the istio-proxy container After port forwarding your app to local port 12002, you should be able to access it: http://localhost:12002 5. Enable mTLS in your service mesh. This is one of the most important use cases of istio that it enables you to enforce mTLS so that only mTLS traffic is allowed in your mesh, improving your cluster security significantly. # Enable mTLS enforcement for default namespace in the cluster (copy / paste and run the entire code block till the 2nd EOF in terminal) kubectl apply -n default -f - <<EOF apiVersion: security.istio.io/v1 kind: PeerAuthentication metadata: name: default spec: mtls: mode: STRICT EOF # Verify your policy got deployed kubectl get peerauthentication -n default Sample output: 6. Now we will deploy managed prometheus and grafana. We will then link them with the AKS cluster. This will enable us to visualize prometheus based metrics from kubernetes on Grafana dashboard. # Create azure monitor resource (managed prometheus resource) export AZURE_MONITOR_NAME="<your desired name for managed prometheus resource>" az resource create --resource-group $MY_RESOURCE_GROUP_NAME --namespace microsoft.monitor --resource-type accounts --name $AZURE_MONITOR_NAME --location $REGION --properties '{}' # Create Azure Managed Grafana instance export GRAFANA_NAME="<your desired name for managed grafana resource>" az grafana create --name $GRAFANA_NAME --resource-group $MY_RESOURCE_GROUP_NAME --location $REGION # Link Azure Monitor and Azure Managed Grafana to the AKS cluster grafanaId=$(az grafana show --name $GRAFANA_NAME --resource-group $MY_RESOURCE_GROUP_NAME --query id --output tsv) azuremonitorId=$(az resource show --resource-group $MY_RESOURCE_GROUP_NAME --name $AZURE_MONITOR_NAME --resource-type "Microsoft.Monitor/accounts" --query id --output tsv) az aks update --name $MY_AKS_CLUSTER_NAME --resource-group $MY_RESOURCE_GROUP_NAME --enable-azure-monitor-metrics --azure-monitor-workspace-resource-id $azuremonitorId --grafana-resource-id $grafanaId # Verify Azure monitor pods are running kubectl get pods -o wide -n kube-system | grep ama- Sample output: On Azure portal, you can check that the new resources are created: You should then open the grafana instance and click on the instance URL to open your managed grafana instance. If you are not able to do so, assign yourself Grafana Admin role under Access control pane in Grafana resource on Azure: 7. Now you will need to configure a job and configmap for prometheus to scrape metrics from istio. Download the configmap prometheus-config from here. # Create job and configmap for scraping istio metrics with prometheus kubectl create configmap ama-metrics-prometheus-config --from-file=prometheus-config -n kube-system Wait for about 10-15 mins and then verify whether istio metrics are getting scraped from your cluster. Go to prometheus resource on Azure -> Metrics on the left pane -> Select “istio_requests_total” and run query. You should be able to see data popping up after that. 8. Import Istio Grafana dashboards to your managed Grafana instance. For doing this first find out the version of istio you are running on your cluster # Get Istio version Installed for importing specific dashboards az aks show --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME --query 'serviceMeshProfile.istio.revisions' Sample output: After this go to the following dashboards and download the specific version based on your istio version: Istio Mesh Dashboard | Grafana Labs Istio Control Plane Dashboard | Grafana Labs Istio Service SLO Demo | Grafana Labs (Only 1 version is available here) For each of the dashboards downloaded above, click on dashboards on Grafana and New->Import option on top right corner. After clicking on import upload the downloaded json file of the dashboard and click on import. Remember to select Azure managed prometheus as data source before importing. Post this you should be able to see istio metrics displayed on the Grafana dashboards: 9. Now that you have exported Istio metrics and created dashboards, we will now need to see how to visualize traffic flow graphs in AKS. This is critical as with complex service mesh, you will need to understand how your traffic is flowing. The standard way to do this is either using Kiali or Jaeger, which currently are not supported with Istio addon for AKS. We will use Hubble, which is an eBPF technology developed by Cilium to scrape network flows using Layer 3 (so it would be more efficient). Hubble is ported to non-cilium AKS clusters using retina which is available using the Advanced Container Networking Service (ACNS) addon. You can download hubble-ui.yaml from here. # Enable ACNS for the AKS cluster az aks update --resource-group $MY_RESOURCE_GROUP_NAME --name $MY_AKS_CLUSTER_NAME --enable-acns # Setup Hubble UI kubectl apply -f hubble-ui.yaml kubectl -n kube-system port-forward svc/hubble-ui 12000:80 Sample output: Navigate to http://localhost:12000 on your browser to open Hubble UI Conclusion We have learned how to configure observability for Istio metrics using managed Prometheus and Grafana on AKS and visualize network flows using Hubble. You can find the commands and yaml files used in this article here. Let me know if you face any issues during this implementation via comments. Thank you for reading this article! Happy learning!1.1KViews6likes0Comments