Connecting On-Prem Apache Kafka from Azure Synapse Spark Notebook from DEP Enabled and Managed VNET-based Synapse Workspace
Background
Apache Kafka is an open-source distributed event streaming platform used for high-performance data pipelines and streaming analytics with real-time to near real-time capabilities. There are many on-premises Apache Kafka implementations for various mission-critical applications.
Azure Synapse Analytics is a limitless analytics service that brings together Enterprise Data Warehousing (EDW) and Big Data Analytics (BDA). In Azure Synapse Analytics, Microsoft provides its own implementation of Apache Spark. Azure Synapse Analytics also supports provisioning the workspace inside a Managed Virtual Network (Vnet) as well as enabling Data Exfiltration Protection to enhance the security.
However, if your Synapse Workspace is using a Managed Vnet and has Data Exfiltration Protection enabled you will not be able to connect to your On-Premises data sources. Also, there is no connector available in Azure Data Factory (ADF) or Synapse Pipeline for Apache Kafka. You can, however, integrate to On-Premises Apache Kafka using either Spark Structured Streaming API or any other Kafka Producer/Consumer supported libraries.
Another options is that Azure Synapse Analytics also provides a Managed Private Endpoint, which is an option when you want to connect to the outside world from a Managed Vnet-based Synapse Workspace. Currently, Managed Private Endpoint does not have direct support for on-premises Apache Kafka either. Therefore, you will need to take a different route to connect an On-Prem Apache Kafka from Synapse Spark which is inside a Managed Vnet.
The easiest and most commonly suggested approach is to use Apache Kafka Mirror Maker to mirror the On-Prem Apache Kafka topics to Azure Event Hubs. You can get more details from this link: Use Apache Kafka MirrorMaker - Azure Event Hubs - Azure Event Hubs | Microsoft Docs. While this works fine, it adds additional complexities to the data pipelines and overall data architecture. If we find a way to connect directly to an On-Prem Apache Kafka then it makes the data pipeline a little bit simpler and will also save time and money. This blog describes how to establish this connectivity and run the Synapse Spark Notebook to connect to On-Premises Apache Kafka directly.
Motivation
Microsoft's official documentation for Azure Data Factory contains a tutorial which explains how to access an On-Premises SQL Server from Azure Data Factory which is inside a Managed Vnet. You can go through that article here: Access on-premises SQL Server from Data Factory Managed Vnet using Private Endpoint - Azure Data Fac....
Although based upon the article's solution, to meet our requirements we needed to substitute On-Prem Apache Kafka for On-Prem SQL Server and instead of an Azure Data Factory inside a Managed Vnet, we used a Synapse Workspace inside a Managed Vnet. The "Forwarding Vnet" concept explained in the above tutorial remains as-is in our approach.
Approach
An Azure Synapse Analytics workspace supports creating a Managed Private Endpoint using Private Link Service. This will allow you to define Fully Qualified Domain Names (FQDNs) of your data sources to you connect to from Synapse Workspace. Our approach is to define FQDNs of our On-Prem Apache Kafka. The Private Link Service will then rely on a Load Balancer to direct the traffic based upon load balancing rules to the backend subnet consisting of two NAT VMs. These VMs will have On-Premises connectivity via Express Route or any other network mechanism. Using this architecture, our requests to access Apache Kafka hosted on On-Premises network will pass through smoothly.
Setup & Configuration
A. Create Subnets
You can follow the instructions from the below mentioned URL for this.
B. Create a standard Internal Load Balancer
You can follow the instructions from the below mentioned URL for this.
C. Create Load Balancer Resources
1. Create a backend pool
You can follow the instructions here:
2. Create a health probe
You can follow the instructions here:
3. Create a load balancer rule
A load balancer rule is used to define how traffic is distributed to the VMs. You define the frontend IP configuration for the incoming traffic and the backend IP pool to receive the traffic. The source and destination port are defined in the rule.
Note:
Here, the Port and Backend Port both are configured as 9092. You should use the port which your Kafka Brokers are configured to use for communication.
Next, create a load balancer rule:
Setting |
Value |
Name |
Enter myRule. |
IP Version |
Select IPv4. |
Frontend IP address |
Select LoadBalancerFrontEnd. |
Protocol |
Select TCP. |
Port |
Enter 9092. |
Backend port |
Enter 9092. |
Backend pool |
Select myBackendPool. |
Health probe |
Select myHealthProbe. |
Idle timeout (minutes) |
Move the slider to 15 minutes. |
TCP reset |
Select Disabled. |
4. Leave the rest of the defaults and then select OK.
5. Create a separate Load Balancer Rule for each Kafka Broker by using a different backend port for each Kafka Broker. You can use any port number as far as you don’t repeat it for your setup. Please note that later will route all traffic to the same port as of your On-Premises Kafka Broker. This is just an intermediate arrangement so that Synapse Spark does not have a clear line of sight to each of your On-Prem Kafka Brokers.
D. Create a private link service
You can follow the instructions here:
You will need as many private link services as the number of on-premises Kafka brokers you want to connect to from a Synapse Spark notebook. Please refer to the above diagram.
E. Create backend servers (NAT VMs)
You can follow the instructions here:
Note:
After provisioning NAT VMs, please ensure that they have the ports open which you are going to use for health probe as well as for Kafka. If those ports are not open, then please add appropriate Inbound NSG rules. Without that, NAT VMs will not be able to communicate properly using a Forwarding Vnet.
F. Create Forwarding Rule(s) to Endpoint(s)
Here, <FQDN/IP> is your target Kafka Broker IP.
3. Run below command and check the iptables in your backend server VMs. You can see one record in your iptables with your target IP.
sudo iptables -t nat -v -L PREROUTING -n --line-number
Note:
If you have more than one Kafka Broker, you will need to define multiple load balancer rule as each broker will have to use a different port to route the traffic to NAT VMs. There will be multiple IP table records with different ports. Otherwise, there will be some issues. For example,
Command run in backend server VM |
|
Kafka Broker 1 |
sudo ./ip_fwd.sh -i eth0 -f 9092 -a <Kafka Broker 1 FQDN/IP> -b 9092 |
Kafka Broker 2 |
sudo ./ip_fwd.sh -i eth0 -f 9093 -a <Kafka Broker 2 FQDN/IP> -b 9092 |
Kafka Broker 3 |
sudo ./ip_fwd.sh -i eth0 -f 9094 -a <Kafka Broker 3 FQDN/IP> -b 9092 |
G. Create a Managed Private Endpoint to Private Link Service
Notes:
8. Create private endpoint.
Note:
H. Create a Synapse Spark Notebook
Here, we assume you already have the Synapse Spark Pool created to attach and run the notebook. If that is not the case then please follow the link here to create the Synapse Spark Pool first: Quickstart: Create a serverless Apache Spark pool using the Azure portal - Azure Synapse Analytics |...
For using the Spark Structured Streaming API with Kafka, you need to have spark-sql-kafka-0-10_2.11-2.4.8.jar installed on your Synapse Spark Pool. As our Synapse Workspace is behind a managed Vnet without internet access, you will need to download the appropriate version of this jar from this maven repository link (Maven Repository: org.apache.spark » spark-sql-kafka-0-10) (mvnrepository.com) and upload it to the workspace packages.
Once it is available in Synapse Studio’s workspace packages section, you can install it to the Synapse Spark Pool using the Synapse Studio GUI. Please refer below URL if you need additional information on package management in Synapse Workspace:
Package management - Azure Synapse Analytics | Microsoft Docs
Now, we are ready to create a Synapse Spark Notebook. Let us look at that.
We will create Synapse Spark Notebook to test the connectivity. There are a couple of ways you can test the connectivity to On-Prem Kafka. We will try to use Spark’s Structured Streaming API first:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.