Azure Synapse Analytics Blog

8 MIN READ

Connect DEP Enabled Synapse Spark to On-Prem Apache Kafka

Former Employee

Aug 01, 2022

Connecting On-Prem Apache Kafka from Azure Synapse Spark Notebook from DEP Enabled and Managed VNET-based Synapse Workspace

Background

Apache Kafka is an open-source distributed event streaming platform used for high-performance data pipelines and streaming analytics with real-time to near real-time capabilities. There are many on-premises Apache Kafka implementations for various mission-critical applications.

Azure Synapse Analytics is a limitless analytics service that brings together Enterprise Data Warehousing (EDW) and Big Data Analytics (BDA). In Azure Synapse Analytics, Microsoft provides its own implementation of Apache Spark. Azure Synapse Analytics also supports provisioning the workspace inside a Managed Virtual Network (Vnet) as well as enabling Data Exfiltration Protection to enhance the security.

However, if your Synapse Workspace is using a Managed Vnet and has Data Exfiltration Protection enabled you will not be able to connect to your On-Premises data sources. Also, there is no connector available in Azure Data Factory (ADF) or Synapse Pipeline for Apache Kafka. You can, however, integrate to On-Premises Apache Kafka using either Spark Structured Streaming API or any other Kafka Producer/Consumer supported libraries.

Another options is that Azure Synapse Analytics also provides a Managed Private Endpoint, which is an option when you want to connect to the outside world from a Managed Vnet-based Synapse Workspace. Currently, Managed Private Endpoint does not have direct support for on-premises Apache Kafka either. Therefore, you will need to take a different route to connect an On-Prem Apache Kafka from Synapse Spark which is inside a Managed Vnet.

The easiest and most commonly suggested approach is to use Apache Kafka Mirror Maker to mirror the On-Prem Apache Kafka topics to Azure Event Hubs. You can get more details from this link: Use Apache Kafka MirrorMaker - Azure Event Hubs - Azure Event Hubs | Microsoft Docs. While this works fine, it adds additional complexities to the data pipelines and overall data architecture. If we find a way to connect directly to an On-Prem Apache Kafka then it makes the data pipeline a little bit simpler and will also save time and money. This blog describes how to establish this connectivity and run the Synapse Spark Notebook to connect to On-Premises Apache Kafka directly.

Motivation

Microsoft's official documentation for Azure Data Factory contains a tutorial which explains how to access an On-Premises SQL Server from Azure Data Factory which is inside a Managed Vnet. You can go through that article here: Access on-premises SQL Server from Data Factory Managed Vnet using Private Endpoint - Azure Data Factory | Microsoft Docs.

Although based upon the article's solution, to meet our requirements we needed to substitute On-Prem Apache Kafka for On-Prem SQL Server and instead of an Azure Data Factory inside a Managed Vnet, we used a Synapse Workspace inside a Managed Vnet. The "Forwarding Vnet" concept explained in the above tutorial remains as-is in our approach.

Approach

An Azure Synapse Analytics workspace supports creating a Managed Private Endpoint using Private Link Service. This will allow you to define Fully Qualified Domain Names (FQDNs) of your data sources to you connect to from Synapse Workspace. Our approach is to define FQDNs of our On-Prem Apache Kafka. The Private Link Service will then rely on a Load Balancer to direct the traffic based upon load balancing rules to the backend subnet consisting of two NAT VMs. These VMs will have On-Premises connectivity via Express Route or any other network mechanism. Using this architecture, our requests to access Apache Kafka hosted on On-Premises network will pass through smoothly.

Setup & Configuration

A. Create Subnets

You can follow the instructions from the below mentioned URL for this.

Access on-premises SQL Server from Data Factory Managed VNet using Private Endpoint - Azure Data Factory | Microsoft Docs

B. Create a standard Internal Load Balancer

You can follow the instructions from the below mentioned URL for this.

Access on-premises SQL Server from Data Factory Managed VNet using Private Endpoint - Azure Data Factory | Microsoft Docs

C. Create Load Balancer Resources

1. Create a backend pool

You can follow the instructions here:

Access on-premises SQL Server from Data Factory Managed VNet using Private Endpoint - Azure Data Factory | Microsoft Docs

2. Create a health probe

You can follow the instructions here:

Access on-premises SQL Server from Data Factory Managed VNet using Private Endpoint - Azure Data Factory | Microsoft Docs

3. Create a load balancer rule

A load balancer rule is used to define how traffic is distributed to the VMs. You define the frontend IP configuration for the incoming traffic and the backend IP pool to receive the traffic. The source and destination port are defined in the rule.

Note:

Here, the Port and Backend Port both are configured as 9092. You should use the port which your Kafka Brokers are configured to use for communication.

Next, create a load balancer rule:

Select All services in the left-hand menu, select All resources, and then select myLoadBalancer from the resources list.
Under Settings, select Load-balancing rules, then select Add.
Use these values to configure the load-balancing rule:

Setting	Value
Name	Enter myRule.
IP Version	Select IPv4.
Frontend IP address	Select LoadBalancerFrontEnd.
Protocol	Select TCP.
Port	Enter 9092.
Backend port	Enter 9092.
Backend pool	Select myBackendPool.
Health probe	Select myHealthProbe.
Idle timeout (minutes)	Move the slider to 15 minutes.
TCP reset	Select Disabled.

4. Leave the rest of the defaults and then select OK.

5. Create a separate Load Balancer Rule for each Kafka Broker by using a different backend port for each Kafka Broker. You can use any port number as far as you don’t repeat it for your setup. Please note that later will route all traffic to the same port as of your On-Premises Kafka Broker. This is just an intermediate arrangement so that Synapse Spark does not have a clear line of sight to each of your On-Prem Kafka Brokers.

D. Create a private link service

You can follow the instructions here:

Access on-premises SQL Server from Data Factory Managed VNet using Private Endpoint - Azure Data Factory | Microsoft Docs

You will need as many private link services as the number of on-premises Kafka brokers you want to connect to from a Synapse Spark notebook. Please refer to the above diagram.

E. Create backend servers (NAT VMs)

You can follow the instructions here:

Access on-premises SQL Server from Data Factory Managed VNet using Private Endpoint - Azure Data Factory | Microsoft Docs

Note:

After provisioning NAT VMs, please ensure that they have the ports open which you are going to use for health probe as well as for Kafka. If those ports are not open, then please add appropriate Inbound NSG rules. Without that, NAT VMs will not be able to communicate properly using a Forwarding Vnet.

F. Create Forwarding Rule(s) to Endpoint(s)

Login and copy script ip_fwd.sh to your backend server VMs.
Run the script on with the following options:
sudo ./ip_fwd.sh -i eth0 -f 9092 -a <FQDN/IP> -b 9092

Here, <FQDN/IP> is your target Kafka Broker IP.

3. Run below command and check the iptables in your backend server VMs. You can see one record in your iptables with your target IP.

sudo iptables -t nat -v -L PREROUTING -n --line-number

Note:

If you have more than one Kafka Broker, you will need to define multiple load balancer rule as each broker will have to use a different port to route the traffic to NAT VMs. There will be multiple IP table records with different ports. Otherwise, there will be some issues. For example,

	Command run in backend server VM
Kafka Broker 1	sudo ./ip_fwd.sh -i eth0 -f 9092 -a <Kafka Broker 1 FQDN/IP> -b 9092
Kafka Broker 2	sudo ./ip_fwd.sh -i eth0 -f 9093 -a <Kafka Broker 2 FQDN/IP> -b 9092
Kafka Broker 3	sudo ./ip_fwd.sh -i eth0 -f 9094 -a <Kafka Broker 3 FQDN/IP> -b 9092

G. Create a Managed Private Endpoint to Private Link Service

Select All services in the left-hand menu, select All resources, and then select your data factory from the resources list.
Select Author & Monitor to launch the Data Factory UI in a separate tab.
Go to the Manage tab and then go to the Managed private endpoints section.
Select + New under Managed private endpoints.
Select the Private Link Service tile from the list and select Continue.
Enter the name of private endpoint and select myPrivateLinkService in private link service list.
Add FQDN of your target on-premises Kafka Broker.

Notes:

It accepts only FQDNs so you will not be able to add IP Addresses of your On-Prem Kafka Cluster. If your On-Prem Kafka Cluster does not have FQDNs defined for each of the brokers then as an alternative you can make a host file entry with pseudo FQDNs for each of the On-Prem Kafka Brokes in both NAT VMs.

8. Create private endpoint.

Note:

You CANNOT create a Linked Service and test the connection as a Kafka Connector is not available for Azure Data Factory or Synapse Pipelines.
If there are more than one Kafka Brokers in your On-Prem Kafka Cluster, then please ensure that you create one Private Links Service for each of your On-Prem Kafka Brokers.

H. Create a Synapse Spark Notebook

Here, we assume you already have the Synapse Spark Pool created to attach and run the notebook. If that is not the case then please follow the link here to create the Synapse Spark Pool first: Quickstart: Create a serverless Apache Spark pool using the Azure portal - Azure Synapse Analytics | Microsoft Docs

For using the Spark Structured Streaming API with Kafka, you need to have spark-sql-kafka-0-10_2.11-2.4.8.jar installed on your Synapse Spark Pool. As our Synapse Workspace is behind a managed Vnet without internet access, you will need to download the appropriate version of this jar from this maven repository link (Maven Repository: org.apache.spark » spark-sql-kafka-0-10) (mvnrepository.com) and upload it to the workspace packages.

Once it is available in Synapse Studio’s workspace packages section, you can install it to the Synapse Spark Pool using the Synapse Studio GUI. Please refer below URL if you need additional information on package management in Synapse Workspace:

Package management - Azure Synapse Analytics | Microsoft Docs

Now, we are ready to create a Synapse Spark Notebook. Let us look at that.

We will create Synapse Spark Notebook to test the connectivity. There are a couple of ways you can test the connectivity to On-Prem Kafka. We will try to use Spark’s Structured Streaming API first:

Create a new notebook and enter the following code in its cells.
The above code snippet will create a streaming dataframe using Spark’s Structured Streaming API. You should replace values of kafka.bootstrap.servers and subscribe option as per your setup.

The above code snippet will create a query which will write the streaming messages to the specified path on your Azure Data Lake Storage location along with updating the checkpoint location. Please ensure you change the values for the path and checkpointLocation as per your setup. The writeStream command is responsible for writing your streaming data to Azure Data Lake Storage on the given path in the format which you have specified in the command.

You can try running the above cells one by one to know the status of your streaming queries. If your query is active then it will show you true as an output for the query.isActive cell. If your query is processing some messages then it will show the details of the same as output for query.recentProgress and query.lastProgress

Updated Jul 29, 2022

Version 1.0

Former Employee

Joined May 09, 2022

View Profile

Azure Synapse Analytics Blog

Follow this blog board to get notified when there's new activity