The Integration Runtime (IR) is the compute powering any activity in Azure Data Factory (ADF) or Synapse Pipelines. There are a few types of Integration Runtimes:
Important! Data Flows execute on Spark clusters, while Copy and External transformation activities execute on Windows OS.
You specify which Integration Runtime to connect to in the Linked Service definition:
In this blog we will focus on the data movement activity and how to choose the most suitable compute for your scenario.
The first question to ask is whether the data sources that you are connecting to are publicly accessible. If so, then you can use the Azure Integration Runtime to connect to those.
Please note, one compute is used to connect to both your source and destination data store, hence if one is in a private network you will need to use an IR that can connect to it. And this will be the IR leveraged at runtime.
If your data store is behind a firewall or in a private network (Azure virtual network, on-premises, other cloud virtual networks, etc.), you have a few options:
For the managed VNet Integration Runtime a dedicated virtual network is created only for you, that is managed by us. This then allows you to create a private endpoint between the IR and your data store. There is no compute kept idle in the VNet, hence when you execute your first activity on this IR, there is some queueing time during which we inject the VMs into the VNet. We have a TTL setting that you can leverage to avoid this queuing for subsequent activities.
When choosing the managed VNet IR you get a secure, fully isolated, and highly available compute option which allows you to run up to 50 concurrent pipeline activities (such as Lookup, Get Metadata, Delete), up to 800 concurrent external pipeline activities. For Copy Activities specifically, the total DIU per subscription, per region is 2400.
For SHIR, as different jobs will consume different resources, to establish what your workload needs are it is best to test with a representative dataset and then extrapolate the results.
For most production scenarios where a data store is in a private network, we will be choosing between option 1 and 2. There isn’t a one-size fits all approach. Let’s look at different important considerations to help you navigate this decision.
In general, if your workload is predictable and you don’t run many concurrent jobs, or you can tolerate the latency caused by queueing jobs due to lack of capacity, you can size your SHIR cluster to fit the load.
In scenarios where it is hard to predict the load due to having multiple teams or projects leveraging the same SHIR, it is better to choose the managed VNet IR. To put it simply, it is difficult to size something for an unknown load. We can of course oversize it, but then we also end up paying and managing an oversized SHIR cluster. The managed VNet IR has built-in high availability and can handle out-of-the-box many concurrent activities. Hence you get a serverless option that can handle very high load.
On the other end of the spectrum, we have scenarios where you don’t have concurrent activities running, and maybe your pipeline contains small copy jobs and external transformation activities, in such scenarios, it might be better to look at having a small SHIR. We recommend a two-node cluster with VM sizes appropriate for the workload. You could also have a single node cluster if your workload can tolerate delays caused by hardware or software failure.
For on-premises data stores, when using the managed VNet IR you need some additional infrastructure to connect to the on-premises environment. This is due to the way private endpoints (and Private Link Service) work. For an example with SQL Server, please see this tutorial.
If this additional infrastructure already exists (e.g. Express route or S2S VPN, the Load balancer, etc.) then managed VNet IR is the best option. Otherwise, setting up an SHIR could be the simpler option.
For one-time migrations, you could leverage existing on-premises commodity hardware. For periodical data loads however, it might be better not to create dependencies on commodity hardware so you can decommission it. Hence looking at having the SHIR run on Azure VMs or leveraging the managed VNet IR might be a better solution.
I hope by this time we’re all on the same page that this decision requires careful consideration of the many aspects that influence it. I also hope that you are now better armed to embark on this decision journey.
Let me know in the comments if you have any questions or there are some important considerations that I have missed.
Further reading
Learn more about integration runtimes
Learn more about the managed virtual network in Azure Data Factory
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.