Azure data factory - Performance optimization

If you have migrated your on-premises ETL processes to Azure cloud and wondering why is it running slower, then this post is for you.

As the world marches towards digitization, where data drives a lot of business decisions, it is important for organizations to modernize their analytical processing. Moving the data integration processes to Azure data platform is one of the many ways of modernization. With the power of data in conjunction with the seamless integrations supported by Azure, organizations can gain valuable insights into customer behavior, market trends, and operational efficiency. By leveraging services like Azure Data Factory, businesses can unlock the true potential of their data, fostering innovation, optimizing processes, and ultimately gaining a competitive edge in today's data-driven economy.

There are various aspects to the modernization journey, we will specifically focus on the execution times in this post.

When comparing job performances in legacy and modern data integration platforms, it should not be restricted to job execution times alone. There are more factors which I would like to discuss here.

1. Cost:

Modernization or optimization shouldn't always be measured by reduction in the processing time of your ETL processes. Cost is also a very important factor. Performance is best measured when it is calculated in conjunction with the cost of execution. If you paid X amount to run your jobs in Y time on-premises. Now post modernization you are paying X/5 amount to run your jobs in 2Y time. Would you term this as performance degradation or performance calibration.

Traditionally there was a distinct segregation of interests. Business stakeholders were concerned about execution costs and the technical counterparts were concerned about the execution times. When you move your analytical systems to cloud based platforms, cost and execution times go pretty much hand-in-hand.

When you run your ETL jobs on Azure Data Factory, you are probably already saving costs on your data center, infrastructure support staff and various licensing costs. With the pay as you go billing models of Azure, you are not even locked-in with any upfront cap-ex.

2. Infrastructure:

Cloud computing brings in a big advantage of pay as you go. When you use iPaas services like Azure Data Factory, Logic app, Azure Functions etc. the infrastructure is provisioned dynamically by Azure. Talking specifically about Azure data factory, it is architected such that it orchestrates between different types of independent compute infrastructures into a single data pipeline. For instance, Execute pipeline activity, copy data activity, mapping data flows all run on independent infrastructures behind the scenes. These infrastructures are created, managed and destroyed at runtime by Azure.

Sure there are ways to pre-provision or reuse some of the infrastructure. The process of ensuring that resources are readily available for immediate use is termed as Pre-warmed resources. This proactive approach can lead to quicker response times in cases where infrastructure workloads are fluctuating.

2.1 Pre-warmed resources:

There are various configuration parameters available on the Azure Integration Runtime to enable pre-warmed compute infrastructure for Azure data factory resources.

Below 3 setting ensures that the infrastructure provisioned remains alive for the defined amount of idle time(Time to live) before getting destroyed.

The pre-warmed resources do help in addressing the cold start of ETL applications to a certain extent, however it still can't be compared to on-premises applications.

2.2 Parallelism and Idle time:

Pre-warmed resources do improve the execution times of ETL applications to a certain extent. However there are 2 more parameters with-in that which need to be considered.

The time to live configuration will make sure that the compute infrastructure for copy activity, execute pipeline and mapping data flows remains alive for the configured amount of idle time. But when you start executing multiple parallel instances of such activities, the pre-warmed infrastructure is already engaged and thus new infrastructure still needs to be provisioned. This leads to cold start for some of the activities and thus slowing down the processing.

Another aspect of pre-warmed resources is the idle times between your batch processing. Complex scheduling of jobs in the ETL batches can prevent you from using pre-warmed resources effectively. Consider a batch consisting of 1000 ETL jobs. If the batch is spread across 10 hours of execution, where jobs run at various times based on their time dependencies, such a batch will still have some cold starts as the resources will be left idle for the configured times and eventually destroyed.

Thus, it is very important to thoroughly analyze the batch and the dependency patterns. Based on the level of parallelism, idle times between batches and the types of compute resources required, the configuration of time to live needs to be appropriately done. A well calibrated setting makes sure that ETL jobs don't have a lot of startup time, and also don't overprovision a lot of resources upfront which often leads to excessive billing.

3. Design & architecture:

Without any doubt, design is a very crucial part of any technical system. System design in itself considers a lot of factors like data volume to be processed, the tools and technology stack, the type of processing, scalability, being future ready etc. Having said that it is also crucial to relook at the design of your ETL workload when moving to Azure cloud.

Large scale ETL systems are often built years back. They continue to evolve as the business changes. Thus, due to the fact that the type of data, volume of data, consumption of data has changed, a maintenance to the system design is already due. Furthermore if the underlying platform is going to change from legacy on-premises to cloud, more the reason for a detailed redesign.

3.1 Latency:

Any jobs developed to serve a certain business requirement are best designed when they are developed considering the tools and technology stack deployed. For instance, running Python, Java, Unix or bat scripts on the underlying operating system where the ETL tool is running is much more efficient. There is negligible to no data latency because the data is also stored on the file systems which are mounted on the same servers.

When the same job is re-platformed to Azure cloud as is, there is a set of independent Azure services like ADLS, Azure data factory, logic app etc. running to achieve the same output. This introduces significant amount of latency and thus delay.

3.2 Data Source and target:

If you are using mapping data flows to transform the data then the sources and targets are very important considerations. Mapping data flows run on Azure integration runtime, thus they can't connect to on-premises data sources or targets directly. There is often a need to stage the data to a cloud storage like ADLS, Synapse, Azure SQl or Azure cosmos DB. This staging is an additional step compared to on-premises ETL jobs. Thus it can lead to additional processing time.

3.3 Spark cluster:

Mapping data flows run on Spark clusters provisioned by Azure as just in time compute. Spark job in itself is a series of activities like

SparkSubmit->SparkDriver->SparkContext->DAGCreation->CodeDeployment->Processing->Cleanup

All the above mentioned steps are necessary to execute one mapping data flow. As a result, it is not wise to split your ETL processes into multiple small processes. Running multiple modular processes one after another might be a good design for on-premises platforms with no latency. However the same design ported to Azure data factory as is can give unexpected results. So it is inevitable to reconsider the processing pipelines.

Why can't you redesign?

Redesign of any system seems like a pretty obvious decision. But there also is a different perspective to this decision. In order to redesign a system, it is very important to have the business understanding of the existing jobs. Older systems running for over a decade often face challenges with this aspect. As the business evolves, the system undergoes constant in-place changes. Team members continue to shuffle across teams or organizations. Documentation remains the most negligible part. As a result, the core business understanding of the system is no more available. Reverse engineering business requirements from the existing jobs and then redesigning them is a developer's nightmare. It takes significant time and effort, which is often underestimated by decision makers.

Based on the need of the hour, if it is urgent for the business to move to cloud then one can chose to lift and shift the code and then continue to calibrate the cost and performance aspect along with business process improvement. However if there isn't any urgency then a well thought migration can go a long way.

I am a seasoned ETL developer having more than 12 years of experience working on different ETL technologies. I have helped multiple clients migrate their ETL platforms. What have been your experiences with migration projects. Do share in the comments section.

azure data factory