Forum Discussion
karthick_sundarasamy
Jun 02, 2023Copper Contributor
Foreach activity parallel runs turns into serial run from parallel after an hour+
This is more specific about "Foreach activity" code in azure synapse pipeline when setting is all set to have parallel run of activity inside, pipeline does good for some time as expected and in later times even though there are handful of task to be finished to run in parallel and slots are available too, it runs in series causing job to run longer.
This issue can be reproduced, I see every time production job runs this way. There is very notable performance drag happening because of this in long running pipelines.
here is scenario in detail. Customer has data in ADLS and running synapse dedicated db. Let's say for example there are 50 tables to be loaded and each table data is big and it takes 10 minutes to load each table (loaded through polybase copy activity). With synapse pipeline with "lookup activity" we are fetching all the details (like source file location, destination database-table details etc) and feeding in to "foreach activity" inside that copy/load is out there. we have set 'batch count = 5' (this number can be more too); sequential flag is turned OFF in foreach - so, 5 table load can happen in parallel. This way in 10 minutes we can finish 5 tables and it works well as expected for an hour or so, then something like reset happening in the internal scheduler of 'foreach activity' I think, foreach started scheduling the table loads in series causing long run in table loads always. please find the screenshot of it, gantt chart explains it well.
In below job we can see for last 1 hour of run (Execute pipeline_deltaload) things ran in series whereas for previous 2 hours things ran in parallel.
we could talk more if details needed, let me know what you folks think about this.
- stevedorpeCopper Contributor
Experiencing the same issue in Azure Data Factory with the "ForEach" activity.
Is there currently any workaround? - qzhouCopper Contributor
For each activity parallel tasks are scheduled at the beginning of the execution. This means that if a task in a queue take longer will delay the whole queue, and even if there is slots available in other queue ADF will not re-shuffle the tasks
- karthick_sundarasamyCopper Contributor
Ok. practically the runtime of activities inside 'foreach' like COPY/LOAD depends on various factors like
size of data, stuff around target database on that day here etc, but existing design choice in adf scheduling in 'foreach' do gives performance impact in production, do we see it is reasonable to flag product improvement needed in this area? if so whom/where to loop this?
Here is chart from today prod run (foreach): if scheduling is dynamic and aggressive job would have finished around 20 minutes earlier though in today's run. some days it goes for hour because of this.Total scheduled task inside 'foreach' = 30 and how it breaks it up,
Queue1: Total task ran = 11
Queue2: Total task ran = 6
Queue3: Total task ran = 7
Queue4: Total task ran = 5
Queue5: Total task ran = 2