Forum Discussion
ADF - REST API Copy Data Activity - Best Practices
Hey everyone,
I'm relatively new to Azure and am using Azure Data Factory (ADF) to extract data from Vonigo via their API.
Everything is working, but I'm looking for ways to improve efficiency. Right now it takes 30+ minutes to pull all franchise data for a given report.
The challenge is that Vonigo only allows reports to be run for one franchise at a time, and switching franchises requires a separate API call. My current process is:
- Select a franchise from a list
- Run all required reports for that franchise
- Save the results to Blob Storage
- Move to the next franchise and repeat
(We'll eventually ingest the files into our silver layer for transformation.)
Speed Issue
One thing that significantly slows the pipeline down is that many activities are running sequentially. If I disable sequential execution, I've seen cases where data gets written to the wrong destination or associated with the wrong franchise.
Has anyone successfully parallelized a similar process while maintaining data integrity? Are there specific points in the workflow where parallel execution would be safe?
Pagination / Loop Issue
Originally, I used a Lookup activity to inspect the most recently created file. An If activity would then determine whether the file contained any records:
- If records existed, increment the page number and continue.
- If no records existed, end the loop.
This worked, but the Lookup activity added noticeable overhead.
To improve performance, I changed the logic to use the Copy Activity output instead. Specifically, I'm checking the amount of data read from the last API call. Pages with no records appear to consistently return the same data-read value, so I use that to determine when to stop paging.
This approach is much faster, but it feels more fragile since it's relying on an indirect indicator rather than the actual record count.
Would you trust the Copy Activity output in this scenario, stick with the Lookup approach, or recommend a different pattern altogether?
Thanks for any suggestions.
3 Replies
- yuscustomermikeCopper Contributor
Second, regarding pagination and loop control. Moving away from Lookup was a good decision from a performance perspective. However, using data read size as a condition is not very robust.
- yuscustomermikeCopper Contributor
Frustrated with automated bots? Learn how to navigate help.microsoft.com, bypass virtual menus, and connect with a live Microsoft person instantly.
- aziz-saijiCopper Contributor
Hi,
You are already on the right track, and your observations are valid. There are two main areas to improve here: safe parallelism and pagination strategy.
First, regarding parallelism and data integrity. The issue you are seeing when disabling sequential execution is very likely caused by shared state such as variables or outputs being reused across iterations. In ADF, parallel execution can lead to data mix-ups if global variables are used.
A more reliable approach is to use a ForEach activity with parallel execution enabled, where each iteration processes one franchise independently. Make sure you do not use shared variables inside the loop, and instead rely on item() and pipeline parameters. Also, ensure that your sink path is dynamically generated and unique per franchise and per run.
For example, write files to a path like: franchise_{franchiseId}/report_{runId}.json
This guarantees isolation and avoids overwriting data. It is also recommended to keep pagination sequential within each franchise, while running multiple franchises in parallel. This gives better performance without compromising data integrity.
Second, regarding pagination and loop control. Moving away from Lookup was a good decision from a performance perspective. However, using data read size as a condition is not very robust.
A better approach is to use Copy Activity output metadata, especially rowsCopied. You can stop your loop when rowsCopied equals zero, which is more reliable than checking the data size.
If the API supports it, an even better option is to rely on response metadata such as a hasMore flag or a next page or token value. This is more stable than inferring behavior from indirect indicators.