Forum Discussion
Itika211999
Jun 14, 2024Copper Contributor
Configuring ADF Pipeline to Fetch and Store Large Tables as Single JSON Files
I have 14 tables on a client server that need to be fetched using REST API calls. Each table contains between 71 and 800,000 rows, but the page size for fetching data is limited to 100 rows. The REST...
zack-sponaugle
Jun 17, 2024Copper Contributor
Hello Itika211999,
I recommend having a Control/Metadata table that you use to lookup the parameters (e.g. table name), and send that information into a ForEach activity. That way you don't have to rewrite the pipeline for each table, and you can run multiple pipelines in parallel.
Regarding pagination, the generic Rest connector does have some support for common pagination patterns. You can find information and examples here:
https://learn.microsoft.com/en-us/azure/data-factory/connector-rest?tabs=data-factory#pagination-support
I would test them out myself but I don't have access to a sample REST API at the moment. If none of those work, let me know and I can share with you what I did to implement "custom" pagination logic for an API that utilized startRow and lastPage parameters. It is not as clean as using the native pagination support offered by ADF but it gets the job done.
Regarding the JSON side of things, I have never been able to get the Copy activity to append multiple iterations of it to a single JSON file (it will just keep overwriting the file). What I end up having to do is writing the data out as separate files in ADLS (one per each API call), and then when that is finished, I have a second copy activity that merges all of these files into a single file. This can be achieved by setting the Copy Activity's Source tab to a wildcard file path (*.json) and the Sink tab's Copy Behavior setting to "Merge files". Set the file pattern to "Set of objects" or "Array of objects" depending on the exact output format required.
Let me know if you have any questions.
Regards,
Zack
I recommend having a Control/Metadata table that you use to lookup the parameters (e.g. table name), and send that information into a ForEach activity. That way you don't have to rewrite the pipeline for each table, and you can run multiple pipelines in parallel.
Regarding pagination, the generic Rest connector does have some support for common pagination patterns. You can find information and examples here:
https://learn.microsoft.com/en-us/azure/data-factory/connector-rest?tabs=data-factory#pagination-support
I would test them out myself but I don't have access to a sample REST API at the moment. If none of those work, let me know and I can share with you what I did to implement "custom" pagination logic for an API that utilized startRow and lastPage parameters. It is not as clean as using the native pagination support offered by ADF but it gets the job done.
Regarding the JSON side of things, I have never been able to get the Copy activity to append multiple iterations of it to a single JSON file (it will just keep overwriting the file). What I end up having to do is writing the data out as separate files in ADLS (one per each API call), and then when that is finished, I have a second copy activity that merges all of these files into a single file. This can be achieved by setting the Copy Activity's Source tab to a wildcard file path (*.json) and the Sink tab's Copy Behavior setting to "Merge files". Set the file pattern to "Set of objects" or "Array of objects" depending on the exact output format required.
Let me know if you have any questions.
Regards,
Zack