Forum Discussion
Configuring ADF Pipeline to Fetch and Store Large Tables as Single JSON Files
I have 14 tables on a client server that need to be fetched using REST API calls. Each table contains between 71 and 800,000 rows, but the page size for fetching data is limited to 100 rows. The REST API does not provide a nextPage link; instead, it uses the parameter "startRow" to specify the starting row for each page. For example, setting startRow=101 fetches rows 101-200.
Our goal is to copy the tables from the REST API connector to ADLS storage as .json files and then transfer the data to a SQL Server. Each table should be stored as a single .json file in ADLS, despite being fetched in multiple pages. Thus, we need 14 .json files, one for each table, in the ADLS storage folder.
Could someone guide me on how to configure the copy data activity's source and sink to achieve this? Any leads would be greatly appreciated.
#ADF #copyactivity #pagination
1 Reply
- zack-sponaugleCopper ContributorHello Itika211999,
I recommend having a Control/Metadata table that you use to lookup the parameters (e.g. table name), and send that information into a ForEach activity. That way you don't have to rewrite the pipeline for each table, and you can run multiple pipelines in parallel.
Regarding pagination, the generic Rest connector does have some support for common pagination patterns. You can find information and examples here:
https://learn.microsoft.com/en-us/azure/data-factory/connector-rest?tabs=data-factory#pagination-support
I would test them out myself but I don't have access to a sample REST API at the moment. If none of those work, let me know and I can share with you what I did to implement "custom" pagination logic for an API that utilized startRow and lastPage parameters. It is not as clean as using the native pagination support offered by ADF but it gets the job done.
Regarding the JSON side of things, I have never been able to get the Copy activity to append multiple iterations of it to a single JSON file (it will just keep overwriting the file). What I end up having to do is writing the data out as separate files in ADLS (one per each API call), and then when that is finished, I have a second copy activity that merges all of these files into a single file. This can be achieved by setting the Copy Activity's Source tab to a wildcard file path (*.json) and the Sink tab's Copy Behavior setting to "Merge files". Set the file pattern to "Set of objects" or "Array of objects" depending on the exact output format required.
Let me know if you have any questions.
Regards,
Zack