Read large volume of files from Blob Storage

Question

Hi!&nbsp;I have a Blob Storage containing approximately 1 billion documents, split across different folders. I'm currently facing challenges in finding an ideal solution to retrieve the files, read their metadata, and update a tag to indicate whether the document has been processed or not. Could you provide any suggestions on how to address this problem? Currently, I am using the blobContainerClient.listBlobsByHierarchy(delimiter, options, null) method to retrieve the files. However, there are cases where I am unable to retrieve all the documents within a folder.&nbsp;Full implementation (Java):&nbsp; &nbsp;...&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; List&lt;Map&lt;String, Map&lt;String, String&gt;&gt;&gt; listOfMetadata = new ArrayList&lt;&gt;();&nbsp; &nbsp; &nbsp; &nbsp; String delimiter = "/";&nbsp; &nbsp; &nbsp; &nbsp; log.info("Setting details");&nbsp; &nbsp; &nbsp; &nbsp; BlobListDetails blobListDetails = new BlobListDetails().setRetrieveMetadata(true).setRetrieveTags(true);&nbsp; &nbsp; &nbsp; &nbsp; log.info("Setting options");&nbsp; &nbsp; &nbsp; &nbsp; ListBlobsOptions options = new ListBlobsOptions().setPrefix(prefix).setDetails(blobListDetails).setMaxResultsPerPage((int) listLimit);&nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; blobContainerClient.listBlobsByHierarchy(delimiter, options, null).stream().parallel().limit(listLimit).forEach(blob -&gt; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Map&lt;String, String&gt; mapOfTags = blob.getTags();&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (mapOfTags.containsKey("wasRead") &amp;&amp; mapOfTags.get("wasRead").equals("false")) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Map&lt;String, Map&lt;String, String&gt;&gt; mapOfBlobs = new HashMap&lt;&gt;();&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mapOfBlobs.put(blob.getName(), blob.getMetadata());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; listOfMetadata.add(mapOfBlobs);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; });&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; return listOfMetadata;...

kidd_ip · Answer

Fjorgego&nbsp;Just wonder current Blob storage in use can fulfill your requirement?&nbsp;https://learn.microsoft.com/en-us/azure/storage/blobs/scalability-targets&nbsp;

prosolutions · Answer

Fjorgego&nbsp; Processing a large volume of files from Blob Storage can be a challenging task. To improve the efficiency and reliability of your implementation, you can consider the following suggestions:&nbsp;1. **Optimize ListBlobsOptions:**- Make sure to set the `setFlatListing(true)` option in `ListBlobsOptions` to retrieve all blobs, regardless of their hierarchy.- Increase the `setMaxResultsPerPage` value to fetch a larger batch of blobs per request. This reduces the number of API calls and improves performance.&nbsp;2. **Use Continuation Token:**- When working with a large number of blobs, you might encounter continuation tokens, which indicate that there are more blobs to retrieve. You should handle these tokens properly to paginate through the results.&nbsp;3. **Parallelization:**- It appears that you are already using parallel streams to process blobs concurrently. Make sure to adjust the level of parallelism based on your hardware capabilities and the rate limits of your Blob Storage to achieve optimal performance.&nbsp;4. **Batch Processing:**- Instead of processing blobs one by one, consider processing them in batches. Fetch a reasonable number of blobs at a time, process them, and then move on to the next batch. This can improve efficiency and reduce overhead.&nbsp;5. **Asynchronous Processing:**- For an even more efficient solution, you can use asynchronous programming to retrieve and process blobs concurrently.&nbsp;6. **Caching and Local Storage:**- Depending on your use case, you could consider caching metadata locally to reduce the number of requests to Blob Storage. However, be mindful of the storage space and data consistency when employing caching.&nbsp;7. **Optimize Metadata and Tags:**- Ensure that the metadata and tags associated with each blob are streamlined and minimized to reduce the amount of data that needs to be fetched.&nbsp;8. **Azure Functions or Durable Functions:**- If you have access to Azure Functions, you can leverage them to process blobs in a serverless and scalable manner. Durable Functions can help with managing state and handling retries in case of failures.&nbsp;9. **Partitioning and Sharding:**- Depending on your specific scenario, you could consider partitioning or sharding your data in Blob Storage to spread the processing workload and improve parallelism.&nbsp;Remember that large-scale data processing can be complex, and it's essential to consider the cost, scalability, and performance trade-offs based on your specific use case. Continuously monitor your implementation and make iterative improvements as needed.&nbsp;Lastly, consider whether you need to update the "wasRead" tag for each blob individually, or if you can batch update tags for multiple blobs at once to reduce API calls. This can significantly improve the efficiency of the tag updating process.

Forum Discussion

Read large volume of files from Blob Storage

2 Replies

Resources