azure data factory
1 Topic[Design Pattern] Handling race conditions and state in serverless data pipelines
Hello community, I recently faced a tricky data engineering challenge involving a lot of Parquet files (about 2 million records) that needed to be ingested, transformed, and split into different entities. The hard part wasn't the volume, but the logic. We needed to generate globally unique, sequential IDs for specific columns while keeping the execution time under two hours. We were restricted to using only Azure Functions, ADF, and Storage. This created a conflict: we needed parallel processing to meet the time limit, but parallel processing usually breaks sequential ID generation due to race conditions on the counters. I documented the three architecture patterns we tested to solve this: Sequential processing with ADF (Safe, but failed the 2-hour time limit). 2. Parallel processing with external locking/e-tags on Table Storage (Too complex and we still hit issues with inserts). 3. A "Fan-Out/Fan-In" pattern using Azure Durable Functions and Durable Entities. We ended up going with Durable Entities. Since they act as stateful actors, they allowed us to handle the ID counter state sequentially in memory while the heavy lifting (transformation) ran in parallel. It solved the race condition issue without killing performance. I wrote a detailed breakdown of the logic and trade-offs here if anyone is interested in the implementation details: https://medium.com/@yahiachames/data-ingestion-pipeline-a-data-engineers-dilemma-and-azure-solutions-7c4b36f11351 I am curious if others have used Durable Entities for this kind of ETL work, or if you usually rely on an external database sequence to handle ID generation in serverless setups? Thanks, Chameseddine12Views0likes1Comment