[Design Pattern] Handling race conditions and state in serverless data pipelines

Question

Hello community,I recently faced a tricky data engineering challenge involving a lot of Parquet files (about 2 million records) that needed to be ingested, transformed, and split into different entities.The hard part wasn't the volume, but the logic. We needed to generate globally unique, sequential IDs for specific columns while keeping the execution time under two hours.We were restricted to using only Azure Functions, ADF, and Storage. This created a conflict: we needed parallel processing to meet the time limit, but parallel processing usually breaks sequential ID generation due to race conditions on the counters.I documented the three architecture patterns we tested to solve this:Sequential processing with ADF (Safe, but failed the 2-hour time limit).2. Parallel processing with external locking/e-tags on Table Storage (Too complex and we still hit issues with inserts).3. A "Fan-Out/Fan-In" pattern using Azure Durable Functions and Durable Entities.We ended up going with Durable Entities. Since they act as stateful actors, they allowed us to handle the ID counter state sequentially in memory while the heavy lifting (transformation) ran in parallel. It solved the race condition issue without killing performance.I wrote a detailed breakdown of the logic and trade-offs here if anyone is interested in the implementation details:https://medium.com/@yahiachames/data-ingestion-pipeline-a-data-engineers-dilemma-and-azure-solutions-7c4b36f11351I am curious if others have used Durable Entities for this kind of ETL work, or if you usually rely on an external database sequence to handle ID generation in serverless setups?Thanks,Chameseddine

rogerval · Answer

Using Durable Entities is a strong choice here. They act as stateful actors and let you maintain a strictly sequential counter while running large-scale parallel transformations. This avoids external locking, reduces complexity, and prevents ID collisions.

Common alternatives in similar ETL/serverless designs include:

SQL or Cosmos DB sequences when a transactional database already exists.
Queue-based atomic increments, though these are less reliable for strict ID ordering.
ADF sequential mode, which is safe but often too slow for high-volume ingestion.

Your Durable Entities approach is the right balance of scalability, correctness, and simplicity for a serverless-only stack.

Forum Discussion

[Design Pattern] Handling race conditions and state in serverless data pipelines

1 Reply