data engineering
2 Topics[Design Pattern] Handling race conditions and state in serverless data pipelines
Hello community, I recently faced a tricky data engineering challenge involving a lot of Parquet files (about 2 million records) that needed to be ingested, transformed, and split into different entities. The hard part wasn't the volume, but the logic. We needed to generate globally unique, sequential IDs for specific columns while keeping the execution time under two hours. We were restricted to using only Azure Functions, ADF, and Storage. This created a conflict: we needed parallel processing to meet the time limit, but parallel processing usually breaks sequential ID generation due to race conditions on the counters. I documented the three architecture patterns we tested to solve this: Sequential processing with ADF (Safe, but failed the 2-hour time limit). 2. Parallel processing with external locking/e-tags on Table Storage (Too complex and we still hit issues with inserts). 3. A "Fan-Out/Fan-In" pattern using Azure Durable Functions and Durable Entities. We ended up going with Durable Entities. Since they act as stateful actors, they allowed us to handle the ID counter state sequentially in memory while the heavy lifting (transformation) ran in parallel. It solved the race condition issue without killing performance. I wrote a detailed breakdown of the logic and trade-offs here if anyone is interested in the implementation details: https://medium.com/@yahiachames/data-ingestion-pipeline-a-data-engineers-dilemma-and-azure-solutions-7c4b36f11351 I am curious if others have used Durable Entities for this kind of ETL work, or if you usually rely on an external database sequence to handle ID generation in serverless setups? Thanks, Chameseddine51Views0likes1CommentPAAS resource metrics using Azure Data Collection Rule to Log Analytics Workspace
Hi Team, I want to build a use case to pull the Azure PAAS resources metrics using azure DCR and push that data metrics to log analytics workspace which eventually will push the data to azure event hub through streaming and final destination as azure postgres to store all the resources metrics information in a centralized table and create KPIs and dashboard for the clients for better utilization of resources. I have not used diagnose setting enabling option since it has its cons like we need to manually enable each resources settings also we get limited information extracted from diagnose setting. But while implementing i saw multiple articles stating DCR is not used for pulling PAAS metrics its only compatible for VM metrics. Want to understand is it possible to use DCR for PAAS metrics? Thanks in advance for any inputs.Solved100Views0likes2Comments