etl
7 TopicsBulk Insert causing duplicates in target table
We use ETL tool(C# based) which uses Bulkcopy method to do bulk insert to target table. Recently we are facing 2 Issues in our daily loads 1. Target table with no primary key - the load returns success but it inserts 5x times of the same record. It loads same records 5 times. 2. Target table with Primary Key - The load returns failure with primary key constraint but it inserts the rows into target table. DBA Team is unable to find anything in there logs ,So I am confused why this error occurs. This occurs daily in 10% of the jobs and once you rerun after 2-3 hrs the same job finishes successfully without above mentioned issues. Please suggest how can I debug this issue.6.5KViews0likes3CommentsLeverage Azure Durable Functions to build lightweight ETL Data Jobs
This blog is co-authored by Dr. Magesh Kasthuri, Distinguished Member of Technical Staff (Wipro) and Sanjeev Radhakishin Assudani, Azure COE Principal Architect (Wipro). This blog post aims to provide you with insights into how Azure Durable functions can be considered as an alternate design choice to build lightweight Azure native solution for data ingestion and transformation. While the solution discussed in this blog pertains to a healthcare industry customer, the design approach presented here is generic and applicable across industries. The scenario A leading healthcare provider planned to modernize Medicare Auto Enrollment Engine (AEE) and Premium Billing capabilities to enable a robust, scalable, and cost-effective solution across their Medicare business line. One of the key requirements was to build an integration layer to their healthcare administration platform into its database which will process the benefit enrollment and maintenance of hundreds of JSON files. Proposed solution will ingest, transform, and load the data in their Database platform on a daily incremental file and monthly audit file basis. The challenge was to identify a most cost effective ETL data engine solution in the real-world scenario to do complex processing in the integration layer yet lightweight. Below is the list of possible solutions identified: o Azure Data Bricks o Mulesoft APIs o Azure Logic Apps o Azure Durable Functions After careful evaluation, Azure Durable Function was chosen to build the integration layer. The following objectives were identified: Azure Durable functions offer modernized and scalable solution for building and managing serverless workflows Lightweight data jobs can be implemented using durable functions and avoid heavy compute intensive services when not needed. Optimized performance to complete the end-to-end enrichment process within hours. Solution components In today's data-driven world, the ability to efficiently handle ETL (Extract, Transform, Load) jobs is crucial for any organization looking to gain insights from their data. Azure provides a robust platform to develop native solutions for ETL jobs, utilizing a combination of Azure Data Factory (ADF) pipelines, Azure Durable Functions, Azure SQL Database, and Azure Storage. This article will guide you through the detailed process of developing an Azure native solution for ETL jobs, encompassing data load, ingestion, transformation, and staging activities. This solution approach avoids Azure Data Lake (ADLS 2) or Databricks to avoid cost bulge or heavy weight architecture and also helps you to define a lightweight reference architecture for high load data processing jobs. Architecture Overview The architecture for an Azure native ETL solution involves several components working together seamlessly. The key components include: Azure Data Factory (ADF) Pipeline: Orchestrates data flow and automates ETL processes. Azure Durable Functions: Handles ingestion and transformation tasks using C# and .NET code. Azure SQL Database: Used for data enrichment and final storage. Azure Storage: To store raw feed files, manage staging activities and temporary data storage. Application Insights & Monitoring: Provides observability and activity tracking. Azure Durable Function Monitor: It provides UI to debug, monitor and manage the orchestration instances. Azure Key Vault: To store secrets like keys, connection strings. Architecture Diagram Azure Data Factory (ADF) Pipeline ADF serves as the backbone of the ETL process. It orchestrates the entire data flow, ensuring that data is moved efficiently from one stage to another. ADF pipelines can be scheduled to run at specific intervals or triggered by events, providing flexibility in managing ETL workflows. Azure Blob Storage Azure Blob Storage acts as the initial landing zone for raw feed data. It is highly scalable and cost-effective, making it ideal for storing large volumes of data. Data is loaded into Blob Storage from various sources, ready for further processing. Azure Durable Functions Durable Functions are a powerful feature of Azure Functions that allow for long-running, stateful operations. Using C# and .NET code, Durable Functions can perform complex data ingestion and transformation tasks. They provide reliability and scalability, ensuring that data processing is efficient and fault tolerant. Azure SQL Database Azure SQL Database is used for data enrichment and final storage. After the transformation process, data is loaded into the SQL database where it can be enriched with additional metadata and made ready for analytics and reporting. It provides high performance, security, and availability. Azure Storage for Staging Activities During the ETL process, intermediate data needs to be temporarily stored. Azure Storage plays a crucial role in managing these staging activities. It ensures that data is available for subsequent processing steps, maintaining the integrity and flow of the ETL pipeline. Observability and Monitoring Application Insights Application Insights is an essential tool for monitoring the health and performance of your ETL solution. It provides real-time insights into application performance, helping to identify and troubleshoot issues quickly. By tracking metrics and logs, you can ensure that your ETL processes are running smoothly and efficiently. Activity Tracking Activity tracking is crucial for understanding the flow and status of data through the ETL pipeline. Logging and monitoring tools can provide detailed information about each step in the process, allowing for better visibility and control. This ensures that any anomalies or failures can be detected and addressed promptly. Durable Function Monitor This is an important tool to list, monitor and debug the orchestrations inside the Azure Durable Function. We can configure this as an extension in Visual Studio code. It helps to view the different instances of orchestrators and activity functions. It also shows the time taken to execute them, this is important for tracking the performance of the different steps in the ETL process. We can also view the Azure Durable Function in the form of a function graph. Kudu Logs This traces the execution of the different orchestrators, activity functions and the native functions. This helps to see the exceptions raised, or whether there are replay happening for the orchestrators, activity functions. Best Practices for Implementing the Solution Here are some best practices to ensure the successful implementation of your Azure native ETL solution: Design for Scalability: Ensure that your solution can handle increasing data volumes and processing demands by leveraging Azure's scalable services. Optimize Data Storage: Use appropriate data storage solutions for different stages of the ETL process, balancing cost and performance. Implement Robust Monitoring: Use Application Insights, Durable Function Monitor and other monitoring tools to track performance and detect issues early. Ensure Data Security: Implement strong security measures to protect sensitive data at rest and in transit. Automate and Schedule Pipelines: Use ADF to automate and schedule ETL pipelines, reducing manual intervention and ensuring consistency. Use Durable Functions for Complex Tasks: Leverage Azure Durable Functions for long running and stateful operations, ensuring reliability and efficiency. By following these guidelines and leveraging Azure's powerful tools and services, you can develop a robust and efficient ETL solution that meets your data processing needs. Azure provides a flexible and scalable platform, enabling you to handle large data volumes and complex transformations with ease. Embrace the power of Azure to unlock the full potential of your data.508Views4likes1CommentWhat Are the Ways to Dynamically Invoke Pipelines in ADF from Another Pipeline?
I am exploring different approaches to dynamically invoke ADF pipelines from within another pipeline as part of a modular and scalable orchestration strategy. My use case involves having multiple reusable pipelines that can be called conditionally or in sequence, based on configuration stored externally (such as in a SQL Managed Instance or another Azure-native source). I am aware of a few patterns like using the Execute Pipeline activity within a ForEach loop, but I would like to understand the full range of available and supported options for dynamically invoking pipelines from within ADF. Could you please clarify the possible approaches for achieving this? Specifically, I am interested in: Using ForEach with Execute Pipeline activity How to structure the control flow for calling multiple pipelines in sequence or parallel. How to pass pipeline names dynamically. Dynamic pipeline name resolution Is it possible to pass the pipeline name as a parameter to the Execute Pipeline activity? How to handle validation when the pipeline name is dynamic? Parameterized execution Best practices for passing dynamic parameters to each pipeline when calling them in a loop or based on external config. Calling ADF pipelines via REST API or Web Activity When would this be preferred over native Execute Pipeline? How to handle authentication and response handling? If there are any recommendations, gotchas, or best practices related to dynamic pipeline orchestration in ADF, I would greatly appreciate your insights. Thanks!60Views0likes0CommentsHow to Orchestrate ADF Pipelines as Selectable Steps in a Configurable Job
I am working on building a dynamic job orchestration mechanism using Azure Data Factory (ADF). I have multiple pipelines in ADF, and each pipeline represents a distinct step in a larger job. I would like to implement a solution where I can dynamically select or deselect individual pipeline steps (i.e., ADF pipelines) as part of a job. The idea is to configure a job by checking/unchecking steps, and then execute only the selected ones in sequence or based on dependencies. Available resources for this solution: Azure Data Factory (ADF) Azure SQL Managed Instance (SQL MI) Any other relevant Azure-native service (if needed) Could you please suggest a solution that meets the following requirements: Dynamically configure which pipelines (steps) to include in a job. Add or remove steps without changing hardcoded logic in ADF. Ensure scalability and maintainability of the orchestration logic. Keep the solution within the scope of ADF, SQL MI, and potentially other Azure-native services (no external apps or third-party orchestrators). Any design pattern, architecture recommendations, or examples would be greatly appreciated. Thanks!46Views0likes0Comments