Modern applications require the capability to retrieve modified data from a database in real time to operate effectively. Usually, developers need to create a customized tracking mechanism in their applications, utilizing triggers, timestamp columns, and supplementary tables, to identify changes in data. The development of such applications typically requires significant effort and can result in schema updates resulting in considerable performance overhead.
Real-time data processing is a crucial aspect of nearly every modern data warehouse project. However, one of the biggest hurdles to overcome in real-time processing solutions is the ability to ingest efficiently and effectively, process, and store messages in real-time, particularly when dealing with high volumes of data. To ensure optimal performance, processing must be conducted in a manner that does not interfere with the ingestion pipeline. In addition to non-blocking processing, the data store must be capable of handling high-volume writes. Further challenges such as the ability to quickly act on the data, generating real-time alerts or business needs where dashboard that needs to be updated in real-time or near real-time. In many cases, the source systems utilize traditional relational database engines, such as MySQL, that do not offer event-based interfaces.
In this series of blog posts, we will introduce an alternative solution that utilizes an open-source tool Debezium to perform Change Data Capture (CDC) from Azure Database for MySQL – Flexible Server with Apache Kafka writes these changes to the Azure Event Hub, Azure Stream Analytics perform real time analytics on the data stream and then write to Azure Data Lake Storage Gen2 for long-term storage and further analysis using Azure Synapse serverless SQL pools and provide insights through Power BI.
Azure Database for MySQL - Flexible Server is a cloud-based solution that provides a fully managed MySQL database service. This service is built on top of Azure's infrastructure and offers greater flexibility. MySQL uses binary log (binlog) to record all the transactions in the order in which they are committed on the database. This includes changes to table schemas as well as changes to the rows in the tables. MySQL uses binlog mainly for purposes of replication and recovery.
Debezium is a powerful CDC (Change Data Capture) tool that is built on top of Kafka Connect. It is designed to stream the binlog, produces change events for row-level INSERT, UPDATE, and DELETE operations in real-time from MySQL into Kafka topics, leveraging the capabilities of Kafka Connect. This allows users to efficiently query only the changes from the last synchronization and upload those changes to the cloud. After this data is stored in Azure Data Lake storage, it can be processed using Azure Synapse Serverless SQL Pools. Business users can then monitor, analyze, and visualize the data using Power BI.
This solution entails ingesting MySQL data changes from the binary logs and converting the changed rows into JSON messages, which are subsequently sent to Azure Event Hub. After the messages are received by the Event Hub, an Azure Stream Analytics (ASA) Job distributes the changes into multiple outputs, as shown in the following diagram.
End-to-end serverless streaming platform with Azure Event Hubs for data ingestion
In this blog post, following are the services used for streaming the changes from Azure Database for MySQL to Power BI.
The following stages outline the process to set up the components involved in this architecture to stream data in real time from the source Azure Database for MySQL flexible Server.
Each of the above stages is outlined in detail in the upcoming sections.
It is important to create an Azure Database for MySQL Flexible Server instance and a Virtual Machine as outlined below before proceeding to the next step. To do so, perform the following steps:
To track row-level changes in response to insert, update and delete operations in database tables, Change Data Capture (CDC) is a technique that you use to track these changes, Debezium is a distributed platform that provides a set of Kafka Connect connectors that can convert these changes into event streams and send those events to Apache Kafka.
To set up Debezium & Kafka on a Linux Virtual Machine follow the steps outlined in: CDC in Azure Database for MySQL – Flexible Server using Kafka, Debezium, and Azure Event Hubs - Micr...
Azure Event Hubs is a fully managed Platform-as-a-Service (PaaS) Data streaming and Event Ingestion platform, capable of processing millions of events per second. Event Hubs can process, and store events, data, or telemetry produced by distributed software and devices. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters. Azure Events Hubs provides an Apache Kafka endpoint on an event hub, which enables users to connect to the event hub using the Kafka protocol.
Use the following steps to configure a Stream Analytics job to capture data in Azure Data Lake Storage Gen2.
3. Enter a name to identify your Stream Analytics job. Select Create.
4. Specify the Serialization type of your data in the Event Hubs and the Authentication method that the job will use to connect to Event Hubs. Then select Connect.
5. Then the connection is established successfully, you'll see:
3. In the New External Table, change Max string length to 250 and continue
4. A dialog window will open. Select or create new database and provide database table name and select Open script
5. A new SQL Script opens, and you run the script against the database, and it will create a new External table.
6. Making a pointer to a specific file. You can only point to folder not the files too
7. Point to enriched folder in Data Lake Storage
8. Save all the work by clicking Publish All.
9. Verify the external table created in Data -> Workspace -> SQL Database
External tables encapsulate access to files making the querying experience almost identical to querying local relational data stored in user tables. After the external table is created, you can query it just like any other table:
SELECT TOP 100 * FROM dbo.orders_info
GO
SELECT COUNT(*) FROM dbo.orders_info
GO
10. END
2. Under External Connections, click Linked services. Click + New. Click Power BI and click Continue.
3. Enter a name for the linked service and select an existing workspace which you want to use to publish. Provide any name in the “Name” field. Then you will see Power BI linked connection with the name.
4. Click Create.
5. View Power BI workspace in Synapse Studio
6. New reports can be created clicking + at the top of the Develop tab. Existing reports can be edited by clicking on the report name. Any saved changes will be written back to the Power BI workspace.
Overall, Debezium, Kafka Connect, Azure Event Hubs, Azure Data Lake Storage, Azure Stream Analytics, Synapse SQL Serverless, and Power BI work together to create a comprehensive, end-to-end data integration, analysis, and visualization solution that can handle real-time data streams from databases, store them in a scalable and cost-effective manner, and provide insights through a powerful BI tool.
To learn more about the services used in this post, check out the following resources:
If you have any feedback or questions about the information provided above, please leave a comment below or email us at AskAzureDBforMySQL@service.microsoft.com. Thank you!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.