This article reviews a common pattern of streaming data (i.e. real-time message ingestion) in Synapse SQL dedicated pool. It opens a discussion on the simple standard way to implement this, as well as the challenges and drawbacks. It then presents an alternate solution which enables optimal performance and greatly reduces maintenance tasks when using clustered column store indexes. This is aimed at developers, DBAs, architects, and anyone who works with streams of data that are captured in real-time.
Description of this scenario is as follow: we have messages produced by sensors (simulated by Azure container instances) sent to an Event Hub, then processed by an ASA job which redirects its output into a Synapse table named “rawdata”. At this stage there are two options:
query directly “rawdata” table.
use “rawdata” as a staging table, then build a pipeline that process the data every ten minutes and store it in a final fact table. It consists of an UPSERT statement to ensure existing records are updated (in case they are resent), otherwise inserts them into a final fact table named “rawdata_fact”.
General solution Design Architecture
Building blocks of the current solution can be found on the following link
a stored procedure scheduled to update the fact from the staging:
ALTER PROC [dbo].[processrawdata] AS
--We ensure the staging data is locked before its data gets either insert or update the fact table to avoid inconsistencies
UPDATE rawdata SET partitionid=0 WHERE 1=0
--Now the fact update can take place as well as the stating data when processed
MERGE dbo.rawdata_fact AS macible
USING dbo.rawdata AS masource
ON (macible.eventId = masource.eventId
THEN UPDATE SET
WHEN NOT MATCHED BY TARGET THEN
delete from rawdata;
Then we can a simple view on to of both tables so we have the latest data as well as the processed one:
CREATE VIEW [dbo].[v_rawdata]
AS select * from rawdata
select * from rawdata_fact;
And monitor in near real time the average temperature and CO2 values:
select type, avg(value) as average
group by type
We can stream data directly from Stream Analytics into a Synapse SQL dedicated pool. We can avoid the clustered store index maintenance by using an intermediate table and a scheduled pipeline.