End to end analytical pipeline demo with Synapse

Former Employee

Jul 22, 2022

While reviewing Azure Synapse, we decided to build an end-to-end solution using AdventureWorks2019 database which is accessible for everybody. Building that type of example will be easily replicable by anyone and will allow us to demo Azure Synapse functionality with a good dataset. In this blog we want to describe the architecture we implemented as well as the overall experience of building such a solution.

For step-by-step guidance, you can follow the steps in the readme file of the GitHub repo where we published the result at https://github.com/EduardoN/Synapse_AdventureWorks2019

Architecture.

The architecture we implemented is shown in the lower part of this diagram inside the blue rectangle:

The upper part of this architecture is described in this other blog post: https://techcommunity.microsoft.com/t5/fasttrack-for-azure/real-time-ingestion-with-event-hub-and-synapse/ba-p/3518504

Implementation.

The implementation first consisted on creating an Azure SQL Database instance and deploying on it an AdventureWorks2019 database.

Next, we created a Synapse Workspace and we connected Synapse to GitHub to leverage all the power of source control, versioning, easier collaboration, etc. and then we worked on building a pipeline to ingest AdventureWorks2019 database into a Data Lake. That is depicted in the Ingest section of the architecture above.

To build the pipeline we used one of the ingest templates Synapse pipeline offers, creating the pipeline in a matter of minutes via the wizard the tool offers. We didn’t ingest the entire database, but just a handful of tables we needed to create a meaningful example. All this process is explained in this link: https://github.com/EduardoN/Synapse_AdventureWorks2019/blob/main/documentation/Ingest_To_DataLake.md#ingest-adventureworks2019-into-data-lake

Once we had the needed information in parquet files inside Azure Data Lake, we then created a Notebook to create a Delta Lake to massage that data.

Synapse made it super easy to create an Apache Spark pool, and to create two Notebooks using PySpark, read from those parquet files and create a Delta Lake to generate the final model in the gold folder. The two models generated, Person and Product, are described in this link: https://github.com/EduardoN/Synapse_AdventureWorks2019/blob/main/documentation/Synapse_Notebook.md#create-synapse-notebook

Next we ingested those models into a dedicated SQL pool. Azure Synapse made it very easy to create the dedicated SQL pool instance, and then we created a new pipeline that read from those parquet files in the Gold folder and copied the data into two tables in the dedicated SQL pool. That process is described in this link Synapse_AdventureWorks2019/Ingest_To_DW.md at main · EduardoN/Synapse_AdventureWorks2019 (github.com)

As the last step we created some visualizations in PowerBI to display those models. That finalized the end-to-end scenario.

Conclusion.

There was ton of learning in this process. Synapse is a very new tool with a lot of capabilities and great potential. New features are added continuously and its impact in the Data analytical space is remarkable, making it very easy to build in a single tool, what before took multiple tools to create, making the interoperability between all the involved services straight forward.

I hope that the solution published in GitHub together with all the companion documentation makes your learning experience better. As you play with this example, feel free to provide your feedback. It is more than welcome.

Updated Nov 18, 2022

Version 3.0

data & ai

Eduardo_Noriega

Former Employee

Joined July 07, 2022

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity

Blog Post

End to end analytical pipeline demo with Synapse