Build a Logical Enterprise Data Warehouse using Azure Data Lake Storage Gen 2 and Synapse Serverless SQL pools
A common pattern in Modern Data Warehouse architecture is to land your source data in its existing format into a data lake followed by transforming and loading it into an Enterprise Data Warehouse (EDW) for reporting and analytics. In Microsoft Azure, this translates to loading data into Azure Data Lake Storage Gen2 (ADLS), transforming data with Synapse or Azure Data Factory Pipelines, storing data in an Azure Synapse Dedicated Pool database or other Azure relational data store, and building reports over that data in Power BI, Excel, or other reporting tools. Typically, the EDW data is stored in a Star Schema, the optimal design for many reporting and analytical tools like Power BI. This is a great practice for enterprise reporting requirements but has some pitfalls:
- Traditional transformation and load of data to the final EDW data store can be time consuming
- Ingress and storage costs increase when transforming and loading from the landing zone to the final data store for reporting
- Deep knowledge of the data sources is required in order to provide business insights and optimal data warehouse design
- What if your data was transformed and ready for reporting as soon as it is landed in the cloud?
- What if you could eliminate that time consuming ETL process?
- How about not incurring additional costs for moving, transforming and storing data in another location?
- What if you suspect that you will need a traditional EDW but would like to start an Exploratory Data Analysis project in the cloud?
- How about doing this all within T-SQL?
Then Serverless SQL Pools in Azure Synapse Analytics may be the answer for you!
In this post, we'll walk through creating a logical data warehouse over your ADLS data using a Serverless SQL database.
- Synapse Analytics Workspace
- Data in an Azure Data Lake Storage Gen 2
Steps to create a logical EDW via Synapse SQL Serverless
Open Synapse Studio
- Create a SQL Serverless Database in Azure Synapse Analytics Workspace Data hub
- I created two schemas, Bronze and Silver. The views in the Bronze schema represent the ADLS data in its existing format. The views in the Silver schema contain queries over the Bronze views to denormalize the data into Facts and Dimensions. In the Data hub, go to your SQL Serverless database and click on the ellipses to the right of the database name, choose New SQL script, then Empty script
In the empty script, type CREATE SCHEMA Bronze, then click Run:
- Repeat step 3 to create a Silver schema.
- Create the SQL views in your Bronze schema over your ADLS files to define column names for easy querying in t-SQL.
A quick way to do this is to navigate to your ADLS account in the Data hub and generate a New SQL script over the folder (or file) for each entity
You will be prompted for the format (Text, Parquet, or Delta) and then a script will be created for you. Change the script to eliminate the TOP100 from the SELECT statement and add CREATE VIEW. Click Run to create the view. Do this for all of the ADLS files that will be used in your logical data warehouse.
Here are the views I created in my Bronze Schema:
- Create SQL views in the Silver schema to define the transformations for the fact and dimension tables for your star schema.
Below are the views I created in the Silver schema. You can create your own views or download the code I created for the views below
Here is an example of the StockItems dimension table that was denormalized for the star schema:
Create View Silver.StockItems as
SI_A.UnitPackageID as [Selling Package],
SI_A.OuterPackageID as [Buying Package],
From Bronze.StockItems as SI_A
INNER JOIN Bronze.Suppliers S ON SI_A.SupplierID = S.SupplierID
INNER JOIN Bronze.Colors C on SI_A.ColorID = C.ColorID
INNER JOIN Bronze.PackageTypes SP_A on SP_A.PackageTypeID = SI_A.UnitPackageID
INNER JOIN Bronze.SupplierCategories SC on S.SupplierCategoryID = SC.SupplierCategoryID
- Connect to the SQL Serverless Endpoint to query your database in your tool of choice, such as Power BI or SQL Server Management studio. Start by navigating to your Synapse Analytics Workspace in the Azure Portal and copying the Serverless SQL Endpoint:
Connect to the endpoint in SQL Server Management Studio by pasting the Serverless SQL endpoint URL as the Server name:
Here's the results of a query over one of the views:
Connect to the Serverless SQL endpoint in Power BI Desktop:
Load and transform just like any other Power BI data source:
Keeping data in ADLS Gen2 without transforming and loading to another data platform gives users quick access to the data they need for reporting and analytics without the cost and delays due to heavy transformations and data movement. After creating your logical EDW in the Synapse SQL Serverless database, you can further evaluate if Synapse SQL Serverless meets your goals for query performance. If more performance is needed, you can leverage your Synapse SQL Serverless views to move data to another Azure data platform like Azure Synapse Analytics SQL Dedicated pool for further query optimization and performance features.