A modern data analytics architecture centered on the Databricks platform implements what is known as a Data Lakehouse architecture. It integrates the traditional Data Lake architecture with some functionality that previously was only available to Data Warehouse platforms, such as advanced data management features and support to ACID transactions, schema enforcement, incremental data loading, among others, all in the same platform.
Fig. 1 below illustrates the major components of that architecture, grouped into functional layers, from a data workflow perspective. Here we show it implemented on Azure Databricks, a Databricks platform implementation optimized for the Microsoft Azure cloud services.
Fig.1 – A modern data analytics architecture with Azure Databricks
Notice that other components usually integrating with this architecture, such as ML model training and serving, and a traditional data warehouse, are not explored here. You can find more information about how this architecture integrates with those other components in this document.
Enhancing your data analytics architecture with a Data Virtualization layer
Consider the following scenario: due to a merging or acquisition, you get access to new data stored in different databases and wants to quickly explore it in conjunction with data already in your Lakehouse, for potential new analytics solutions, without having to develop new data ingestion pipelines into the Lakehouse. In another scenario, you might want to simplify the data management needed to comply with regulations governing how and where the original data is accessed and stored.
For the scenarios described above, a data virtualization layer would be a nice addition to the analytics platform architecture.
A data virtualization solution allows data stored in different platforms to be virtually integrated, by hiding the complexity inherent in accessing data from different sources and formats and exposing it at a single location without having to replicate it. In this way, disparate data residing at different platforms and locations can be analyzed and combined as they were local and under the same format.
The data virtualization layer is implemented with PolyBase in Microsoft SQL Server. It allows data to be accessed as external tables and queried in SQL Server. It can connect to Azure Databricks through ODBC, to allow integration with curated data managed by Delta Lake tables in the Silver and Gold storage layers. In the same way, it can connect to disparate OLTP and OLAP external data managed by different databases. It can also connect directly to the raw data in the Azure Data Lake Storage for read and write access.
To enhance query performance and avoid a large amount of data to be streamed back to the virtualization layer, PolyBase supports query predicate pushdown for the Lakehouse data managed by Azure Databricks, and for external data managed by other databases.
We now briefly describe the roles and major components of the other layers.
Ingestion layer
This is the layer responsible for loading data into the Data Lakehouse, either in near real-time for streaming data processing, or in batches.
Azure Event Hubs acts as the event ingestor, for near real-time data ingestion decoupling the events stream generation from their consumption by the processing layer.
Azure Data Factory allows the creation of data movement workflows, for ingesting data into the storage layer in a batch fashion.
Storage layer
In the storage layer we store all unstructured, semi-structured, and structured data that comprises the Data Lakehouse. It implements the Delta Lake storage framework on top of an object storage infrastructure, such as Azure Data Lake Storage.
The storage layer implements the Medallion Architecture pattern, which is a data design pattern used to organize data into specialized layers of representation. Raw ingested data is stored into the Bronze layer. After initial refinement and preparation, usually for tasks such as ML modeling and ad-hoc BI, data is stored into the Silver layer. Finally, the aggregated and enriched data is stored into the Gold layer into specialized tables and data models, for project specific analytics.
Processing and Serving layers
In this layer we run all data workloads and computations needed for the functionalities implemented by the analytics platform, such as streaming processing for near real-time data ingestion, data transformations, feature engineering, machine learning modeling, and data aggregations.
Apache Spark and Databricks SQL Analytics are the main engines used for those tasks listed above and are now powered by the Databricks Photon engine.
Presentation layer
Here is where the BI visualization tools, such as Microsoft Power BI, plug in. Data in the Lakehouse can be accessed by specialized drivers, such as the one implemented into Power BI to connect to Azure Databricks, or through generic JDBC and ODBC drivers.
In conclusion
In this post we explored the main components of a Data Lakehouse architecture implemented on Azure Databricks, and we showed how a data virtualization layer with PolyBase on Microsoft SQL Server can bring more agility and data management simplification when integrating the Lakehouse with external data sources.