Today's companies are dealing with data of many different types, in many different sizes, and coming in at varying frequencies. These companies are looking beyond the limitations of traditional data architectures to enable cloud scale analytics, data science, and machine learning on all of this data. One architecture pattern that addresses many of the challenges of traditional data architectures is the lakehouse architecture.
Lakehouses combine the low cost and flexibility of data lakes with the reliability and performance of data warehouses. The lakehouse architecture provides several key features including:
When building a lakehouse architecture, keep these 3 key principles and their associated components in mind:
Let's look at how Azure Databricks along with Azure Data Lake Storage and Delta Lake can help build a lakehouse architecture using these 3 principles.
One part of the first principle is to have a data lake to store all your data. Azure Data Lake Storage offers a cheap, secure object store capable of storing data of any size (big and small), of any type (structured or unstructured), and at any speed (fast or slow). The second part of the first principle is to have the Curated data in the data lake be in an open format that supports ACID transactions. Companies often use Delta Lake to build this curated zone of their data lake. Delta Lake is simply an open file format based on Parquet that can be stored in Azure Data Lake Storage. Among other things, it supports ACID transactions (UPDATE, DELETE, and even MERGE), time travel, schema evolution/enforcement, and streaming as a source and a sync. These features make the Delta Lake format used in Azure Data Lake Storage an ideal component for the first principle of the lakehouse architecture.
The 2nd principle discussed above is to have a foundational compute layer built on open standards that can handle all of the core lakehouse use cases. The Photon-powered Delta Engine found in Azure Databricks is an ideal layer for these core use cases. The Delta Engine is rooted in Apache Spark, supporting all of the Spark APIs along with support for SQL, Python, R, and Scala. In addition, Azure Databricks provides other open source frameworks including:
Azure Databricks also provides a collaborative workspace along with the Delta Engine that includes an integrated notebook environment as well as a SQL Analytics environment designed to make it easier for analysts to write SQL on the data lake, visualize results, build dashboards, and schedule queries and alerts. All of this makes Azure Databricks and the Delta Engine and ideal foundational compute layer for core lakehouse use cases.
The final principle focuses on key integrations between the Curated data lake, foundational compute layer, and other services. This is necessary because there will always be specialized or new use cases that are not "core" lakehouse use cases. Also, different business areas may prefer different or additional tools (especially in the SQL analytics and BI space). A lakehouse built on Azure Data Lake Storage, Delta Lake, and Azure Databricks provides easy integrations for these new or specialized use cases.
To conclude, the lakehouse architecture pattern is one that will continue to be adopted because of its flexibility, cost efficiency, and open standards. Building an architecture with Azure Databricks, Delta Lake, and Azure Data Lake Storage provides the foundation for lakehouse use cases that is open, extensible, and future proof.
To learn more about Lakehouse architecture, check out this research paper and blog from Databricks and join an Azure Databricks event.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.