How to Organize your Data Lake

Microsoft

Aug 05, 2020

All good points. I think when it comes to cost it is not a simple equation. For example if it is a really active OLTP application running extracts from it directly are probably off the table. An active readable secondary, replicated copy, or Change Data Capture of the tables needed for analysis are probably needed in the architecture. Even if it is just to get the data onto Storage to support Delta Lake. That expense is kind of a sunk cost. Also if I an organization has DBA FTE that maintain this, adding additional FTE to replicate the process on the Spark side may be more costly than paying the RDBMS or CDC vendor. I think when the Data Engineer goes and gets the data by query via JDBC and writes it to storage as required (per project basis or cross project) as part of a pipeline that makes sense. I spoke to some contacts at Databricks and they said as Rodrigo mentions, it depends. Dimension table (not super big) query the data source Databricks Data Sources Large Table write it to storage. Agreed that Data on Storage can be accessed by multiple query (compute) engines (Spark, the Database, and the BI Tools). I believe in the polyglot persistence concept that Rodrigo has another blog post on as well as James Serra Blog. Use the right technology for the right purpose. take a look at my diagram from my latest post. Everything above storage in the diagram is coming from the OLTP side. Data Integration could be above Storage and below Storage as well because it is using (at least in the case of ADF and Synapse, and I believe Informatica) Spark under the covers of the transformation. In fact Spark should be munged into the Data Integration layer. Also just noticed that my Spark layer should extend into the AI and ML zone (will fix that). Spark and Data Integration can transcend both storage and database, including NoSQL (CosmosDB) and Data Explorer. Storage is the lowest common denominator for Integration. For Streaming and Logs on the bottom of the diagram, storage is where Kafka, EventHub, IoT Hub, and Spark land data. That being said if data is in a database and you can leave it there and use it without writing it to storage, or in the case of Azure Synapse build an external table on the data in storage and join it to the database table and let the SQL engine join the data. Thanks again. Darwin

Blog Post