Blog Post

Data Architecture Blog

5 MIN READ

How to Organize your Data Lake

Former Employee

Feb 19, 2020

Data Lakes are one of the best outputs of the Big Data revolution, enabling cheap and reliable storage for all kinds of data, from relational to unstructured, from small to huge, from static to stre...

Updated Jul 23, 2020

Version 2.0

data architecture

Rodrigo Souza

Former Employee

Joined September 25, 2018

View Profile

Data Architecture Blog

Follow this blog board to get notified when there's new activity

DarwinSchweitzer

Microsoft

Jul 31, 2020

_{Rodrigo, seeing this late. Great blog post. I have always wondered about the best way to organize the data lake files. Also what are your thoughts on putting RDBMS data in csv format in the Data lake vs just landing it in a landing zone RDBMS. Then just join the RDBMS data and file-based data lake data with Spark when you need to. Seems like a shame to de-schematize table to csv and maintain it in sync with changes to the table just to have the data in the data lake. There is the cost of having the RDBMS running in the landing zone, but is it worth it to keep the schema? Be interesting to see what the consensus is. I like the idea of keeping the data that is schematized in a landing zone RDBMS (maintaining it with ETL, CDC, Transactional Replication) and joining via Spark or ADF to file-based data sources when needed. What do you think is a best practice?}