How to Organize your Data Lake

Former Employee

Jul 31, 2020

Hello DarwinSchweitzer ! Thank you for you answer and feedback. Going to your question, I think that there are some decision points:

How big is the data? Does it fit in your RDBMS? How much is it to scale your RDBMS? How big it can go?
Data Lake is storage only, no processing costs to ingest data. How often data will be ingested? Can you save money by pausing the RDBMS? Can you save money by landing data into a Data Lake and using a small RDBMS to read it or to host only curated/aggregated data?
Are you willing to integrate (join, enrich, etc) that csv data with semi-structured or unstructured data? How big is that other data? What is cheaper solution to mix these 2 data types?
Are you willing to use that csv data for something else like ML? IF yes, does your RDBMS support in-database ML?
Can you benefit to land that data into a Data Lake to make it available to more than one query engine? Data in a RDBMS can only be used trough that RDBMS query engine. Data in a Data Lake can be queried at the same time by Hive, Polybase, Spark, LLAP, PBI, Jupiter Notebooks, etc.
RDBMS enforce lots of controls like schema on write, transactions control, referential integrity, locks. Do you need all of that? Or the data comes from a data source that did all of that before? Or the data comes from a data source that doesn't control anything of that and you want to keep it as is to also analyzed the data quality?
Data Lakes are auto-healing by design, at low cost. Do you need to protect the data? How much is it to do it in your RDBMS?

This are the decision points that I can see at the moment. What do you think? Tks!!

Blog Post