SOLVED

Do I really need a data archive in a lake?

Copper Contributor

Hi

 

I am building a new data warehouse using Azure Cloud.

 

We have a limited number of MySQL relational databases as source and the important tables all has history. In other words I can sum things up over time by using simple SQL.

 

Now I got the advice that I should use a Data Lake with .CSV files as a data Archive layer in my Data Warehouse.

 

But why?

 

I can not see why I should spend time moving the data from MySQL to .CSV and finally into a Azure SQL Server database when I can the the data directly from MySQL to a staging layer in my Azure SQL Server database using Azure Data Factory.

 

In the longer we might have a neeed for storing massive data from our NoSQL database and sources without history. In this case the Data Lake is a good fast and cheap place to store these data.

 

What do you think? Do you agree that I can skip the Data Lake for now?

 

Regards,

Christian

3 Replies
best response confirmed by Christian_Hoffmann (Copper Contributor)
Solution
Yes, you can skip it for now. There are reasons to use the "middle step", for example once you start getting data from multiple sources, maybe some of them require more mangling and there are extra steps in combining everything, you might want to have that middle layer as sort of staging/conversion area. Data size is also a factor here, it might be faster to just dump everything into a data lake or some such, and do additional filtering / sanitizing / anonymisation and so on, before pushing it to the final database (or whatever the end-target is). It is easy to add steps later on. My advice in general is to start simple, not over-engineer without a good reason.
Thank you for good advice!

In case you are interested in reading another viewpoint, Melissa Coates & James Serra have written about it in the past.
TL;DR - the answer is almost always "it depends"
https://www.jamesserra.com/archive/2018/11/should-i-load-structured-data-into-my-data-lake/

1 best response

Accepted Solutions
best response confirmed by Christian_Hoffmann (Copper Contributor)
Solution
Yes, you can skip it for now. There are reasons to use the "middle step", for example once you start getting data from multiple sources, maybe some of them require more mangling and there are extra steps in combining everything, you might want to have that middle layer as sort of staging/conversion area. Data size is also a factor here, it might be faster to just dump everything into a data lake or some such, and do additional filtering / sanitizing / anonymisation and so on, before pushing it to the final database (or whatever the end-target is). It is easy to add steps later on. My advice in general is to start simple, not over-engineer without a good reason.

View solution in original post