Do I really need a data archive in a lake?

Christian_Hoffmann · ‎Mar 01 2021

Hi

I am building a new data warehouse using Azure Cloud.

We have a limited number of MySQL relational databases as source and the important tables all has history. In other words I can sum things up over time by using simple SQL.

Now I got the advice that I should use a Data Lake with .CSV files as a data Archive layer in my Data Warehouse.

But why?

I can not see why I should spend time moving the data from MySQL to .CSV and finally into a Azure SQL Server database when I can the the data directly from MySQL to a staging layer in my Azure SQL Server database using Azure Data Factory.

In the longer we might have a neeed for storing massive data from our NoSQL database and sources without history. In this case the Data Lake is a good fast and cheap place to store these data.

What do you think? Do you agree that I can skip the Data Lake for now?

Regards,

Christian

Christian_Hoffmann · ‎Mar 12 2021

Yes, you can skip it for now. There are reasons to use the "middle step", for example once you start getting data from multiple sources, maybe some of them require more mangling and there are extra steps in combining everything, you might want to have that middle layer as sort of staging/conversion area. Data size is also a factor here, it might be faster to just dump everything into a data lake or some such, and do additional filtering / sanitizing / anonymisation and so on, before pushing it to the final database (or whatever the end-target is). It is easy to add steps later on. My advice in general is to start simple, not over-engineer without a good reason.

Christian_Hoffmann · ‎Mar 12 2021

Thank you for good advice!

cpmohanraj · ‎Jul 05 2021

In case you are interested in reading another viewpoint, Melissa Coates & James Serra have written about it in the past.
TL;DR - the answer is almost always "it depends"
https://www.jamesserra.com/archive/2018/11/should-i-load-structured-data-into-my-data-lake/

Christian_Hoffmann · ‎Mar 12 2021

Yes, you can skip it for now. There are reasons to use the "middle step", for example once you start getting data from multiple sources, maybe some of them require more mangling and there are extra steps in combining everything, you might want to have that middle layer as sort of staging/conversion area. Data size is also a factor here, it might be faster to just dump everything into a data lake or some such, and do additional filtering / sanitizing / anonymisation and so on, before pushing it to the final database (or whatever the end-target is). It is easy to add steps later on. My advice in general is to start simple, not over-engineer without a good reason.

View solution in original post

Do I really need a data archive in a lake?

Do I really need a data archive in a lake?

RE: Do I really need a data archive in a lake?

RE: Do I really need a data archive in a lake?

Re: Do I really need a data archive in a lake?

RE: Do I really need a data archive in a lake?

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Do I really need a data archive in a lake?