How to Organize your Data Lake

Rodrigo Souza · ‎Feb 19 2020

Data Lakes are one of the best outputs of the Big Data revolution, enabling cheap and reliable storage for all kinds of data, from relational to unstructured, from small to huge, from static to streaming. While on-prem implementations of this technology face administration and scalability challenges, public clouds made our life easier with data lakes as a service offers, like Azure Data Lake that has unlimited scalability and integrated security.

But Data Lakes don’t enforce schema and can easily became a Data Swamp, turning into an useless out of control data structure that won’t be as useful as it could until they are removed. In this blog post, you will see some suggestions to avoid this logical and organizational problem.

The Challenge

Let’s start from a hypothetical scenario where the Data Lake was already deployed and multiple teams are using it for mobile apps, logs management, IoT, EDW, and other Big Data Scenarios. There are clients uploading images into intelligent agents, applications streaming Terabytes of JSON files, multiple relational data silos sending data in CSV format, sensors sending IoT data and much more. The data sources are in different time zones and currencies.

Still in our hypothetical scenario, let's say the volume of data is increasing without control and there are numerous folders (directories) within each other. Developers are lost unaware of what query engine to use, and the processes made so far are becoming slower and slower. And where to save the outputs? Load into a SQL database like Azure SQL DB? Or into an MPP database like Azure Synapse Analytics? Should the developers save the output of their transformations into the data lake itself? Using which format?

These are the basis of a hypothetical swampy data. After a few months and you may have a perfect storm, when all analytics is compromised because of costs, performance, lost SLAs and wrong calculations.

The Solution

Data Governance tools like Azure Data Catalog can help to register and access the data assets, but it is not enough to avoid a data swamp. There isn’t a single measure or tool to avoid all possible problems with your data lake, while good practices will protect your environment from this kind of disaster. Here is a list of measures that you may take:

Folders Structure

Now let’s start some suggestions from our experience on implementing many data lake projects. The first point is to define a clear directories structure, that reflects its usage. Since a data lake is a distributed file system, everything will be a file within a folder. In collaboration with all teams, you can try to create a layered structure like this one below.

Data Lake Layer	Usage	Expected Volume	Path – Per Project	Sub Folders (Granularity)
Raw Files	Files without any transformation, stored “as is”, every minute. This is the landing zone of your data lake	~ TBs / day	/project-name/raw-files	/year/month/day/hour/minute
Raw Data	Now all files are in data queryable format: same time zone and currency. Special characters and duplicated were removed. Small files are merged into bigger files, what is a best practice for big data workloads.	~ GBs / day	/project-name/raw-data	/year/month/day
Business Data	Raw data + business rules. Now you start to have the basic aggregations that will help all other analysis. It is a good idea do use parallel processing on top of distributed file system to accomplish this heavy workload. This layer also can be used for data ingestion into DWs or DMs.	~ MBs / day	/project-name/business-data	/year/month

Some important points about the table above:

Each layer is input of the next one.
You may want to add the data source after the project name.
All these folders will contain sub folders per timestamp. Granularity will decrease as you move the next layer. This won’t be a problem for query engines since they will leverage the metastore that maps the root folder as a table or container.

If anything goes out of control, you can easily identity where the problem is happening. You can also create jobs to check and log the folders size evolution.

Files Format

It is very useful to avoid different file formats or compression in the same folder. This will help the sanity of the developers and data engineers. Another table will help the organization:

Data Lake Layer	Files Format	Compression	Why
Raw Files	“as is”	Gzip	The same format of the original data, for fast data ingestion. Gzip will deliver good compression rate for most of the file types.
Raw Data	Sequence Files	Snappy	Sequence files are a good option for map reduce programming paradigm as it can be easily splitted across data nodes enabling parallel processing. Other advantage of using sequence files is to merge two or more files into a bigger one , a typical goal of this layer. Snappy doesn’t give the best ratio, but it has excellent compress/decompress performance.
Business Data	Parquet Files	Snappy	For interactive queries using Presto, Impala, or an external connection like Polybase, that allows you to query external data from Azure Synapse Analytics. Snappy compression again will be used, parquet files are columnar, making them compressed by nature. The data volume isn’t a problem anymore.

With a fixed organization like this, it is easy to determinate which tools should be used for data integration and analysis. And the tool used to access the data will define what kind of professionals will have access to each layer. This point is addressed in the next topic.

Access Control

Now let’s see how we can organize access to each layer. Again, let’s use a table for better visualization.

Data Lake Layer	Files Format	Compatible Tools	Access for	Access Type	How
Raw Files	“as is”	Hive, Pig, Java, Python	Data Engineers	Read / Write	Batch jobs
Raw Data	Sequence Files	Hive, Python, Impala, Hive2, Drill, Pres	Data Engineers, Data Scientists	Read / Write	Data Exploration activities, knowing that the data isn’t submitted for the business rules. The size of this layer and the state of the data make it unusable for data analysts or end users.
Business Data	Parquet Files	Impala, Hive2, Drill, Presto, BI Tools, Polybase, Sqoop, Azure Data Factory	Data Engineers, Data Scientists, Data Analysts	Read Only	Here the data is user friendly and the format is optimized for interactive queries. Modern SQL Databases, that allow external data access, can query this data for extra data integrations or for a virtual data warehouse. Polybase is the key tool to do it on Azure. External data will also be used to feed data marts with the SQL Database. Tools like Sqoop or ADF can be used to export the data also into SQL Databases.

Conclusion

If you create methods to enforce this big data architecture, most of the typical problems will be avoided. You can also create monitoring jobs to search and log problems, what allow you to keep record of the data state within your data lake. The human factor is decisive and methodologies like TDSP are useful to address conversations with the data science team. An open dialog with all stakeholders should be taken before, during, and after the processes implementations.

brunofontoura · ‎Feb 20 2020

As always, a very instructive and concise explanation. Cheers!

Rodrigo Souza · ‎Feb 20 2020

Thank you!!

cobrow · ‎Feb 23 2020

Rodrigo, I've been exploring how to enable the Common Data Model in the Data Lake. This would enable my customer that has "siloed" data to be able to talk the same language across silos. Wondering if there is a possible hybrid approach with your recommended solution above?

Rodrigo Souza · ‎Feb 24 2020

Hello @cobrow

I think that CDM was designed for relational data, but it might work if you don't need transactions. Also, keep in mind that you can translate que suggested solution for a traditional DW:

raw files = landing zone
raw data = staging area
business data = ODS

If you test it, let me know if it went well or not.

Tks

Kristian_Appel · ‎Jul 23 2020

Hi @Rodrigo Souza

You approach is developed around projects which is fine for "output" but what is your take on data that is not project related? E.g. master data (product, customer etc) that will be reused in many projects. It could also be transaction data like sales that will be used in many projects. To me it would be more logic to organize data around sources on the input side and then each project (e.g. a churn project) would save intermediate table and results in a project related structure.

With kind regards

Kristian Appel

Rodrigo Souza · ‎Jul 23 2020

Hello @Kristian_Appel

For master data you may not need the sub-folders per year. Also, you will probably join it with your "business data", meaning that parquet is a good idea.

I agree with the second point: depending on the objectives, it may be a good idea to organize the data with the source in the first level. But date/time is always important to transactional data, as it allows you to do timeline analysis. This partition strategy also avoid files that are too big or too small, you just need to define the folders granularity.

Tks!

Darwin Schweitzer · ‎Jul 31 2020

_{Rodrigo, seeing this late. Great blog post. I have always wondered about the best way to organize the data lake files. Also what are your thoughts on putting RDBMS data in csv format in the Data lake vs just landing it in a landing zone RDBMS. Then just join the RDBMS data and file-based data lake data with Spark when you need to. Seems like a shame to de-schematize table to csv and maintain it in sync with changes to the table just to have the data in the data lake. There is the cost of having the RDBMS running in the landing zone, but is it worth it to keep the schema? Be interesting to see what the consensus is. I like the idea of keeping the data that is schematized in a landing zone RDBMS (maintaining it with ETL, CDC, Transactional Replication) and joining via Spark or ADF to file-based data sources when needed. What do you think is a best practice?}

Rodrigo Souza · ‎Jul 31 2020

Hello @Darwin Schweitzer ! Thank you for you answer and feedback. Going to your question, I think that there are some decision points:

How big is the data? Does it fit in your RDBMS? How much is it to scale your RDBMS? How big it can go?
Data Lake is storage only, no processing costs to ingest data. How often data will be ingested? Can you save money by pausing the RDBMS? Can you save money by landing data into a Data Lake and using a small RDBMS to read it or to host only curated/aggregated data?
Are you willing to integrate (join, enrich, etc) that csv data with semi-structured or unstructured data? How big is that other data? What is cheaper solution to mix these 2 data types?
Are you willing to use that csv data for something else like ML? IF yes, does your RDBMS support in-database ML?
Can you benefit to land that data into a Data Lake to make it available to more than one query engine? Data in a RDBMS can only be used trough that RDBMS query engine. Data in a Data Lake can be queried at the same time by Hive, Polybase, Spark, LLAP, PBI, Jupiter Notebooks, etc.
RDBMS enforce lots of controls like schema on write, transactions control, referential integrity, locks. Do you need all of that? Or the data comes from a data source that did all of that before? Or the data comes from a data source that doesn't control anything of that and you want to keep it as is to also analyzed the data quality?
Data Lakes are auto-healing by design, at low cost. Do you need to protect the data? How much is it to do it in your RDBMS?

This are the decision points that I can see at the moment. What do you think? Tks!!

Kristian_Appel · ‎Aug 02 2020

@Darwin Schweitzer I don't think that a RDBMS is a real viable solution together with the SPARK. The reason is that the join you talk about is taking place in SPARK not in the database which means that you will have to load the full tables from the RDBMS into SPARK and most RDBMS are not designed to deliver data like that. Also the RDBMS is costly service compared to the data lake.

Darwin Schweitzer · ‎Aug 05 2020

All good points. I think when it comes to cost it is not a simple equation. For example if it is a really active OLTP application running extracts from it directly are probably off the table. An active readable secondary, replicated copy, or Change Data Capture of the tables needed for analysis are probably needed in the architecture. Even if it is just to get the data onto Storage to support Delta Lake. That expense is kind of a sunk cost. Also if I an organization has DBA FTE that maintain this, adding additional FTE to replicate the process on the Spark side may be more costly than paying the RDBMS or CDC vendor. I think when the Data Engineer goes and gets the data by query via JDBC and writes it to storage as required (per project basis or cross project) as part of a pipeline that makes sense. I spoke to some contacts at Databricks and they said as Rodrigo mentions, it depends. Dimension table (not super big) query the data source Databricks Data Sources Large Table write it to storage. Agreed that Data on Storage can be accessed by multiple query (compute) engines (Spark, the Database, and the BI Tools). I believe in the polyglot persistence concept that Rodrigo has another blog post on as well as James Serra Blog. Use the right technology for the right purpose. take a look at my diagram from my latest post. Everything above storage in the diagram is coming from the OLTP side. Data Integration could be above Storage and below Storage as well because it is using (at least in the case of ADF and Synapse, and I believe Informatica) Spark under the covers of the transformation. In fact Spark should be munged into the Data Integration layer. Also just noticed that my Spark layer should extend into the AI and ML zone (will fix that). Spark and Data Integration can transcend both storage and database, including NoSQL (CosmosDB) and Data Explorer. Storage is the lowest common denominator for Integration. For Streaming and Logs on the bottom of the diagram, storage is where Kafka, EventHub, IoT Hub, and Spark land data. That being said if data is in a database and you can leave it there and use it without writing it to storage, or in the case of Azure Synapse build an external table on the data in storage and join it to the database table and let the SQL engine join the data. Thanks again. Darwin

Rodrigo Souza · ‎Aug 06 2020

Hello @Darwin Schweitzer ! Great comments and I agree with you, there is no simple answer... My approach is to always use the best tool for the job!

I didn't know your blog! I will followed it starting today! That's a great diagram, I would say that Cosmos DB can be used for information delivery too, like aggregated data, product catalog, etc. Specially for global distributed applications.

Tks!

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs