Efficiently Generating and Loading 1 Billion Rows into a Relational Database Content in just an hour

AssafFraenkel

Former Employee

Mar 30, 2023

The challenge:

To generate and load 1 billion rows of data into a relational database on Azure in a significantly shorter time frame than the previous 72 hours.

The problem:

The initial parallel threads were competing for IO and CPU resources, causing internal locking and slowing down the process even further.

Previous solution (scaling issue):

The application generated batches of 1000 rows.
To load 1B rows – the application should insert 1M batches!
Parallel threads generated locking and even deadlocks.

The solution:

Generate 30 million rows in each of 36 files with different first letter/digit in parallel with small modifications of the original application that generate the rows according to complex business logic.
Define 36 identical tables with a constraint on the custom GUID key according to the first letter.
Load the files in parallel to the relevant table according to the first letter and constraint and increase database resources if necessary.
Define one partitioned table according to the first letter with the same key.
Switch in the 36 tables (meta-data operation) into the partitioned table.

Result:

The solution enabled the generation and loading of 1 billion rows of data into a relational database within an hour, significantly improving the previous time frame of 72 hours.

In conclusion, the solution involves parallelizing the data generation process, partitioning the data by the first letter or digit of a custom GUID key, and optimizing the loading process by using parallel threads and increasing database resources if necessary. This approach can significantly reduce the time it takes to generate and load a large volume of data into a relational database and can be applied to all relational databases that support partitioning.

Updated Mar 30, 2023

Version 1.0

data & ai

AssafFraenkel

Former Employee

Joined February 01, 2022

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity

Blog Post

Efficiently Generating and Loading 1 Billion Rows into a Relational Database Content in just an hour