Lesson Learned #169: Bulk Insert using Python in Azure SQL

Microsoft

Apr 17, 2021

Today, I worked in a very interesting case where our customer wants to insert millions of rows using Python. We reviewed two alternatives to import the data as soon as possible: Using BCP command line and using executemany command. Following I would like to share my lessons learned about it.

Background:

We have 5 CSV files with 111.100.000 and around 22 columns (20 varchar(6) and 2 int data type columns).
The Azure SQL Database is Standard 9
Virtual Machine:
- All files are placed on temporary drive (D:).
- The accelerated networking is disabled.
- vCore 4 and 16 Gb of RAM
The goal is to import the data as soon as possible.

Lessons Learned:

I developed two different python script, both will read every CSVs files but using two different methods to import the data:
- 1) Using executemany method:
  - As I saw that a single thread will take too much time, I configure to run 100 threads at the same time.
  - The main thread reads the CSV file and when it reached 10000 rows, it sends this array of data to a single thread that execute the executemany in background. I managed the status of 100 threads.
    - This thread creates a temporal table in TEMPDB
    - Once it finished runs an INSERT INTO to the main table.
    - This main table is a head table without any non-clustered indexes.
  - The time spent to insert all the data was at database level:
    - 40 mins of execution.
    - CPU DB usage 70%-80%
    - LOG IO usage 60%-70%
    - Every minute 3,4 M rows are inserted.
- 2) Calling BCP.exe
  - Using the main thread to read every CSV. Batch size is 10000.
  - The time spent to insert all the data was at database level:
    - 20 mins.
    - CPU DB usage 5%-10%
    - LOG IO usage 80%-90%
    - Every minute 7M rows are inserted

It is clear that BCP is better to even when executemany is using a prepared TSQL command to improve the performance. I will check if we have the save option as we have in .NET - SqlBulkCopy that the performance is much better.

Enjoy!

Updated Apr 17, 2021

Version 1.0

Jose_Manuel_Jurado

Microsoft

Joined November 29, 2018

View Profile

Azure Database Support Blog

Follow this blog board to get notified when there's new activity

Blog Post

Lesson Learned #169: Bulk Insert using Python in Azure SQL