Today, I worked in a very interesting case where our customer wants to insert millions of rows using Python. We reviewed two alternatives to import the data as soon as possible: Using BCP command line and using executemany command. Following I would like to share my lessons learned about it.
Background:
- We have 5 CSV files with 111.100.000 and around 22 columns (20 varchar(6) and 2 int data type columns).
- The Azure SQL Database is Standard 9
- Virtual Machine:
- All files are placed on temporary drive (D:).
- The accelerated networking is disabled.
- vCore 4 and 16 Gb of RAM
- The goal is to import the data as soon as possible.
Lessons Learned:
- I developed two different python script, both will read every CSVs files but using two different methods to import the data:
- 1) Using executemany method:
- As I saw that a single thread will take too much time, I configure to run 100 threads at the same time.
- The main thread reads the CSV file and when it reached 10000 rows, it sends this array of data to a single thread that execute the executemany in background. I managed the status of 100 threads.
- This thread creates a temporal table in TEMPDB
- Once it finished runs an INSERT INTO to the main table.
- This main table is a head table without any non-clustered indexes.
- The time spent to insert all the data was at database level:
- 40 mins of execution.
- CPU DB usage 70%-80%
- LOG IO usage 60%-70%
- Every minute 3,4 M rows are inserted.
- 2) Calling BCP.exe
- Using the main thread to read every CSV. Batch size is 10000.
- The time spent to insert all the data was at database level:
- 20 mins.
- CPU DB usage 5%-10%
- LOG IO usage 80%-90%
- Every minute 7M rows are inserted
It is clear that BCP is better to even when executemany is using a prepared TSQL command to improve the performance. I will check if we have the save option as we have in .NET - SqlBulkCopy that the performance is much better.
Enjoy!