Lesson Learned #169: Bulk Insert using Python in Azure SQL

Published 04-17-2021 08:15 AM 1,300 Views

Today, I worked in a very interesting case where our customer wants to insert millions of rows using Python. We reviewed two alternatives to import the data as soon as possible: Using BCP command line and using executemany command. Following I would like to share my lessons learned about it.

 

Background:

 

  • We have 5 CSV files with  111.100.000 and around 22 columns (20 varchar(6) and 2 int data type columns). 
  • The Azure SQL Database is Standard 9 
  • Virtual Machine:
    • All files are placed on temporary drive (D:).
    • The accelerated networking is disabled.
    • vCore 4 and 16 Gb of RAM
  • The goal is to import the data as soon as possible.  

 

Lessons Learned:

 

  • I developed two different python script, both will read every CSVs files but using two different methods to import the data:
    • 1) Using executemany method:
      • As I saw that a single thread will take too much time, I configure to run 100 threads at the same time. 
      • The main thread reads the CSV file and when it reached 10000 rows, it sends this array of data to a single thread that execute the executemany in background. I managed the status of 100 threads.
        • This thread creates a temporal table in TEMPDB
        • Once it finished runs an INSERT INTO to the main table. 
        • This main table is a head table without any non-clustered indexes.
      • The time spent to insert all the data was at database level:
        • 40 mins of execution.
        • CPU DB usage 70%-80%
        • LOG IO usage 60%-70%
        • Every minute 3,4 M rows are inserted.
    • 2) Calling BCP.exe 
      • Using the main thread to read every CSV. Batch size is 10000. 
      • The time spent to insert all the data was at database level:
        • 20 mins.
        • CPU DB usage 5%-10%
        • LOG IO usage 80%-90%
        • Every minute 7M rows are inserted

 

It is clear that BCP is better to even when executemany is using a prepared TSQL command to improve the performance. I will check if we have the save option as we have in .NET - SqlBulkCopy that the performance is much better.

 

Enjoy!

%3CLINGO-SUB%20id%3D%22lingo-sub-2276837%22%20slang%3D%22en-US%22%3ELesson%20Learned%20%23169%3A%20Bulk%20Insert%20using%20Python%20in%20Azure%20SQL%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-2276837%22%20slang%3D%22en-US%22%3E%3CP%3EToday%2C%20I%20worked%20in%20a%20very%20interesting%20case%20where%20our%20customer%20wants%20to%20insert%20millions%20of%20rows%20using%20Python.%20We%20reviewed%20two%20alternatives%20to%20import%20the%20data%20as%20soon%20as%20possible%3A%20Using%20BCP%20command%20line%20and%20using%20executemany%20command.%20Following%20I%20would%20like%20to%20share%20my%20lessons%20learned%20about%20it.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EBackground%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3EWe%20have%205%20CSV%20files%20with%26nbsp%3B%20111.100.000%20and%20around%2022%20columns%20(20%20varchar(6)%20and%202%20int%20data%20type%20columns).%26nbsp%3B%3C%2FLI%3E%0A%3CLI%3EThe%20%3CSTRONG%3EAzure%20SQL%20Database%20is%20Standard%209%26nbsp%3B%3C%2FSTRONG%3E%3C%2FLI%3E%0A%3CLI%3E%3CSTRONG%3EVirtual%20Machine%3A%3C%2FSTRONG%3E%3CUL%3E%0A%3CLI%3EAll%20files%20are%20placed%20on%20%3CSTRONG%3Etemporary%20drive%3C%2FSTRONG%3E%20(D%3A).%3C%2FLI%3E%0A%3CLI%3EThe%20%3CSTRONG%3Eaccelerated%20networking%20is%20disabled.%3C%2FSTRONG%3E%3C%2FLI%3E%0A%3CLI%3EvCore%3CSTRONG%3E%204%20and%2016%20Gb%20of%20RAM%3C%2FSTRONG%3E%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3CLI%3EThe%20goal%20is%20to%20import%20the%20data%20as%20soon%20as%20possible.%26nbsp%3B%26nbsp%3B%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3ELessons%20Learned%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3EI%20developed%20two%20different%20%3CA%20href%3D%22https%3A%2F%2Fgithub.com%2FJMNetwalker%2FInsertDataFaster%2Ftree%2Fmain%2Fpython%22%20target%3D%22_self%22%20rel%3D%22noopener%20noreferrer%22%3Epython%20script%3C%2FA%3E%2C%20both%20will%20read%20every%20CSVs%20files%20but%20using%20two%20different%20methods%20to%20import%20the%20data%3A%3CUL%3E%0A%3CLI%3E%3CSTRONG%3E1)%20Using%20executemany%20method%3C%2FSTRONG%3E%3A%3CUL%3E%0A%3CLI%3EAs%20I%20saw%20that%20a%20single%20thread%20will%20take%20too%20much%20time%2C%20I%20configure%20to%20run%20100%20threads%20at%20the%20same%20time.%26nbsp%3B%3C%2FLI%3E%0A%3CLI%3EThe%20main%20thread%20reads%20the%20CSV%20file%20and%20when%20it%20reached%2010000%20rows%2C%20it%20sends%20this%20array%20%3CSTRONG%3Eof%20data%20to%20a%20single%20thread%20that%20execute%20the%20executemany%20in%20background.%3C%2FSTRONG%3E%20I%20managed%20the%20status%20of%20100%20threads.%3CUL%3E%0A%3CLI%3EThis%20thread%20%3CSTRONG%3Ecreates%20a%20temporal%20table%3C%2FSTRONG%3E%20in%20TEMPDB%3C%2FLI%3E%0A%3CLI%3EOnce%20it%20finished%20runs%20an%20%3CSTRONG%3EINSERT%20INTO%20to%20the%20main%20table.%26nbsp%3B%3C%2FSTRONG%3E%3C%2FLI%3E%0A%3CLI%3EThis%20main%20table%20is%20a%20%3CSTRONG%3Ehead%20table%3C%2FSTRONG%3E%20without%20%3CSTRONG%3Eany%20non-clustered%20indexes.%3C%2FSTRONG%3E%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3CLI%3E%3CSTRONG%3EThe%20time%20spent%20to%20insert%20all%20the%20data%20was%20at%20database%20level%3A%3C%2FSTRONG%3E%3CUL%3E%0A%3CLI%3E%3CSTRONG%3E40%20mins%3C%2FSTRONG%3E%20of%20execution.%3C%2FLI%3E%0A%3CLI%3ECPU%20DB%20usage%20%3CSTRONG%3E70%25-80%25%3C%2FSTRONG%3E%3C%2FLI%3E%0A%3CLI%3ELOG%20IO%20usage%20%3CSTRONG%3E60%25-70%25%3C%2FSTRONG%3E%3C%2FLI%3E%0A%3CLI%3EEvery%20minute%20%3CSTRONG%3E3%2C4%20M%3C%2FSTRONG%3E%20rows%20are%20inserted.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3CLI%3E%3CSTRONG%3E2)%20Calling%20BCP.exe%26nbsp%3B%3C%2FSTRONG%3E%3CUL%3E%0A%3CLI%3EUsing%20the%20main%20thread%20to%20read%20every%20CSV.%20Batch%20size%20is%2010000.%26nbsp%3B%3C%2FLI%3E%0A%3CLI%3E%3CSTRONG%3EThe%20time%20spent%20to%20insert%20all%20the%20data%20was%20at%20database%20level%3A%3C%2FSTRONG%3E%3CBR%20%2F%3E%3CUL%3E%0A%3CLI%3E%3CSTRONG%3E20%3C%2FSTRONG%3E%20mins.%3C%2FLI%3E%0A%3CLI%3ECPU%20DB%20usage%20%3CSTRONG%3E5%25-10%25%3C%2FSTRONG%3E%3C%2FLI%3E%0A%3CLI%3ELOG%20IO%20usage%20%3CSTRONG%3E80%25-90%25%3C%2FSTRONG%3E%3C%2FLI%3E%0A%3CLI%3EEvery%20minute%20%3CSTRONG%3E7M%20rows%3C%2FSTRONG%3E%20are%20inserted%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EIt%20is%20clear%20that%20BCP%20is%20better%20to%20even%20when%20executemany%20is%20using%20a%20prepared%20TSQL%20command%20to%20improve%20the%20performance.%20I%20will%20check%20if%20we%20have%20the%20save%20option%20as%20we%20have%20in%20.NET%20-%20%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fdotnet%2Fapi%2Fsystem.data.sqlclient.sqlbulkcopy%3Fview%3Ddotnet-plat-ext-5.0%22%20target%3D%22_self%22%20rel%3D%22noopener%20noreferrer%22%3ESqlBulkCopy%3C%2FA%3E%26nbsp%3Bthat%20the%20performance%20is%20much%20better.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EEnjoy!%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-TEASER%20id%3D%22lingo-teaser-2276837%22%20slang%3D%22en-US%22%3E%3CP%3EToday%2C%20I%20worked%20in%20a%20very%20interesting%20case%20where%20our%20customer%20wants%20to%20insert%20millions%20of%20rows%20using%20Python.%20We%20reviewed%20two%20alternatives%20to%20import%20the%20data%20as%20soon%20as%20possible%3A%20Using%20BCP%20command%20line%20and%20using%20executemany%20command.%20Following%20I%20would%20like%20to%20share%20my%20lessons%20learned%20about%20it.%3C%2FP%3E%3C%2FLINGO-TEASER%3E
Version history
Last update:
‎Apr 17 2021 08:13 AM
Updated by: