Just a quicky today. I’m working on a batch CSV->INSERT script to process monthly membership dumps that come from another organization. We’re not talking billions of records but I wanted to make it performant. Based on this post about group batch inserts, I came up with a bit of clever code for grouping multiple inserts using the syntax:
INSERT INTO foo (column1, column2)
VALUES (val1, val2), (val3, val4), (val5, val6);
This wraps more than one row into a single commit and speeds things up. I’ve seen claims of about 3x performance on Postgres and as much as 10x on MySQL. The good news is wrapping inserts into this format is simple. Here’s some code to get you started for looping over an array of arrays (converted by Ben Nadels’ very quick CSVToArray):
< !--- how many at a time to commit --->
<cfset incr = 1000 />
<cftransaction>
<cftry>
< !--- ii = 1, 1001, 2001, 3001, 4001, ... --->
<cfloop from="1" to="#len#" step="#incr#" index="ii">
<cfquery name="insert" datasource="#dsn#">
INSERT INTO someTable (field1
,field2
,field3)
VALUES
< !--- 1, 2, 3, ... 1000; 1001, 1002, 1003, ... 2000; ... --->
<cfloop from="#ii#" to="#min(len, ii+incr-1)#" index="jj">
< !--- prevent trailing comma error --->
<cfif ii NEQ jj>, </cfif>
(<cfqueryparam value="#arrData[jj][1]#" cfsqltype="cf_sql_char" />
,<cfqueryparam value="#arrData[jj][2]#" cfsqltype="cf_sql_varchar" />
,<cfqueryparam value="#arrData[jj][3]#" cfsqltype="cf_sql_date" />)
</cfloop>
</cfquery>
</cfloop>
<cfcatch type="any">
<cftransaction action="rollback" />
<cfrethrow />
</cfcatch>
</cftry>
</cftransaction>
This groups the INSERTS into bunches of 1000. In my code, I wanted the records to either all commit or all fail so I wrapped it in a transaction; you could surely remove that and the corresponding try/catch for your own needs.
Grouping the inserts was taking about 90ms per 100 records over a total of 4,720 test records. The one-insert-per-query approach was taking around 270ms per 100 records. I suspect, run directly against the database as opposed to through ColdFusion and JDBC, there would be less absolute gain but the relative result is about a 3x improvement in this small test case on my development laptop. The database is PostgreSQL 8.3 but this technique also works with MySQL.
Comments are closed.