Database transactions are normally fast when connected to a local or low-latency SQL server. This analysis will focus on the performance of bulk inserts with varying levels of RTT latency.
As an example of how fast the DBI is able to perform, we'll start off with a very basic example using a SQLite in-memory database and insert 500 rows.
Code: SQLite 500 inserts
This code runs in 0.05 s of wall-time for me, so time is hardly a factor. The DBI is clearly able to perform well and the total time to insert the rows sounds very reasonable.
Let's try it on a remote database server on a lab network without adding any additional latency.
Code: PostgreSQL 500 inserts, network database
This runs in about 0.40 s of wall-time, which is still fairly negligible.
Let's see what happens if we have the lab router between the client & server insert a 40ms RTT latency (20ms each direction) as might be typical for a decent WAN connection to a remote facility.
We'll use the same code from above: PostgreSQL 500 inserts, network database
With the added latency, this now takes 21.55 s of wall-time, or about 5400% slower. So what's going on to make some seemingly acceptable latency take 54 times longer?
The reason latency is such a big deal here is because the DBI is inserting each row individually and waiting for the confirmation from the server before inserting the next row.
Surely there's a way to make it faster, right? Good news: there is. Bad news: the DBI won't do it for you. The DBI does allow for drivers to insert records in bulk for a specific method call, but neither MySQL or PostgreSQL drivers utilize this.
The docs for the DBI indicate that execute_for_fetch()
(which is used by execute_array()
) can potentially be implemented by drivers to collect statements in bulk for more efficient execution. This is exactly what's needed here, but it's evidently not used by the 2 drivers I'm testing with.
Consider this code, which creates that 500-row in-memory database, starts a SELECT
statement on them, and calls this execute_for_fetch()
routine. We'll use the Time::HiRes
module to time just the INSERT portion of the process:
Code: PostgreSQL execute_for_fetch
This takes 21.41 s to execute. This is no better than the direct INSERT
method, so clearly the PostgreSQL driver is not aggregating commits together as the DBI suggests drivers have the "option" of doing. MySQL does no better here.
With the Perl DBI, at least using both MySQL and PostgreSQL drivers, SELECT
statements do not incur such a large performance hit because the records are all returned when the statement is executed.
Consider the following code that takes each of the 500 rows inserted earlier and dumps them to a local file:
Code: PostgreSQL select 500 rows, network database
Here are the run times:
This is only a 600% increase, mostly dealing with the extra round-trip times involved to connect, prepare, and execute the SELECT
statement.
Since the DBI (or more specifically the DBD modules in use here) can't perform efficiently enough, we'll cook up an INSERT
statement that does the job properly without waiting on 500 sequential RTT delays.
This requires preparing a statement where all 500 rows can be inserted at once. This should reduce the latency hit on the execute()
call to a single round-trip.
Code: PosgtreSQL execute bulk insert
Now the same INSERT
statement that took 21.41 s earlier only takes 0.20 s on the same 40ms latency-link. The full wall-time is 0.40 s, which includes the database connection, prepared statements, and disconnects.
There is a bit of a trade-off to the method used here, and this trade-off is likely why DBD driver modules don't implement this feature by default.
In order to prepare the @bulk_values
used in the final method, Perl must allocate memory for the entire set of values to be inserted. This isn't so bad here because we only have 500 lines that contain a short amount of text. This would be a much bigger deal if we were dealing with records that held many megabytes of data, such as blobs of documents, media files, or similar "large data" objects.
The documentation for execute_for_fetch()
specifically notes that the subroutine is allowed to return the same array-reference, and typically does. This means the results of each row must be cached somewhere. In the above example, the caching happens in memory. This would not be appropriate where large amounts of memory would be required to store it all.
The method here works well for specific use-cases, but clearly isn't suited as a default for DBD drivers without careful understanding of where the delays occur. It would be foolish to assume network latency is always going to be worse than Perl's memory management for all workloads.
The method used here can potentially be extended upon and made more generic. This would enable users to make logical choices about how to do bulk inserts taking into account the trade-offs listed above.
Some possibilities for such a generic solution might allow the user to control this balance by:
INSERT
call, ensuring that it won't grow boundlesslyThis document is available under a Creative Commons Attribution 4.0 International license (CC BY 4.0.) Licensing details
Code examples are available under a 3-clause BSD license. Licensing details