Posted on 20 December, 2022
At LogicVein, our products handle a lot of data. In some cases, data comes periodically, for example monitor data in ThirdEye.
At LogicVein, our products handle a lot of data. In some cases, data comes periodically, for example monitor data in ThirdEye. Sometimes they come in bulk which we need to persist as quickly as possible. There are various ways to insert data into a database. It can be quite slow depending on the amount of data, schemas, and communication between application and database server. But there are some simple things we can do in general to make it fast.
We will use a table with 4 columns for this test, one column is a primary key, another column is an integer, two columns are varchar. We will do 100k inserts and compare the result of each method. Here is the code to setup the database connection we will use for testing
fun createConnection(): Connection { val url = "jdbc:postgresql://localhost/app?user=app&password=pass" return DriverManager.getConnection(url) } fun sqlExecute(autoCommit: Boolean = false, block: (conn: Connection) -> Unit) { createConnection().use { conn -> conn.autoCommit = autoCommit block(conn) if (!autoCommit) conn.commit() } } fun measureTime(msg: String, block: () -> Unit) { val start = System.currentTimeMillis() block() val end = System.currentTimeMillis() println("Execution time for $msg is ${end - start} ms") } fun main() { println("Testing insert into db") measureTime("testInsertAutoCommit") { testInsertAutoCommit(100000) } measureTime("testInsert") { testInsert(100000) } measureTime("testBatchInsert") { testBatchInsert(100000) } measureTime("testMultiValuesInsert") { testMultiValuesInsert(100000) } }
If you don’t set anything, auto-commit is on by default in JDBC. What it does is it makes each SQL statement run in its own transaction. Here is the code for this test and its result.
fun testInsertAutoCommit(count: Int) { sqlExecute(true) { conn -> val stmt = conn.prepareStatement("INSERT INTO data1 (col1, col2, col3) VALUES (?, ?, ?)") for (i in 1..count) { stmt.setInt(1, 1) stmt.setString(2, "hello") stmt.setString(3, "world") stmt.executeUpdate() } } } Execution time for testInsertAutoCommit is 23701 ms
It’s a little slow. We can improve this without any effort. Let’s turn off auto-commit and try again.
Let’s run the same test again but this time we turn off auto-commit by calling sqlExecute using the default parameter for autoCommit.
fun testInsert(count: Int) { sqlExecute { conn -> val stmt = conn.prepareStatement("INSERT INTO data1 (col1, col2, col3) VALUES (?, ?, ?)") for (i in 1..count) { stmt.setInt(1, 1) stmt.setString(2, "hello") stmt.setString(3, "world") stmt.executeUpdate() } } } Execution time for testInsert is 9091 ms
It’s about 61% faster with minimal effort. There’s still 9 seconds that we could do something about.
Instead of executing each insert statement individually, JDBC allows us to group statements in a batch and execute them together as one batch.
fun testBatchInsert(count: Int) { sqlExecute { conn -> val stmt = conn.prepareStatement("INSERT INTO data1 (col1, col2, col3) VALUES (?, ?, ?)") for (i in 1..count) { stmt.setInt(1, 1) stmt.setString(2, "hello") stmt.setString(3, "world") stmt.addBatch() } stmt.executeBatch() } } Execution time for testBatchInsert is 1410 ms
We just replaced executeUpdate() with addBatch() and made a call to executeBatch() at the end. And the result is 84% faster than the previous attempt and 94% from the original attempt. This is a lot better than we had originally.
All the previous attempts have been on the application side. There are things we can do on the PostgreSQL side to improve the insert speed. We could temporarily drop indexes and foreign keys and re-create them after insert. However, before we do that we can try a different insert statement. PostgreSQL supports multiple row insert by specifying multiple VALUES in INSERT statements. For example,
INSERT INFO data1 (1, “s1”, “s2”), (2, “s3”, “s4”);
There is a limit of 32767 for the number of bind parameters in the PostgreSQL JDBC driver. So we will have to do the insert in a chunk of 10,000, or depends on the number of columns. We have to dynamically generate the SQL statement as the number of parameters can be varied. We can cache the SQL statement if the number of parameters is not changed. We can do this with the Kotlin code below.
fun testMultiValuesInsert(count: Int) { sqlExecute { conn -> var sb: StringBuilder? = null var prevSize = -1 var pstmt: PreparedStatement? = null (1..count).chunked(10000) .forEach { chunk -> if (sb == null || prevSize != chunk.size) { sb = StringBuilder("INSERT INTO data1 (col1, col2, col3) VALUES ").apply { append(("(?,?,?),").repeat(chunk.size)) deleteCharAt(length - 1) } prevSize = chunk.size pstmt?.executeBatch() pstmt = conn.prepareStatement(sb.toString()) } pstmt?.let { pstmt -> var idx = 1 chunk.forEach { pstmt.setInt(idx++, 1) pstmt.setString(idx++, "hello") pstmt.setString(idx++, "world") } pstmt.addBatch() } } pstmt?.executeBatch() } } Execution time for testMultiValuesInsert is 919 ms
It’s 34% faster than batch insert but the code is more complicated than other methods. But we would need it in some situations.
There is another way to do bulk insert in PostgreSQL not shown here. We can use COPY (https://www.postgresql.org/docs/current/sql-copy.html) to copy the data from input stream into a table. It could be faster than all the methods presented here. But the downside is we have to escape the data according to the format expected by COPY statement as supposed to relying on parameter binding. We do use COPY in our product, ThirdEye, as in that case the performance is critical and outweighs the complexity requires for the code.
With all that said, these are not all the things we can do to improve the performance. You would have to do your own benchmark as there are many varying variables. Each schemas are different and have their own requirements.
Get hands-on experience with ThirdEye for 30 day free of cost and assess it
by using our evaluation license.