trvrm.github.io
Efficient Postgres Bulk Inserts Take 2
Mon 24 December 2018
In a previous post I outlined a technique for achieving highly efficient bulk inserts from Python into a Postgres database.
The heart of this technique relies on passing multiple rows to postgres as a single
parameter, and using the unnest
function to convert that parameter from an
array into a set of rows:
INSERT INTO upload_time_test(text,properties)
SELECT unnest( %(texts)s ) ,
unnest( %(properties)s)
More recently, I've been using an even more expressive technique that relies on
jsonb_array_elements
. Like unnest
, this function takes a single parameter and unrolls
it into multiple rows, but unlike unnest
we only need to use the function once,
rather than once per column.
For example, imagine we have a table like this:
create table test(id int primary key, firstname text, lastname text, age int);
We could insert values into it one at a time like this:
INSERT INTO test (id, firstname,lastname,age)
VALUES (%(id)s, %(firstname)s, %(lastname)s, %(age)s)
and run a Python loop over the rows, calling this insert once for every row:
INSERT = """
INSERT INTO test (id, firstname,lastname,age)
VALUES (%(id)s, %(firstname)s, %(lastname)s, %(age)s)
"""
with engine.connect() as connection:
for row in rows:
connection.execute(INSERT,row)
In testing, it took about 8 seconds to insert 10,000 rows using this technique. This clearly doesn't scale to larger datasets. We need a way of inserting multiple rows simultaneously.
Enter jsonb_array_elements
:
INSERT = """
INSERT INTO test (id, firstname,lastname,age)
SELECT
(el->>'id')::int,
el->>'firstname',
el->>'lastname',
(el->>'age')::int
FROM (
SELECT jsonb_array_elements(%(data)s) el
) a;
"""
with engine.connect() as connection:
connection.execute(INSERT,data=Json(rows))
This code took only 70 milliseconds to insert 10,000 rows, representing a 100-fold speedup!
A full demonstration of this technique is available at https://github.com/trvrm/bulktest