trvrm.github.io

Efficient Postgres Bulk Inserts using Psycopg2 and Unnest

Thu 22 October 2015


One of the biggest challenges I face maintaining large Postgres systems is getting data into them in an efficient manner. Postgres is very, very good at maintaining, organising, querying and retrieving data, but inserts themselves can be quite slow.

Various stackoverflow questions suggest using things like COPY , but that assumes your application has direct write access to the machine that is running Postgres, and I work very hard to maintain strict seperation of functionality between software components.

So I've been looking for a faster way of inserting, say, 100,000 rows into the database across the wire, and what follows is by far the most efficient technique I've found so far.

Set some visualisation options

%matplotlib inline
import seaborn

Import the libraries we'll be using

import pandas
from numpy import array_split
import psycopg2
from psycopg2.extras import Json
import time
import contextlib

A context manager for timing operations.

@contextlib.contextmanager
def timer(name="duration"):
    'Utility function for timing execution'
    start=time.time()
    yield
    duration=time.time()-start
    print("{0}: {1} second(s)".format(name,duration))

Test setup

My tests are based on creating a fairly simple table and timing how long it takes to perform inserts on it via several different methods. The table has a couple of default columns, a text column, and an HSTORE column.

SETUP_SQL="""
    DROP TABLE IF EXISTS upload_time_test;

    CREATE TABLE upload_time_test(
        uuid uuid primary key default uuid_generate_v4(),
        created timestamp with time zone not null default now(),
        text text not null,
        properties hstore not null default ''::hstore
    );

    GRANT ALL ON upload_time_test TO test;
"""

This is the SQL we'll use for inserting a single row into the database table:

SINGLE_INSERT="""
    INSERT INTO upload_time_test(text,properties)
         VALUES (%(text)s, %(properties)s)
"""

Credentials for connecting to my local test database

HOST='localhost'
DATABASE='test'
USER='test'
PASSWORD='password'

Then we define a couple of simple wrappers around psycopg2

def connect():
    connection= psycopg2.connect(host=HOST,database=DATABASE,user=USER,password=PASSWORD)
    psycopg2.extras.register_hstore(connection)
    return connection
def execute(sql,params={}):
    with connect() as connection:
        with connection.cursor() as cursor:
            cursor.execute(sql,params)

This is the heart of my tests. The Tester class destroys and re-creates the sample table every time we instantiate it.

It provides three different functions for inserting database rows, each based on a different technique.

  • slowInsert() is the slowest, because it creates a new database connection for each row
  • insert() is the approach I had been using up till now. It creates one connection, and re-uses it for each insertion. This is basically what executemany() in psycopg2 is doing.
  • fastInsert() is my new approach, based on using unnest() to unroll a set of arrays passed in through psycopg2
class Tester():
    def __init__(self,count):
        execute(SETUP_SQL)
        self.count=count

        self.data=[
            {
                'text':'Some text',
                'properties': {"key":"value"},
            }
            for i in range(count)
        ]

    def slowInsert(self):
        '''
            Creates a new connection for each insertion
        '''
        for row in self.data:
            text=row['text']
            properties=row['properties']
            execute(SINGLE_INSERT,locals())

    def insert(self):
        '''
            One connection.
            Multiple queries.
        '''
        with connect() as connection:
            with connection.cursor() as cursor:
                for row in self.data:
                    text=row['text']
                    properties=row['properties']
                    cursor.execute(SINGLE_INSERT,locals())


    def fastInsert(self):
        '''
            One connection, one query.
        '''
        sql='''
            INSERT INTO upload_time_test(text,properties)
              SELECT unnest( %(texts)s ) ,
                     unnest( %(properties)s)

        '''

        texts=[r['text'] for r in self.data]
        properties=[r['properties'] for r in self.data]
        execute(sql,locals())

So now we have the Tester class written, we can see how log each approach takes to insert 1000 rows

def runTests(iterations):
    tester = Tester(iterations)
    with timer('slow'):
        tester.slowInsert()
    with timer('normal'):
        tester.insert()
    with timer('fast'):
        tester.fastInsert()
runTests(1000)
slow: 7.160489320755005 second(s)
normal: 0.1441025733947754 second(s)
fast: 0.042119503021240234 second(s)

We notice an obvious difference between the approaches.

Re-using the connection makes a huge difference. Inserts run 50 times faster.

Using unnest() runs 3 times faster than that.

What about much bigger data sets?

Next, I wanted to know if this held true for inserting, say, 100,000 rows. I won't bother with the slowest approach, because that's clearly unusable.

tester=Tester(count=100000)
with timer('normal'):
    tester.insert()

tester=Tester(count=100000)
with timer('fast'):
    tester.fastInsert()
normal: 14.866096019744873 second(s)
fast: 3.9566986560821533 second(s)

So even over 100,000 rows we still run nearly 4 times faster using unnest

Further investigation - mapping insertion rate against number of rows inserted.

I wanted to see the exact relationship between the rate of insertion and the number of rows being inserted.

So first I wrote a couple of functions to measure the insertion rate of our two methods:

def bulkInsertRate(count):
    tester=Tester(count)
    start=time.time()
    tester.fastInsert()
    duration=time.time()-start
    return count/duration

def normalInsertRate(count):
    tester=Tester(count)
    start=time.time()
    tester.insert()
    duration=time.time()-start
    return count/duration

And then we run them with a range of dataset sizes

counts=[50,100,200,500,1000,2000,5000,10000,20000,50000,100000]

rates=[
    {
        "count":count,
         'bulk':bulkInsertRate(count),
        'normal':normalInsertRate(count)

    }
    for count in counts
]

Finally, I use Pandas to plot the output data from these tests.

frame=pandas.DataFrame(rates).set_index('count')
frame
bulk normal
count
50 4485.694730 3867.856879
100 10159.389609 4847.897547
200 15212.186276 6057.106548
500 27340.842720 7081.049689
1000 33248.545382 7694.657609
2000 35640.695767 7070.777670
5000 41223.200473 8027.790910
10000 40948.723106 7785.005392
20000 42604.387914 7568.314015
50000 40795.233470 7291.552509
100000 27014.354119 6872.935483
frame.plot(logx=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcbdecd0908>

Conclusion

Using unnest to load multiple rows simultaneously has the following advantages:

  • It is significantly faster than a regular insert loop, especially when inserting thousands of rows.
  • The benefits of using unnest() increase at least up to 50,000 rows
  • It still allows us to write (reasonably) straightforward parameterised SQL with no string concatenation
  • When I tried this on a remote database, the improvements were even more impressive, presumably as it reduces significantly how much data is transferred back and forth across the network.

Pelican, Atom and Markdown

Fri 26 June 2015


I've been using reStructured text in general to write this blog, but I think I'm going to be switching to Markdown. As an experiment, I'm writing this post in Markdown.

I'm also writing it in the Atom text editor, which has really come on a long way since I last tried it. Specifically, it includes a Markdown preview function, so I can see the effects of the markup that I'm writing as I write it.

Mostly, I want a rapid way of creating and publishing code snippets, without the mental overhead of switching between markup languages. Although reStructured text and markdown are broadly similar, there are subtle differences between them when it comes to things like syntax highlighting. But I've discovered today that if I use the triple-backtick syntax, I can get the same output from Pelican, Atom, and IPython notebooks.

So

    ```python
    def syntax(highlighting=True):
        return "cool huh?"
    ```

yields

 def syntax(highlighting=True):
   return "cool huh?"

And if I paste that into a markdown cell in an IPython notebook, I get the same effect, as can be seen here

So this seems to be the general way that the open-source ecosystem is going: Markdown allows me to use the same syntax for my GitHub documentation, my IPython notebooks, and my blog posts.

I do use Sphinx in various places for Python code documentation, so that will still require reStructured text, but elsewhere I think Markdown is the way to go

Asynchronous Python

Thu 01 January 2015


It's possible to get python to do node-like non-blocking requests, this could take away one of the key reasons for using node.

The following is a full bottle-based python web application.

A client can sucessfully call /test while another client is waiting for /slowproxy to return a result from a slow web service.

from gevent import monkey; monkey.patch_all()

from bottle import route, run
import time

@route('/sleep/<seconds:int>')
def sleep(seconds):
    time.sleep(seconds)
    return "Slept For {0}".format(seconds)

@route('/test')
def test():
     return 'test'


@route('/slowproxy/<seconds:int>')
def slowproxy(seconds):
    import requests
    url="https://s.nooro.com/sleeptest.php?seconds=%i" %seconds
    response=requests.get(url)
    response.raise_for_status()
    return response.text

run(host='0.0.0.0', port=8080,server='gevent')

My first attempt used grequests, but apparently that's not even necessary.

I guess that the call to monkey.patch_all() even patches the socket code that requests uses. I'm very impressed.

CoffeeScript

Thu 01 January 2015


Through a bizarre twist of history, the entire client-side web runs on a language that was thrown together in 10 days.

Despite huge investments in their own proprietary technology by the likes of Sun Microsystems, Adobe and Microsoft, this weird little spinoff of self and scheme is everywhere, while client-side Java, ActiveX and Flash fade into obscurity.

Unsurprisingly for a language developed so quickly, Javascript is pretty ugly. I'm fond of saying that it's a horrible language, with a really nice language inside trying to get out. It gets some things, like scoping rules, very, very wrong. But it got other things, like anonymous functions, exactly right, long before they were adopted in Java, C#, or C++. Even Python, my favourite language ever, doesn't get them quite right.

Several people have attempted to build a nicer syntax on top of the javascript virtual machine. In fact, the list of languages that compile to JS is startlingly big.

For the last couple of years I've been using CoffeeScript as my standard javascript syntax.

From the project page:

"CoffeeScript is a little language that compiles into JavaScript. Underneath that awkward Java-esque patina, JavaScript has always had a gorgeous heart. CoffeeScript is an attempt to expose the good parts of JavaScript in a simple way."

and I think it achieves this admirably. It doesn't solve all of javascript's problems - you can still get into trouble with the Infamous Loop Problem, but it does make the language considerably more succinct, mostly by stealing ideas from Python and Haskell.

Examples

Function definitions go from

function d(x){
    return 2*x
}

to

d = (x) -> 2*x

This makes for very quick object construction:

math =
    root:   Math.sqrt
    square: square
    cube:   (x) -> x * square x

It also borrows Python's list comprehension syntax:

values=(option.value for option in question.options)

The near complete absense of curly brackets saves a lot of wasted lines in my source code, and enables me to see what's going on a lot clearer than in raw javascript. On the downside, I do find myself fairly regularly testing out code snippets in the CoffeeScript online compiler to make sure that I've properly understood how they will be interpreted.

Because CoffeeScript is a compiled language, to work with it effectively requires integrating the compiler into your toolchain. For my larger projects I've hand-written a tool using Python's Watchdog package to monitor my source code directories and output compiled javascript everytime a file changes.

As a nice little extra, my tool jams in a warning message wrapped in an alert call if the compliation fails, so if I introduce a syntax error in my coffeescript, as soon as I refresh the page that is using it I'll be presented with the source of the problem.

Configuring Systems with Fabric and Cuisine

Thu 01 January 2015


I've mentioned Fabric before on this blog. Because so much of my development time is spent in Python, it makes sense for me to look for system administration tools that are also written in Python. Fabric fits the bill perfectly, and allows me to run tasks remotely on multiple machines simultaneously.

A useful addition to Fabric is Cuisine. Cuisine is a small set of functions that sit on top of Fabric, to abstract common administration operations such as file/dir operations, user/group creation, package install/upgrade, making it easier to write portable administration and deployment scripts.

IPython with Python 3

Thu 01 January 2015


This took me longer than I was expecting.

In general when working with IPython I use pip rather than apt-get , as pip tends to have more up-to-date packages.

In the end I found the simplest thing to do was to set up IPython in an isolated virtualenv environment. The main trick is to let virtualenv know what version of Python you want it to use by default.

$ virtualenv --python=python3.4 python_3_demo
$ cd python_3_demo/
$ source ./bin/activate
$ pip install ipython
$ ipython
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
...

In [1]: import sys

In [2]: print(sys.version)
3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2]

And voila, I have Python 3 in the best Python interpreter ever built, I'm ready to start wrapping my head around byte arrays and UTF-8 encodings.

logrotate

Thu 01 January 2015


I have a large, complex mailing system that processes a significant amount of data every hour. While I'm developing it, I want to know what it's doing, and whether it's having any problems. So I use the excellent python logging library to produce comprehensive monitoring data.

The only problem is that these log files can get pretty big. And because I don't know ahead of time when I'm going to need to hunt through them, I tend to leave the logging system in a fairly verbose state.

Enter logrotate. This is a standard service on Ubuntu that regularly rotates your log files, throwing away old data, compressing middle-age data, and leaving young log files fresh and accessible. Thus you are protected from runaway log file growth and nasty calls in the middle of the night from your monitoring service telling you that your server just died because the hard drives were full.

A default ubuntu installation comes with logrotate already set up for various services. If you don't have it, install it with apt-get install logrotate , and then it's mostly just a question of copying a file from /etc/logrotate.d/ and modifying it according to your needs.

vi /etc/logrotate.d/myservice

/var/log/myservice/*.log {
  rotate 7
  daily
  compress
  missingok
  notifempty
}

And that's it! The actually invocation of the logrotate command will get triggered regularly by a script in /etc/cron.daily

You can also force a rotation, a useful option when testing out a new configuration, via

logrotate -f /etc/logrotate.d/myservice

One quick word of warning: if you're using the python logging library, then you'll want to use the WatchedFileHandler class. If the logfile gets rotated out while it's in use, WatchedFileHandler will notice this, close the file stream and open a new one.

More Ractive

Thu 01 January 2015


I've used Ractive for several projects recently. It's been like a breath of fresh air. I've been writing user interfaces of one kind or another for more than a decade, and keeping what's displayed to the user in sync with what's stored in the data structures has often been a source of frustration.

In the javascript world, I've mostly used backbone and jQuery for creating interactive web pages. While these tools are very good at what they do, I still find myself writing a fair amount of code to update data in the model whenever a user interacts with a control, and to update the control displays whenever data in the model changes.

Enter Ractive. It's not the only library to handle 2-way data binding - Angular, Knockout and React all play in this space, but it's my current favourite.

Anyway, here's a little demo....

Given a ractive template like this:

<p>Type in the boxes below.<\p>
<input class="form-control" value="{{var1}}">
<input class="form-control" value="{{var2}}">
<p>Current values are var1:<code>{{var1}}</code>, var2:<code>{{var2}}</p>
<button class="btn btn-primary" on-click="changeme">Set var1</button>

and javascript like this:

var ractive = new Ractive({
   el:"#demo",
   template:template,
   data:{var1:'beeblebrox',var2:'lintilla'}
});

We get:

Neat, huh? I haven't had to write any code to manually react to keyup, or change events in the input controls - Ractive simply takes care of the fact that I refer to var1 in both the output paragraph and the control value, and binds the two elements together, refreshing them whenever needed.

The code for responding to the button click is simply:

ractive.on('changeme',function(){
    ractive.set('var1','zarniwoop');
});

By setting data in the underlying model, the user interface automatically updates, again without any manual intervention.

UI development might be fun again....

I have the same feeling on discovering Ractive that I had when I was first shown jQuery. All of a sudden, a bunch of boring, fiddly manual tasks are taken care of in an intuitive way. And unlike other frameworks, all ractive does is data-binding. It doesn't try to be a control library, an AJAX toolkit or a Model-View-Controller framework. For those who like all-in-one solutions, this will be a weakness, but as someone who believes in the unix philosophy of building systems from tools that each do one thing well, I'm very impressed.