trvrm.github.io

Thu 01 January 2015

Through a bizarre twist of history, the entire client-side web runs on a language that was thrown together in 10 days.

Despite huge investments in their own proprietary technology by the likes of Sun Microsystems, Adobe and Microsoft, this weird little spinoff of self and scheme is everywhere, while client-side Java, ActiveX and Flash fade into obscurity.

Unsurprisingly for a language developed so quickly, Javascript is pretty ugly. I'm fond of saying that it's a horrible language, with a really nice language inside trying to get out. It gets some things, like scoping rules, very, very wrong. But it got other things, like anonymous functions, exactly right, long before they were adopted in Java, C#, or C++. Even Python, my favourite language ever, doesn't get them quite right.

Several people have attempted to build a nicer syntax on top of the javascript virtual machine. In fact, the list of languages that compile to JS is startlingly big.

For the last couple of years I've been using CoffeeScript as my standard javascript syntax.

From the project page:

"CoffeeScript is a little language that compiles into JavaScript. Underneath that awkward Java-esque patina, JavaScript has always had a gorgeous heart. CoffeeScript is an attempt to expose the good parts of JavaScript in a simple way."

and I think it achieves this admirably. It doesn't solve all of javascript's problems - you can still get into trouble with the Infamous Loop Problem, but it does make the language considerably more succinct, mostly by stealing ideas from Python and Haskell.

Examples

Function definitions go from

function d(x){
    return 2*x
}

to

d = (x) -> 2*x

This makes for very quick object construction:

math =
    root:   Math.sqrt
    square: square
    cube:   (x) -> x * square x

It also borrows Python's list comprehension syntax:

values=(option.value for option in question.options)

The near complete absense of curly brackets saves a lot of wasted lines in my source code, and enables me to see what's going on a lot clearer than in raw javascript. On the downside, I do find myself fairly regularly testing out code snippets in the CoffeeScript online compiler to make sure that I've properly understood how they will be interpreted.

Because CoffeeScript is a compiled language, to work with it effectively requires integrating the compiler into your toolchain. For my larger projects I've hand-written a tool using Python's Watchdog package to monitor my source code directories and output compiled javascript everytime a file changes.

As a nice little extra, my tool jams in a warning message wrapped in an alert call if the compliation fails, so if I introduce a syntax error in my coffeescript, as soon as I refresh the page that is using it I'll be presented with the source of the problem.

IPython with Python 3

Thu 01 January 2015

This took me longer than I was expecting.

In general when working with IPython I use pip rather than apt-get , as pip tends to have more up-to-date packages.

In the end I found the simplest thing to do was to set up IPython in an isolated virtualenv environment. The main trick is to let virtualenv know what version of Python you want it to use by default.

$ virtualenv --python=python3.4 python_3_demo
$ cd python_3_demo/
$ source ./bin/activate
$ pip install ipython
$ ipython

Python 3.4.0 (default, Apr 11 2014, 13:05:11)
...

In [1]: import sys

In [2]: print(sys.version)
3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2]

And voila, I have Python 3 in the best Python interpreter ever built, I'm ready to start wrapping my head around byte arrays and UTF-8 encodings.

More Ractive

Thu 01 January 2015

I've used Ractive for several projects recently. It's been like a breath of fresh air. I've been writing user interfaces of one kind or another for more than a decade, and keeping what's displayed to the user in sync with what's stored in the data structures has often been a source of frustration.

In the javascript world, I've mostly used backbone and jQuery for creating interactive web pages. While these tools are very good at what they do, I still find myself writing a fair amount of code to update data in the model whenever a user interacts with a control, and to update the control displays whenever data in the model changes.

Enter Ractive. It's not the only library to handle 2-way data binding - Angular, Knockout and React all play in this space, but it's my current favourite.

Anyway, here's a little demo....

Given a ractive template like this:

<p>Type in the boxes below.<\p>
<input class="form-control" value="{{var1}}">
<input class="form-control" value="{{var2}}">
<p>Current values are var1:<code>{{var1}}</code>, var2:<code>{{var2}}</p>
<button class="btn btn-primary" on-click="changeme">Set var1</button>

and javascript like this:

var ractive = new Ractive({
   el:"#demo",
   template:template,
   data:{var1:'beeblebrox',var2:'lintilla'}
});

We get:

Neat, huh? I haven't had to write any code to manually react to keyup, or change events in the input controls - Ractive simply takes care of the fact that I refer to var1 in both the output paragraph and the control value, and binds the two elements together, refreshing them whenever needed.

The code for responding to the button click is simply:

ractive.on('changeme',function(){
    ractive.set('var1','zarniwoop');
});

By setting data in the underlying model, the user interface automatically updates, again without any manual intervention.

UI development might be fun again....

I have the same feeling on discovering Ractive that I had when I was first shown jQuery. All of a sudden, a bunch of boring, fiddly manual tasks are taken care of in an intuitive way. And unlike other frameworks, all ractive does is data-binding. It doesn't try to be a control library, an AJAX toolkit or a Model-View-Controller framework. For those who like all-in-one solutions, this will be a weakness, but as someone who believes in the unix philosophy of building systems from tools that each do one thing well, I'm very impressed.

PDF Generation With Pelican

Thu 01 January 2015

The existing documentation is a little unclear on this, because it says you need to add PDF_GENERATOR=True to your pelicanconf.py file.

This advice is out of date: PDF generation has been moved to a plugin.

So you need to first make sure you have rst2pdf installed:

$ sudo apt-get install rst2pdf

and then add the following to pelicanconf.py

PLUGIN_PATH = '../pelican-plugins'  # or wherever.
PLUGINS = ['pdf']

However, doing this seems to screw up the pygments highlighting on my regular html output. This is because deep in the rst2pdf code, in a file called pygments2style.py , all the pygment elements have their CSS classes prepended with pygment- . I haven't figured out how to generate HTML and PDF nicely at the same time.

Python Comprehensions

Thu 01 January 2015

Python list comprehensions are one of the most powerful and useful features of the language. However, I've noticed even quite experienced Python programmers using less powerful idioms when a list comprehension would be the perfect solution to their problem, and even though I've been a Python developer for more than a decade, I've recently learned some very nice aspects of this feature.

What's a List Comprehension?

Python is such a strong language in part because of its willingness to steal ideas from other languages. Python list comprehensions are an idea that comes from Haskell. Fundamentally, they are a kind of 'syntactic sugar' for construct lists from other data sources in a tight, elegant fashion.

One of the things I like most about them is they eliminate the need to manually create loop structures and extra variables. So consider the following:

squares=list()
for i in range(10):
    squares.append(i**i)
squares

[1, 1, 4, 27, 256, 3125, 46656, 823543, 16777216, 387420489]

With List Comprehensions we can eliminate both the for loop and the calls to append()

[i**i for i in range(10)]

[1, 1, 4, 27, 256, 3125, 46656, 823543, 16777216, 387420489]

Comprehensions work with any kind of iterable as an import source:

[ord(letter) for letter in "hello world"]

[104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]

Multiple generators

To make things a little more complex, we can specifiy more than one input data source:

[(i,j) for i in xrange(2) for j in xrange(3)]

[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)]

Instead of just boring numbers, we could use this to construct some sentences.

[(number, animal) for number in range(3) for animal in ['cats','dogs','elephants']]

[(0, 'cats'),
 (0, 'dogs'),
 (0, 'elephants'),
 (1, 'cats'),
 (1, 'dogs'),
 (1, 'elephants'),
 (2, 'cats'),
 (2, 'dogs'),
 (2, 'elephants')]

Furthermore, we have a lot of control over how we construct the final output objects - we can put any valid python expression in the left-hand-side.

[
    "{0} {1}".format(adjective,animal)
    for adjective in ['red','cute','hungry']
    for animal in ['cat','puppy','hippo']
]

['red cat',
 'red puppy',
 'red hippo',
 'cute cat',
 'cute puppy',
 'cute hippo',
 'hungry cat',
 'hungry puppy',
 'hungry hippo']

or even

[
    "There are {0} {1} {2}".format(number, adjective,animal)
    for number in range(2,4)
    for adjective in ['cute','hungry']
    for animal in ['puppys','bats']
]

['There are 2 cute puppys',
 'There are 2 cute bats',
 'There are 2 hungry puppys',
 'There are 2 hungry bats',
 'There are 3 cute puppys',
 'There are 3 cute bats',
 'There are 3 hungry puppys',
 'There are 3 hungry bats']

Dictionary Comprehensions

An equally powerful construct is the dictionary comprehension. Just like list comprehensions, this enables you to construct python dictionaries using a very similar syntax.

{
    key:value
    for key,value in [
        ('k','v'),
        ('foo','bar'),
        ('this','that')
    ]
}

{'foo': 'bar', 'k': 'v', 'this': 'that'}

Armed with these tools, we can write very concise code to transform data from one structure to another. Recently I've found them very helpful for unpacking nested data structures.

Consider a simple org-structure:

departments=[
    {'name':'Manufacturing', 'staff': ["Jacob","Jonah", "Chloe","Liam"]},
    {'name':'Marketing','staff':["Emily","Shawn","Alex"]},
    {'name':'HR','staff':["David","Jessica"]},
    {'name':'Accounts','staff':["Nicole"]}
]

Now let's extract some data from it.

#Department names
[department['name'] for department in departments]

['Manufacturing', 'Marketing', 'HR', 'Accounts']

#Staff count
sum([len(department['staff']) for department in departments])

#All staff names
[
    name
    for department in departments
    for name in department['staff']
]

['Jacob',
 'Jonah',
 'Chloe',
 'Liam',
 'Emily',
 'Shawn',
 'Alex',
 'David',
 'Jessica',
 'Nicole']

Note how in the last example the second data-generating clause, department['staff'] , used a reference from the first one.

We can take this even further. Let's make our org-chart a little more complicated...

departments=[
    {
        'name':'Manufacturing',
        'staff': [
            {'name':"Jacob",'salary':50000},
            {'name':"Chloe",'salary':60000},
            {'name':"Liam",'salary':70000},
            {'name':"Jonah",'salary':55000},
        ]
    },
    {
        'name':'Marketing',
        'staff':[
            {'name':"Emily",'salary':50000},
            {'name':"Shawn",'salary':45000},
            {'name':"Alex",'salary':40000},
        ]
    },
    {
        'name':'HR',
        'staff':[

            {'name':"David",'salary':50000},
            {'name':"Jessica",'salary':60000},
       ]
    },
    {
        'name':'Accounts',
        'staff':[
            {'name':"Nicole",'salary':40000}
        ]
    }
]

Calculate the total salary:

sum(
    person['salary']
    for department in departments
    for person in department['staff']
)

Now let's calculate the wages bill by department, and put the results in a dictionary

{
    department['name'] : sum(person['salary'] for person in department['staff'])
    for department in departments
}

{'Accounts': 40000, 'HR': 110000, 'Manufacturing': 235000, 'Marketing': 135000}

Conclusion

I've been finding this type of approach very helpful when working with document-oriented data stores. We store a lot of data in JSON documents, either on the file system or in Postgresql. For that data to be useful, we have to be able to quickly mine, explore, select and transform it. Tools like JSONSelect do exist, but JSONSelect is only available in Javascript, and doesn't allow you to do the kind of rich expression-based transforms as you roll up the data that Python does.

I also find that it avoids many common programming pitfalls: mis-assigned variables, off-by-one errors and so on. You'll note that in all the examples above I never need to create a temporary variable or explicitly construct a for -loop.

Reportlab Images in IPython

Thu 01 January 2015

With a bit of work we can get IPython to render ReportLab objects directly to the page as Matplotlib plots.

Huge thanks to github user deeplook, this is basically a modification of this IPython notebook.

First our imports.

from reportlab.lib import colors
from reportlab.graphics import renderPM
from reportlab.graphics.shapes import Drawing, Rect
from reportlab.graphics.charts.linecharts import HorizontalLineChart

from io import BytesIO
from IPython.core import display

Now we create a hook that causes reportlab drawings to actually be rendered when we type out its name.

def display_reportlab_drawing(drawing):
    buff=BytesIO()
    renderPM.drawToFile(drawing,buff,fmt='png',dpi=72)
    data=buff.getvalue()
    ip_img=display.Image(data=data,format='png',embed=True)
    return ip_img._repr_png_()

png_formatter=get_ipython().display_formatter.formatters['image/png']
drd=png_formatter.for_type(Drawing,display_reportlab_drawing)

Now that's done, we can start creating ReportLab objects and see them immediately.

drawing = Drawing(150,100)
drawing.add(Rect(0,0,150,100,strokeColor=colors.black,fillColor=colors.antiquewhite))
drawing

chart=HorizontalLineChart()

drawing.add(chart)

drawing

A Seriously Subtle Bug

Thu 01 January 2015

I build and maintain a number of web applications built using Python, Bottle, and uWSGI. In general, I've found this a very powerful and robust software stack. However, this week we encountered a very strange issue that took us many hours to fully diagnose.

Our first indication that something was wrong was when our automated monitoring tools warned us that one of our sites was offline. We manage our applications through the uWSGI Emperor service, which makes it easy to restart errant applications. Simply touching the config file for the application in question causes it to be reloaded:

$ touch /etc/uwsgi-emperor/vassals/myapp.ini

This brought our systems back up, but obviously didn't explain the problem, and over the coming weeks it recurred several times, usually several days apart. So, obviously my first step was to look at the log files. Our first indication of trouble was a log line from our database connection layer:

OperationalError: could not create socket: too many open files

Which actually led us away from the real cause of the bug to start with - at first we thought that we were simply creating too many database connections. But further examination reassured us that yes, our database layer was fine, our connections were getting opened and closed correctly. Postgres has excellent introspective tools, if you know how to use them; in this case the following is very helpful:

SELECT * FROM pg_stat_activity;

which revealed that we had no more database connections open than expected. So, our next step was the linux systems administration tool lsof . This tool lists information about currently open files

$ sudo lsof > lsof.txt

COMMAND     PID   TID       USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
init          1             root  cwd       DIR                8,1      4096          2 /
init          1             root  rtd       DIR                8,1      4096          2 /
init          1             root  txt       REG                8,1    265848   14422298 /sbin/init
...

... followed by thousands more lines. Armed with this information, we could figure out how many files each process was using.

Enter Pandas

While it would be quite possible to search and filter this data using traditional Unix tools such as awk and grep , I'm finding that more and more I'm staying inside the python ecosystem to do systems administration and analysis tasks. I use the Pandas data analysis library heavily, and it was a perfect fit for this particular task.

$ ipython

import pandas
widths=[9,6,6,11,5,10,19,10,12,200]
frame=pandas.read_fwf('lsof.txt',widths=widths)
frame.columns

Index([u'COMMAND', u'PID', u'TID', u'USER', u'FD', u'TYPE', u'DEVICE', u'SIZE/OFF', u'NODE', u'NAME'], dtype='object')

So now we have a DataFrame (a construct very similar to an Excel worksheet) with a list of every open file on the system, along with the process id and name of the program that is holding it open. Our next step was to ask Pandas to tell us which processes had the most files open:

frame.PID.value_counts().head()

2445     745
2454     745
...

So process 2445 has 745 open files. OK, what is that process?

frame[frame.PID==2445][['USER','COMMAND']]

          USER    COMMAND
3083  www-data  uwsgi-cor
3084  www-data  uwsgi-cor
3085  www-data  uwsgi-cor
...

So we've learned, then, that a uWSGI process belonging to www-data is holding open more than 700 files. Now, under Ubuntu, this is going to be a problem very soon, because the maximum number of files that www-data may have open per-process is 1024.

$ sudo su www-data
$ ulimit -n

So, clearly one of our web application processes is opening files and not closing them again. This is the kind of bug that I hate as a programmer, because it wouldn't show up in development, when I'm frequently restarting the application, or even in testing, but only appears under real-world load. But at least now we have a path towards temporary remediation. So first we simply increased the limits in ulimit so that the service would run longer before this bug re-appeared. But we still wanted to understand why this was happening.

Next Steps

Again, we used Pandas to interrogate the output of lsof , but this time to find out whether there was a pattern to the filenames that were being left open

frame.NAME.value_counts().head()

Which revealed to us that the the vast majority of the files being left open were ones that we were delivering through our Bottle Python application. Specifically, they were being served through the static_file function.

We verified this by hitting the url that was serving up those static files, and watching the output of lsof. Immediately we saw that yes, every time we served that file, the open count for that file went up. So, we clearly had a resource leak on our hands. Now, this surprised me, because usually the memory management and garbage collection in Python is excellent, and I've left the days of manually tracking resources in C long behind me.

So, next I constructed some test cases. Firstly, I ran our software on a test virtual machine to verify that I could recreate the bug. Then, I wrote a very bare-bones Bottle app that simply served a static file:

import bottle

application=bottle.Bottle()

@application.get('/diagnose')
def test():
    return bottle.static_file('cat.jpg', '.')

And I immediately saw that this didn't trigger any kind of file leak. The main difference between the two was that our production application uses Bottle's mounting capability to namespace URLS. So I changed my test application as follows:

import bottle

app=bottle.Bottle()

@app.get('/')
def test():
    return bottle.static_file('cat.jpg', '.')

rootapp=bottle.Bottle()
rootapp.mount("/diagnose", app)
application=rootapp

And lsof indicated that we were leaking files. Every time I hit /diagnose, the open file count for cats.jpg increased by one.

So, we could simply re-write our application to not use Bottle.mount , but that wasn't good enough for me. I wanted to understand why such a simple change would trigger a resource leak. At this point, it turns out it's good that I have Aspergers, and with it a tendency to hyper-focus on interesting problems, because it took a long time. In fact, I ended up taking the Bottle library, and manually stripping it of every line of code that wasn't related to simply handling that single URL, in an attempt to understand exactly what the different code paths were between the leaking program and the safe one.

In doing so, I was greatly aided by the amazing introspective powers of Python. We felt sure that we were dealing with some kind of resource leak - in Python, every file is handled by a file object, and when that object gets cleaned up by garbage collection, the underlying file handle is closed. So firstly, I replaced the relevant call to the file constructor with my own object that derived from file

class MonitoredFile(file):
    def __init__(self,name,mode):
        logging.info("Opening {0}".format(name))
        file.__init__(self,name,mode)

    def __del__(self):
        logging.info('file.__del__({0})'.format(self.name))

So this object behaves exactly like a regular file, but logs events when it is created and when it is destroyed. And sure enough, I saw that in the file-leaking version of my code, MonitoredFile:__del__() was never getting called. Now in Python an object should get deleted when its reference count drops to zero, and indeed the Python sys library provides the getrefcount function (https://docs.python.org/2/library/sys.html#sys.getrefcount). By adding some logging statements with calls to sys.getrefcount() , I saw that in the leaking-version of my code, the refcount for our file object was one higher than in the non-leaking code when it was returned from the main application handler function.

Why should this be? Eventually, by stripping out all extraneous code from the Bottle library, I discovered that in the version that was using Bottle.mount() , the response object was passed twice through the _cast() function. Bottle can handle all sorts of things as response objects - strings, dictionaries, JSON objects, lists, but if it notices that it is handling a file then it treats it specially. The smoking gun code is here: https://github.com/bottlepy/bottle/blob/854fbd7f88aa2f809f54dd724aea7ecf918a3b6e/bottle.py#L913

if hasattr(out, 'read'):
    if 'wsgi.file_wrapper' in request.environ:
        return request.environ['wsgi.file_wrapper'](out)
    elif hasattr(out, 'close') or not hasattr(out, '__iter__'):
        return WSGIFileWrapper(out)

Which looks innocent enough, and indeed is in the first version of our code. But in the second version, our file handler gets passed through this code block twice, because it's getting handled recursively. And, indeed, if wsgi.file_wrapper isn't specified, then WSGIFileWrapper is used, and everything is fine. But in our case, we're serving this application via uWSGI, which does define wsgi.file_wrapper . Now, I'm still not 100% clear what this wrapping function is supposed to do, but on inspecting the uWSGI source I see that it is set to call this C function:

PyObject *py_uwsgi_sendfile(PyObject * self, PyObject * args) {

    struct wsgi_request *wsgi_req = py_current_wsgi_req();

    if (!PyArg_ParseTuple(args, "O|i:uwsgi_sendfile", &wsgi_req->async_sendfile, &wsgi_req->sendfile_fd_chunk)) {
        return NULL;
    }


    if (PyFile_Check((PyObject *)wsgi_req->async_sendfile)) {
        Py_INCREF((PyObject *)wsgi_req->async_sendfile);
        wsgi_req->sendfile_fd = PyObject_AsFileDescriptor(wsgi_req->async_sendfile);
    }

    // PEP 333 hack
    wsgi_req->sendfile_obj = wsgi_req->async_sendfile;
    //wsgi_req->sendfile_obj = (void *) PyTuple_New(0);

    Py_INCREF((PyObject *) wsgi_req->sendfile_obj);
    return (PyObject *) wsgi_req->sendfile_obj;
}

And we can clearly see that Py_INCREF is getting called on the file object. So if this function is called twice, presumably the internal reference count is incremented twice, but only decremented once elsewhere.

And indeed, as soon as I added:

if 'wsgi.file_wrapper' in environ:
    del environ['wsgi.file_wrapper']

to my application code, the file leaking stopped.

Concluding Thoughts

At the moment, I'm not exactly sure whether this is a bug or a misunderstanding. I'm not sure what wsgi.file_wrapper is supposed to do - I clearly have more research to do, time permitting. And because this bug only occurred when Bottle and uWSGI interacted - I couldn't trigger it in one or other environment on its own - it's hard to say that either project has a bug. But hopefully this analysis will help prevent others from going through the same headaches I just did.

SQL Magic

Thu 01 January 2015

I'm finding the %sql magic function extremely useful. It turns IPython into a very nice front-end to Postgresql.

First, make sure you have the ipython-sql extension installed:

pip install ipython-sql

https://pypi.python.org/pypi/ipython-sql

Then we load the extension

%load_ext sql

Then we set up our database connection.

%%sql
postgresql://testuser:password@localhost/test

u'Connected: testuser@test'

And now we can start interacting directly with the database as if we were at the psql command line.

%%sql
CREATE TABLE people (first text, last text, drink text);
INSERT INTO people (first,last,drink)
VALUES
    ('zaphod','beeblebrox','pan galactic gargle blaster'),
    ('arthur','dent','tea'),
    ('ford','prefect','old janx spirit')
    ;

Done.
3 rows affected.

[]

%sql select * from people

3 rows affected.

first	last	drink
zaphod	beeblebrox	pan galactic gargle blaster
arthur	dent	tea
ford	prefect	old janx spirit

We can access the results as a python object:

result = %sql select * from people
len(result)

And we can even get our recordset as a pandas dataframe

%config SqlMagic.autopandas=True

frame = %sql select * from people
frame

	first	last	drink
0	zaphod	beeblebrox	pan galactic gargle blaster
1	arthur	dent	tea
2	ford	prefect	old janx spirit

3 rows × 3 columns

frame['first'].str.upper()

0    ZAPHOD
1    ARTHUR
2      FORD
Name: first, dtype: object