trvrm.github.io

Using SQLAlchemy and Postgres functions to produce JSON tree structures from SQL joins

Thu 06 July 2017


More and more I'm discovering that Postgres is an amazingly powerful 'NoSQL' database, as well as the best relational database available today.

Since the introduction of the JSON and JSONB data types, I've been able to store both deeply nested, unstructured data AND highly relational data in the same data store.

But sometimes I need to be able to map between the two domains. For example, I want to perform a join across two tables and return the result as a nested tree structure, perhaps for rendering on a webpage.

In the past I might have aggregated the data manually in a loop, but here I demonstrate some neat tricks to get Postgres to do the heavy lifting for you.

I also show how to wrap these tricks in SQLAlchemy expressions. I've started writing my SQL queries almost exclusively in SQLAlchemy, since I discovered that it allows me to think and reason about queries as collections of composable elements.

import sqlalchemy
import pandas
from sqlalchemy.sql import func,select, literal_column
import functools

pandas.set_option("display.width", 500)
pandas.set_option('display.max_colwidth', 110)

engine = sqlalchemy.create_engine('postgresql+psycopg2://demo:password@localhost/demo')
read = functools.partial(pandas.read_sql,con=engine)

Turn a simple join into a nested JSON structure

Consider a database consisting of two tables: book and author.

Each author may have written multiple books. I want a list of authors, and for each author I want a nested list of the books they have written.

Create the tables

engine.execute('''
    DROP TABLE IF EXISTS book;
    DROP TABLE IF EXISTS author;
    CREATE TABLE IF NOT EXISTS author(
        id int primary key,
        name text
    );

    CREATE TABLE IF NOT EXISTS book(
        id int primary key  ,
        author_id int references author,
        name text
    );
''');

Use SQLAlchemy reflection to get objects that represent each table

metadata = sqlalchemy.MetaData(bind=engine)
metadata.reflect()
tables=metadata.tables
Author = tables['author']
Book = tables['book']

Populate the database

engine.execute(Author.insert().values(id=1,name='Douglas Adams'))
engine.execute(Author.insert().values(id=2,name='JK Rowling'))
engine.execute(Author.insert().values(id=3,name='JRR Tolkien'))

engine.execute(Book.insert().values(id=4, author_id=1,name='The Hitchhikers Guide to the Galaxy'))
engine.execute(Book.insert().values(id=5, author_id=1,name='The Restaurant at the End of the Universe'))
engine.execute(Book.insert().values(id=6, author_id=1,name='Life, The Universe, and Everything'))
engine.execute(Book.insert().values(id=7, author_id=2,name='Harry Potter and the Giant Plot Hole'))
engine.execute(Book.insert().values(id=8, author_id=3,name='The Silmarillion'))
engine.execute(Book.insert().values(id=9, author_id=3,name='The Lord of the Rings'));
read(Author.select())
id name
0 1 Douglas Adams
1 2 JK Rowling
2 3 JRR Tolkien
read(Book.select())
id author_id name
0 4 1 The Hitchhikers Guide to the Galaxy
1 5 1 The Restaurant at the End of the Universe
2 6 1 Life, The Universe, and Everything
3 7 2 Harry Potter and the Giant Plot Hole
4 8 3 The Silmarillion
5 9 3 The Lord of the Rings

The problem: JOIN creates multiple rows per author

read(
    select([
        Author.join(Book)
    ])
)
id name id author_id name
0 1 Douglas Adams 4 1 The Hitchhikers Guide to the Galaxy
1 1 Douglas Adams 5 1 The Restaurant at the End of the Universe
2 1 Douglas Adams 6 1 Life, The Universe, and Everything
3 2 JK Rowling 7 2 Harry Potter and the Giant Plot Hole
4 3 JRR Tolkien 8 3 The Silmarillion
5 3 JRR Tolkien 9 3 The Lord of the Rings

The goal

It would be far more helpful to have a query that returns three rows - one for each author, with each author's books contained in a sub list.

We do that by using two powerful Postgres techniques:

Specifically we use the function json_agg to roll up a set of books into a JSON list

frame = read('''
    WITH author_books AS (SELECT author_id, json_agg(book) FROM book GROUP BY author_id)

    SELECT * FROM author
     JOIN author_books ON author_books.author_id=author.id 

''')

frame
id name author_id json_agg
0 1 Douglas Adams 1 [{'id': 4, 'name': 'The Hitchhikers Guide to the Galaxy', 'author_id': 1}, {'id': 5, 'name': 'The Restarau...
1 3 JRR Tolkien 3 [{'id': 8, 'name': 'The Silmarillion', 'author_id': 3}, {'id': 9, 'name': 'The Lord of the Rings', 'author...
2 2 JK Rowling 2 [{'id': 7, 'name': 'Harry Potter and the Giant Plot Hole', 'author_id': 2}]

And because we're using Pandas dataframes, we even have a convenience function to turn this directly into a nested Python dictionary:

frame.to_dict('records')
[{'author_id': 1,
  'id': 1,
  'json_agg': [{'author_id': 1,
    'id': 4,
    'name': 'The Hitchhikers Guide to the Galaxy'},
   {'author_id': 1,
    'id': 5,
    'name': 'The Restaurant at the End of the Universe'},
   {'author_id': 1, 'id': 6, 'name': 'Life, The Universe, and Everything'}],
  'name': 'Douglas Adams'},
 {'author_id': 3,
  'id': 3,
  'json_agg': [{'author_id': 3, 'id': 8, 'name': 'The Silmarillion'},
   {'author_id': 3, 'id': 9, 'name': 'The Lord of the Rings'}],
  'name': 'JRR Tolkien'},
 {'author_id': 2,
  'id': 2,
  'json_agg': [{'author_id': 2,
    'id': 7,
    'name': 'Harry Potter and the Giant Plot Hole'}],
  'name': 'JK Rowling'}]

So now we have our data in a form which would be very easy to use in the templating language of your choice when building a web application.

Ok, but can I do that in SQLAlchemy?

I'm glad you asked. First I define a little helper function to represent the underlying json_agg function.

def json_agg(table):
    return func.json_agg(literal_column('"'+ table.name+'"'))

Then I create my CTE:

AuthorBooks = (
    select([
        Book.c.author_id,
        json_agg(Book).label('books')
    ])
    .select_from(Book)
    .group_by(Book.c.author_id)
).cte('author_books')

And finally I use my CTE exactly as if it were a real table:

query=(
    select([
        Author,
        AuthorBooks
    ])
    .select_from(
        Author.join(AuthorBooks)
    )
)
frame=read(query)
frame
id name author_id books
0 1 Douglas Adams 1 [{'id': 4, 'name': 'The Hitchhikers Guide to the Galaxy', 'author_id': 1}, {'id': 5, 'name': 'The Restarau...
1 3 JRR Tolkien 3 [{'id': 8, 'name': 'The Silmarillion', 'author_id': 3}, {'id': 9, 'name': 'The Lord of the Rings', 'author...
2 2 JK Rowling 2 [{'id': 7, 'name': 'Harry Potter and the Giant Plot Hole', 'author_id': 2}]

And as before we can turn this into a nested Python data structure

authors = frame.to_dict('records')
authors[0]['name']
'Douglas Adams'
authors[0]['books']
[{'author_id': 1, 'id': 4, 'name': 'The Hitchhikers Guide to the Galaxy'},
 {'author_id': 1,
  'id': 5,
  'name': 'The Restaurant at the End of the Universe'},
 {'author_id': 1, 'id': 6, 'name': 'Life, The Universe, and Everything'}]

What about many-to-many joins?

Let's consider a slightly more complex example, that of 'users' and 'groups'.

(As an extra bit of fun, both user and group are reserved keywords in Postgres, so we have to be careful with our quoting to make this work)

engine.execute('''
    DROP TABLE IF EXISTS "user_to_group";
    DROP TABLE IF EXISTS "group";
    DROP TABLE IF EXISTS "user";


    CREATE TABLE IF NOT EXISTS "group"(
        id int primary key,
        name text
    );
    CREATE TABLE IF NOT EXISTS "user"(
        id int primary key,
        name text
    );
    CREATE TABLE IF NOT EXISTS "user_to_group"(
        group_id int references "group",
        user_id int references "user"
        );
''');
metadata = sqlalchemy.MetaData(bind=engine)
metadata.reflect()
tables=metadata.tables

User        = tables['user']
Group       = tables['group']
UserToGroup = tables['user_to_group']

Populate the data

We'll create three groups and five users, some of whom may be in more than one group

inserts=[
    Group.insert().values(id=1,name='Hobbits'),
    Group.insert().values(id=2,name='Wizards'),
    Group.insert().values(id=3,name='The Fellowship'),

    User.insert().values(id=1,name='Frodo'),
    User.insert().values(id=2,name='Gandalf'),
    User.insert().values(id=3,name='Legolas'),
    User.insert().values(id=4,name='Pippin'),
    User.insert().values(id=5,name='Saruman'),


    UserToGroup.insert().values(user_id=1,group_id=1),
    UserToGroup.insert().values(user_id=4,group_id=1),
    UserToGroup.insert().values(user_id=2,group_id=2),
    UserToGroup.insert().values(user_id=5,group_id=2),


    UserToGroup.insert().values(user_id=1,group_id=3),
    UserToGroup.insert().values(user_id=2,group_id=3),
    UserToGroup.insert().values(user_id=3,group_id=3),
    UserToGroup.insert().values(user_id=4,group_id=3),
]
for i in inserts:
    engine.execute(i)

A simple join gives us all our data, but in a form that may be clunky to work with.

read(Group.join(UserToGroup).join(User).select())
id name group_id user_id id name
0 1 Hobbits 1 1 1 Frodo
1 1 Hobbits 1 4 4 Pippin
2 2 Wizards 2 2 2 Gandalf
3 2 Wizards 2 5 5 Saruman
4 3 The Fellowship 3 1 1 Frodo
5 3 The Fellowship 3 2 2 Gandalf
6 3 The Fellowship 3 3 3 Legolas
7 3 The Fellowship 3 4 4 Pippin

Instead, let's create two queries

In SQL, it looks like this:

read('''
    WITH user_groups AS (
        SELECT user_id, json_agg("group") AS groups
          FROM user_to_group 
          JOIN "group" 
            ON user_to_group.group_id="group".id
        GROUP BY user_id  

    )

    SELECT id,name,groups FROM "user"
     JOIN user_groups on "user".id = user_groups.user_id

''')
id name groups
0 4 Pippin [{'id': 1, 'name': 'Hobbits'}, {'id': 3, 'name': 'The Fellowship'}]
1 1 Frodo [{'id': 1, 'name': 'Hobbits'}, {'id': 3, 'name': 'The Fellowship'}]
2 5 Saruman [{'id': 2, 'name': 'Wizards'}]
3 3 Legolas [{'id': 3, 'name': 'The Fellowship'}]
4 2 Gandalf [{'id': 2, 'name': 'Wizards'}, {'id': 3, 'name': 'The Fellowship'}]

But we want to use SQLAlchemy, so again we create CTE objects to help us

A list of users showing the groups they belong to

UserGroups=(
    select([
        UserToGroup.c.user_id,
        json_agg(Group).label('groups')
    ])
    .select_from(Group.join(UserToGroup))
    .group_by(UserToGroup.c.user_id)
).cte('user_groups')
query=(
    select([
        User,
        UserGroups.c.groups
    ])
    .select_from(
        User.join(UserGroups,User.c.id==UserGroups.c.user_id)
    )
)
read(query)
id name groups
0 4 Pippin [{'id': 1, 'name': 'Hobbits'}, {'id': 3, 'name': 'The Fellowship'}]
1 1 Frodo [{'id': 1, 'name': 'Hobbits'}, {'id': 3, 'name': 'The Fellowship'}]
2 5 Saruman [{'id': 2, 'name': 'Wizards'}]
3 3 Legolas [{'id': 3, 'name': 'The Fellowship'}]
4 2 Gandalf [{'id': 2, 'name': 'Wizards'}, {'id': 3, 'name': 'The Fellowship'}]

A list of groups showing the users that are members of them

GroupUsers=(
    select([
        UserToGroup.c.group_id,
        json_agg(User).label('users')
    ])
    .select_from(User.join(UserToGroup))
    .group_by(UserToGroup.c.group_id)
).cte('group_users')
query=(
    select([
        Group,
        GroupUsers.c.users
    ])
    .select_from(
        Group.join(GroupUsers,Group.c.id==GroupUsers.c.group_id)
    )
)
read(query)
id name users
0 1 Hobbits [{'id': 1, 'name': 'Frodo'}, {'id': 4, 'name': 'Pippin'}]
1 3 The Fellowship [{'id': 1, 'name': 'Frodo'}, {'id': 2, 'name': 'Gandalf'}, {'id': 3, 'name': 'Legolas'}, {'id': 4, 'name':...
2 2 Wizards [{'id': 2, 'name': 'Gandalf'}, {'id': 5, 'name': 'Saruman'}]

And again, we can always convert this to a nested Python data structure

read(query).to_dict('records')
[{'id': 1,
  'name': 'Hobbits',
  'users': [{'id': 1, 'name': 'Frodo'}, {'id': 4, 'name': 'Pippin'}]},
 {'id': 3,
  'name': 'The Fellowship',
  'users': [{'id': 1, 'name': 'Frodo'},
   {'id': 2, 'name': 'Gandalf'},
   {'id': 3, 'name': 'Legolas'},
   {'id': 4, 'name': 'Pippin'}]},
 {'id': 2,
  'name': 'Wizards',
  'users': [{'id': 2, 'name': 'Gandalf'}, {'id': 5, 'name': 'Saruman'}]}]

Conclusion

I demonstrate how to use WITH statements (Common Table Expressions), the json_agg function and SQLAlchemy to quickly convert complex SQL joins into nested Python data structures.

Using techniques like the ones presented here, Postgres can act as a powerful relational data store that can still provide applications with data in helpful forms, such as nested dictionaries and lists.