trvrm.github.io
Using SQLAlchemy and Postgres functions to produce JSON tree structures from SQL joins
Thu 06 July 2017
More and more I'm discovering that Postgres is an amazingly powerful 'NoSQL' database, as well as the best relational database available today.
Since the introduction of the JSON and JSONB data types, I've been able to store both deeply nested, unstructured data AND highly relational data in the same data store.
But sometimes I need to be able to map between the two domains. For example, I want to perform a join across two tables and return the result as a nested tree structure, perhaps for rendering on a webpage.
In the past I might have aggregated the data manually in a loop, but here I demonstrate some neat tricks to get Postgres to do the heavy lifting for you.
I also show how to wrap these tricks in SQLAlchemy expressions. I've started writing my SQL queries almost exclusively in SQLAlchemy, since I discovered that it allows me to think and reason about queries as collections of composable elements.
import sqlalchemy
import pandas
from sqlalchemy.sql import func,select, literal_column
import functools
pandas.set_option("display.width", 500)
pandas.set_option('display.max_colwidth', 110)
engine = sqlalchemy.create_engine('postgresql+psycopg2://demo:password@localhost/demo')
read = functools.partial(pandas.read_sql,con=engine)
Consider a database consisting of two tables: book
and author
.
Each author may have written multiple books. I want a list of authors, and for each author I want a nested list of the books they have written.
engine.execute('''
DROP TABLE IF EXISTS book;
DROP TABLE IF EXISTS author;
CREATE TABLE IF NOT EXISTS author(
id int primary key,
name text
);
CREATE TABLE IF NOT EXISTS book(
id int primary key ,
author_id int references author,
name text
);
''');
metadata = sqlalchemy.MetaData(bind=engine)
metadata.reflect()
tables=metadata.tables
Author = tables['author']
Book = tables['book']
engine.execute(Author.insert().values(id=1,name='Douglas Adams'))
engine.execute(Author.insert().values(id=2,name='JK Rowling'))
engine.execute(Author.insert().values(id=3,name='JRR Tolkien'))
engine.execute(Book.insert().values(id=4, author_id=1,name='The Hitchhikers Guide to the Galaxy'))
engine.execute(Book.insert().values(id=5, author_id=1,name='The Restaurant at the End of the Universe'))
engine.execute(Book.insert().values(id=6, author_id=1,name='Life, The Universe, and Everything'))
engine.execute(Book.insert().values(id=7, author_id=2,name='Harry Potter and the Giant Plot Hole'))
engine.execute(Book.insert().values(id=8, author_id=3,name='The Silmarillion'))
engine.execute(Book.insert().values(id=9, author_id=3,name='The Lord of the Rings'));
read(Author.select())
id | name | |
---|---|---|
0 | 1 | Douglas Adams |
1 | 2 | JK Rowling |
2 | 3 | JRR Tolkien |
read(Book.select())
id | author_id | name | |
---|---|---|---|
0 | 4 | 1 | The Hitchhikers Guide to the Galaxy |
1 | 5 | 1 | The Restaurant at the End of the Universe |
2 | 6 | 1 | Life, The Universe, and Everything |
3 | 7 | 2 | Harry Potter and the Giant Plot Hole |
4 | 8 | 3 | The Silmarillion |
5 | 9 | 3 | The Lord of the Rings |
read(
select([
Author.join(Book)
])
)
id | name | id | author_id | name | |
---|---|---|---|---|---|
0 | 1 | Douglas Adams | 4 | 1 | The Hitchhikers Guide to the Galaxy |
1 | 1 | Douglas Adams | 5 | 1 | The Restaurant at the End of the Universe |
2 | 1 | Douglas Adams | 6 | 1 | Life, The Universe, and Everything |
3 | 2 | JK Rowling | 7 | 2 | Harry Potter and the Giant Plot Hole |
4 | 3 | JRR Tolkien | 8 | 3 | The Silmarillion |
5 | 3 | JRR Tolkien | 9 | 3 | The Lord of the Rings |
It would be far more helpful to have a query that returns three rows - one for each author, with each author's books contained in a sub list.
We do that by using two powerful Postgres techniques:
Specifically we use the function json_agg
to roll up a set of books into a JSON list
frame = read('''
WITH author_books AS (SELECT author_id, json_agg(book) FROM book GROUP BY author_id)
SELECT * FROM author
JOIN author_books ON author_books.author_id=author.id
''')
frame
id | name | author_id | json_agg | |
---|---|---|---|---|
0 | 1 | Douglas Adams | 1 | [{'id': 4, 'name': 'The Hitchhikers Guide to the Galaxy', 'author_id': 1}, {'id': 5, 'name': 'The Restarau... |
1 | 3 | JRR Tolkien | 3 | [{'id': 8, 'name': 'The Silmarillion', 'author_id': 3}, {'id': 9, 'name': 'The Lord of the Rings', 'author... |
2 | 2 | JK Rowling | 2 | [{'id': 7, 'name': 'Harry Potter and the Giant Plot Hole', 'author_id': 2}] |
And because we're using Pandas dataframes, we even have a convenience function to turn this directly into a nested Python dictionary:
frame.to_dict('records')
[{'author_id': 1,
'id': 1,
'json_agg': [{'author_id': 1,
'id': 4,
'name': 'The Hitchhikers Guide to the Galaxy'},
{'author_id': 1,
'id': 5,
'name': 'The Restaurant at the End of the Universe'},
{'author_id': 1, 'id': 6, 'name': 'Life, The Universe, and Everything'}],
'name': 'Douglas Adams'},
{'author_id': 3,
'id': 3,
'json_agg': [{'author_id': 3, 'id': 8, 'name': 'The Silmarillion'},
{'author_id': 3, 'id': 9, 'name': 'The Lord of the Rings'}],
'name': 'JRR Tolkien'},
{'author_id': 2,
'id': 2,
'json_agg': [{'author_id': 2,
'id': 7,
'name': 'Harry Potter and the Giant Plot Hole'}],
'name': 'JK Rowling'}]
So now we have our data in a form which would be very easy to use in the templating language of your choice when building a web application.
I'm glad you asked. First I define a little helper function to represent the underlying json_agg
function.
def json_agg(table):
return func.json_agg(literal_column('"'+ table.name+'"'))
Then I create my CTE:
AuthorBooks = (
select([
Book.c.author_id,
json_agg(Book).label('books')
])
.select_from(Book)
.group_by(Book.c.author_id)
).cte('author_books')
And finally I use my CTE exactly as if it were a real table:
query=(
select([
Author,
AuthorBooks
])
.select_from(
Author.join(AuthorBooks)
)
)
frame=read(query)
frame
id | name | author_id | books | |
---|---|---|---|---|
0 | 1 | Douglas Adams | 1 | [{'id': 4, 'name': 'The Hitchhikers Guide to the Galaxy', 'author_id': 1}, {'id': 5, 'name': 'The Restarau... |
1 | 3 | JRR Tolkien | 3 | [{'id': 8, 'name': 'The Silmarillion', 'author_id': 3}, {'id': 9, 'name': 'The Lord of the Rings', 'author... |
2 | 2 | JK Rowling | 2 | [{'id': 7, 'name': 'Harry Potter and the Giant Plot Hole', 'author_id': 2}] |
And as before we can turn this into a nested Python data structure
authors = frame.to_dict('records')
authors[0]['name']
'Douglas Adams'
authors[0]['books']
[{'author_id': 1, 'id': 4, 'name': 'The Hitchhikers Guide to the Galaxy'},
{'author_id': 1,
'id': 5,
'name': 'The Restaurant at the End of the Universe'},
{'author_id': 1, 'id': 6, 'name': 'Life, The Universe, and Everything'}]
Let's consider a slightly more complex example, that of 'users' and 'groups'.
(As an extra bit of fun, both user
and group
are reserved keywords in Postgres, so we have to be careful with our quoting to make this work)
engine.execute('''
DROP TABLE IF EXISTS "user_to_group";
DROP TABLE IF EXISTS "group";
DROP TABLE IF EXISTS "user";
CREATE TABLE IF NOT EXISTS "group"(
id int primary key,
name text
);
CREATE TABLE IF NOT EXISTS "user"(
id int primary key,
name text
);
CREATE TABLE IF NOT EXISTS "user_to_group"(
group_id int references "group",
user_id int references "user"
);
''');
metadata = sqlalchemy.MetaData(bind=engine)
metadata.reflect()
tables=metadata.tables
User = tables['user']
Group = tables['group']
UserToGroup = tables['user_to_group']
We'll create three groups and five users, some of whom may be in more than one group
inserts=[
Group.insert().values(id=1,name='Hobbits'),
Group.insert().values(id=2,name='Wizards'),
Group.insert().values(id=3,name='The Fellowship'),
User.insert().values(id=1,name='Frodo'),
User.insert().values(id=2,name='Gandalf'),
User.insert().values(id=3,name='Legolas'),
User.insert().values(id=4,name='Pippin'),
User.insert().values(id=5,name='Saruman'),
UserToGroup.insert().values(user_id=1,group_id=1),
UserToGroup.insert().values(user_id=4,group_id=1),
UserToGroup.insert().values(user_id=2,group_id=2),
UserToGroup.insert().values(user_id=5,group_id=2),
UserToGroup.insert().values(user_id=1,group_id=3),
UserToGroup.insert().values(user_id=2,group_id=3),
UserToGroup.insert().values(user_id=3,group_id=3),
UserToGroup.insert().values(user_id=4,group_id=3),
]
for i in inserts:
engine.execute(i)
read(Group.join(UserToGroup).join(User).select())
id | name | group_id | user_id | id | name | |
---|---|---|---|---|---|---|
0 | 1 | Hobbits | 1 | 1 | 1 | Frodo |
1 | 1 | Hobbits | 1 | 4 | 4 | Pippin |
2 | 2 | Wizards | 2 | 2 | 2 | Gandalf |
3 | 2 | Wizards | 2 | 5 | 5 | Saruman |
4 | 3 | The Fellowship | 3 | 1 | 1 | Frodo |
5 | 3 | The Fellowship | 3 | 2 | 2 | Gandalf |
6 | 3 | The Fellowship | 3 | 3 | 3 | Legolas |
7 | 3 | The Fellowship | 3 | 4 | 4 | Pippin |
In SQL, it looks like this:
read('''
WITH user_groups AS (
SELECT user_id, json_agg("group") AS groups
FROM user_to_group
JOIN "group"
ON user_to_group.group_id="group".id
GROUP BY user_id
)
SELECT id,name,groups FROM "user"
JOIN user_groups on "user".id = user_groups.user_id
''')
id | name | groups | |
---|---|---|---|
0 | 4 | Pippin | [{'id': 1, 'name': 'Hobbits'}, {'id': 3, 'name': 'The Fellowship'}] |
1 | 1 | Frodo | [{'id': 1, 'name': 'Hobbits'}, {'id': 3, 'name': 'The Fellowship'}] |
2 | 5 | Saruman | [{'id': 2, 'name': 'Wizards'}] |
3 | 3 | Legolas | [{'id': 3, 'name': 'The Fellowship'}] |
4 | 2 | Gandalf | [{'id': 2, 'name': 'Wizards'}, {'id': 3, 'name': 'The Fellowship'}] |
But we want to use SQLAlchemy, so again we create CTE objects to help us
UserGroups=(
select([
UserToGroup.c.user_id,
json_agg(Group).label('groups')
])
.select_from(Group.join(UserToGroup))
.group_by(UserToGroup.c.user_id)
).cte('user_groups')
query=(
select([
User,
UserGroups.c.groups
])
.select_from(
User.join(UserGroups,User.c.id==UserGroups.c.user_id)
)
)
read(query)
id | name | groups | |
---|---|---|---|
0 | 4 | Pippin | [{'id': 1, 'name': 'Hobbits'}, {'id': 3, 'name': 'The Fellowship'}] |
1 | 1 | Frodo | [{'id': 1, 'name': 'Hobbits'}, {'id': 3, 'name': 'The Fellowship'}] |
2 | 5 | Saruman | [{'id': 2, 'name': 'Wizards'}] |
3 | 3 | Legolas | [{'id': 3, 'name': 'The Fellowship'}] |
4 | 2 | Gandalf | [{'id': 2, 'name': 'Wizards'}, {'id': 3, 'name': 'The Fellowship'}] |
GroupUsers=(
select([
UserToGroup.c.group_id,
json_agg(User).label('users')
])
.select_from(User.join(UserToGroup))
.group_by(UserToGroup.c.group_id)
).cte('group_users')
query=(
select([
Group,
GroupUsers.c.users
])
.select_from(
Group.join(GroupUsers,Group.c.id==GroupUsers.c.group_id)
)
)
read(query)
id | name | users | |
---|---|---|---|
0 | 1 | Hobbits | [{'id': 1, 'name': 'Frodo'}, {'id': 4, 'name': 'Pippin'}] |
1 | 3 | The Fellowship | [{'id': 1, 'name': 'Frodo'}, {'id': 2, 'name': 'Gandalf'}, {'id': 3, 'name': 'Legolas'}, {'id': 4, 'name':... |
2 | 2 | Wizards | [{'id': 2, 'name': 'Gandalf'}, {'id': 5, 'name': 'Saruman'}] |
read(query).to_dict('records')
[{'id': 1,
'name': 'Hobbits',
'users': [{'id': 1, 'name': 'Frodo'}, {'id': 4, 'name': 'Pippin'}]},
{'id': 3,
'name': 'The Fellowship',
'users': [{'id': 1, 'name': 'Frodo'},
{'id': 2, 'name': 'Gandalf'},
{'id': 3, 'name': 'Legolas'},
{'id': 4, 'name': 'Pippin'}]},
{'id': 2,
'name': 'Wizards',
'users': [{'id': 2, 'name': 'Gandalf'}, {'id': 5, 'name': 'Saruman'}]}]
I demonstrate how to use WITH
statements (Common Table Expressions), the json_agg
function and SQLAlchemy to quickly convert complex SQL joins into nested Python data structures.
Using techniques like the ones presented here, Postgres can act as a powerful relational data store that can still provide applications with data in helpful forms, such as nested dictionaries and lists.