trvrm.github.io
Thu 01 January 2015
At my company, I maintain a large distributed. data collection platform
. Pretty much every record we collect needs to be stamped with a
created
field. But because the incoming data comes from sources on
various devices in multiple countries and timezones, making sure that
the timestamps are precise and meaningful can be a challenge.
Postgres can do this very elegantly,
but can also trip you up in subtle ways.
Postgres has two subtly different timestamp data types:
TIMESTAMP
and
TIMESTAMP WITH TIMEZONE
.
The former stores year/month/day/hour/minutes/second/milliseconds, as you’d expect, and the later ALSO stores a timezone offset, expressed in hours.
We can switch between the two using the
AT TIMEZONE
syntax, but, and
here is the tricky bit the function goes BOTH WAYS, and you can easily
get confused if you don’t know what type you’re starting with.
Furthermore, Postgres will sometimes sneakily convert one to the other without you asking.
%load_ext sql
%config SqlMagic.feedback=False
%%sql
postgresql://testuser:password@localhost/test
u'Connected: testuser@test'
%sql SELECT NOW();
now |
---|
2014-08-18 22:33:58.998549-04:00 |
now()
returns a
TIMESTAMP WITH TIME ZONE
. It shows the current
local time, and the offset between that time and UTC
(http://time.is/UTC)
But if we put the output from
now()
into a field that has type
TIMESTAMP
we will get a silent conversion:
%sql SELECT NOW():: timestamp
now |
---|
2014-08-18 22:33:58.998549 |
Which is not the current UTC time. We have stripped the timezone offset right of it. However, if we explicitly do the conversion, we get:
%sql SELECT NOW() AT TIME ZONE 'UTC';
timezone |
---|
2014-08-19 02:33:58.998549 |
Which is the current UTC time: (http://time.is/UTC)
It's worth reviewing the Postgresql documentation on this construct at this point.
Expression | Return Type | Description |
---|---|---|
timestamp without time zone AT TIME ZONE zone | timestamp with time zone | Treat given time stamp without time zone as located in the specified time zone |
timestamp with time zone AT TIME ZONE zone | timestamp without time zone | Convert given time stamp with time zone to the new time zone, with no time zone designation |
name | created |
---|---|
zaphod beeblebrox | 2014-08-18 22:34:03.620583 |
ford prefect | 2014-08-19 02:34:03.621957 |
Note that the second record contains the current UTC time, but the first contains the current time local to the database server. This seems a good idea, and tends to work fine in local testing. But when you try to maintain a system where the database may be in one province, the data collected in another, and then reviewed in a third, you start to understand why this is too simplistic.
The fact that it's 10:12 now in Toronto isn't very helpful for a record that's getting created for a user in Halifax and is monitored from Vancouver.
So it's probably best to save timestamps WITH their timezone so as to avoid any ambiguity. This is the recommendation given here.
In our above example, the simplest approach is to change the table definition:
%%sql
DROP TABLE IF EXISTS test;
CREATE TABLE test(name TEXT, created TIMESTAMP WITH TIME ZONE DEFAULT (NOW() ));
%%sql
INSERT INTO test (name) VALUES ('zaphod beeblebrox');
INSERT INTO test(name,created) VALUES('ford prefect',now() );
SELECT * FROM test;
name | created |
---|---|
zaphod beeblebrox | 2014-08-18 22:35:15.988764-04:00 |
ford prefect | 2014-08-18 22:35:15.989726-04:00 |
So now the dates are globally meaningful. But I still have to be careful, because if I use the wrong date format to populate this table, it'll still get messed up.
%sql INSERT INTO test(name,created) VALUES ('arthur dent',now() at time zone 'utc')
%sql SELECT * FROM test;
name | created |
---|---|
zaphod beeblebrox | 2014-08-18 22:35:15.988764-04:00 |
ford prefect | 2014-08-18 22:35:15.989726-04:00 |
arthur dent | 2014-08-19 02:35:15.990308-04:00 |
Note how arthur dent has completely the wrong created time.
Now, if I want to report on this data, I'm going to now have to specify which timezone I want the dates formatted too:
%sql delete from test WHERE name='arthur dent';
%sql select name, created FROM test;
name | created |
---|---|
zaphod beeblebrox | 2014-08-18 22:35:15.988764-04:00 |
ford prefect | 2014-08-18 22:35:15.989726-04:00 |
gives me timestamps formatted in the timezone of the database server, which isn't necessarily particularly helpful, which may be helpful, but will be less so if the actual users of the data are in a different time zone.
%sql SELECT name, created at time zone 'utc' FROM test;
name | timezone |
---|---|
zaphod beeblebrox | 2014-08-19 02:35:15.988764 |
ford prefect | 2014-08-19 02:35:15.989726 |
gives me the time formatted in the UTC timezone, and
%sql select CREATED at time zone 'CST' FROM test;
timezone |
---|
2014-08-18 20:35:15.988764 |
2014-08-18 20:35:15.989726 |
gives me the time formatted for central standard time.
Now so far we've been letting the database create the timestamps, but sometimes we want to save data provided to us from an external source. In this case it's very important the we know what timezone the incoming data comes from. So our middleware should require that all dates include a timestamp. Fortunately, if we're writing javascript applications, we get this automatically:
%%html
<div id="js-output"></div>
%%javascript
var d = JSON.stringify(new Date())
"2014-08-19T02:41:12.872Z"
import psycopg2,pandas
def execute(sql,params={}):
with psycopg2.connect(database='test') as connection:
with connection.cursor() as cursor:
cursor.execute(sql,params)
So let's imagine that we got this string submitted to us by a client, and we're going to store it in the database via some Python code.
sql="INSERT INTO test (name, created) VALUES ( 'externally created date', %(date)s)"
params=dict(date="2014-08-19T02:35:24.321Z")
execute(sql,params)
%sql SELECT * FROM test
name | created |
---|---|
zaphod beeblebrox | 2014-08-18 22:35:15.988764-04:00 |
ford prefect | 2014-08-18 22:35:15.989726-04:00 |
externally created date | 2014-08-18 22:35:24.321000-04:00 |
And now we're getting to the point where all our timestamp data is both stored and displayed unambiguously.