Aggregates (last/first) not behaving

From: Wes Devauld <wes(at)devauld(dot)ca>
To: pgsql-sql(at)postgresql(dot)org
Subject: Aggregates (last/first) not behaving
Date: 2010-07-22 06:37:04
Message-ID: AANLkTim1g8cejgRokrpTr6a_z6bXeKxOhUSFeTIwMrcK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I have PostgreSQL 8.3.9 [PostgreSQL 8.3.9 on i386-apple-darwin10.3.0,
compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build
5646) (dot 1)]

and the custom first and last aggregates from:

http://wiki.postgresql.org/wiki/First_(aggregate)
http://wiki.postgresql.org/wiki/Last_(aggregate)

I have a simple table, of two columns. The first is a timestamp and is the
primary key, the second is an integer. I've loaded the table up with
values, one for every minute, for a whole year. Some SQL to recreate the
table and the aggregates can be retrieved from:

http://blog.devauld.ca/aggregate_test.zip (File is approximately 180KB)

Now when I try to make use of the first and last aggregates, I get:

# select first(t), last(t) from test group by extract(day from t);
first | last
---------------------+---------------------
2009-01-01 00:00:00 | 2009-01-01 17:02:00
2009-01-02 10:07:00 | 2009-01-02 10:06:00
2009-01-03 20:15:00 | 2009-01-03 20:14:00
2009-01-04 00:00:00 | 2009-01-04 23:59:00
2009-01-05 00:00:00 | 2009-01-05 23:59:00
2009-01-06 16:31:00 | 2009-01-06 16:30:00
2009-01-07 00:00:00 | 2009-01-07 23:49:00
2009-01-08 11:09:00 | 2009-01-08 11:42:00
2009-01-09 11:08:00 | 2009-01-09 00:51:00
2009-01-10 11:33:00 | 2009-01-10 23:37:00
2009-01-11 13:05:00 | 2009-01-11 23:59:00
2009-01-12 23:55:00 | 2009-01-12 23:47:00
2009-01-13 01:50:00 | 2009-01-13 23:36:00
2009-01-14 23:55:00 | 2009-01-14 23:41:00
2009-01-15 00:47:00 | 2009-01-15 23:40:00
2009-01-16 00:29:00 | 2009-01-16 23:38:00
2009-01-17 00:09:00 | 2009-01-17 23:37:00
2009-01-18 23:48:00 | 2009-01-18 23:37:00
2009-01-19 23:56:00 | 2009-01-19 23:39:00
2009-01-20 07:14:00 | 2009-01-20 23:36:00
2009-01-21 23:40:00 | 2009-01-21 23:41:00
2009-01-22 02:57:00 | 2009-01-22 23:40:00
2009-01-23 23:56:00 | 2009-01-23 23:38:00
2009-01-24 09:34:00 | 2009-01-24 23:37:00
2009-01-25 23:50:00 | 2009-01-25 23:37:00
2009-01-26 23:48:00 | 2009-01-26 23:39:00
2009-01-27 06:36:00 | 2009-01-27 23:37:00
2009-01-28 23:59:00 | 2009-01-28 23:41:00
2009-01-29 16:12:00 | 2009-01-29 23:40:00
2009-01-30 21:11:00 | 2009-01-30 23:39:00
2009-01-31 20:12:00 | 2009-01-31 16:20:00
(31 rows)

For some reason the aggregates are not falling into the proper group. I
can't blame timezones as the results are all over the map, and first/last
relationship is broken as in some cases 'last' is chronologically before
'first'

If I explicitly retrieve the values for midnight each day:

# select t, v from test where extract(hour from t) = 0 and extract(minute
from t) = 0;
t | v
---------------------+-------
2009-01-01 00:00:00 | 0
2009-01-02 00:00:00 | 1440
2009-01-03 00:00:00 | 2880
2009-01-04 00:00:00 | 4320
2009-01-05 00:00:00 | 5760
2009-01-06 00:00:00 | 7200
2009-01-07 00:00:00 | 8640
2009-01-08 00:00:00 | 10080
2009-01-09 00:00:00 | 11520
2009-01-10 00:00:00 | 12960
2009-01-11 00:00:00 | 14400
2009-01-12 00:00:00 | 15840
2009-01-13 00:00:00 | 17280
2009-01-14 00:00:00 | 18720
2009-01-15 00:00:00 | 20160
2009-01-16 00:00:00 | 21600
2009-01-17 00:00:00 | 23040
2009-01-18 00:00:00 | 24480
2009-01-19 00:00:00 | 25920
2009-01-20 00:00:00 | 27360
2009-01-21 00:00:00 | 28800
2009-01-22 00:00:00 | 30240
2009-01-23 00:00:00 | 31680
2009-01-24 00:00:00 | 33120
2009-01-25 00:00:00 | 34560
2009-01-26 00:00:00 | 36000
2009-01-27 00:00:00 | 37440
2009-01-28 00:00:00 | 38880
2009-01-29 00:00:00 | 40320
2009-01-30 00:00:00 | 41760
2009-01-31 00:00:00 | 43200
(31 rows)

I get back the values for which I am seeking. The pain is in finding the
last record in the day before. I would have thought that grouping by
date_trunc on month would have yeilded similar results to above:

# select first(t), first(v) from test group by date_trunc('day', t);
first | first
---------------------+-------
2009-01-01 00:00:00 | 0
2009-01-02 10:07:00 | 2047
2009-01-03 20:15:00 | 4095
2009-01-04 00:00:00 | 4320
2009-01-05 00:00:00 | 5760
2009-01-06 17:33:00 | 8253
2009-01-07 16:56:00 | 9656
2009-01-08 17:28:00 | 11128
2009-01-09 21:14:00 | 12794
2009-01-10 05:47:00 | 13307
2009-01-11 16:42:00 | 15402
2009-01-12 16:30:00 | 16830
2009-01-13 20:14:00 | 18494
2009-01-14 23:59:00 | 20159
2009-01-15 22:17:00 | 21497
2009-01-16 23:57:00 | 23037
2009-01-17 18:32:00 | 24152
2009-01-18 20:15:00 | 25695
2009-01-19 07:58:00 | 26398
2009-01-20 22:16:00 | 28696
2009-01-21 17:31:00 | 29851
2009-01-22 16:37:00 | 31237
2009-01-23 23:59:00 | 33119
2009-01-24 21:13:00 | 34393
2009-01-25 22:17:00 | 35897
2009-01-26 16:42:00 | 37002
2009-01-27 16:30:00 | 38430
2009-01-28 16:52:00 | 39892
2009-01-29 23:59:00 | 41759
2009-01-30 10:19:00 | 42379
2009-01-31 14:58:00 | 44098
(31 rows)

Looking at the plan:

# explain select first(t), first(v) from test group by date_trunc('day', t);
QUERY PLAN
-----------------------------------------------------------------------
GroupAggregate (cost=5010.56..6238.16 rows=44640 width=12)
-> Sort (cost=5010.56..5122.16 rows=44640 width=12)
Sort Key: (date_trunc('day'::text, t))
-> Seq Scan on test (cost=0.00..800.00 rows=44640 width=12)
(4 rows)

Now, if I have a much smaller data set:

# delete from test where t > '2009-01-03'; vacuum analyze;
# select first(t), first(v) from test group by date_trunc('day', t);
first | first
---------------------+-------
2009-01-01 00:00:00 | 0
2009-01-03 00:00:00 | 2880
2009-01-02 00:00:00 | 1440
(3 rows)

# explain select first(t), first(v) from test group by date_trunc('day', t);
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=73.62..124.04 rows=2881 width=12)
-> Seq Scan on test (cost=0.00..52.01 rows=2881 width=12)
(2 rows)

So, is there a way I can set up my query to deterministically return the
same results each time? Is there another way to get 'first of the month' or
'last for the day' aggregates that work reliably on varying data set sizes?
Can I force the optimizer to not use a GroupAggregate?

Any help would be appreciated,
-W

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Rainer Stengele 2010-07-22 08:09:16 grouping subsets
Previous Message Kenneth Marshall 2010-07-21 12:38:08 Re: UUID for Postgresql 8.4