Finding latest record for a number of groups in an INSERT-only table

From: Daniel Farina <daniel(at)heroku(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Finding latest record for a number of groups in an INSERT-only table
Date: 2011-07-04 23:48:06
Message-ID: CAAZKuFa2dw9TT5yuPyrYQDDxpA8YFLAOVadsafmAxhNDhqQgJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

This is basically exactly the same as
http://archives.postgresql.org/pgsql-sql/2008-10/msg00009.php; I'm
just asking again, to see if thinking on the problem has changed:

The basic problem, restated, is one has a relation with tuples like this:

(key, recency_stamp, other_columns*)

Some example data may be:

'ice cream', 0, 'vanilla'
'ice cream', 1, 'chocolate'
'beer', 0, 'ale'
'beer', 3, 'lager'

The goal is to extract from this relation the following:

'ice cream', '1', 'chocolate'
'beer', 3, 'lager'

In my dataset, the distinct number of keys is << the number of rows;
in fact, that ratio grows ever smaller until the data is truncated,
with a convergence point probably resting at a negligible fraction of
a percent -- definitely less than 0.1% in short order, easily getting
to 0.01% and 0.001% in not a vast amount of time. There is a standard
btree index on (key, recency_stamp), which ideally is exploitable to
compute the 'most recent' relation very quickly. Approaches I've
investigated include DISTINCT ON (this may be the cleanest), windowing
functions (rank, and a predicate), and a GROUP BY with a self-join,
but none of them avoid a full relation scan, which is just crippling
performance-wise.

The only way I could find to trick the optimizer into doing something
non-crippling was to trick it with
http://wiki.postgresql.org/wiki/Loose_indexscan to find distinct
values quickly and subsequently using a subquery in the target list
that the optimizer decides to run over every distinct value (it
estimates a very high cost for this, even though it is actually rather
fast, but the ridiculous costing can subsequent joins against the
relation in unfortunate ways). This mechanism is obscure enough that
I may just write a plpgsql function as a workaround, as that may well
be more lucid.

I am using PostgreSQL 9.

Does anyone have fresh thoughts or suggestions for dealing with
INSERT-mostly tables conceived in this manner?

--
fdr

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Brent Wood 2011-07-05 00:58:15 Re: Read MS-SQL data into Postgres via ODBC link?
Previous Message Jonathan Brinkman 2011-07-04 21:10:09 Read MS-SQL data into Postgres via ODBC link?