Quick Links

Re: Finding latest record for a number of groups in an INSERT-only table

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Finding latest record for a number of groups in an INSERT-only table
Date:	2011-07-05 07:32:30
Message-ID:	CA+U5nMJ9bm54F-Egr9CSnUL-aR017TQEwFf3uCsrcvgWUDFZLQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Tue, Jul 5, 2011 at 12:48 AM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
> This is basically exactly the same as
> http://archives.postgresql.org/pgsql-sql/2008-10/msg00009.php; I'm
> just asking again, to see if thinking on the problem has changed:
>
> The basic problem, restated, is one has a relation with tuples like this:
>
> (key, recency_stamp, other_columns*)
>
> Some example data may be:
>
> 'ice cream', 0, 'vanilla'
> 'ice cream', 1, 'chocolate'
> 'beer', 0, 'ale'
> 'beer', 3, 'lager'
>
> The goal is to extract from this relation the following:
>
> 'ice cream', '1', 'chocolate'
> 'beer', 3, 'lager'
>
> In my dataset, the distinct number of keys is << the number of rows;
> in fact, that ratio grows ever smaller until the data is truncated,
> with a convergence point probably resting at a negligible fraction of
> a percent -- definitely less than 0.1% in short order, easily getting
> to 0.01% and 0.001% in not a vast amount of time. There is a standard
> btree index on (key, recency_stamp), which ideally is exploitable to
> compute the 'most recent' relation very quickly. Approaches I've
> investigated include DISTINCT ON (this may be the cleanest), windowing
> functions (rank, and a predicate), and a GROUP BY with a self-join,
> but none of them avoid a full relation scan, which is just crippling
> performance-wise.
>
> The only way I could find to trick the optimizer into doing something
> non-crippling was to trick it with
> http://wiki.postgresql.org/wiki/Loose_indexscan to find distinct
> values quickly and subsequently using a subquery in the target list
> that the optimizer decides to run over every distinct value (it
> estimates a very high cost for this, even though it is actually rather
> fast, but the ridiculous costing can subsequent joins against the
> relation in unfortunate ways). This mechanism is obscure enough that
> I may just write a plpgsql function as a workaround, as that may well
> be more lucid.

I think its a pretty common requirement and we should be looking to
optimize it if it isn't handled well.

The only problem is that there is a few ways of writing that in SQL
and we would probably need to recognise how to transform between query
types to achieve a common form.

For example, the post you cite uses a correlated subquery whereas its
possible to write it using an IN subselect.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Finding latest record for a number of groups in an INSERT-only table at 2011-07-04 23:48:06 from Daniel Farina

Responses

Re: Finding latest record for a number of groups in an INSERT-only table at 2011-07-05 07:43:08 from Daniel Farina

Browse pgsql-general by date

	From	Date	Subject
Next Message	Sim Zacks	2011-07-05 07:37:13	Re: Read MS-SQL data into Postgres via ODBC link?
Previous Message	Daniel Farina	2011-07-05 07:13:39	Re: Finding latest record for a number of groups in an INSERT-only table