Quick Links

Re: intermittant performance problem

From:	Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
To:	Mike Charnoky <noky(at)nextbus(dot)com>
Cc:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject:	Re: intermittant performance problem
Date:	2009-03-26 10:30:01
Message-ID:	AF429182-2DE3-4C70-9352-22C116931388@solfertje.student.utwente.nl
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Mar 25, 2009, at 5:09 PM, Mike Charnoky wrote:

> Due to the nature of the sampling (need to limit using several
> parameters with a WHERE clause), I can't just generate random
> numbers to select data that I need. Looks like I am stuck using
> ORDER BY RANDOM(). The only other option at this point seems to be
> to implement TABLESAMPLE, probably starting with the academic work
> that Neil Conway published (http://neilconway.org/talks/hacking/)

I'm not sure I answered to this thread before, but ORDER BY RANDOM is
not the only way to get x random rows out of n rows.
Calculating random numbers isn't particularly cheap, so doing that n
times is going to cost a fair amount of CPU cycles and will require a
sequential scan over the table. If you search the archives you're
bound to stumble on a solution I suggested before that only needs to
call random() x times (instead of n times). It still does a sequential
scan (I think there's currently no way around that unless quasi-random
results are acceptable to you). My solution involves a scrollable
cursor that is used to navigate to x random rows in the (otherwise
unsorted) n rows in the result set.

I tried putting that functionality into a pl/pgsql function, but pl/
pgsql doesn't (yet) support the MOVE FORWARD ALL statement, which you
need to get the upper boundary of the random row number (equals the
number of rows in the result set).
An alternative solution is to wrap your query in SELECT COUNT(*) FROM
(...) AS resultset or something similar, but in that case the query
(which does a seqscan) has to be performed twice! Maybe other PL-
languages fair better there, but from the looks of it not even C-
functions can perform MOVE FORWARD ALL, so I guess they won't.

My last attempt used that approach, but it's obviously not optimal.
I'd much prefer to feed my function a query or a refcursor instead of
a string containing a query. Feeding it a string makes me itch.
Anyway, here's how far I got. It is in a usable state and I'm
interested how it performs on a real data set compared to ordering by
random() or other solutions.

!DSPAM:737,49cb5930129747428277249!

Attachment	Content-Type	Size
sample.sql	application/octet-stream	2.0 KB
unknown_filename	text/plain	495 bytes

In response to

Re: intermittant performance problem at 2009-03-25 16:09:39 from Mike Charnoky

Responses

Re: intermittant performance problem at 2009-03-28 10:55:30 from Alban Hertroys

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tim Uckun	2009-03-26 10:43:44	How to compile a 32 bit version of postgres on a x64 machine.
Previous Message	Albe Laurenz	2009-03-26 09:53:59	Re: Weird encoding behavior