Re: Obtaining random rows from a result set

From: Alban Hertroys <alban(at)magproductions(dot)nl>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Obtaining random rows from a result set
Date: 2007-09-01 10:44:08
Message-ID: 5F2804AD-A18A-4791-8BAE-4203615E5ED4@magproductions.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Aug 31, 2007, at 15:54, Martijn van Oosterhout wrote:

> On Fri, Aug 31, 2007 at 02:42:18PM +0200, Alban Hertroys wrote:
>> Examples:
>> * random(maxrows) would return random rows from the resultset.
>> * median() would return the rows in the middle of the result set
>> (this
>> would require ordering to be meaningful).
>
> It would be possible to write an aggregate that returns a single
> random
> value from a set. The algorithm is something like:
>
> n = 1
> v = null
> for each row
> if random() < 1/n:
> v = value of row
> n = n + 1
>
> return v

Doesn't this always return the first record, since random() is always
less than 1/1?
I don't think this method has a linear distribution, but then again I
don't understand what 'value of row' refers to...

> It does require a seqscan though.

I doubt that a seqscan can be entirely avoided to fetch random rows
from a set, at least not until the last random result has been
returned, _unless_ the number of matching records would be known
before starting taking random samples.

> If you're asking for 5 random rows
> you probably mean 5 random but distinct rows, which is different to
> just running the above set 5 times in parallel.

Indeed, that is one of the distinctions that need some thought for my
original preposition. I left it out, as it's an implementation detail
(an important one, admittedly).

> I don't know if there's a similar method for median...

I'm not entirely sure, but I think your method is the only one
suggested that doesn't involve calculating random() a million times
(for a million records) to return 5 (random) records.

My suggestion involved a way to calculate random() only when
retrieving records from the result set (only 5 times for a million
records in this case).
For a linearly distributed random set it does require knowing the
number of records in the set though, an estimate would make it non-
linear (although only a little bit if accurate enough).

OTOH, I'm starting to think that the last sort step of an order by
can be postponed to the result set fetching cycle under the
conditions that:
- the ordering expression is unrelated to the records involved, and
- only a fraction of the total number of records will be returned.
(Which is somewhat similar to the condition for an index being more
efficient than a seqscan, btw)

Comparing records with each other for something not related seems a
waste of effort, while the result set has already been determined
(just not ordered in any particular way), am I right?

With that change (postponing sorting) my original ORDER BY random()
LIMIT 5 would perform quite adequately, I think - it'd only involve
calculating random() at least 5 times, not as often as the number of
records in the result set.

Or is order by random() acting as some kind of shuffling method? Is
that a requirement to get a linearly distributed set to randomly draw
from?
I can see how it wouldn't be linear if you'd start randomly comparing
records from the beginning of the result set... (Which would be the
logical method if you don't know the size of the set before hand)

I thought of another solution (with only a few calculations of random
()) that can be deployed in existing versions of PG, using a set-
returning function with a scrolling cursor that accepts the query
string as input like this (in pseudoish-code):

----
create function random(text _query, integer _limit)
returns set
volatile
as $$
DECLARE
_cur cursor;
_cnt bigint;
_idx integer;
_rowpos bigint;

_rec record;
BEGIN
open _cur for execute query;
fetch forward all into _rec;
-- select total nr of records into _cnt

for _idx in 1.._limit loop
_rowpos := random() * _cnt;

fetch absolute _rowpos into _rec;
return next _rec;
end loop;

return;
END;
$$
language 'plpgsql';
----

This method could return the same record twice though, I'll need to
build in some accounting for used up rowpos'es.
Would it be more efficient than the usual methods?

Sorry for the brain dump, I tried to get everything into this single
message. I hope it is at least comprehensible and useful, or
interesting or at least mildly amusing if not.

Regards,

Alban Hertroys
magproductions b.v.

!DSPAM:737,46d93d9b289906550616460!

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Björn Lundin 2007-09-01 10:47:21 Re: Export data to MS Excel
Previous Message Ron Johnson 2007-09-01 10:08:41 Re: Export data to MS Excel