Quick Links

Re: Selecting K random rows - efficiently!

From:	Paul Tillotson <spam1011(at)adelphia(dot)net>
To:	cluster <skrald(at)amossen(dot)dk>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Selecting K random rows - efficiently!
Date:	2007-10-25 02:23:36
Message-ID:	471FFE28.2050405@adelphia.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

cluster wrote:
> It has been suggested [1] that a good way to select K random rows
> efficiently from a table is to
> 1) add a new column, random_number, to the table
> and initialize it with random()
> 2) perform the following query:
> SELECT *
> FROM mydata
> WHERE random_number >= (SELECT RANDOM() OFFSET 0)
> ORDER BY random_number ASC LIMIT <K>
> Here <K> is the number of random rows to pick. E.g. 100.
>
> The benefit in this solution is that the "random_number" column can be
> indexed, allowing the query to be performed using a simple and fast
> index scan.
>
> However, there is a couple of major drawbacks in this method:
>
> 1) The smaller "random_number" is, the less likely is it that it will
> be picked when using this method. Example: A random_number close to
> zero will only have a very small probability to be selected.
When the above query returns L rows (where L < K) then, you need to
append the first K - L rows from the table to simulate a "ring" without
start or end. (Conveniently, this also solves the problem of not
finding K rows because the random start value was too large.)

The second set of rows can certainly be fetched using a second SELECT
statement. Whether this can be computed efficiently as a single SELECT
statement I am not sure but you might try something like this:

(SELECT 1 AS seq, *
FROM mydata
WHERE random_number >= (SELECT RANDOM() OFFSET 0)
ORDER BY random_number ASC LIMIT <K>)
UNION ALL
(SELECT 2 AS seq, *
FROM mydata
ORDER BY random_number ASC LIMIT <K>)
ORDER BY seq ASC, random_number ASC LIMIT K;

This should provide each row with an equal chance of being selected
while requiring the database to fetch at most 2 * K rows.

Regards,

Paul Tillotson

In response to

Selecting K random rows - efficiently! at 2007-10-24 08:35:14 from cluster

Browse pgsql-general by date

	From	Date	Subject
Next Message	Joe Conway	2007-10-25 02:26:16	Re: Crosstab Problems
Previous Message	Chris Travers	2007-10-25 01:10:59	Migration questions for upcoming 8.3 release and fts