Getting random rows from a table

From: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
To: PGSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Getting random rows from a table
Date: 2012-09-01 04:23:19
Message-ID: CABwTF4WBW3O9kmxXEURdk+8C+DdpmaCJxUSkRJfi3zhaVbJ4qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

This email has been sitting in my drafts folder since Sept 20th, 2008.
Almost 4 years! Getting it off my chest now. I am attaching 2 files for
this solution I developed, because I don't know which is the correct one
(probably both are, with something different in implementation), but I
don't have time or energy to verify that now. I am pretty sure the
randomization works, although it is a tad bit expensive to get random rows.

Although the procedure's parameter names are pretty self descriptive, I'll
explain them in brief here:

p_schemaname : name of the schema where the table resides
p_tablename : name of the table you want to get random rows from
p_columns : column list (AFAIR, these can expressions too)
p_where : the WHERE clause you wish to appy, if any.
p_numrows : how many rows you want in the result.
p_maxretries : how many times to retry when we can't find a row, before
giving up; null implies 'retry forever'

Hi All,

For one of my small experiments (which is obviously backed by PG), I
needed to get a set of random rows from a table. iGoogling around gave me
some pointers, but they all were either not convenient (needed adding a
column), or were not performant enough (sort on huge resultsets!); some had
both the problems. In my test table of about 90 MB containing 1 million
rows, these solutions clearly did not perform well!

One of the solutions I thought of, and which worked too for me, was
using the 'Synchronized Sequential Scans' feature of 8.3. You make one of
your connections do sequential scans on the target table in a loop (jut do
a count(*) on that table in a loop). And when you want to select rows, say
5, from that table, you just fire 'SELECT * FROM mytab LIMIT 5'. Depending
on where the other constantly-looping sequential scan is, you will get 5
rows from a random location in your table.

The problem with this approach is that, that you will always get the
first rows from whichever database block you hit. So, in effect, you will
almost never be able to see al the rows which lie at the end of the blocks
(unless your LIMIT is high enough, or all the rows before that row are
dead).

So I developed another solution, which might work for many cases; and
in cases it doesn't work, the code can be easily be extended/modified to
suit any query type.

Attached is the file containing the definition of plpgsql function
get_random_rows(), using which we can get a specified number of random
rows. This function returns truly random rows from the mentioned table.
Here are two invocations of this function on a test table:

postgres=> explain analyze select * from get_random_rows( null, 'url',
'{url}', 100, null ) as ( a varchar );
NOTICE: Number of misses: 17
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Function Scan on get_random_rows (cost=0.00..260.00 rows=1000 width=32)
(actual time=193.790..194.269 rows=100 loops=1)
Total runtime: 195.017 ms
(2 rows)

postgres=> explain analyze select * from get_random_rows( null, 'url',
'{url}', 100, null ) as ( a varchar );
NOTICE: Number of misses: 30
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Function Scan on get_random_rows (cost=0.00..260.00 rows=1000 width=32)
(actual time=246.101..246.714 rows=100 loops=1)
Total runtime: 247.452 ms
(2 rows)

postgres=>

--
Gurjeet Singh

Attachment Content-Type Size
get_random_rows.sql application/octet-stream 3.3 KB
get_random_rows_seq.sql application/octet-stream 3.3 KB

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Thalis Kalfigkopoulos 2012-09-01 06:07:09 Re: CASE/WHEN behavior with NULLS
Previous Message Andrew Sullivan 2012-09-01 03:54:34 Re: "Too far out of the mainstream"