From: | "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com> |
---|---|
To: | vincent <vinny(at)xs4all(dot)nl> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Selecting K random rows - efficiently! |
Date: | 2007-10-30 10:33:31 |
Message-ID: | 162867790710300333m687f4615r15a786a607c5a863@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
2007/10/30, vincent <vinny(at)xs4all(dot)nl>:
> > 2007/10/26, Patrick TJ McPhee <ptjm(at)interlog(dot)com>:
> >> In article <ffnid8$1q2t$1(at)news(dot)hub(dot)org>, cluster <skrald(at)amossen(dot)dk>
> >> wrote:
> >> % > How important is true randomness?
> >> %
> >> % The goal is an even distribution but currently I have not seen any way
> >> % to produce any kind of random sampling efficiently. Notice the word
> >>
> >> How about generating the ctid randomly? You can get the number of pages
> >> from pg_class and estimate the number of rows either using the number
> >> of tuples in pg_class or just based on what you know about the data.
> >> Then just generate two series of random numbers, one from 0 to the
> >> number
> >> of pages and the other from 1 to the number of rows per page, and keep
> >> picking rows until you have enough numbers. Assuming there aren't too
> >> many dead tuples and your estimates are good, this should retrieve n
> >> rows
> >> with roughly n look-ups.
> >>
> >> If your estimates are low, there will be tuples which can never be
> >> selected,
> >> and so far as I know, there's no way to construct a random ctid in a
> >> stock
> >> postgres database, but apart from that it seems like a good plan. If
> >> efficiency is important, you could create a C function which returns a
> >> series of random tids and join on that.
> >> --
> >>
> >
> > SELECT id, ...
> > FROM data
> > WHERE id = ANY(ARRAY(
> > SELECT (random()*max_id)::int
> > FROM generate_series(1,20)))
> > LIMIT 1;
> >
> > -- max_id is external constant
> >
> > Pavel Stehule
>
> That selects records where the id is one of twenty random numbers between
> zero and the maximum id. That's not truely random, nor is it completely
> safe if there are lots of gaps in the values of id. For example, if the
> lowest id is 50000 and the highest is 50040, this will be very likely to
> generate lots of numbers below 50000 and find no records at all.
>
>
there is only one safe way ORDER BY random() LIMIT 1;
if you know so your id has not uniform distribution, you have to mo:
SELECT random()*(max_id - min_id) + min_id
This solution is far to ideal, but it is fast and for some purposes
enough (web shops, etc)
Pavel
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2007-10-30 10:37:24 | Re: Checking empty array |
Previous Message | Richard Huxton | 2007-10-30 10:32:11 | Re: Checking empty array |