Re: random record from small set

From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Jeff Davis <jdavis-pgsql(at)empires(dot)org>
Cc: PgSQL General List <pgsql-general(at)postgresql(dot)org>
Subject: Re: random record from small set
Date: 2005-02-15 05:18:43
Message-ID: 20050215051843.GA40527@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:
>
> * Am I right to use NUMERIC for the chance attribute?

I ran tests with numeric, real, and double precision; double precision
was consistently about 10% faster than the others. I used the
sample data you posted and the PL/pgSQL function shown later in
this message.

> * Does perl's arithmetic leave me with the chance that those numeric
> values don't add up to 1.00 (and in this case that could mean an
> infinite loop)?

I'd suggest looping through the records so you can't possibly end
up in an infinite loop.

> * In my design I'll need a constraint trigger making sure that the
> numbers add up to 1.00.

If the sum must be exactly 1.00 then be careful if you use double
precision -- if you test with the equality operator then the check
might fail because the sum is 0.9999999987.

> Will that be a performance problem for operations on the table that
> don't modify the chance attribute?

Any trigger that you didn't otherwise need will cause a performance
hit. I'd expect a statement-level AFTER trigger to have the lowest
impact since it would run only once per statement, whereas a row-level
trigger might run multiple times per statement. On the other hand,
if you make a lot of updates that don't modify the chance attribute,
then you might want to try a row-level trigger that skips the check
when NEW.chance = OLD.chance. I'd suggesting testing different
methods under expected conditions and see which has the lowest impact.

> * Is there a better way?
> * Does spi_exec_query pull the entire result set into memory at once?

I think it does. I ran some tests with the following PL/pgSQL
function and got significantly faster times than with PL/Perl,
especially as the data set grew:

CREATE FUNCTION randrec() RETURNS integer AS $$
DECLARE
r double precision := random();
accum double precision := 0.0;
row record;
BEGIN
FOR row IN SELECT * FROM r1 LOOP
accum := accum + row.chance;
IF accum >= r THEN
EXIT;
END IF;
END LOOP;

RETURN row.i;
END;
$$ LANGUAGE plpgsql VOLATILE;

SELECT * FROM r1;
i | chance
---+--------
1 | 0.25
2 | 0.20
3 | 0.15
4 | 0.10
5 | 0.30

SELECT i, count(*)
FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
i | count
---+-------
1 | 2467
2 | 1939
3 | 1536
4 | 1016
5 | 3042
(5 rows)
Time: 3300.710 ms

Here are the results using the PL/Perl function you posted:

SELECT i, count(*)
FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
i | count
---+-------
1 | 2501
2 | 2040
3 | 1463
4 | 994
5 | 3002
(5 rows)
Time: 8765.584 ms

I ran each query several times and those times were typical of both.
With a data set of 100 records, the PL/pgSQL function ran in about
14 seconds, while the PL/Perl function took around 65 seconds.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2005-02-15 05:58:43 Re: Lost rows/data corruption?
Previous Message Andrew Hall 2005-02-15 05:15:24 Re: Lost rows/data corruption?