From: | Michael Fuhr <mike(at)fuhr(dot)org> |
---|---|
To: | Jeff Davis <jdavis-pgsql(at)empires(dot)org> |
Cc: | PgSQL General List <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: random record from small set |
Date: | 2005-02-15 05:18:43 |
Message-ID: | 20050215051843.GA40527@winnie.fuhr.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:
>
> * Am I right to use NUMERIC for the chance attribute?
I ran tests with numeric, real, and double precision; double precision
was consistently about 10% faster than the others. I used the
sample data you posted and the PL/pgSQL function shown later in
this message.
> * Does perl's arithmetic leave me with the chance that those numeric
> values don't add up to 1.00 (and in this case that could mean an
> infinite loop)?
I'd suggest looping through the records so you can't possibly end
up in an infinite loop.
> * In my design I'll need a constraint trigger making sure that the
> numbers add up to 1.00.
If the sum must be exactly 1.00 then be careful if you use double
precision -- if you test with the equality operator then the check
might fail because the sum is 0.9999999987.
> Will that be a performance problem for operations on the table that
> don't modify the chance attribute?
Any trigger that you didn't otherwise need will cause a performance
hit. I'd expect a statement-level AFTER trigger to have the lowest
impact since it would run only once per statement, whereas a row-level
trigger might run multiple times per statement. On the other hand,
if you make a lot of updates that don't modify the chance attribute,
then you might want to try a row-level trigger that skips the check
when NEW.chance = OLD.chance. I'd suggesting testing different
methods under expected conditions and see which has the lowest impact.
> * Is there a better way?
> * Does spi_exec_query pull the entire result set into memory at once?
I think it does. I ran some tests with the following PL/pgSQL
function and got significantly faster times than with PL/Perl,
especially as the data set grew:
CREATE FUNCTION randrec() RETURNS integer AS $$
DECLARE
r double precision := random();
accum double precision := 0.0;
row record;
BEGIN
FOR row IN SELECT * FROM r1 LOOP
accum := accum + row.chance;
IF accum >= r THEN
EXIT;
END IF;
END LOOP;
RETURN row.i;
END;
$$ LANGUAGE plpgsql VOLATILE;
SELECT * FROM r1;
i | chance
---+--------
1 | 0.25
2 | 0.20
3 | 0.15
4 | 0.10
5 | 0.30
SELECT i, count(*)
FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
i | count
---+-------
1 | 2467
2 | 1939
3 | 1536
4 | 1016
5 | 3042
(5 rows)
Time: 3300.710 ms
Here are the results using the PL/Perl function you posted:
SELECT i, count(*)
FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
i | count
---+-------
1 | 2501
2 | 2040
3 | 1463
4 | 994
5 | 3002
(5 rows)
Time: 8765.584 ms
I ran each query several times and those times were typical of both.
With a data set of 100 records, the PL/pgSQL function ran in about
14 seconds, while the PL/Perl function took around 65 seconds.
--
Michael Fuhr
http://www.fuhr.org/~mfuhr/
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2005-02-15 05:58:43 | Re: Lost rows/data corruption? |
Previous Message | Andrew Hall | 2005-02-15 05:15:24 | Re: Lost rows/data corruption? |