Re: random record from small set

From: Jeff Davis <jdavis-pgsql(at)empires(dot)org>
To: Michael Fuhr <mike(at)fuhr(dot)org>
Cc: PgSQL General List <pgsql-general(at)postgresql(dot)org>
Subject: Re: random record from small set
Date: 2005-02-15 06:15:14
Message-ID: 1108448114.354.229.camel@jeff
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks very much for the information. I had very similar results on my
machine. I will take your advice and use the double-precision values,
since it doesn't affect the probability significantly anyway. As far as
the constraint trigger, I will see if it becomes a problem before I
worry about its performance. As far as whether those values add up to
1.0, I'll just check to make sure it's fairly close to 1.0 :)

The only real difference that I saw was that I didn't notice much
difference if the underlying table's chance attribute was double
precision vs. numeric. I used your generate_series()-based query.

Perhaps that was because I was using ALTER TABLE to modify r1's "chance"
attribute. One thing that I noticed there was that I had to CREATE OR
REPLACE the plpgsql function for that to work. Perhaps it was a bug?
Here's a test case:

test=# create table t1(i int);
CREATE TABLE
test=# insert into t1 values(1);
INSERT 17775 1
test=# create or replace function err() returns int as $$ DECLARE accum
double precision := 0.0; row record; BEGIN FOR row IN SELECT i FROM t1
LOOP accum := accum + row.i; END LOOP; RETURN row.i; END; $$ language
plpgsql;
CREATE FUNCTION
test=# insert into t1 values(2); INSERT 17778 1
test=# select err();
err
-----
2
(1 row)
test=# alter table t1 alter column i type numeric;
ALTER TABLE
test=# select err();
err
-----
10
(1 row)

And if you keep playing around with the type and values you can get
other errors like:
ERROR: type "double precision" value out of range: underflow
CONTEXT: PL/pgSQL function "err" line 1 at assignment
Or:
ERROR: invalid memory alloc request size 4294967290
CONTEXT: PL/pgSQL function "randrec" line 7 at assignment

If any more information would be helpful someone let me know. It looks a
little like a bug; perhaps we should throw an error when dependent
functions are called after the underlying types have changed? Or I
suppose if we can recognize that, it might as well recompile the
function and proceed without error.

Regards,
Jeff Davis

On Mon, 2005-02-14 at 22:18 -0700, Michael Fuhr wrote:
> On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:
> >
> > * Am I right to use NUMERIC for the chance attribute?
>
> I ran tests with numeric, real, and double precision; double precision
> was consistently about 10% faster than the others. I used the
> sample data you posted and the PL/pgSQL function shown later in
> this message.
>
> > * Does perl's arithmetic leave me with the chance that those numeric
> > values don't add up to 1.00 (and in this case that could mean an
> > infinite loop)?
>
> I'd suggest looping through the records so you can't possibly end
> up in an infinite loop.
>
> > * In my design I'll need a constraint trigger making sure that the
> > numbers add up to 1.00.
>
> If the sum must be exactly 1.00 then be careful if you use double
> precision -- if you test with the equality operator then the check
> might fail because the sum is 0.9999999987.
>
> > Will that be a performance problem for operations on the table that
> > don't modify the chance attribute?
>
> Any trigger that you didn't otherwise need will cause a performance
> hit. I'd expect a statement-level AFTER trigger to have the lowest
> impact since it would run only once per statement, whereas a row-level
> trigger might run multiple times per statement. On the other hand,
> if you make a lot of updates that don't modify the chance attribute,
> then you might want to try a row-level trigger that skips the check
> when NEW.chance = OLD.chance. I'd suggesting testing different
> methods under expected conditions and see which has the lowest impact.
>
> > * Is there a better way?
> > * Does spi_exec_query pull the entire result set into memory at once?
>
> I think it does. I ran some tests with the following PL/pgSQL
> function and got significantly faster times than with PL/Perl,
> especially as the data set grew:
>
> CREATE FUNCTION randrec() RETURNS integer AS $$
> DECLARE
> r double precision := random();
> accum double precision := 0.0;
> row record;
> BEGIN
> FOR row IN SELECT * FROM r1 LOOP
> accum := accum + row.chance;
> IF accum >= r THEN
> EXIT;
> END IF;
> END LOOP;
>
> RETURN row.i;
> END;
> $$ LANGUAGE plpgsql VOLATILE;
>
> SELECT * FROM r1;
> i | chance
> ---+--------
> 1 | 0.25
> 2 | 0.20
> 3 | 0.15
> 4 | 0.10
> 5 | 0.30
>
> SELECT i, count(*)
> FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
> GROUP BY i
> ORDER by i;
> i | count
> ---+-------
> 1 | 2467
> 2 | 1939
> 3 | 1536
> 4 | 1016
> 5 | 3042
> (5 rows)
> Time: 3300.710 ms
>
> Here are the results using the PL/Perl function you posted:
>
> SELECT i, count(*)
> FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
> GROUP BY i
> ORDER by i;
> i | count
> ---+-------
> 1 | 2501
> 2 | 2040
> 3 | 1463
> 4 | 994
> 5 | 3002
> (5 rows)
> Time: 8765.584 ms
>
> I ran each query several times and those times were typical of both.
> With a data set of 100 records, the PL/pgSQL function ran in about
> 14 seconds, while the PL/Perl function took around 65 seconds.
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Rey Arqueza 2005-02-15 07:21:36 Re: /usr/sbin/useradd is needed by postgresql-server-8.0.1-PGDG
Previous Message Tom Lane 2005-02-15 05:58:43 Re: Lost rows/data corruption?