Re: Recursive CTEs and randomness - is there something I'm missing?

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pól Ua Laoínecháin <linehanp(at)tcd(dot)ie>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Recursive CTEs and randomness - is there something I'm missing?
Date: 2020-02-28 22:19:34
Message-ID: f3d0e5dc-33fc-0ec1-caa7-d30f042a02bc@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2/28/20 9:45 AM, Tom Lane wrote:
> =?UTF-8?B?UMOzbCBVYSBMYW/DrW5lY2jDoWlu?= <linehanp(at)tcd(dot)ie> writes:
>> The SQL:
>
>> WITH RECURSIVE rand (num, md, a_2_s) AS
>> (
>> SELECT
>> 1,
>> MD5(RANDOM()::TEXT),
>> ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
>> FROM GENERATE_SERIES(1, 5)), '')
>> UNION
>> SELECT num + 1,
>> MD5(RANDOM()::TEXT),
>> ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
>> FROM GENERATE_SERIES(1, 5)), '')
>> FROM rand
>> WHERE num < 5
>> )
>> SELECT * FROM rand;
>
>> A typical result is shown below:
>
>> 1 974ee059a1902e5ca1ec73c91275984b GYXYS
>> 2 6cf5a974d5859eae23cdb9c310e3a3bf YFDPT
>> 3 fa6be95eb720fe6f80c7c8fb6ba11171 YFDPT
>> 4 fa54913b0bb43de0025b153fd71a5334 YFDPT
>> 5 523fab9bdc6c4c51a89e0d901273fb69 YFDPT
>
>> The fact that the last 4 are identical is not a coincidence. If I put
>> 100 in the GENERATE_SERIES, I still get the same result, the first and
>> second records are different, but ALL subsequent instances of the
>> ARRAY_TO_STRING are identical!
>
> Yeah, it's weird. A look at EXPLAIN VERBOSE offers some insight:
>
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------
> CTE Scan on rand (cost=4.75..5.37 rows=31 width=68)
> Output: rand.num, rand.md, rand.a_2_s
> CTE rand
> -> Recursive Union (cost=0.13..4.75 rows=31 width=68)
> -> Result (cost=0.13..0.15 rows=1 width=68)
> Output: 1, md5((random())::text), array_to_string($1, ''::text)
> InitPlan 1 (returns $1)
> -> Function Scan on pg_catalog.generate_series (cost=0.00..0.13 rows=5 width=32)
> Output: chr((('65'::double precision + round((random() * '25'::double precision))))::integer)
> Function Call: generate_series(1, 5)
> -> WorkTable Scan on rand rand_1 (cost=0.13..0.40 rows=3 width=68)
> Output: (rand_1.num + 1), md5((random())::text), array_to_string($2, ''::text)
> Filter: (rand_1.num < 5)
> InitPlan 2 (returns $2)
> -> Function Scan on pg_catalog.generate_series generate_series_1 (cost=0.00..0.13 rows=5 width=32)
> Output: chr((('65'::double precision + round((random() * '25'::double precision))))::integer)
> Function Call: generate_series(1, 5)
> (17 rows)
>
> The ARRAY sub-selects are being done as initplans, not subplans,
> which means they're only done once not once per row. This is correct
> so far as the planner is concerned because those sub-selects are
> "uncorrelated", ie they use no values from the outer query.
>
> There is room to argue that because the sub-selects contain volatile
> functions, they ought not be optimized into initplans. We have
> traditionally not considered that, however, and I'm afraid that a
> lot of people have written queries that depend on it. For example,
> there's lore out there that replacing
> WHERE mycol < random()
> with
> WHERE mycol < (SELECT random())
> will freeze the random() result as a single value rather than
> computing a new value for each row, which sometimes is what you
> need. These days, better practice would be to put the random()
> call in a CTE, but there's still a lot of code out there that
> does it as above.
>
> For your immediate problem, since you don't care that much
> (I suppose) about exactly how the strings are generated, you
> could fix the issue by making the sub-selects depend on
> "num" somehow. Or possibly there's a way to do it without
> a sub-select. On the whole this looks like a mighty expensive
> way to generate a random string, so I'd be inclined to look
> for other implementations.

An off the cuff Python solution:

CREATE OR REPLACE FUNCTION public.upper_random()
RETURNS character varying
LANGUAGE plpythonu
AS $function$
from string import ascii_uppercase as au
import random
return ''.join(random.sample(au,5))
$function$

SELECT
num,
MD5(RANDOM()::TEXT),
upper_random()
FROM
generate_series(1, 5) AS t(num);

num | md5 | upper_random
-----+----------------------------------+--------------
1 | 5896a9e3efa53027873d7999e58904ae | TRGEB
2 | 9f9677c32a64b6eae73759a69e1acfff | TFQHG
3 | 5aefda5b498215065e01ba697d79caee | KYBZS
4 | 1605b7fa54fef9bdc5c49f9b79810e07 | EUDML
5 | 4ba59880c1c67bca1d1f184bda5350b6 | RFPGO

>
> regards, tom lane
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Peter J. Holzer 2020-02-29 19:48:57 Re: trouble making PG use my Perl
Previous Message Tom Lane 2020-02-28 17:45:29 Re: Recursive CTEs and randomness - is there something I'm missing?