Re: [PATCH] Introduce array_shuffle() and array_sample()

From: Martin Kalcher <martin(dot)kalcher(at)aboutsource(dot)net>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Introduce array_shuffle() and array_sample()
Date: 2022-07-21 11:15:43
Message-ID: f6267e99-4448-b47c-9fe8-ec23690b931f@aboutsource.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Am 21.07.22 um 10:41 schrieb Dean Rasheed:
>
> A couple of quick comments on the current patch:

Thank you for your feedback!

> It's important to mark these new functions as VOLATILE, not IMMUTABLE,
> otherwise they won't work as expected in queries. See
> https://www.postgresql.org/docs/current/xfunc-volatility.html

CREATE FUNCTION marks functions as VOLATILE by default if not explicitly
marked otherwise. I assumed function definitions in pg_proc.dat have the
same behavior. I will fix that.
https://www.postgresql.org/docs/current/sql-createfunction.html

> It would be better to use pg_prng_uint64_range() rather than rand() to
> pick elements. Partly, that's because it uses a higher quality PRNG,
> with a larger internal state, and it ensures that the results are
> unbiased across the range. But more importantly, it interoperates with
> setseed(), allowing predictable sequences of "random" numbers to be
> generated -- something that's useful in writing repeatable regression
> tests.

I agree that we should use pg_prng_uint64_range(). However, in order to
achieve interoperability with setseed() we would have to use
drandom_seed (rather than pg_global_prng_state) as rng state, which is
declared statically in float.c and exclusively used by random(). Do we
want to expose drandom_seed to other functions?

> Assuming these new functions are made to interoperate with setseed(),
> which I think they should be, then they also need to be marked as
> PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
> https://www.postgresql.org/docs/current/parallel-safety.html, which
> explains why setseed() and random() are parallel restricted.

As mentioned above, i assumed the default here is PARALLEL UNSAFE. I'll
fix that.

> In my experience, the requirement for sampling with replacement is
> about as common as the requirement for sampling without replacement,
> so it seems odd to provide one but not the other. Of course, we can
> always add a with-replacement function later, and give it a different
> name. But maybe array_sample() could be used for both, with a
> "with_replacement" boolean parameter?

We can also add a "with_replacement" boolean parameter which is false by
default to array_sample() later. I do not have a strong opinion about
that and will implement it, if desired. Any opinions about
default/no-default?

Martin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dean Rasheed 2022-07-21 12:25:27 Re: [PATCH] Introduce array_shuffle() and array_sample()
Previous Message Daulat 2022-07-21 10:01:44 Re: More than one Cluster on single server (single instance)

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-07-21 11:17:51 Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Previous Message Thomas Munro 2022-07-21 11:14:57 Re: standby recovery fails (tablespace related) (tentative patch and discussion)