From: | Martin Kalcher <martin(dot)kalcher(at)aboutsource(dot)net> |
---|---|
To: | Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> |
Cc: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: [PATCH] Introduce array_shuffle() and array_sample() |
Date: | 2022-07-24 19:29:39 |
Message-ID: | f5af7698-d1eb-bdf4-9969-82b18ab4f8cf@aboutsource.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
Am 24.07.22 um 10:15 schrieb Fabien COELHO:
>
> My 0,02€ about this patch:
Thank you for your feedback. I attached a patch, that addresses most of
your points.
> Use (void) when declaring no parameters in headers or in functions.
Done.
> Should the exchange be skipped when i == k?
The additional branch is actually slower (on my machine, tested with an
10M element int array) for 1D-arrays, which i assume is the most common
case. Even with a 2D-array with a subarray size of 1M there is no
difference in execution time. i == k should be relatively rare for large
arrays, given a good prng.
> I do not see the point of having *only* inline functions in a c file.
> The external functions should not be inlined?
Done.
> The random and array shuffling functions share a common state. I'm
> wondering whether it should ot should not be so. It seems ok, but then
> ISTM that the documentation suggests implicitely that setseed applies to
> random() only, which is not the case anymore after the patch.
I really like the idea of a prng state owned by the user, that is used
by all user facing random functions, but see that their might be
concerns about this change. I would update the setseed() documentation,
if this proposal is accepted. Do you think my patch should already
include the documentation update?
> If more samples are required than the number of elements, it does not
> error out. I'm wondering whether it should.
The second argument to array_sample() is treated like a LIMIT, rather
than the actual number of elements. I updated the documentation. My
use-case is: take max random items from an array of unknown size.
> Also, the sampling should not return its result in order when the number
> of elements required is the full array, ISTM that it should be shuffled
> there as well.
You are the second person asking for this. It's done. I am thinking
about ditching array_sample() and replace it with a second signature for
array_shuffle():
array_shuffle(array anyarray) -> anyarray
array_shuffle(array anyarray, limit int) -> anyarray
What do you think?
> I must say that without significant knowledge of the array internal
> implementation, the swap code looks pretty strange. ISTM that the code
> would be clearer if pointers and array references style were not
> intermixed.
Done. Went with pointers.
> Maybe you could add a test with a 3D array? Some sample with NULLs?
Done.
> Unrelated: I notice again that (postgre)SQL does not provide a way to
> generate random integers. I do not see why not. Should we provide one?
Maybe. I think it is outside the scope of this patch.
Thank you, Martin
Attachment | Content-Type | Size |
---|---|---|
0001-Introduce-array_shuffle-and-array_sample.patch | text/x-patch | 16.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Fabien COELHO | 2022-07-24 19:42:48 | Re: [PATCH] Introduce array_shuffle() and array_sample() |
Previous Message | xavier | 2022-07-24 16:50:02 | Re: Queries in another user's tables (SOLVED) |
From | Date | Subject | |
---|---|---|---|
Next Message | Fabien COELHO | 2022-07-24 19:42:48 | Re: [PATCH] Introduce array_shuffle() and array_sample() |
Previous Message | Noah Misch | 2022-07-24 19:10:17 | Re: fairywren hung in pg_basebackup tests |