From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "jian he" <jian(dot)universality(at)gmail(dot)com> |
Cc: | "Tom Dunstan" <pgsql(at)tomd(dot)cc>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Do we want a hashset type? |
Date: | 2023-06-19 22:50:55 |
Message-ID: | 6734dc1f-90d4-4039-a64b-13ebfcf581da@app.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jun 19, 2023, at 14:59, Tomas Vondra wrote:
> What unexpected issues you mean? Sure, if someone uses multisets as if
> they were sets (so ignoring the handling of duplicates), things will go
> booom! quickly.
The unexpected issues I had in mind are subtle bugs due to treating multisets
as sets, which could go undetected due to having no duplicates initially.
Multisets might initially therefore seem equal, but later diverge due to
different element counts, leading to hard-to-detect issues.
> I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
> operators) that'd allow perhaps help with this.
Might help. But still think providing both structures would be a more foolproof
solution, offering users the choice to select what's best for their use-case.
>> Despite SQL's multiset possibility, a distinct hashset type is my preference,
>> helping appropriate data structure choice and reducing misuse.
>>
>> The necessity of multisets is vague beyond standards compliance.
>
> True - we haven't had any requests/proposal to implement MULTISETs.
>
> I've looked at the SQL standard primarily to check if maybe there's some
> precedent that'd give us guidance on the SQL syntax etc. And I think
> multisets are that - even if we end up not implementing them, it'd be
> sad to have unnecessarily inconsistent syntax (in case someone decides
> to add multisets in the future).
>
> We could invent "SET" data type, so while standard has ARRAY / MULTISET,
> we'd have ARRAY / MULTISET / SET, and the difference between the last
> two would be just handling of duplicates.
Is the idea to use the "SET" keyword for the syntax?
Isn't it a risk that will be confusing, since "SET" is currently
only used for configuration and update operations?
> The other way to look at sets is that they are pretty similar to arrays,
> except that there are no duplicates and order does not matter. Sure, the
> on-disk format and code is different, but from the SQL perspective it'd
> be nice to allow using sets in most places where arrays are allowed
> (which is what the standard does for MULTISETS, more or less).
>
> That'd mean we could probably search through gram.y for places working
> with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
> make them work with sets too, say by having SET_SUBLINK instead of
> ARRAY_SUBLINK, set_expression instead of array_expression, etc.
>
> This might be also "consistent" with defining hashset type using CREATE
> TYPE with ELEMENT, because we consider the type to be "array". So that
> would be polymorphic type, but we don't have pre-defined array for every
> type (and I'm not sure we want to).
>
> Of course, maybe there's some fatal flaw in these idea, I don't know.
> And I don't want to move the goalposts too far - but it seems like this
> might make some stuff actually simpler to implement (by piggy-backing on
> the existing array infrastructure).
I think it's very interesting thoughts and ambitions.
I wonder though, from a user-perspective, if a new hashset type still
wouldn't just be considered simpler, than introducing new SQL syntax?
However, it would be interesting to see how the piggy-backing on the
existing array infrastructure would look in practise code-wise though.
I think it's still meaningful to continue hacking on the int4-type
hashset extension, to see if we can agree on the semantics,
especially around null handling and sorting.
> A mostly unrelated thought - I wonder if this might be somehow related
> to the foreign key array patch ([1] might be the most recent attempt in
> this direction). Not to hashset itself, but I recalled these patches
> because it'd mean we don't need the separate "edges" link table (so the
> hashset column would be the think backing the FK).
>
> [1]
> https://www.postgresql.org/message-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9%3DGq3g%40mail.gmail.com
I remember that one! We tried to revive that one, but didn't manage to keep it alive.
It's a really good idea though. Good idea to see if there might be synergies
between arrays and hashsets in this area, since if we envision the elements in
a hashset mostly will be PKs, then it would be nice to enforce reference
integrity.
/Joel
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2023-06-19 22:58:55 | Re: pgsql: Fix search_path to a safe value during maintenance operations. |
Previous Message | Nathan Bossart | 2023-06-19 21:55:34 | Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX |