Quick Links

Re: Do we want a hashset type?

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Joel Jacobson <joel(at)compiler(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, jian he <jian(dot)universality(at)gmail(dot)com>
Cc:	Tom Dunstan <pgsql(at)tomd(dot)cc>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Do we want a hashset type?
Date:	2023-06-20 00:04:27
Message-ID:	136e97fe-a41c-7e32-f58f-312f1a0e4caa@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 6/20/23 00:50, Joel Jacobson wrote:
> On Mon, Jun 19, 2023, at 14:59, Tomas Vondra wrote:
>> What unexpected issues you mean? Sure, if someone uses multisets as if
>> they were sets (so ignoring the handling of duplicates), things will go
>> booom! quickly.
>
> The unexpected issues I had in mind are subtle bugs due to treating multisets
> as sets, which could go undetected due to having no duplicates initially.
> Multisets might initially therefore seem equal, but later diverge due to
> different element counts, leading to hard-to-detect issues.
>

Understood.

>> I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
>> operators) that'd allow perhaps help with this.
>
> Might help. But still think providing both structures would be a more foolproof
> solution, offering users the choice to select what's best for their use-case.
>

Yeah. Not confusing people is better.

>>> Despite SQL's multiset possibility, a distinct hashset type is my preference,
>>> helping appropriate data structure choice and reducing misuse.
>>>
>>> The necessity of multisets is vague beyond standards compliance.
>>
>> True - we haven't had any requests/proposal to implement MULTISETs.
>>
>> I've looked at the SQL standard primarily to check if maybe there's some
>> precedent that'd give us guidance on the SQL syntax etc. And I think
>> multisets are that - even if we end up not implementing them, it'd be
>> sad to have unnecessarily inconsistent syntax (in case someone decides
>> to add multisets in the future).
>>
>> We could invent "SET" data type, so while standard has ARRAY / MULTISET,
>> we'd have ARRAY / MULTISET / SET, and the difference between the last
>> two would be just handling of duplicates.
>
> Is the idea to use the "SET" keyword for the syntax?
> Isn't it a risk that will be confusing, since "SET" is currently
> only used for configuration and update operations?
>

I haven't tried doing that, so not sure if there would be any conflicts
in the grammar. But I can't think of a case that'd be confusing for
users - when setting internal GUC variables it's a completely different
context, there's no use for SQL-level collections (arrays, sets, ...).

For UPDATE, it'd be pretty clear too, I think. It's possible to do

UPDATE table SET col = SET[1,2,3]

and it's clear the first is the command SET, while the second is a set
constructor. For SELECT there'd be conflict, and for ALTER TABLE it'd be
possible to do

ALTER TABLE table ALTER COLUMN col SET DEFAULT SET[1,2,3];

Seems clear to me too, I think.

>> The other way to look at sets is that they are pretty similar to arrays,
>> except that there are no duplicates and order does not matter. Sure, the
>> on-disk format and code is different, but from the SQL perspective it'd
>> be nice to allow using sets in most places where arrays are allowed
>> (which is what the standard does for MULTISETS, more or less).
>>
>> That'd mean we could probably search through gram.y for places working
>> with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
>> make them work with sets too, say by having SET_SUBLINK instead of
>> ARRAY_SUBLINK, set_expression instead of array_expression, etc.
>>
>> This might be also "consistent" with defining hashset type using CREATE
>> TYPE with ELEMENT, because we consider the type to be "array". So that
>> would be polymorphic type, but we don't have pre-defined array for every
>> type (and I'm not sure we want to).
>>
>> Of course, maybe there's some fatal flaw in these idea, I don't know.
>> And I don't want to move the goalposts too far - but it seems like this
>> might make some stuff actually simpler to implement (by piggy-backing on
>> the existing array infrastructure).
>
> I think it's very interesting thoughts and ambitions.
>
> I wonder though, from a user-perspective, if a new hashset type still
> wouldn't just be considered simpler, than introducing new SQL syntax?
>

It's a matter of personal taste, I guess. I'm fine with calling function
API and what not, but a sensible SQL syntax seems nicer.

> However, it would be interesting to see how the piggy-backing on the
> existing array infrastructure would look in practise code-wise though.
>
> I think it's still meaningful to continue hacking on the int4-type
> hashset extension, to see if we can agree on the semantics,
> especially around null handling and sorting.
>

Definitely. It certainly was not my intention to derail the work by
proposing more and more stuff. So feel free to pursue what makes sense
to you / helps the use case.

TBH I don't particularly see why we'd want to sort sets.

I wonder if the SQL standard says something about these things (for
MULTISETs), especially for the NULL handling. If it does, I'd try to
stick with those rules.

>> A mostly unrelated thought - I wonder if this might be somehow related
>> to the foreign key array patch ([1] might be the most recent attempt in
>> this direction). Not to hashset itself, but I recalled these patches
>> because it'd mean we don't need the separate "edges" link table (so the
>> hashset column would be the think backing the FK).
>>
>> [1]
>> https://www.postgresql.org/message-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9%3DGq3g%40mail.gmail.com
>
> I remember that one! We tried to revive that one, but didn't manage to keep it alive.
> It's a really good idea though. Good idea to see if there might be synergies
> between arrays and hashsets in this area, since if we envision the elements in
> a hashset mostly will be PKs, then it would be nice to enforce reference
> integrity.

I haven't followed that at all, but I wonder how difficult would it be
to also support other collection types (like sets) and not just arrays.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Do we want a hashset type? at 2023-06-19 22:50:55 from Joel Jacobson

Responses

Re: Do we want a hashset type? at 2023-06-20 05:38:56 from Joel Jacobson

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2023-06-20 00:42:45	Re: Partial aggregates pushdown
Previous Message	Tristan Partin	2023-06-19 23:49:05	Re: Make pgbench exit on SIGINT more reliably