Re: Do we want a hashset type?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>, Joel Jacobson <joel(at)compiler(dot)org>, jian he <jian(dot)universality(at)gmail(dot)com>
Cc: Tom Dunstan <pgsql(at)tomd(dot)cc>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-23 13:12:37
Message-ID: c8cb89a0-3cb8-7b70-c01d-4548128d9be3@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/23/23 13:47, Andrew Dunstan wrote:
>
> On 2023-06-23 Fr 04:23, Joel Jacobson wrote:
>> On Fri, Jun 23, 2023, at 08:40, jian he wrote:
>>> I played around array_func.c
>>> many of the code can be used for multiset data type.
>>> now I imagine multiset as something like one dimension array. (nested
>>> is somehow beyond the imagination...).
>> Are you suggesting it might be a better idea to start over completely
>> and work on a new code base that is based on arrayfuncs.c,
>> and aim for MULTISET/SET or anyhashset from start, that would not
>> only support int4/int8/uuid but any type?
>>
>
> Before we run too far down this rabbit hole, let's discuss the storage
> implications of using multisets. ISTM that for small base datums like
> integers it will be a substantial increase in size, since you'll need an
> addition int for the item count, unless some very clever tricks are played.
>

I honestly don't quite understand what exactly is meant by the proposal
to "reuse array_func.c for multisets". We're implementing sets, not
multisets (those were mentioned only to illustrate behavior). And the
whole point is that sets are not arrays - no duplicates, ordering does
not matter (so no index).

I mentioned that maybe we can model sets based on arrays (say, gram.y
would do similar stuff for SET[] and ARRAY[], polymorphism), not that we
should store sets as arrays. Would it be possible - maybe, if we extend
arrays to also maintain some hash hash table. But I'd bet that'll just
make arrays more complex, and will make sets slower.

Or maybe I just don't understand the proposal. Perhaps it'd be best if
jian wrote a patch illustrating the idea, and showing how it performs
compared to the current approach.

As for the storage size, I don't think an extra "count" field would make
any measurable difference. If we're storing a hash table, we're bound to
have a couple percent of wasted space due to load factor (likely between
0.75 and 0.9).

> As for this older discussion referred to upthread, if the SQL Standards
> Committee hasn't acted on it by now it seem reasonable to think they are
> unlikely to.
>

AFAIK multisets are included in SQL 2023, pretty much matching the draft
we discussed earlier. Yeah, it's unlikely to change in the future.

> Just for reference, Here's some description of Oracle's suport for
> Multisets from
> <https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/Oracle-Support-for-Optional-Features-of-SQLFoundation2011.html#GUID-3BA98AEC-FAAD-4F21-A6AD-F696B5D36D56>:
>

good to know

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2023-06-23 13:18:25 Re: logical decoding and replication of sequences, take 2
Previous Message Andrew Dunstan 2023-06-23 13:06:43 Re: Bytea PL/Perl transform