Re: Do we want a hashset type?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-10 20:12:36
Message-ID: 9eb5931c-6375-760b-bbc6-3b1b01eeeb22@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/9/23 12:58, Joel Jacobson wrote:
> On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
>> Would you be interested in helping with / working on some of that? I
>> don't have immediate need for this stuff, so it's not very high on my
>> TODO list.
>
> Sure, I'm willing to help!
>
> I've attached a patch that works on some of the items on your list,
> including some additions to the README.md.
>
> There were a bunch of places where `maxelements / 8` caused bugs,
> that had to be changed to do proper integer ceiling division:
>
> - values = (int32 *) (set->data + set->maxelements / 8);
> + values = (int32 *) (set->data + (set->maxelements + 7) / 8);
>
> Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV macros
> to the PostgreSQL source code in general, since it's easy to make this mistake,
> and quite verbose/error-prone to write it out manually everywhere.
> Such macros could simplify code in e.g. numeric.c.
>

Yeah, it'd be good to have macros to calculate the sizes. We'll need
that in many places.

>> There's a bunch of stuff that needs to be improved to make this properly
>> usable, like:
>>
>> 1) better hash table implementation
> TODO
>
>> 2) input/output functions
> I've attempted to implement these.
> I thought comma separated values wrapped around curly braces felt as the most natural format,
> example:
> SELECT '{1,2,3}'::hashset;
>

+1 to that. I'd mimic the array in/out functions as much as possible.

>> 3) support for other types (now it only works with int32)
> TODO
>

I think we should decide what types we want/need to support, and add one
or two types early. Otherwise we'll have code / on-disk format making
various assumptions about the type length etc.

I have no idea what types people use as node IDs - is it likely we'll
need to support types passed by reference / varlena types? Or can we
just assume it's int/bigint?

>> 4) I wonder if this might be done as an array-like polymorphic type.
> That would be nice!
> I guess the work-around would be to store the actual value of non-int type
> in a lookup table, and then hash the int-based primary key in such table.
>
> Do you think later implementing polymorphic type support would
> mean a more or less complete rewrite, or can we carry on with int32-support
> and add it later on?
>

I don't know. From the storage perspective it doesn't matter much, I
think, we would not need to change that. But I think adding a
polymorphic type (array-like) would require changes to grammar, and
that's not possible for an extension. If there's a way, I'm not aware of
it and I don't recall an extension doing that.

>> 5) more efficient storage format, with versioning etc.
> TODO
>

I think the main question is whether to serialize the hash table as is,
or compact it in some way. The current code just uses the same thing for
both cases - on-disk format and in-memory representation (aggstate).
That's simple, but it also means the on-disk value is likely not well
compressible (because it's ~50% random data.

I've thought about serializing just the values (as a simple array), but
that defeats the whole purpose of fast membership checks. I have two ideas:

a) sort the data, and use binary search for this compact variant (and
then expand it into "full" hash table if needed)

b) use a more compact hash table (with load factor much closer to 1.0)

Not sure which if this option is the right one, each has cost for
converting between formats (but large on-disk value is not free either).

That's roughly what I did for the tdigest extension.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-06-10 20:24:24 Re: abi-compliance-checker
Previous Message Andres Freund 2023-06-10 19:48:46 Re: abi-compliance-checker