Re: Do we want a hashset type?

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-09 11:33:09
Message-ID: CACJufxE=QbDzYYay4Z6MJ=BqdkYiHi_P34YOKduyQ0yyXmMhOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 9, 2023 at 6:58 PM Joel Jacobson <joel(at)compiler(dot)org> wrote:

> On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
> > Would you be interested in helping with / working on some of that? I
> > don't have immediate need for this stuff, so it's not very high on my
> > TODO list.
>
> Sure, I'm willing to help!
>
> I've attached a patch that works on some of the items on your list,
> including some additions to the README.md.
>
> There were a bunch of places where `maxelements / 8` caused bugs,
> that had to be changed to do proper integer ceiling division:
>
> - values = (int32 *) (set->data + set->maxelements / 8);
> + values = (int32 *) (set->data + (set->maxelements + 7) / 8);
>
> Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV
> macros
> to the PostgreSQL source code in general, since it's easy to make this
> mistake,
> and quite verbose/error-prone to write it out manually everywhere.
> Such macros could simplify code in e.g. numeric.c.
>
> > There's a bunch of stuff that needs to be improved to make this properly
> > usable, like:
> >
> > 1) better hash table implementation
> TODO
>
> > 2) input/output functions
> I've attempted to implement these.
> I thought comma separated values wrapped around curly braces felt as the
> most natural format,
> example:
> SELECT '{1,2,3}'::hashset;
>
> > 3) support for other types (now it only works with int32)
> TODO
>
> > 4) I wonder if this might be done as an array-like polymorphic type.
> That would be nice!
> I guess the work-around would be to store the actual value of non-int type
> in a lookup table, and then hash the int-based primary key in such table.
>
> Do you think later implementing polymorphic type support would
> mean a more or less complete rewrite, or can we carry on with int32-support
> and add it later on?
>
> > 5) more efficient storage format, with versioning etc.
> TODO
>
> > 6) regression tests
> I've added some regression tests.
>
> > Right. IMHO the query language is a separate thing, you still need to
> > evaluate the query somehow - which is where hashset applies.
>
> Good point, I fully agree.
>
> /Joel

Hi, I am quite new about C.....
The following function I have 3 questions.
1. 7691,4201, I assume they are just random prime ints?
2. I don't get the last return set, even the return type should be bool.
3. I don't understand 13 in hash = (hash + 13) % set->maxelements;

static bool
hashset_contains_element(hashset_t *set, int32 value)
{
int byte;
int bit;
uint32 hash;
char *bitmap;
int32 *values;

hash = ((uint32) value * 7691 + 4201) % set->maxelements;

bitmap = set->data;
values = (int32 *) (set->data + set->maxelements / 8);

while (true)
{
byte = (hash / 8);
bit = (hash % 8);

/* found an empty slot, value is not there */
if ((bitmap[byte] & (0x01 << bit)) == 0)
return false;

/* is it the same value? */
if (values[hash] == value)
return true;

/* move to the next element */
hash = (hash + 13) % set->maxelements;
}

return set;
}

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii.Yuki@df.MitsubishiElectric.co.jp 2023-06-09 11:38:37 RE: Partial aggregates pushdown
Previous Message Amit Langote 2023-06-09 11:25:32 Re: Views no longer in rangeTabls?