Re: Do we want a hashset type?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "jian he" <jian(dot)universality(at)gmail(dot)com>
Subject: Re: Do we want a hashset type?
Date: 2023-06-12 17:34:58
Message-ID: 645be3ab-d4fd-4368-82c6-f7639aea71be@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>> 1) better hash table implementation

I noticed src/include/common/hashfn.h provides implementation
of the Jenkins/lookup3 hash function, and thought maybe
we could simply use it in hashset?

However, I noticed that according to SMHasher [1],
Jenkins/lookup3 has some quality problems:
"UB, 28% bias, collisions, 30% distr, BIC"

Not sure if that's true or maybe not a problem in the PostgreSQL implementation?

According to SHHasher, the two fastest 32/64-bit hash functions
for non-cryptographic purposes without any quality problems
that are also portable seems to be these two:

wyhash v4.1 (64-bit) [2]
MiB/sec: 22513.04
cycl./hash: 29.01
size: 474

xxh3low (xxHash v3, 64-bit, low 32-bits part) [3]
MiB/sec: 20722.94
cycl./hash: 30.26
size: 756

[1] https://github.com/rurban/smhasher
[2] https://github.com/wangyi-fudan/wyhash
[3] https://github.com/Cyan4973/xxHash

>>> 5) more efficient storage format, with versioning etc.
> I think the main question is whether to serialize the hash table as is,
> or compact it in some way. The current code just uses the same thing for
> both cases - on-disk format and in-memory representation (aggstate).
> That's simple, but it also means the on-disk value is likely not well
> compressible (because it's ~50% random data.
>
> I've thought about serializing just the values (as a simple array), but
> that defeats the whole purpose of fast membership checks. I have two ideas:
>
> a) sort the data, and use binary search for this compact variant (and
> then expand it into "full" hash table if needed)
>
> b) use a more compact hash table (with load factor much closer to 1.0)
>
> Not sure which if this option is the right one, each has cost for
> converting between formats (but large on-disk value is not free either).
>
> That's roughly what I did for the tdigest extension.

Is the choice of hash function (and it's in-memory representation)
independent of the on-disk format question, i.e. could we work
on experimenting and evaluating different hash functions first,
to optimise for speed and quality, and then when done, proceed
and optimise for space, or are the two intertwined somehow?

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message kaido vaikla 2023-06-12 18:03:24 query_id, pg_stat_activity, extended query protocol
Previous Message Robert Haas 2023-06-12 17:33:45 Re: pgsql: Fix search_path to a safe value during maintenance operations.