From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "jian he" <jian(dot)universality(at)gmail(dot)com> |
Subject: | Re: Do we want a hashset type? |
Date: | 2023-06-12 17:34:58 |
Message-ID: | 645be3ab-d4fd-4368-82c6-f7639aea71be@app.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>> 1) better hash table implementation
I noticed src/include/common/hashfn.h provides implementation
of the Jenkins/lookup3 hash function, and thought maybe
we could simply use it in hashset?
However, I noticed that according to SMHasher [1],
Jenkins/lookup3 has some quality problems:
"UB, 28% bias, collisions, 30% distr, BIC"
Not sure if that's true or maybe not a problem in the PostgreSQL implementation?
According to SHHasher, the two fastest 32/64-bit hash functions
for non-cryptographic purposes without any quality problems
that are also portable seems to be these two:
wyhash v4.1 (64-bit) [2]
MiB/sec: 22513.04
cycl./hash: 29.01
size: 474
xxh3low (xxHash v3, 64-bit, low 32-bits part) [3]
MiB/sec: 20722.94
cycl./hash: 30.26
size: 756
[1] https://github.com/rurban/smhasher
[2] https://github.com/wangyi-fudan/wyhash
[3] https://github.com/Cyan4973/xxHash
>>> 5) more efficient storage format, with versioning etc.
> I think the main question is whether to serialize the hash table as is,
> or compact it in some way. The current code just uses the same thing for
> both cases - on-disk format and in-memory representation (aggstate).
> That's simple, but it also means the on-disk value is likely not well
> compressible (because it's ~50% random data.
>
> I've thought about serializing just the values (as a simple array), but
> that defeats the whole purpose of fast membership checks. I have two ideas:
>
> a) sort the data, and use binary search for this compact variant (and
> then expand it into "full" hash table if needed)
>
> b) use a more compact hash table (with load factor much closer to 1.0)
>
> Not sure which if this option is the right one, each has cost for
> converting between formats (but large on-disk value is not free either).
>
> That's roughly what I did for the tdigest extension.
Is the choice of hash function (and it's in-memory representation)
independent of the on-disk format question, i.e. could we work
on experimenting and evaluating different hash functions first,
to optimise for speed and quality, and then when done, proceed
and optimise for space, or are the two intertwined somehow?
/Joel
From | Date | Subject | |
---|---|---|---|
Next Message | kaido vaikla | 2023-06-12 18:03:24 | query_id, pg_stat_activity, extended query protocol |
Previous Message | Robert Haas | 2023-06-12 17:33:45 | Re: pgsql: Fix search_path to a safe value during maintenance operations. |