From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "jian he" <jian(dot)universality(at)gmail(dot)com> |
Subject: | Re: Do we want a hashset type? |
Date: | 2023-06-12 20:36:01 |
Message-ID: | f342049a-efb7-45e4-b161-10b2eb63d045@app.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote:
> But those are numbers for large keys - if you restrict the input to
> 4B-16B (which is what we planned to do by focusing on int, bigint and
> uuid), there's no significant difference:
Oh, sorry, I completely failed to read the meaning of the Columns.
> lookup3:
>
> Small key speed test - 4-byte keys - 30.17 cycles/hash
> Small key speed test - 8-byte keys - 31.00 cycles/hash
> Small key speed test - 16-byte keys - 49.00 cycles/hash
>
> xxh3low:
>
> Small key speed test - 4-byte keys - 29.00 cycles/hash
> Small key speed test - 8-byte keys - 29.58 cycles/hash
> Small key speed test - 16-byte keys - 37.00 cycles/hash
The winner of the "Small key speed test" competition seems to be:
ahash64 "ahash 64bit":
Small key speed test - 4-byte keys - 24.00 cycles/hash
Small key speed test - 8-byte keys - 24.00 cycles/hash
Small key speed test - 16-byte keys - 26.98 cycles/hash
Looks like it's a popular one, e.g. it's used by Rust in their std::collections::HashSet.
Another notable property of ahash64 is that it's "DOS resistant",
but it isn't crucial for our use case, since we exclusively target
auto-generated primary keys which are not influenced by end-users.
> Not sure what you mean by "optimizing for space" - I imagined the
> on-disk format would just use the same hash table with tiny amount of
> free space (say 10% and not ~%50).
With "optimizing for space" I meant trying to find some alternative or
intermediate data structure that is more likely to be compressible,
like your idea of sorting the data.
What I wondered was if the on-disk format would be affected by
the choice of hash function. I guess it wouldn't, if the hashset
is created by adding the elements one-by-one by iterating
through the elements by reading the on-disk format.
But I thought maybe something more advanced could be
done, where conversion between the on-disk format
and the in-memory format could be done without naively
iterating through all elements, i.e. something less complex
than O(n).
No idea what that mechanism would be though.
> My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
> (through hash_bytes or something), and at the same time make it possible
> to switch to a different function in the future. I'd store and ID of the
> hash function in the set, so that we can support a different algorithm
> in the future, if we choose to.
Sounds good to me.
Smart idea to include the hash function algorithm ID in the set,
to allow implementing a different one in the future!
/Joel
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2023-06-12 20:38:59 | Shouldn't construct_array_builtin and deconstruct_array_builtin agree on types? |
Previous Message | Peter Eisentraut | 2023-06-12 20:33:16 | Re: Documentation for building with meson |