Re: Do we want a hashset type?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-06 11:20:40
Message-ID: 44ac4916-5a7b-fd02-e967-4c733ae000fe@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 6/5/23 21:52, Joel Jacobson wrote:
> On Mon, Jun 5, 2023, at 01:44, Tomas Vondra wrote:
>> Anyway, I hacked together a trivial set backed by an open addressing
>> hash table:
>>
>> https://github.com/tvondra/hashset
>>
>> It's super-basic, providing just some bare minimum of functions, but
>> hopefully good enough for experiments.
>>
>> - hashset data type - hash table in varlena
>> - hashset_init - create new hashset
>> - hashset_add / hashset_contains - add value / check membership
>> - hashset_merge - merge two hashsets
>> - hashset_count - count elements
>> - hashset_to_array - return
>> - hashset(int) aggregate
>>
>> This allows rewriting the recursive query like this:
>>
>> WITH RECURSIVE friends_of_friends AS (
> ...
>
> Nice! I get similar results, 737 ms (hashset) vs 1809 ms (array_agg).
>
> I think if you just add one more hashset function, it will be a win against roaringbitmap, which is 400 ms.
>
> The missing function is an agg that takes hashset as input and returns hashset, similar to roaringbitmap's rb_or_agg().
>
> With such a function, we could add an adjacent nodes hashset column to the `nodes` table, which would eliminate the need to scan the edges table for graph traversal:
>

I added a trivial version of such aggregate hashset(hashset), and if I
rewrite the CTE like this:

WITH RECURSIVE friends_of_friends AS (
SELECT
(select hashset(v) from values (5867) as s(v)) AS current,
0 AS depth
UNION ALL
SELECT
new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
hashset(edges.to_node) AS new_current
FROM
edges
WHERE
from_node =
ANY(hashset_to_array(friends_of_friends.current))
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
depth,
hashset_count(current)
FROM
friends_of_friends
WHERE
depth = 3;

it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
on your system. There's a bunch of opportunities for more improvements,
as the hash table implementation is pretty naive/silly, the on-disk
format is wasteful and so on.

But before spending more time on that, it'd be interesting to know what
would be a competitive timing. I mean, what would be "good enough"? What
timings are achievable with graph databases?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-06-06 11:37:45 Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
Previous Message Michael Paquier 2023-06-06 11:19:12 Re: 回复:Fix missing initialization of delayChkptEnd