From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Do we want a hashset type? |
Date: | 2023-05-31 16:59:35 |
Message-ID: | a7d114c4-6cf9-d7eb-0a11-7cb57bffd2d1@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 5/31/23 17:40, Joel Jacobson wrote:
> On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
>> I think this needs a better explanation - what exactly is a hashset in
>> this context? Something like an array with a hash for faster lookup of
>> unique elements, or what?
>
> In this context, by "hashset" I am indeed referring to a data structure similar
> to an array, where each element would be unique, and lookups would be faster
> than arrays for larger number of elements due to hash-based lookups.
>
Why would you want hash-based lookups? It should be fairly trivial to
implement as a user-defined data type, but in what cases are you going
to ask "does the hashset contain X"?
> This data structure would store identifiers (IDs) of the nodes, not the complete
> nodes themselves.
>
How does storing just the IDs solves anything? Isn't the main challenge
the random I/O when fetching the adjacent nodes? This does not really
improve that, no?
>> Presumably it'd store whole adjacent nodes, not just some sort of node
>> ID. So what if a node is adjacent to many other nodes? What if a node is
>> added/deleted/modified?
>
> That would require updating the hashset, which should be close to O(1) in
> practical applications.
>
But you need to modify hashsets for all the adjacent nodes. Also, O(1)
doesn't say it's cheap. I wonder how expensive it'd be in practice.
>> AFAICS the main problem is the lookups of adjacent nodes, generating
>> lot of random I/O etc. Presumably it's not that hard to keep the
>> "relational" schema with table for vertices/edges, and then an auxiliary
>> table with adjacent nodes grouped by node, possibly maintained by a
>> couple triggers. A bit like an "aggregated index" except the queries
>> would have to use it explicitly.
>
> Yes, auxiliary table would be good, since we don't want to duplicate all
> node-related data, and only store the IDs in the adjacent nodes hashset.
>
I may be missing something, but as mentioned, I don't quite see how this
would help. What exactly would this save us? If you create an index on
the edge node IDs, you should get the adjacent nodes pretty cheap from
an IOS. Granted, a pre-built hashset is going to be faster, but if the
next step is fetching the node info, that's going to do a lot of random
I/O, dwarfing all of this.
It's entirely possible I'm missing some important aspect. It'd be very
useful to have a couple example queries illustrating the issue, that we
could use to actually test different approaches.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Dagfinn Ilmari Mannsåker | 2023-05-31 17:02:57 | Re: generate syscache info automatically |
Previous Message | Chang Wei 昌維 | 2023-05-31 16:31:45 | Support edit order of the fields in table |