From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Do we want a hashset type? |
Date: | 2023-06-08 10:19:07 |
Message-ID: | 86f8a027-e6d4-65c0-7349-faa3301c8b8b@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 6/8/23 11:41, Joel Jacobson wrote:
> On Wed, Jun 7, 2023, at 19:37, Tomas Vondra wrote:
>> Interesting, considering how dumb the the hash table implementation is.
>
> That's promising.
>
Yeah, not bad for sleep-deprived on-plane hacking ...
There's a bunch of stuff that needs to be improved to make this properly
usable, like:
1) better hash table implementation
2) input/output functions
3) support for other types (now it only works with int32)
4) I wonder if this might be done as an array-like polymorphic type.
5) more efficient storage format, with versioning etc.
6) regression tests
Would you be interested in helping with / working on some of that? I
don't have immediate need for this stuff, so it's not very high on my
TODO list.
>>> I tested Neo4j and the results are surprising; it appears to be significantly *slower*.
>>> However, I've probably misunderstood something, maybe I need to add some index or something.
>>> Even so, it's interesting it's apparently not fast "by default".
>>>
>>
>> No idea how to fix that, but it's rather suspicious.
>
> I've had a graph-db expert review my benchmark, and he suggested adding an index:
>
> CREATE INDEX FOR (n:User) ON (n.id)
>
> This did improve the execution time for Neo4j a bit, down from 819 ms to 528 ms, but PostgreSQL 299 ms is still a win.
>
> Benchmark here: https://github.com/joelonsql/graph-query-benchmarks
> Note, in this benchmark, I only test the naive RECURSIVE CTE approach using array_agg(DISTINCT ...).
> And I couldn't even test the most connected user with Neo4j, the query never finish for some reason,
> so I had to test with a less connected user.
>
Interesting. I'd have expected the graph db to be much faster.
> The graph expert also said that other more realistic graph use-cases might be "multi-relational",
> and pointed me to a link: https://github.com/totogo/awesome-knowledge-graph#knowledge-graph-dataset
> No idea how such multi-relational datasets would affect the benchmarks.
>
Not sure either, but I don't have ambition to improve everything at
once. If the hashset improves one practical use case, fine with me.
> I think we have already strong indicators that PostgreSQL with a hashset type will from a relative
> performance perspective, do just fine processing basic graph queries, even with large datasets.
>
> Then there will always be the case when users primarily write very different graph queries all day long,
> who might prefer a graph query *language*, like SQL/PGQ in SQL:2023, Cypher or Gremlin.
>
Right. IMHO the query language is a separate thing, you still need to
evaluate the query somehow - which is where hashset applies.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2023-06-08 10:29:38 | Re: running logical replication as the subscription owner |
Previous Message | Hannu Krosing | 2023-06-08 10:15:58 | Re: Let's make PostgreSQL multi-threaded |