Re: allowing broader use of simplehash

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: allowing broader use of simplehash
Date: 2019-12-10 21:59:54
Message-ID: 20191210215954.tmtwamwy7w47wmu3@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Neat!

On 2019-12-10 13:07:02 -0500, Robert Haas wrote:
> I recently became annoyed while working on patch A that I could not
> use simplehash in shared memory, and then I became annoyed again while
> working on patch B that I could not use simplehash in frontend code.
> So here are a few patches for discussion.

I wanted to use it in frontend code a couple times as well.

> A significant problem in either case is that a simplehash wants to
> live in a memory context; no such thing exists either for data in
> shared memory nor in frontend code. However, it seems to be quite easy
> to provide a way for simplehash to be defined so that it doesn't care
> about memory contexts. See 0001.

I wonder if we shouldn't instead just go for an "implicit" memory
context instead. It's a bit ugly to have a growing set of different
signatures.

> As far as frontend code goes, the only other problem I found is that
> it makes use of elog() to signal some internal-ish messages. It seemed
> to me that the easiest thing to do was, if FRONTEND is defined, use
> pg_log_error(...) instead of elog(ERROR, ...). For the one instance of
> elog(LOG, ...) in simplehash.h, I chose to use pg_log_info(). It's not
> really equivalent, but it's probably the closest thing that currently
> exists, and I think it's good enough for what's basically a debugging
> message. See 0002.

Yea, I think that's fine.

> I think those changes would also be enough to allow simplehash to be
> used in a dynamic shared area (DSA). Using it in the main shared
> memory segment seems more problematic, because simplehash relies on
> being able to resize the hash table. Shared hash tables must have a
> fixed maximum size, but with dynahash, we can count on being able to
> use all of the entries without significant performance degradation.
> simplehash, on the other hand, uses linear probing and relies on being
> able to grow the hash table as a way of escaping collisions. By
> default, the load factor is not permitted to drop below 0.1, so to
> mimic the collision-avoidance behavior that we get in backend-private
> uses of simplehash, we'd have to overallocate by 10x, which doesn't
> seem desirable.

It'd be fine to set SH_GROW_MIN_FILLFACTOR to something higher, for many
uses. I've only added that after the fact, because somebody demonstrated
a workload with SQL level data that had a *lot* of conflicts with our
hash functions. But that shouldn't be a concern for most other uses.

> I'd really like to have an alternative to dynahash, which is awkward
> to use and probably not particularly fast, but I'm not sure simplehash
> is it. Maybe what we really need is a third (or nth) hash table
> implementation.

I think it depends a bit on what use-cases you'd like to cover? I think
there's unfortunately a lot of tradeoffs here that are hard to square:

1) For performance, we want the hashtable code to be specialized for the
specific key/value combination. I'm not aware of a way to do that
without some code generation thing like simplehash. Being able to use
simpler pointer math by having fixed sizes, and avoiding indirect
function calls, is important for performance.

It's fairly annoying to have to do the song-and-dance for simplehash
when it's just a local lookup table or something.

2) For performance, using a chained hashtable turns out to be
problematic, if the hashtable will often get big (for small tables
the CPU caches makes it ok). It's hard to avoid reallocations
(and/or smoother growth) for non-chaining tables however.

3) For lots of one-off uses of hashtables that aren't performance
critical, we want a *simple* API. That IMO would mean that key/value
end up being separately allocated pointers, and that just a
comparator is provided when creating the hashtable.

4) For some hashtables it's important to be very concurrent - but it's
considerably harder to do that with an open addressing one.

While I don't think it's possible to avoid compromise on all these
aspects, I think it'd be a lot more realistic to have one implementation
fulfilling most needs (except perhaps the concurrency part) if we didn't
have the limitations of C. This kind of thing really is one where
e.g. C++ style templates are just extremely hard to beat in C.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jens-Wolfhard Schicke-Uffmann 2019-12-10 22:08:34 Re: Contention on LWLock buffer_content, due to SHARED lock(?)
Previous Message Jens-Wolfhard Schicke-Uffmann 2019-12-10 21:44:17 Re: Contention on LWLock buffer_content, due to SHARED lock(?)