Re: RFC: Packing the buffer lookup table

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>, Zhang Mingli <zmlpostgres(at)gmail(dot)com>
Subject: Re: RFC: Packing the buffer lookup table
Date: 2025-02-04 18:58:36
Message-ID: CAEze2WiRo4Zu71jwxYmqjq6XK814Avf2-kytaL6n=BreZR2ZbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 30 Jan 2025 at 08:48, Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
>
> Together that results in the following prototype patchset.

Here's an alternative patch, which replaces dynahash in the buffer
lookup table with an open-coded replacement that has fewer
indirections during lookups, and allows for a much increased locality.

Overall, it has less memory usage than our current dynahash in all
cases; but compared to my previous patch it improves memory only in
some cases, in others it uses more memory.

The main design is a separate chaining hash table (similar to and
derived from code of Dynahash) to provide efficient access to free
elements. The innovation over Dynahash is that in this new hash table
the stored elements are slotted between the bucket's element pointers,
thus allowing access to hash elements without further cache misses if
the element that stores the bucket's data is stored on the same cache
line (very possible). Further, the implementation uses int-based
offset pointers, reducing pointer size overhead on 64-bit system and
limiting the hash table to 2^31 elements (for a buffer mapping table
that seems to be fine, given our current buffer count limit of
2^30-1).
Finally, it shows up with accurate memory usage in
pg_shmem_allocations due to the usage of a single allocation,
improving user debugability of resources vs dynahash.

0001 implements this hash table, with some minimal optimizations.
0002 adds the optimization where the hash table prefers to pick the
local hash element if it's not already in use.

Future optimizations could implement memory prefetching (so that
buffer tag compares won't have cache misses for memory stalls),
cacheline-local entry selection, and other optimizations, but that'll
need more effort which I don't want to put into it before getting a
green light on "this might actually be a good step in the right
direction".

As to why simplehash was not chosen: Simplehash doesn't (can't?)
support partitioning due to its open addressing model. By taking the
locality of open addressing and combining it with the partitioning
capabilities of separate chaining we get the best of both worlds (or
at least closer than dynahash got).

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. This isn't a perfect solution; it still causes approximately
random memory accesses when we look up sequential pages of a relation.
However, the cost of these memory accesses can now be significantly
reduced without breaking APIs or significantly impacting memory
overheads. Replacing the hashtable with another seems fairly
straightforward, while replacing the hashtable with e.g. radixtree.h
would be a very significant undertaking of shared memory allocations
that I'm not comfortable with writing myself.

Attachment Content-Type Size
v1-0001-BufTable-Use-optimized-hash-table-for-buffer-look.patch application/octet-stream 15.4 KB
v1-0002-BufTable-Optimize-allocation-of-BufTable-entries.patch application/octet-stream 8.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2025-02-04 19:03:10 Re: hash_search_with_hash_value is high in "perf top" on a replica
Previous Message Alvaro Herrera 2025-02-04 18:41:12 Re: Support for NO INHERIT to INHERIT state change with named NOT NULL constraints