Re: RFC: Packing the buffer lookup table

From: Andres Freund <andres(at)anarazel(dot)de>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: RFC: Packing the buffer lookup table
Date: 2025-02-05 01:14:17
Message-ID: aut3y6gd7gunp5ujlo7bi32rt3zey3qplucnp64y7cvzv6utdv@rw6zbfnos2sl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-01-30 08:48:56 +0100, Matthias van de Meent wrote:
> Some time ago I noticed that every buffer table entry is quite large at 40
> bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are
> padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared
> buffers array, with generally 8 more bytes used for the bucket pointers.
> (32-bit systems: 32 (+4) bytes)
>
> Does anyone know why we must have the buffer tag in the buffer table?

It turns out to actually substantially improve CPU cache hit ratio on
concurrent workloads. The BufferDesc is obviously frequently modified. Each
modification (pin, lock) will pull the cacheline into modified state, within a
single core. Causing higher latency when accessing that data on other
cores. That's obviously not great for a crucial hashtable... I think it
mainly matters for things like inner index pages etc.

It turns out that these misses due to dirtying already costs us rather dearly
when running read-heavy workloads on bigger machines, mainly due to nbtree
code doing things like BufferGetBlockNumber().

It'd be interesting to benchmark how a separate, more densely packed,
BufferTag array, shared by the hashtable and the BufferDesc itself. Obviously
that'd be a somewhat painful change.

> It seems to me we can follow the offset pointer into the shared BufferDesc
> array whenever we find out we need to compare the tags (as opposed to just
> the hash, which is stored and present in HASHELEMENT). If we decide to just
> follow the pointer, we can immediately shave 16 bytes (40%) off the lookup
> table's per-element size, or 24 if we pack the 4-byte shared buffer offset
> into the unused bytes in HASHELEMENT, reducing the memory usage of that
> hash table by ~50%: We'd have 16 bytes for every
> ELEMENT+shared_buffer_offset, plus 8 bytes for every bucket pointer (of
> which there are approximately as many as there are elements), resulting in
> 24 bytes /max_alloc elements.

It does make the code more branchy...

> Does anyone have an idea on how to best benchmark this kind of patch, apart
> from "running pgbench"? Other ideas on how to improve this? Specific
> concerns?

I'd recommend benchmarking at least the following workloads, all fully shared
buffer cache resident:

1) sequential scans

Tests: very low likelihood of cpu cache hit rates

2) sequential scan with a parameterized index nested loop over a small table

Tests: mixes low-cache hit likelihood (seqscan) with high cache hit rate
(index scan)

3) small range index scan with param nestloop join to index scan

Tests: Very high cache hit ratio

It's unfortunately fairly important to test these both with a single client
*and* a large number of clients on a multi-socket server. The latter makes
cache miss latency much more visible.

It might be worth looking at perf c2c profiles for before/after, I'd expect it
to change noticeably. Might also be worth at looking at perf stat for cache
misses, hitm, etc.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-02-05 01:22:15 Re: RFC: Packing the buffer lookup table
Previous Message Tom Lane 2025-02-05 01:10:46 Re: new commitfest transition guidance