From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: RFC: Packing the buffer lookup table |
Date: | 2025-02-05 01:14:17 |
Message-ID: | aut3y6gd7gunp5ujlo7bi32rt3zey3qplucnp64y7cvzv6utdv@rw6zbfnos2sl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-01-30 08:48:56 +0100, Matthias van de Meent wrote:
> Some time ago I noticed that every buffer table entry is quite large at 40
> bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are
> padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared
> buffers array, with generally 8 more bytes used for the bucket pointers.
> (32-bit systems: 32 (+4) bytes)
>
> Does anyone know why we must have the buffer tag in the buffer table?
It turns out to actually substantially improve CPU cache hit ratio on
concurrent workloads. The BufferDesc is obviously frequently modified. Each
modification (pin, lock) will pull the cacheline into modified state, within a
single core. Causing higher latency when accessing that data on other
cores. That's obviously not great for a crucial hashtable... I think it
mainly matters for things like inner index pages etc.
It turns out that these misses due to dirtying already costs us rather dearly
when running read-heavy workloads on bigger machines, mainly due to nbtree
code doing things like BufferGetBlockNumber().
It'd be interesting to benchmark how a separate, more densely packed,
BufferTag array, shared by the hashtable and the BufferDesc itself. Obviously
that'd be a somewhat painful change.
> It seems to me we can follow the offset pointer into the shared BufferDesc
> array whenever we find out we need to compare the tags (as opposed to just
> the hash, which is stored and present in HASHELEMENT). If we decide to just
> follow the pointer, we can immediately shave 16 bytes (40%) off the lookup
> table's per-element size, or 24 if we pack the 4-byte shared buffer offset
> into the unused bytes in HASHELEMENT, reducing the memory usage of that
> hash table by ~50%: We'd have 16 bytes for every
> ELEMENT+shared_buffer_offset, plus 8 bytes for every bucket pointer (of
> which there are approximately as many as there are elements), resulting in
> 24 bytes /max_alloc elements.
It does make the code more branchy...
> Does anyone have an idea on how to best benchmark this kind of patch, apart
> from "running pgbench"? Other ideas on how to improve this? Specific
> concerns?
I'd recommend benchmarking at least the following workloads, all fully shared
buffer cache resident:
1) sequential scans
Tests: very low likelihood of cpu cache hit rates
2) sequential scan with a parameterized index nested loop over a small table
Tests: mixes low-cache hit likelihood (seqscan) with high cache hit rate
(index scan)
3) small range index scan with param nestloop join to index scan
Tests: Very high cache hit ratio
It's unfortunately fairly important to test these both with a single client
*and* a large number of clients on a multi-socket server. The latter makes
cache miss latency much more visible.
It might be worth looking at perf c2c profiles for before/after, I'd expect it
to change noticeably. Might also be worth at looking at perf stat for cache
misses, hitm, etc.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-02-05 01:22:15 | Re: RFC: Packing the buffer lookup table |
Previous Message | Tom Lane | 2025-02-05 01:10:46 | Re: new commitfest transition guidance |