Re: Avoiding superfluous buffer locking during nbtree backwards scans

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject: Re: Avoiding superfluous buffer locking during nbtree backwards scans
Date: 2024-10-10 17:41:13
Message-ID: CAH2-Wzmq9NHGbx4zuP6Yn4LO2cJK=gQuK318QuXq4QMqmQh4WA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 8, 2024 at 12:52 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> The code in v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch
> looks reasonable to me. I don't think that there are any impediments
> to committing it sometime this week.

I see significant benefits for parallel index-only backwards scans
with this patch. It's not that hard to see about a 10% decrease in
execution time. I imagine that there are disproportionate benefits for
the parallel case because the work we're avoiding is work that takes
place while a backend has seized the scan. Not bad!

I've been looking at this patch in more detail. I think that you
(Matthias) have gone too far in the direction of making the backwards
scan handling within _bt_readnextpage match the forwards scan
handling. You fail to update blkno when we have to deal with a
concurrent split/deletion that necessitates reading from another blkno
(when the correct left sibling page to read from changes) -- v3 still
used the potentially-stale blkno passed by _bt_readnextpage's caller.

Attached v4 deals with that issue. It also goes further in the
direction of simplifying the code in and around _bt_readnextpage.

The code in this area has always seemed to me to be quite a lot more
complicated than it really needed to be. Now seems like a good time to
refactor and simplify. It's likely that we'll need to do more work
nearby to enable prefetching during plain index scans, without
regressing the kill_prior_tuple/_bt_killitems mechanism -- Tomas
Vondra is currently working on that. The mechanism in question is
closely tied to the stuff that I'm simplifying now, particularly the
_bt_drop_lock_and_maybe_pin/TID-recycle-safety business.

The main simplification new to v4 is that v4 isolates the need to call
_bt_drop_lock_and_maybe_pin to only 2 functions: _bt_readnextpage, and
its new _bt_readfirstpage "sibling" function. The functions have
similar preconditions, and identical postconditions -- all of which
are now clearly documented.

It's not just _bt_drop_lock_and_maybe_pin that's only called from
these 2 functions in v4. All calls to _bt_readpage (plus their
associated PredicateLockPage calls) are now also isolated to those
same 2 functions. This shift enables removing the
_bt_parallel_readpage function entirely -- the point at the top of
_bt_first where _bt_parallel_readpage used to be called can now call
_bt_readnextpage directly instead. ISTM that having a separate
_bt_parallel_readpage function never made much sense -- it was
papering over the fact that the interfaces it uses are so confusing.

v4 also demotes _bt_steppage to a helper function that sets up a call
to _bt_readnextpage. Note that the 2017 parallel index scan commit
split _bt_steppage in two: it became the current _bt_steppage code,
plus the current _bt_readnextpage code. Making that relationship
explicit seems much clearer. I'm essentially finishing off the process
of splitting _bt_steppage up.

Side note: all of these simplifications were enabled by making the
preconditions for calling _bt_readnextpage work independently of the
scan direction (which was present in earlier versions of the patch
from Matthias). We now always do a
"BTScanPosUnpinIfPinned(so->currPos)" within _bt_steppage, so it's
easy to talk clearly about the preconditions/postconditions for
_bt_readnextpage without directly considering its _bt_steppage caller
(still the main caller, including during parallel index scans).

Performance improvement bonus: We do that
"BTScanPosUnpinIfPinned(so->currPos)" at a slightly different point in
_bt_steppage in v4, as compared to master, even during forwards scans:
we consistently unpin *before* the required next call to
_bt_parallel_seize takes place (unless, of course, it's a serial scan,
which'll never need to call _bt_parallel_seize). Presumably this is
why I see a ~5% increase in throughput for parallel index-only forward
(standard) scans with this v4 -- all parallel scans (backwards and
forwards alike) now receive at least some benefit from this patch. It
makes a certain amount of sense; we should do as little as possible
during any window where our backend has seized the scan.

I do need to do more testing, but the direction of v4 seems promising.

--
Peter Geoghegan

Attachment Content-Type Size
v4-0001-Optimize-and-simplify-nbtree-backwards-scans.patch application/octet-stream 33.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-10-10 17:49:02 Re: Build issue with postgresql 17 undefined reference to `pg_encoding_to_char' and `pg_char_to_encoding'
Previous Message Robert Haas 2024-10-10 16:51:51 Re: allowing extensions to control planner behavior