Re: Avoiding superfluous buffer locking during nbtree backwards scans

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Avoiding superfluous buffer locking during nbtree backwards scans
Date: 2024-10-08 16:52:40
Message-ID: CAH2-WzkBTuFv7W2+84jJT8mWZLXVL0GHq2hMUTn6c9Vw=eYrCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 5, 2024 at 12:47 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On HEAD, the following query requires 24 buffer hits (I'm getting a
> standard/forward index scan for this):
>
> select
> abalance
> from
> pgbench_accounts
> where
> aid in (1, 500, 1000, 1500, 2000, 3000) order by aid asc;
>
> However, we don't see that with the backwards scan variant:
>
> select
> abalance
> from
> pgbench_accounts
> where
> aid in (1, 500, 1000, 1500, 2000, 3000) order by aid desc;
>
> We actually see 30 buffer hits for this on HEAD. Whereas with the
> patch, both variants get only 24 buffer hits -- there's parity, at
> least in cases like these two.

I'll now present a similar example that shows the remaining issue that
my patch (which became my commit 3f44959f) didn't address, and how the
issue if fixed by Matthias' more recent follow-up patch:

$ pgbench -i -s 1

Now run:

explain (analyze, buffers, costs off, summary off)
select
abalance
from
pgbench_accounts order by aid;

With a warm cache, this query gets a full index scan requiring 1915
buffer hits. However, we still see worse performance for an equivalent
backwards scan:

explain (analyze, buffers, costs off, summary off)
select
abalance
from
pgbench_accounts order by aid desc;

On master, the latter query gets 2188 buffer hits -- so clearly my
commit 3f44959f left money on the table. These extra buffer hits just
don't make sense.

Matthias' patch (I tested
v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch) will
result in both variants requiring 1915 buffer hits. (Plus, of course,
query execution should be faster with Matthias' new approach.)

Though we do need special extra steps for dealing with concurrent page
splits during backwards scans (steps involving a reread of the
original page when its sibling link turns out to be stale), there
hasn't been a concurrent split here -- which is why I find the
inconsistency to be unnatural. Matthias' patch fixes the problem by
passing _bt_walk_left the left link that will now have been stashed in
the local scan state back when the page was read by _bt_readpage, so
that it (_bt_walk_left) can optimistically *start* there (and *not*
start at the page that has already been read, as on the master
branch).

Of course, the left link might be stale, but it was always inherently
necessary for _bt_walk_left to be able to deal with that. Technically
we're being more optimistic about it now, but that optimism is
extremely well justified (concurrent page splits are rare). Arguably,
this latest patch from Matthias actually makes the code simpler. It
makes the backwards scan case more similar to the forwards scan case.
This is another missed opportunity for commit 2ed5b87f96, I'd say --
that 2015 commit left things in a slightly odd in-between state, which
this patch fully corrects.

Note: my example was deliberately constructed to use an index scan
backwards, rather than an index-only scan backwards. While both types
of backwards scan will benefit in the same way (less buffer lock
traffic), you'll only see a reduction in buffers hit for an
EXPLAIN(ANALYZE, BUFFERS) involving a plain index scan backwards. This
is due to the fact that the mechanism added by commit 2ed5b87f96 will
only drop a leaf page buffer pin early for plain index scans, never
for index-only scans. My example also wouldn't make the difference
apparent with an unlogged table, for the same reason -- this
difference isn't really important, but seems worth noting to avoid any
confusion.

A nice side benefit of this approach is that it'll make it easier to
add better coverage for _bt_walk_left. The case where _bt_walk_left
notices a stale left link will still be very rare (we're really not
being all that optimistic in doing things this way), but it'll now be
easier to hit on purpose.

The code in v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch
looks reasonable to me. I don't think that there are any impediments
to committing it sometime this week.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2024-10-08 16:58:34 Re: [PATCH] pg_permissions
Previous Message Tom Lane 2024-10-08 16:47:08 Re: overflow bug for inhcounts