Scrollable cursors that use an nbtree index scan are subtly broken with array keys (bug affecting 17+)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Scrollable cursors that use an nbtree index scan are subtly broken with array keys (bug affecting 17+)
Date: 2024-10-29 16:01:24
Message-ID: CAH2-Wznv49bFsE2jkt4GuZ0tU2C91dEST=50egzjY2FeOcHL4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached test case illustrates how a bug in nbtree's handling of
primitive index scan scheduling can lead to wrong answers when a
scrollable cursor is used. This happens when the scan direction
changes for a scan with array keys, at exactly the wrong time, when
certain precise conditions are met (more on that below, and in the
attached test case). The bug in question is from my commit 5bf748b8.
Both 17 and master are affected.

Attached is the fix. It's fairly simple. I propose to address the
issue by making sure to unset the "needs primitive index scan" flag
when we detect that it is invalid for the current (also recently
changed) scan direction. This happens at the point that we "change our
mind and step to the page to the left, and not to the right as
initially expected" (since the test case happens to work by starting
the scan in the forward direction and then switching it to scan
backwards). In other words, with the patch applied we'll *reliably*
detect and recover from violations of _bt_readpage's soft assumption
that the scan direction won't change before the prim scan it schedules
is allowed to get started -- _bt_steppage is the right choke point for
making sure the soft assumption cannot ever really be violated.

FWIW the bug is far more subtle than the fix itself may suggest. Our
general approach to dealing with scrollable cursors that change
direction mostly works already. The test case reveals one particular
way in which it might not quite work as intended.

To recap, we generally won't get confused about primitive scan
direction in this way (in spite of the so->needPrimScan flag being
set) because we'll usually *also* find the currPos "moreLeft" flag set
to 'true' by the time _bt_readnextpage is reached (since we set both
moreLeft and moreRight to 'true' within _bt_readfirstpage iff
_bt_readfirstpage finds so->needPrimScan set from before). As a result
of all this, the so->needPrimScan flag won't usually be effective at
all in most cases where the scan direction subsequently changes before
we leave the page in question. Everything generally works out because
the scan's array keys will simply get reset (in a way that considers
the new scan direction) when we next call
_bt_readpage/_bt_checkkeys/_bt_advance_array_keys (this time in the
opposite scan direction to last time). And so _bt_first won't ever get
the chance to get called, so everything still works out. My test case
is designed in such a way as to make recovering from the problem via
_bt_readfirstpage's "set both moreLeft and moreRight to 'true' within
_bt_readfirstpage iff _bt_readfirstpage finds so->needPrimScan set
from before" behavior not save us -- we don't get that behavior for
the very first _bt_first, which is how things work within the test
case (the test case's initial call to _bt_readfirstpage will set
moreLeft to 'false', since it'll be dealing with the first page for
the entire top-level scan, not just one individual primitive index
scan).

Another very specific adversarial aspect of my test case seems worth
drawing attention to, for context: the test case fetches tuples from
the cursor in a way that is precisely aligned to a boundary between a
pair of sibling leaf pages. IOW, for things to visibly break as they
do in the test case, the scan needs to reach the precise point of
scheduling the next primitive index scan, *without* fetching even a
single additional index tuple in the same direction (same direction as
the one _bt_readpage used when it scheduled that same primitive index
scan) before backing up/changing the scan direction. And so if we were
to (say) fetch 11 rows (instead of 10) in the first fetch statement,
then we'll actually end up performing the scheduled primitive scan as
expected, in order to be able to fetch that 11th index tuple -- a
factor that'll deny _bt_first the opportunity to get called with the
wrong array keys for the then-current scan direction.

The repro is complicated, but fortunately the fix itself is much simpler.

--
Peter Geoghegan

Attachment Content-Type Size
v1-0001-Avoid-nbtree-array-scrollable-cursor-confusion.patch application/octet-stream 1.8 KB
btfirst_array_confusion_repro.sql application/octet-stream 2.4 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Jesper Pedersen 2024-10-29 16:03:04 Re: protocol-level wait-for-LSN
Previous Message vignesh C 2024-10-29 15:20:15 Re: Pgoutput not capturing the generated columns