pgsql: Avoid parallel nbtree index scan hangs with SAOPs.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: pgsql: Avoid parallel nbtree index scan hangs with SAOPs.
Date: 2024-09-17 15:10:55
Message-ID: E1sqZqw-001WYc-QE@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Avoid parallel nbtree index scan hangs with SAOPs.

Commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution, made
parallel index scans work with the new design for arrays via explicit
scheduling of primitive index scans. A backend that successfully
scheduled the scan's next primitive index scan saved its backend local
array keys in shared memory. Any backend could pick up the scheduled
primitive scan within _bt_first. This scheme decouples scheduling a
primitive scan from starting the scan (by performing another descent of
the index via a _bt_search call from _bt_first) to make things robust.

The scheme had a deadlock hazard, at least when the leader process
participated in the scan. _bt_parallel_seize had a code path that made
backends that were not in an immediate position to start a scheduled
primitive index scan wait for some other backend to do so instead.
Under the right circumstances, the leader process could wait here
forever: the leader would wait for any other backend to start the
primitive scan, while every worker was busy waiting on the leader to
consume tuples from the scan's tuple queue.

To fix, don't wait for a scheduled primitive index scan to be started by
some other eligible backend from within _bt_parallel_seize (when the
calling backend isn't in a position to do so itself). Return false
instead, while recording that the scan has a scheduled primitive index
scan in backend local state. This leaves the backend in the same state
as the existing case where a backend schedules (or tries to schedule)
another primitive index scan from within _bt_advance_array_keys, before
calling _bt_parallel_seize. _bt_parallel_seize already handles that
case by returning false without waiting, and without unsetting the
backend local state. Leaving the backend in this state enables it to
start a previously scheduled primitive index scan once it gets back to
_bt_first.

Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.

Matthias van de Meent, with tweaks by me.

Author: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Reported-By: Tomas Vondra <tomas(at)vondra(dot)me>
Reviewed-By: Peter Geoghegan <pg(at)bowt(dot)ie>
Discussion: https://postgr.es/m/CAH2-WzmMGaPa32u9x_FvEbPTUkP5e95i=QxR8054nvCRydP-sw@mail.gmail.com
Backpatch: 17-, where nbtree SAOP execution was enhanced.

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/d8adfc18bebfb1b69b456a00e67453775a77f594

Modified Files
--------------
src/backend/access/nbtree/nbtree.c | 53 ++++++++++++++++++++++++--------------
1 file changed, 33 insertions(+), 20 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Alexander Korotkov 2024-09-17 19:51:15 pgsql: Ensure standby promotion point in 043_wal_replay_wait.pl
Previous Message Andrew Dunstan 2024-09-17 13:47:57 Re: pgsql: scripts: add Perl script to add links to release notes