Re: Limiting overshoot in nbtree's parallel SAOP index scans

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Limiting overshoot in nbtree's parallel SAOP index scans
Date: 2024-10-16 18:52:04
Message-ID: CAH2-Wz=k56btO_MMsA4VQSFVb4es4OsVCOk5m1+tepgyduqa0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 11, 2024 at 10:27 AM Matthias van de Meent
<boekewurm(at)gmail(dot)com> wrote:
> With the introduction of the new SAOP handling in PG17, however, the
> shared state has become a bit more muddied. Because the default has
> changed from "stop scanning at the end of a SAOP value's range" to
> "continue scanning, unless ...", and because it is relatively
> expensive to determine whether we need to stop or continue we release
> the parallel scan to continue before we know if a new primitive scan
> would be preferred.

It's not just relatively expensive. It's essential that _bt_readpage
be able to release the parallel scan at the earliest opportunity (just
as soon as the next sibling link is known). Whereas a parallel backend
worker will only decide that it's likely a good idea to do another
primitive index scan when _bt_readpage has finished executing.

Clearly it would not be sensible to wait until _bt_readpage had gotten
as far as deciding on the need for another descent of the index. To do
so would essentially serialize the scan. There are clear advantages to
doing as little coordination as possible among backends. (I don't
think that you disagree with any of this, but just want to establish
context for anybody else that's reading along -- you seem to want to
do some limited/conditional form of this behavior.)

> Problem
> -------
> The new parallel index scan code can only get into a new primitive
> scan IFF no parallel worker has taken the next block in the time
> between the call to _bt_parallel_release at the start of _bt_readpage,
> and the call to _bt_parallel_primscan_schedule through _bt_checkkeys
> slightly later in _bt_readpage. This makes the code susceptible to
> race conditions

Do you have a test case?

I was aware of the issues in this area when designing the
_bt_parallel_primscan_schedule mechanism. I believe it's true that the
scan is only strictly guaranteed to be able to make forward progress
via naively scanning the next leaf page, again and again. This is a
fairly theoretical point, though.

I believe that it's practically impossible for this to cause us
problems. That would require a group of backends that all want to
start another primitive index scan, but each block each other out,
constantly. Not just once, mind you -- again and again (enough to be
noticeable). This would have to happen in spite of the fact that we
only release one backend per _bt_parallel_release call
(_bt_parallel_release uses ConditionVariableSignal under the hood,
which only releases the oldest worker waiting on the CV, not all
backends).

There is no signalling of condition variables within
_bt_parallel_primscan_schedule, either. So you just have this one call
to ConditionVariableSignal. Why should it be possible to get a
stampede of parallel backends? (I'm not 100% sure if your concern is
even theoretically valid.)

> Given the above, a parallel index scan with ScanKey `id = ANY
> (minvalue, maxvalue)` and bad enough timing/luck could thus scan the
> whole index, while just 2 primitive index scans (the behaviour before
> PG17) are sufficient.
>
> In practice it is quite unlikely that parallel index scans have such
> bad luck, but I don't think we should have luck or timing as a big
> factor for the performance picture in parallel index scans.

It's also possible for a hash table to degenerate when there are
excessively many hash collisions. This is certainly possible with
adversarial inputs -- it's not hard to write a test case that'll
generate some. So hash tables generally also "have luck as a big
factor", in about the same sense. No?

I don't think that it's fundamentally unreasonable to have concerns
about things like this, of course. Safety-critical avionics systems
tend to just not use hash tables at all.

> I think I've found a reasonably elegant solution, which limits the
> number of buffers we can fail to start a new primitive scan to the
> number of parallel workers + 1: If we detect that a parallel worker
> in the same primscan range thinks this is the right moment to start a
> new primscan (but couldn't due to concurrent advancement), we don't
> release the parallel scan immediately, but instead only release it
> after reading the pages contents, to find out if we really should
> start a new primitive scan. If so, we do that, and if not, we instead
> mark that the primitive scan has reached a new primscan range, do some
> cleanup, and then continue business as usual.

I'm opposed to this patch. It'll introduce a behavior that's more
difficult to reason about and test then the one that it purports to
fix.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2024-10-16 19:13:55 Re: New "raw" COPY format
Previous Message Joel Jacobson 2024-10-16 18:30:47 Re: New "raw" COPY format