Re: BitmapHeapScan streaming read user and prelim refactoring

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Date: 2024-03-17 16:38:09
Message-ID: 20240317163809.2y6h7btfuo76r4z7@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-03-16 21:25:18 +0100, Tomas Vondra wrote:
> On 3/16/24 20:12, Andres Freund wrote:
> > That would address some of the worst behaviour, but it doesn't really seem to
> > address the underlying problem of the two iterators being modified
> > independently. ISTM the proper fix would be to protect the state of the
> > iterators with a single lock, rather than pushing down the locking into the
> > bitmap code. OTOH, we'll only need one lock going forward, so being economic
> > in the effort of fixing this is also important.
> >
>
> Can you share some details about how you identified the problem, counted
> the prefetches that happen too late, etc? I'd like to try to reproduce
> this to understand the issue better.

There's two aspects. Originally I couldn't reliably reproduce the regression
with Melanie's repro on my laptop. I finally was able to do so after I
a) changed the block device's read_ahead_kb to 0
b) used effective_io_concurrency=1

That made the difference between the BitmapAdjustPrefetchIterator() locations
very significant, something like 2.3s vs 12s.

Besides a lot of other things, I finally added debugging fprintfs printing the
pid, (prefetch, read), block number. Even looking at tiny excerpts of the
large amount of output that generates shows that two iterators were out of
sync.

> If I understand correctly, what may happen is that a worker reads blocks
> from the "prefetch" iterator, but before it manages to issue the
> posix_fadvise, some other worker already did pread. Or can the iterators
> get "out of sync" in a more fundamental way?

I agree that the current scheme of two shared iterators being used has some
fairly fundamental raciness. But I suspect there's more than that going on
right now.

Moving BitmapAdjustPrefetchIterator() to later drastically increases the
raciness because it means table_scan_bitmap_next_block() happens between
increasing the "real" and the "prefetch" iterators.

An example scenario that, I think, leads to the iterators being out of sync,
without there being races between iterator advancement and completing
prefetching:

start:
real -> block 0
prefetch -> block 0
prefetch_pages = 0
prefetch_target = 1

W1: tbm_shared_iterate(real) -> block 0
W2: tbm_shared_iterate(real) -> block 1
W1: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 0
W2: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 1
W1: read block 0
W2: read block 1
W1: BitmapPrefetch() -> prefetch_pages++ -> 1, tbm_shared_iterate(prefetch) -> 2, prefetch block 2
W2: BitmapPrefetch() -> nothing, as prefetch_pages == prefetch_target

W1: tbm_shared_iterate(real) -> block 2
W2: tbm_shared_iterate(real) -> block 3

W2: BitmapAdjustPrefetchIterator() -> prefetch_pages--
W2: read block 3
W2: BitmapPrefetch() -> prefetch_pages++, tbm_shared_iterate(prefetch) -> 3, prefetch block 3

So afaict here we end up prefetching a block that the *same process* just had
read.

ISTM that the idea of somehow "catching up" in BitmapAdjustPrefetchIterator(),
separately from advancing the "real" iterator, is pretty ugly for non-parallel
BHS and just straight up broken in the parallel case.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2024-03-17 17:57:45 Re: pg_upgrade failing for 200+ million Large Objects
Previous Message Christophe Pettus 2024-03-17 15:44:45 Re: Regression tests fail with musl libc because libpq.so can't be loaded