Quick Links

Re: BitmapHeapScan streaming read user and prelim refactoring

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: BitmapHeapScan streaming read user and prelim refactoring
Date:	2025-02-13 12:47:02
Message-ID:	1e59efd9-4a83-494c-8016-b145fd4e2cca@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2/13/25 05:15, Thomas Munro wrote:
> On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
>> On Sun, Feb 9, 2025 at 9:27 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>> For the nvme RAID (device: raid-nvme), it's looks almost exactly the
>>> same, except that with parallel query (page 27) there's a clear area of
>>> regression with eic=1 (look for "column" of red cells). That's a bit
>>> unfortunate, because eic=1 is the default value.
>>
>> So, I feel pretty confident after even more analysis today with Thomas
>> Munro that all of the parallel cases with effective_io_concurrency ==
>> 1 are because master cheats effective_io_concurrency. Therefore, I'm
>> not too worried about those for now. We probably will have to increase
>> effective_io_concurrency before merging this patch, though.
>
> Yeah. Not only did we see it issuing up to 20 or so unauthorised
> overlapping POSIX_FADV_WILLNEED calls, but in the case we studied they
> didn't really do anything at all because it was *postfetching* (!), ie
> issuing advice for blocks it had already read. The I/O was sequential
> (reading all or most blocks in order), so the misdirected advice at
> least kept out of the kernel's way, and it did some really effective
> readahead. Meanwhile the stream version played by the rules and
> respected eic=1, staying only one block ahead. We know that advice for
> sequential blocks blocks readahead, so that seems to explain the red
> for that parameter/test permutation. (Hypothesis based on some
> quick reading: I guess kernel pages faulted in that way don't get
> marked with the internal "readahead was here" flag, which would
> normally cause the following pread() to trigger more asynchronous
> readahead with ever larger physical size). But that code
> in master is also *trying* to do exactly what we're doing, it's just
> failing because of some combination of bugs in the parallel scan code
> (?). It would be absurd to call it the winner: we're setting out
> with the same goal, but for that particular set of parameters and test
> setup, apparently two (?) bugs make a right. Melanie mentioned that
> some other losing cases also involved unauthorised amounts of
> overlapping advice, except there it was random and beneficial and the
> iterators didn't get out of sync with each other, so again it beat us
> in a few red patches, seemingly with a different accidental behaviour.
> Meanwhile we obediently trundled along with just 1 concurrent I/O or
> whatever, and lost.
>

Makes sense, likely explains the differences. And I agree we should not
hold this against this patch series, it's master that's doing something
wrong. And in most of these cases we'd not even do bitmap scan anyway, I
think, it only happens because we forced the plan.

BTW how are you investigating those I/O requests? strace? iosnoop? perf?
Something else?

> On top of that, read_stream.c is also *trying* to avoid issuing advice
> for access patterns that look sequential, and *trying* to do I/O
> combining, which all works OK in serial BHS, but it breaks down in
> parallel because:
>
>> Thomas mentioned this to me off-list, and I think he's right. We
>> likely need to rethink the way parallel bitmap heap scan workers get
>> block assignments for reading and prefetching to make it more similar
>> to parallel sequential scan. The workers should probably get
>> assignments of a range of blocks. On master, each worker does end up
>> issuing reads/fadvises for a bunch of blocks in a row -- even though
>> that isn't the intent of the parallel bitmap table scan
>> implementation. We are losing some of that with the patch -- but only
>> because it is behaving as you would expect given the implementation
>> and design. I don't consider that a blocker, though.
>
> + /*
> + * The maximum number of tuples per page is not large (typically 256 with
> + * 8K pages, or 1024 with 32K pages). So there's not much point in making
> + * the per-page bitmaps variable size. We just legislate that the size is
> + * this:
> + */
> + OffsetNumber offsets[MaxHeapTuplesPerPage];
> } TBMIterateResult;
>
> Seems to be 291? So sizeof(TBMIterateResult) must be somewhere
> around 588 bytes? There's one of those for every buffer, and we'll
> support queues of up to effective_io_concurrency (1-1000) *
> io_combine_limit (1-128?), which would require ~75MB of
> per-buffer-data with maximum settings That's a lot of memory to have
> to touch as you whizz around the circular queue even when you don't
> really need it. With more typical numbers like 10 * 32 it's ~90kB,
> which I guess is OK. I wonder if it would be worth trying to put a
> small object in there instead that could be expanded to the results
> later, cf f6bef362 for TIDStore, but I'm not sure if it's a blocker,
> just an observation.

Not sure I follow. Doesn't f6bef362 do quite the opposite, i.e. removing
the repalloc() calls and replacing that with a local on-stack buffer
sized for MaxOffsetNumber items? Which I think is 2048 for 8kB pages?

Note: Is it correct to use MaxHeapTuplesPerPage for TBMIterateResult, or
should it not be tied to heap?

I don't have a clear opinion on sizing the offsets buffer - whether it
should be static (like above) or dynamic (expanded using repalloc). But
I'd probably be more worried about efficiency for "normal" configs than
about memory usage when everything is maxed out. The 75MB is not
negligible, but it's also not terrible for machines that would need such
values for these GUCs. Saving memory is nice, but if it meant "normal"
configs spend much more time on resizing the buffers, that doesn't seem
like a good tradeoff to me.

The one thing I'd be worried about is if the allocations get too large
and can no longer be cached by the memory context (and doing malloc
every time). But I don't think that's the case here, it'd need to be
larger than 8kB.

regards

--
Tomas Vondra

In response to

Re: BitmapHeapScan streaming read user and prelim refactoring at 2025-02-13 04:15:58 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nazir Bilal Yavuz	2025-02-13 13:02:18	Re: Adding NetBSD and OpenBSD to Postgres CI
Previous Message	Álvaro Herrera	2025-02-13 12:47:01	Re: pg_stat_statements and "IN" conditions