Quick Links

Re: Some read stream improvements

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some read stream improvements
Date:	2025-03-13 14:58:52
Message-ID:	CA+hUKGKH2mwj=mpCobC4ZyBdR4ToW2hpQcYFDivj40fxf4yQow@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 12, 2025 at 8:29 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2025-03-12 07:35:46 +1300, Thomas Munro wrote:
> > On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
> > I wonder if we should use temp_buffers - 100? Then leave the minimum GUC
> > value at 100 still, so you have an easy way to test with 0, 1,
> > ... additional buffers?
>
> I think that just makes it harder to test the exhaustion scenario without
> really fixing anything?

Not sure I agree yet but I'll come back to this in a bit (I think
there might be something worth thinking about some more here but it's
not in the way of committing these patches).

> > +/* see GetAdditionalPinLimit() */
> > +uint32
> > +GetAdditionalLocalPinLimit(void)
> > +{
> > + Assert(NLocalPinnedBuffers <= num_temp_buffers);
> > + return num_temp_buffers - NLocalPinnedBuffers;
> > +}
>
> This doesn't behave quite the way GetAdditionalPinLimit() does - it can return
> 0. Which makes some sense, pinning an additional buffer will always
> fail. Perhaps worth calling out though?

No, GetAdditionalPinLimit() works that way too. It's only
LimitAdditional[Local]PinLimit() that has the special "you can always
have one" logic that I needed to escape from. But yes I should
highlight that in a comment: done above GetAdditionalPinLimit().
GetAdditionalLocalPinLimit() just tells you to see the shared version.

> > static void
> > read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
> > {
> > while (stream->ios_in_progress < stream->max_ios &&
> > - stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
> > + ((stream->pinned_buffers == 0 && stream->distance > 0) ||
> > + stream->pinned_buffers + stream->pending_read_nblocks < stream->distance))
>
> What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really
> mean? And when would it be true when the pre-existing condition wouldn't
> already be true?

Well the reason is basically that the distance can get chomped lower
than pending_read_nblocks if you recently hit the pin limit, and we
have to be able to start your I/O anyway (at least one block of it) to
make progress. But I realised that was a stupid way to handle that,
and actually I had screwed up further down, and the right way is just:

@@ -382,15 +435,25 @@ read_stream_look_ahead(ReadStream *stream, bool
suppress_advice)
* io_combine_limit size once more buffers have been consumed. However,
* if we've already reached io_combine_limit, or we've reached the
* distance limit and there isn't anything pinned yet, or the
callback has
- * signaled end-of-stream, we start the read immediately.
+ * signaled end-of-stream, we start the read immediately. Note that the
+ * pending read could even exceed the distance goal, if the latter was
+ * reduced on buffer limit exhaustion.
*/
if (stream->pending_read_nblocks > 0 &&
(stream->pending_read_nblocks == stream->io_combine_limit ||
- (stream->pending_read_nblocks == stream->distance &&
+ (stream->pending_read_nblocks >= stream->distance &&
stream->pinned_buffers == 0) ||
stream->distance == 0) &&
stream->ios_in_progress < stream->max_ios)
read_stream_start_pending_read(stream, suppress_advice);

> > {
> > BlockNumber blocknum;
> > int16 buffer_index;
> > void *per_buffer_data;
> >
> > - if (stream->pending_read_nblocks == io_combine_limit)
> > + /* If have a pending read that can't be extended, start it now. */
> > + Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
> > + stream->max_pinned_buffers);
> > + if (stream->pending_read_nblocks == io_combine_limit ||
> > + (stream->pinned_buffers == 0 &&
> > + stream->pending_read_nblocks == stream->max_pinned_buffers))
> > {
> > read_stream_start_pending_read(stream, suppress_advice);
> > suppress_advice = false;
> > @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
> > /* We have to start the pending read before we can build another. */
> > while (stream->pending_read_nblocks > 0)
> > {
> > - read_stream_start_pending_read(stream, suppress_advice);
> > - suppress_advice = false;
> > - if (stream->ios_in_progress == stream->max_ios)
> > + if (!read_stream_start_pending_read(stream, suppress_advice) ||
> > + stream->ios_in_progress == stream->max_ios)
> > {
> > - /* And we've hit the limit. Rewind, and stop here. */
> > + /* And we've hit a buffer or I/O limit. Rewind and wait. */
> > read_stream_unget_block(stream, blocknum);
> > return;
> > }
> > +
> > + suppress_advice = false;
> > }
>
> If read_stream_start_pending_read() returns false because we hit the pin
> limit, does it really help to call read_stream_unget_block()? IIUC that'll
> defer one block for later - but what about the other buffers in a multi-block
> read?

They are already represented by (pending_read_blocknum,
pending_read_nblocks). We were unable to append this block number to
the pending read because it's already full-sized or the newly acquired
block number isn't consecutive, but we were also unable to start the
pending read to get it out of the way. That was a pre-existing edge
case that you could already hit if it turned out to be a short read:
ie the remaining part of the pending read is still in your way, and
now you've reached stream->max_ios so you can't start it. So we had
to put it aside for later.

This change piggy-backs on that approach for buffer starvation:
read_stream_start_buffers() can now decline to start even a partial
read. In fact it usually declines, unless it is forced to accept a
short read because stream->pinned_buffers == 0 (ie, we have to do
whatever we can to make progress).

It's OK that pending_read_nblocks exceeds what we can start right now,
we still remember it, and we can always start it one block at a time
if it comes to it. Make sense?

> > @@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
> > else
> > Assert(stream->next_buffer_index == stream->oldest_buffer_index);
> >
> > - /*
> > - * If advice hasn't been suppressed, this system supports it, and this
> > - * isn't a strictly sequential pattern, then we'll issue advice.
> > - */
> > - if (!suppress_advice &&
> > - stream->advice_enabled &&
> > - stream->pending_read_blocknum != stream->seq_blocknum)
> > + /* Do we need to issue read-ahead advice? */
> > + flags = 0;
> > + if (stream->advice_enabled)
> > + {
> > flags = READ_BUFFERS_ISSUE_ADVICE;
> > - else
> > - flags = 0;
> > +
> > + if (stream->pending_read_blocknum == stream->seq_blocknum)
> > + {
> > + /*
> > + * Suppress advice if our WaitReadBuffers() calls have caught up
> > + * with the first advice we issued for this sequential run.
> > + */
> > + if (stream->seq_start == InvalidBlockNumber)
> > + suppress_advice = true;
> > + }
> > + else
> > + {
> > + /* Random jump, so start a new sequential run. */
> > + stream->seq_start = stream->pending_read_blocknum;
> > + }
> > +
> > + if (suppress_advice)
> > + flags = 0;
> > + }
>
> Seems a bit confusing to first set
> flags = READ_BUFFERS_ISSUE_ADVICE
> to then later unset it again. Maybe just set it in if (!suppress_advice)?

Yeah that was a bit too tangly. I found a better expression of that
logic, which also removed that annoying suppress_advice function
argument. I hope this is much clearer.

> > * Skip the initial ramp-up phase if the caller says we're going to be
> > @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
> > distance = stream->distance * 2;
> > distance = Min(distance, stream->max_pinned_buffers);
> > stream->distance = distance;
> > +
> > + /*
> > + * If we've caught up with the first advice issued for the current
> > + * sequential run, cancel further advice until the next random
> > + * jump. The kernel should be able to see the pattern now that
> > + * we're issuing sequential preadv() calls.
> > + */
> > + if (stream->ios[io_index].op.blocknum == stream->seq_start)
> > + stream->seq_start = InvalidBlockNumber;
>
> So stream->seq_start doesn't really denote the start of sequentialness, it
> denotes up to where the caller needs to process before we disable sequential
> access. Maybe add a comment to it and rename it to something like
> ->seq_until_processed?

WFM.

> Other than this the approach seems to make sense!

Cool, so I'm planning to start pushing the earlier ones tomorrow.
Here also are the buffer forwarding ones, rebased on top.

Here's some strace art showing the old and new advice for patch 0003.
I traced ANALYZE io_combine_limit=8 and used different
default_statistics_targets to find interesting test cases. The
"bracket" on the right shows a sequential range of blocks. Master
only calls fadvise once per sequential chunk:

fadvise ●──────────────────────╮││ 3 0.000006
││╰─► pread 1 676..676 2 0.000007
fadvise ●─────────────────────╮││ 3 0.000006
││╰──► pread 1 678..678 2 0.000007
fadvise ●────────────────────╮││ 3 0.000007
││╰───► pread 3 680..682 2 0.000031
│╰────► pread 6 684..689 1 0.000015
╰─────► pread 8 691..698 ─╮ 0 0.000018
pread 8 699..706 │ 0 0.000016
fadvise ●────────────────────────╮ │ 1 0.000007
│ pread 8 707..714 │ 1 0.000019
│ pread 7 715..721 ─╯ 1 0.000017
╰─► pread 8 723..730 ─╮ 0 0.000016
pread 8 731..738 │ 0 0.000019
fadvise ●────────────────────────╮ │ 1 0.000007
│ pread 8 739..746 │ 1 0.000018
│ pread 5 747..751 ─╯ 1 0.000013

The patch can call three times when that's the configured I/O
concurrency level, because that controls when the pread() calls catch
up with the first block:

fadvise ●────────────────────╮││ 3 0.000007
││╰───► pread 2 255..256 2 0.000014
fadvise ●───────────────────╮││ 3 0.000007
││╰────► pread 8 258..265 ─╮ 2 0.000035
│╰─────► preadv 8 266..273 │ 1 0.000021
╰──────► pread 8 274..281 │ 0 0.000017
fadvise ●────────────────────────╮ │ 1 0.000007
│ pread 8 282..289 │ 1 0.000017
fadvise ●───────────────────────╮│ │ 2 0.000007
││ pread 6 290..295 ─╯ 2 0.000015
fadvise ●──────────────────────╮││ 3 0.000007
││╰─► pread 8 297..304 ─╮ 2 0.000015
fadvise ●─────────────────────╮││ │ 3 0.000007
││╰──► pread 8 305..312 │ 2 0.000017

Purely sequential streams still get none:

pread 1 0..0 ─╮ 0 0.000016
pread 2 1..2 │ 0 0.000014
pread 4 3..6 │ 0 0.000021
pread 8 7..14 │ 0 0.000034
... blah blah blah ...
pread 8 4455..4462 │ 0 0.000029
pread 8 4463..4470 │ 0 0.000026
pread 8 4471..4478 │ 0 0.000020
pread 1 4479..4479 ─╯ 0 0.000010

Attachment	Content-Type	Size
v3-0001-Improve-buffer-manager-API-for-backend-pin-limits.patch	text/x-patch	6.9 KB
v3-0002-Respect-pin-limits-accurately-in-read_stream.c.patch	text/x-patch	9.6 KB
v3-0003-Improve-read-stream-advice-for-large-random-chunk.patch	text/x-patch	7.4 KB
v3-0004-Look-ahead-more-when-sequential-in-read_stream.c.patch	text/x-patch	7.2 KB
v3-0005-Support-buffer-forwarding-in-read_stream.c.patch	text/x-patch	8.7 KB
v3-0006-Support-buffer-forwarding-in-StartReadBuffers.patch	text/x-patch	10.5 KB

In response to

Re: Some read stream improvements at 2025-03-11 19:29:29 from Andres Freund

Responses

Re: Some read stream improvements at 2025-03-14 09:03:15 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nico Williams	2025-03-13 15:04:58	Re: protocol support for labels
Previous Message	vignesh C	2025-03-13 14:41:49	Re: Add support for EXTRA_REGRESS_OPTS for meson