Re: Some read stream improvements

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some read stream improvements
Date: 2025-03-11 19:29:29
Message-ID: 4zrbkuxva3yecbtmsgxhdu6lsu3cld7ejtbunocy3z46p7jkp5@ew2k27u2fxsm
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-03-12 07:35:46 +1300, Thomas Munro wrote:
> On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
> > > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > > I was working on expanding tests for AIO and as part of that wrote a test for
> > > > temp tables -- our coverage is fairly awful, there were many times during AIO
> > > > development where I knew I had trivially reachable temp table specific bugs
> > > > but all tests passed.
> > > >
> > > > The test for that does trigger the problem described above and is fixed by the
> > > > patches in this thread (which I included in the other thread):
>
> Here is a subset of those patches again:
>
> 1. Per-backend buffer limit, take III. Now the check is in
> read_stream_start_pending_read() so TOC == TOU.
>
> Annoyingly, test cases like the one below still fail, despite
> following the rules. The other streams eat all the buffers and then
> one gets an allowance of zero, but uses its right to take one pin
> anyway to make progress, and there isn't one.

I think that may be ok. If there are no unpinned buffers, it seems to be
expected that starting a new stream will fail. That's not the same as - as we
did previously - failing in a read stream that did start successfully, because
we issue large IOs even though there are only a small number of unpinned
buffers.

> I wonder if we should use temp_buffers - 100? Then leave the minimum GUC
> value at 100 still, so you have an easy way to test with 0, 1,
> ... additional buffers?

I think that just makes it harder to test the exhaustion scenario without
really fixing anything?

> 2. It shouldn't give up issuing random advice immediately after a
> jump, or it could stall on (say) the second 128kB of a 256kB
> sequential chunk (ie the strace you showed on the BHS thread). It
> only makes sense to assume kernel readahead takes over once you've
> actually *read* sequentially. In practice this makes it a lot more
> aggressive about advice (like the BHS code in master): it only gives
> up if the whole look-ahead window is sequential.

> 3. Change the distance algorithm to care only about hits and misses,
> not sequential heuristics. It made at least some sense before, but it
> doesn't make sense for AIO, and even in synchronous mode it means that
> you hit random jumps with insufficient look-ahead, so I don't think we
> should keep it.
>
> I also realised that the sequential heuristics are confused by that
> hidden trailing block thing, so in contrived pattern testing with
> hit-miss-hit-miss... would be considered sequential, and even if you
> fix that (the forwarding patches above fix that), an exact
> hit-miss-hit-miss pattern also gets stuck between distances 1 and 2
> (double, decrement, double, ... might be worth waiting a bit longer
> before decrementing, IDK.
>
> I'll rebase the others and post soon.

> +
> +/* see GetAdditionalPinLimit() */
> +uint32
> +GetAdditionalLocalPinLimit(void)
> +{
> + Assert(NLocalPinnedBuffers <= num_temp_buffers);
> + return num_temp_buffers - NLocalPinnedBuffers;
> +}

This doesn't behave quite the way GetAdditionalPinLimit() does - it can return
0. Which makes some sense, pinning an additional buffer will always
fail. Perhaps worth calling out though?

> static void
> read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
> {
> while (stream->ios_in_progress < stream->max_ios &&
> - stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
> + ((stream->pinned_buffers == 0 && stream->distance > 0) ||
> + stream->pinned_buffers + stream->pending_read_nblocks < stream->distance))

What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really
mean? And when would it be true when the pre-existing condition wouldn't
already be true?

> {
> BlockNumber blocknum;
> int16 buffer_index;
> void *per_buffer_data;
>
> - if (stream->pending_read_nblocks == io_combine_limit)
> + /* If have a pending read that can't be extended, start it now. */
> + Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
> + stream->max_pinned_buffers);
> + if (stream->pending_read_nblocks == io_combine_limit ||
> + (stream->pinned_buffers == 0 &&
> + stream->pending_read_nblocks == stream->max_pinned_buffers))
> {
> read_stream_start_pending_read(stream, suppress_advice);
> suppress_advice = false;
> @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
> /* We have to start the pending read before we can build another. */
> while (stream->pending_read_nblocks > 0)
> {
> - read_stream_start_pending_read(stream, suppress_advice);
> - suppress_advice = false;
> - if (stream->ios_in_progress == stream->max_ios)
> + if (!read_stream_start_pending_read(stream, suppress_advice) ||
> + stream->ios_in_progress == stream->max_ios)
> {
> - /* And we've hit the limit. Rewind, and stop here. */
> + /* And we've hit a buffer or I/O limit. Rewind and wait. */
> read_stream_unget_block(stream, blocknum);
> return;
> }
> +
> + suppress_advice = false;
> }

If read_stream_start_pending_read() returns false because we hit the pin
limit, does it really help to call read_stream_unget_block()? IIUC that'll
defer one block for later - but what about the other buffers in a multi-block
read?

> @@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
> else
> Assert(stream->next_buffer_index == stream->oldest_buffer_index);
>
> - /*
> - * If advice hasn't been suppressed, this system supports it, and this
> - * isn't a strictly sequential pattern, then we'll issue advice.
> - */
> - if (!suppress_advice &&
> - stream->advice_enabled &&
> - stream->pending_read_blocknum != stream->seq_blocknum)
> + /* Do we need to issue read-ahead advice? */
> + flags = 0;
> + if (stream->advice_enabled)
> + {
> flags = READ_BUFFERS_ISSUE_ADVICE;
> - else
> - flags = 0;
> +
> + if (stream->pending_read_blocknum == stream->seq_blocknum)
> + {
> + /*
> + * Suppress advice if our WaitReadBuffers() calls have caught up
> + * with the first advice we issued for this sequential run.
> + */
> + if (stream->seq_start == InvalidBlockNumber)
> + suppress_advice = true;
> + }
> + else
> + {
> + /* Random jump, so start a new sequential run. */
> + stream->seq_start = stream->pending_read_blocknum;
> + }
> +
> + if (suppress_advice)
> + flags = 0;
> + }

Seems a bit confusing to first set
flags = READ_BUFFERS_ISSUE_ADVICE
to then later unset it again. Maybe just set it in if (!suppress_advice)?

> * Skip the initial ramp-up phase if the caller says we're going to be
> @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
> distance = stream->distance * 2;
> distance = Min(distance, stream->max_pinned_buffers);
> stream->distance = distance;
> +
> + /*
> + * If we've caught up with the first advice issued for the current
> + * sequential run, cancel further advice until the next random
> + * jump. The kernel should be able to see the pattern now that
> + * we're issuing sequential preadv() calls.
> + */
> + if (stream->ios[io_index].op.blocknum == stream->seq_start)
> + stream->seq_start = InvalidBlockNumber;

So stream->seq_start doesn't really denote the start of sequentialness, it
denotes up to where the caller needs to process before we disable sequential
access. Maybe add a comment to it and rename it to something like
->seq_until_processed?

Other than this the approach seems to make sense!

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marcos Pegoraro 2025-03-11 19:38:55 Re: Document NULL
Previous Message Nico Williams 2025-03-11 19:03:12 Re: protocol support for labels