Quick Links

Re: Streaming relation data out of order

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming relation data out of order
Date:	2025-04-11 00:50:43
Message-ID:	CA+hUKG+tiy_u5fpf_c0cNHcZ5ZQOZ7A7a48SHY8nrU7AhQQP6Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Apr 10, 2025 at 7:35 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> 1) Increase distance by * 2 + read_size on a miss
>
> This would help us to increase with distance = 1, where * 2 only increases
> distance by 1, just as -1 obviously reduces it by 1.
>
> It'd also somewhat address the over-weighing of hits compared to misses.
>
> 2) Only reduce distance if we have no outstanding IOs

Interesting. That's similar in some ways to one of my own attempts:
I had an extra thing called sustain, and that had to be counted down
to zero before distance started decrementing. I didn't want the
distance itself to get higher than it already did by doubling once per
(combined) IO, I just wanted to defer the decay. (Names from an
earlier abandoned attempt to use synthesiser terminology: attack,
decay, sustain, release; cute but useless.) I tried a few different
recipes, but just adding read sizes to sustain works out pretty close
to your pleasingly simple rule 2: sustain until we can't see any IOs
in the lookahead window. But I also tried adding a bit more for
longer sustain, sustain += read_size * 2 or whatever, so you don't
give up your window so easily on a run of hits, and it also achieve
escape velocity from the gravity of 1. IDK, I hadn't formed any
strong opinions yet and need to test more stuff, hence lack of
concrete patches. Thanks, this was helpful, I'll play around with it!

> > +/*
> > + * Increment buffer index by n, wrapping around at queue size.
> > + */
> > +static inline void
> > +read_stream_advance_buffer_n(ReadStream *stream, int16 *index, int16 n)
> > +{
> > + Assert(*index >= 0);
> > + Assert(*index < stream->queue_size);
> > + Assert(n <= stream->io_combine_limit);
> > +
> > + *index += n;
> > + if (*index >= stream->queue_size)
> > + *index -= stream->queue_size;
> > +
> > + Assert(*index >= 0);
> > + Assert(*index < stream->queue_size);
> > +}
>
> Not sure if that's it's possible to have a large enough queue_size, but this
> doesn't look like it's safe against *index + n > PG_INT16_MAX. The value would
> wrap into the negative.

No, because the queue is sized such that indexes into this overflow
space also fit under INT16_MAX:

/*
* If starting a multi-block I/O near the end of the queue, we might
* temporarily need extra space for overflowing buffers before they are
* moved to regular circular position. This is the maximum extra space we
* could need.
*/
queue_overflow = io_combine_limit - 1;

The patch clearly lacks a comment to explain that mystery. Will add. Thanks!

> I think all these variables really ought to be unsigned. None of them ever
> could be negative afaict.

Maybe... I tend to default to signed variables when there isn't a good
reason to prefer unsigned. But I have also been mulling a change here
since:

commit 06fb5612c970b3af95aca3db5a955669b07537ca
Author: Thomas Munro <tmunro(at)postgresql(dot)org>
Date: Wed Mar 19 15:23:12 2025 +1300

Increase io_combine_limit range to 1MB.

Unfortunately uint16 still needs to worry about UINT16_MAX
(effective_io_concurrency = 1000 * io_combine_limit = 128 = 128,000
buffers + a few, still one bit short; before the above commit the cap
was 32,000 buffers + a few, which is why int16 was enough when the
code was written). Maybe we should just go directly to int, or uint32
if you insist, and stop worrying about overflow. I think we agreed
that 128,000 buffers = 1000MB of in-progress I/O seemed unrealistic
and 32K ought to be enough for anybody™, but really we're talking
about saving a handful of bytes in struct ReadStream in exchange for a
risk of bugs. Will write a patch to try that soon...

In response to

Re: Streaming relation data out of order at 2025-04-09 19:35:43 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2025-04-11 01:10:46	Re: sync_standbys_defined read/write race on startup
Previous Message	Sami Imseih	2025-04-11 00:38:05	Re: n_ins_since_vacuum stats for aborted transactions