Re: Streaming relation data out of order

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Streaming relation data out of order
Date: 2025-04-09 19:35:43
Message-ID: zha6yshh32n3nejoiqitf3xtley5srz6dp5t2cm3qlb4zt3wqp@gwfkwmr3beaz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-04-09 19:17:00 +1200, Thomas Munro wrote:
> One weird issue, not just with reordering, is that read_stream.c's
> distance cools off too easily with some simple test patterns. Start
> at 1, double on miss, subtract one on hit, repeat, and you can be
> trapped alternating between 1 and 2 when you'd certainly benefit from
> IO concurrency and also reordering. It may need a longer memory.

Just having a hit in itself really is no reason to reduce the distance
IMO. It's completely normal to have a few hits when doing readahead, but a few
hits in a row doesn't mean the IO latency of the subsequent miss doesn't
matter anymore. But if our entire lookahead only is hits, we actually are in
a pattern where there no misses for the moment, and can reduce the distance.

Another issue is that we, afaict, overweigh hits rather substantially due to
IO combining. A single one block hit (hits are never bigger), decreases IO
distance. But a read of size N only affects distance once, not N times.

I.e. what I am suggesting is something like:

1) Increase distance by * 2 + read_size on a miss

This would help us to increase with distance = 1, where * 2 only increases
distance by 1, just as -1 obviously reduces it by 1.

It'd also somewhat address the over-weighing of hits compared to misses.

2) Only reduce distance if we have no outstanding IOs

> Subject: [PATCH v2 1/4] Move read stream modulo arithmetic into functions.
>
> Several places have open-coded circular index arithmetic. Make some
> common functions for better readability and consistent assertion
> checking. This avoids adding yet more open-coding in later patches.
> ---
> src/backend/storage/aio/read_stream.c | 93 +++++++++++++++++++++------
> 1 file changed, 74 insertions(+), 19 deletions(-)
>
> diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
> index 0e7f5557f5c..0d3dd65bfed 100644
> --- a/src/backend/storage/aio/read_stream.c
> +++ b/src/backend/storage/aio/read_stream.c
> @@ -216,6 +216,67 @@ read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
> stream->buffered_blocknum = blocknum;
> }
>
> +/*
> + * Increment buffer index, wrapping around at queue size.
> + */
> +static inline void
> +read_stream_advance_buffer(ReadStream *stream, int16 *index)
> +{
> + Assert(*index >= 0);
> + Assert(*index < stream->queue_size);
> +
> + *index += 1;
> + if (*index == stream->queue_size)
> + *index = 0;
> +}
> +/*
> + * Increment buffer index by n, wrapping around at queue size.
> + */
> +static inline void
> +read_stream_advance_buffer_n(ReadStream *stream, int16 *index, int16 n)
> +{
> + Assert(*index >= 0);
> + Assert(*index < stream->queue_size);
> + Assert(n <= stream->io_combine_limit);
> +
> + *index += n;
> + if (*index >= stream->queue_size)
> + *index -= stream->queue_size;
> +
> + Assert(*index >= 0);
> + Assert(*index < stream->queue_size);
> +}

Not sure if that's it's possible to have a large enough queue_size, but this
doesn't look like it's safe against *index + n > PG_INT16_MAX. The value would
wrap into the negative.

I think all these variables really ought to be unsigned. None of them ever
could be negative afaict. Afaict that would lead to the correct behaviour when
wapping around.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2025-04-09 19:39:32 Re: n_ins_since_vacuum stats for aborted transactions
Previous Message David G. Johnston 2025-04-09 19:31:03 Re: n_ins_since_vacuum stats for aborted transactions