Quick Links

Some read stream improvements

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Andres Freund <andres(at)anarazel(dot)de>
Subject:	Some read stream improvements
Date:	2025-02-17 04:55:09
Message-ID:	CA+hUKGK_=4CVmMHvsHjOVrK6t4F=LBpFzsrr3R+aJYN8kcTfWg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Here are some patches that address some of Andres's feedback since the
AIO v2 rebase[1], anticipate out-of-order streams, and make some other
minor improvements. They are independent of the main AIO patch set
and apply to master, hence separate thread.

0001-Refactor-read_stream.c-s-circular-arithmetic.patch

This just replaced open-coded arithmetic with inline functions. They
will be used a lot more in later work, and provide central places to
put assertions that were not checked as uniformly as I would like.

0002-Allow-more-buffers-for-sequential-read-streams.patch

Currently we're only generating I/O concurrency for random I/O, and I
originally guesstimated that random I/O wouldn't typically be able to
combine more than about 4 blocks on average when capping the buffer
queue. Debatable perhaps, but soon obsolete: for true asynchronous
I/O, we'll need full concurrency * full combine limit, so we'll
potentially want to pin more buffers. This just changes that
hard-coded 4 to io_combine_limit.

0003-Improve-buffer-pool-API-for-per-backend-pin-limits.patch

In preparation for the next patch, provide some new bufmgr APIs.

0004-Respect-pin-limits-accurately-in-read_stream.c.patch

The current coding only computes the remaining "fair share" of the
buffer pool for this backend at stream initialisation. It's hard, but
not impossible, to get one backend to pin more than 1/max_connections
of the buffer pool (the intended cap), when using multiple streams at
the same time in one backend. This patch moves the time of check to
the time of use, so it respects the limit strictly. I avoid adding
any changes to the fast path for all-cached streams, which only pin
one buffer at a time.

This is basically a TOCTOU bug, and could in theory be back-patched,
but I don't personally think it's necessary yet, since it's so hard to
actually break anything with v17's rather limited use of streams. The
only way I can think of right now to see a concrete problem would be
to get many backends all to create many CURSORs that stream cold data,
and FETCH in a particular order only after they've all been
constructed, which is also a recipe for running out of buffers without
using streams, albeit not quite as quickly.

0005-Support-buffer-forwarding-in-read_stream.c.patch
0006-Support-buffer-forwarding-in-StartReadBuffers.patch

Background: StartReadBuffers() reports a short read when it finds a
cached block that divides a potential I/O. For example, if you ask to
read 8 blocks, and the 6th one turns out to be a hit, ie already valid
in the buffer pool, then we only need to read the first 5 from disk.
Perhaps the 7th and 8th blocks will also need I/O, but we don't want
StartReadBuffers() to start *two* I/Os, so the 6th block terminates
our search. The question is what to do with that 6th block.
Currently we'll tell the user that the size of the operation is 6
blocks -- that is, we'll silently include that hit that prevented
further combining, even though we'll only actually do the read for the
first 5 blocks, and if someone else manages to make any buffers valid
before you call WaitReadBuffers(), we'll just silently skip over them
and loop.

That was simple enough, but having hits that are invisibly mixed up
with misses prevents us from implementing fully general reordering of
hits, an important optimisation that reduces I/O stalls for consumers
that can cope with out-of-order data (I'll post a new patch for that
separately).

A simple solution we rejected was to unpin such buffers and then repin
later, but that would waste cycles, allow eviction and potentially
mess up the usage count. The AIO patch set also introduces another
case involving potentially many buffers: it moves the
BM_IO_IN_PROGRESS negotiation into StartReadBuffers(), and reports a
short read then too for middle blocks, but it has already acquired all
the pins.

The solution we agreed on is to introduce a way for StartReadBuffers()
to communicate with future calls, and "forward" pinned buffers between
calls. The function arguments don't change, but its "buffers"
argument becomes an in/out array: one StartReadBuffers() call can
leave extra pinned buffers after the ones that were included in a
short read (*nblocks), and then when you retry (or possibly extend)
the rest of the read, you have to pass them back in. That is easy for
the read stream code, as it can just leave them in its circular queue
for the next call to take as input. It only needs to be aware of them
for pin limit accounting and stream reset (including early shutdown).

One goal was to avoid introducing any new instructions into
ReadBuffer() or the read stream fast path, so StartReadBuffer(), the
single-block specialisation, doesn't support forwarding (it already
can't split reads, they are only one block, but it also doesn't
support receiving forwarded buffers).

I plan to share some new C-level tests and illustrations of the
internal states of read_stream.c separately, as the complexity is
obviously increasing a bit (especially with out-of-order streams,
about which more soon).

[1] https://www.postgresql.org/message-id/CA%2BhUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw%40mail.gmail.com

Attachment	Content-Type	Size
v1-0001-Refactor-read_stream.c-s-circular-arithmetic.patch	text/x-patch	4.6 KB
v1-0002-Allow-more-buffers-for-sequential-read-streams.patch	text/x-patch	2.4 KB
v1-0003-Improve-buffer-pool-API-for-per-backend-pin-limit.patch	text/x-patch	6.3 KB
v1-0004-Respect-pin-limits-accurately-in-read_stream.c.patch	text/x-patch	9.9 KB
v1-0005-Support-buffer-forwarding-in-read_stream.c.patch	text/x-patch	8.8 KB
v1-0006-Support-buffer-forwarding-in-StartReadBuffers.patch	text/x-patch	10.4 KB

Responses

Re: Some read stream improvements at 2025-02-17 05:55:26 from Kirill Reshke
Re: Some read stream improvements at 2025-02-17 05:57:46 from Thomas Munro
Re: Some read stream improvements at 2025-02-26 09:55:04 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	PG Doc comments form	2025-02-17 05:01:48	pg_copy_logical_replication_slot doesn't copy the failover property
Previous Message	Peter Smith	2025-02-17 04:18:52	Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.