From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: Streaming I/O, vectored I/O (WIP) |
Date: | 2024-01-11 14:31:22 |
Message-ID: | 31a70f95-a2ad-44e3-9ac0-b79178948f77@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 11/01/2024 05:19, Thomas Munro wrote:
> On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> On 10/01/2024 06:13, Thomas Munro wrote:
>>> Bikeshedding call: I am open to better suggestions for the names
>>> PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
>>> grammatically clumsy.
>>
>> How will these functions work in the brave new async I/O world? I assume
>> PrepareReadBuffer() will initiate the async I/O, and
>> CompleteReadBuffers() will wait for it to complete. How about
>> StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
>> WaitBufferRead()?
>
> What we have imagined so far is that the asynchronous version would
> probably have three steps, like this:
>
> * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed
> * StartReadBuffers() -> starts the I/O for n contiguous !found buffers
> * CompleteReadBuffers() -> waits, completes as necessary
Ok. It feels surprising to have three steps. I understand that you need
two steps, one to start the I/O and another to wait for them to finish,
but why do you need separate Prepare and Start steps? What can you do in
between them? (You explained that. I'm just saying that that's my
initial reaction when seeing that API. It is surprising.)
>> If StartReadBuffer() starts the async I/O, the idea that you can call
>> ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think
>> StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for
>> you in RBM_ZERO_* modes.
>
> Yeah, good thoughts, and topics that have occupied me for some time
> now. I also thought that StartReadBuffer() should take
> ReadBufferMode, but I came to the idea that it probably shouldn't like
> this:
>
> I started working on all this by trying to implement the most
> complicated case I could imagine, streaming recovery, and then working
> back to the easy cases that just do scans with RBM_NORMAL. In
> recovery, we can predict that a block will be zeroed using WAL flags,
> and pre-existing cross-checks at redo time that enforce that the flags
> and redo code definitely agree on that, but we can't predict which
> exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or
> RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and
> get_cleanup_lock=true, as the comment warns them not to, but I
> digress).
>
> That's OK, because we can't take locks while looking ahead in recovery
> anyway (the redo routine carefully controls lock order/protocol), so
> the code to actually do the locking needs to be somewhere near the
> output end of the stream when the redo code calls
> XLogReadBufferForRedoExtended(). But if you try to use RBM_XXX in
> these interfaces, it begins to look pretty funny: the streaming
> callback needs to be able to say which ReadBufferMode, but anywhere
> near Prepare*(), Start*() or even Complete*() is too soon, so maybe we
> need to invent a new value RBM_WILL_ZERO that doesn't yet say which of
> the zero modes to use, and then the redo routine needs to pass in the
> RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to
> XLogReadBufferForRedoExtended() and it does it in a separate step
> anyway, so we are ignoring ReadBufferMode. But that feels just wrong
> -- we'd be using RBM but implementing them only partially.
I see. When you're about to zero the page, there's not much point in
splitting the operation into Prepare/Start/Complete stages anyway.
You're not actually doing any I/O. Perhaps it's best to have a separate
"Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)"
function that does the same as
ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.
> The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the
> right time and makes sense to process as a batch, so it becomes a
> flag.
+1
>> Putting all that together, I propose:
>>
>> /*
>> * Initiate reading a block from disk to the buffer cache.
>> *
>> * XXX: Until we have async I/O, this just allocates the buffer in the
>> buffer
>> * cache. The actual I/O happens in WaitReadBuffer().
>> */
>> Buffer
>> StartReadBuffer(BufferManagerRelation bmr,
>> ForkNumber forkNum,
>> BlockNumber blockNum,
>> BufferAccessStrategy strategy,
>> ReadBufferMode mode,
>> bool *foundPtr);
>>
>> /*
>> * Wait for a read that was started earlier with StartReadBuffer() to
>> finish.
>> *
>> * XXX: Until we have async I/O, this is the function that actually
>> performs
>> * the I/O. If multiple I/Os have been started with StartReadBuffer(), this
>> * will try to perform all of them in one syscall. Subsequent calls to
>> * WaitReadBuffer(), for those other buffers, will finish quickly.
>> */
>> void
>> WaitReadBuffer(Buffer buf);
>
> I'm confused about where the extra state lives that would allow the
> communication required to build a larger I/O. In the AIO branch, it
> does look a little more like that, but there is more magic state and
> machinery hiding behind the curtain: the backend's pending I/O list
> builds up a chain of I/Os, and when you try to wait, if it hasn't
> already been submitted to the kernel/bgworkers yet it will be, and
> before that merging will happen. So you get bigger I/Os without
> having to say so explicitly.
Yeah, I was thinking that there would be a global variable that holds a
list of operations started with StartReadBuffer().
> For this synchronous version (and hopefully soon a more efficient
> improved version in the AIO branch), we want to take advantage of the
> client's pre-existing and more cheaply obtained knowledge of ranges.
Ok.
> In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would
> take place in StartReadBuffers() instead, which would be responsible
> for kicking off asynchronous I/Os (= asking a background worker to
> call pread(), or equivalent fancy kernel async APIs). One question is
> what it does if it finds a block in the middle that chops the read up,
> or for that matter a segment boundary. I don't think we have to
> decide now, but the two options seem to be that it starts one single
> I/O and reports its size, making it the client's problem to call again
> with the rest, or that it starts more than one I/O and they are
> somehow chained together so that the caller doesn't know about that
> and can later wait for all of them to finish using just one <handle
> thing>.
>
> (The reason why I'm not 100% certain how it will look is that the
> real, working code in the aio branch right now doesn't actually expose
> a vector/nblocks bufmgr interface at all, yet. Andres's original
> prototype had a single-block Start*(), Complete*() design, but a lower
> level of the AIO system notices if pending read operations are
> adjacent and could be merged. While discussing all this we decided it
> was a bit strange to have lower code deal with allocating, chaining
> and processing lots of separate I/O objects in shared memory, when
> higher level code could often work in bigger ranges up front, and then
> interact with the AIO subsystem with many fewer objects and steps.
> Also, the present simple and lightweight synchronous proposal that
> lacks the whole subsystem that could do that by magic.)
Hmm, let's sketch out what this would look like with the approach that
you always start one I/O and report the size:
/*
* Initiate reading a range of blocks from disk to the buffer cache.
*
* If the pages were already found in the buffer cache, returns true.
* Otherwise false, and the caller must call WaitReadBufferRange() to
* wait for the I/O to finish, before accessing the buffers.
*
* 'buffers' is a caller-supplied array large enough to hold (endBlk -
* startBlk) buffers. It is filled with the buffers that the pages are
* read into.
*
* This always starts a read of at least one block, and tries to
* initiate one read I/O for the whole range if possible. But if the
* read cannot be performed as a single I/O, a partial read is started,
* and *endBlk is updated to reflect the range for which the read was
* started. The caller can make another call to read the rest of the
* range. A partial read can occur if some, but not all, of the pages
* are already in the buffer cache, or because the range crosses a
* segment boundary.
*
* XXX: Until we have async I/O, this just allocates the buffers in the
* buffer cache. And perhaps calls smgrprefetch(). The actual I/O
* happens in WaitReadBufferRange().
*/
bool
StartReadBufferRange(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber startBlk,
BlockNumber *endBlk,
BufferAccessStrategy strategy,
Buffer *buffers);
/*
* Wait for a read that was started earlier with StartReadBufferRange()
* to finish.
*
* XXX: Until we have async I/O, this is the function that actually
* performs
* the I/O. StartReadBufferRange already checked that the pages can be
* read in one preadv() syscall. However, it's possible that another
* backend performed the read for some of the pages in between. In that
* case this will perform multiple syscalls, after all.
*/
void
WaitReadBufferRange(Buffer *buffers, int nbuffers, bool zero_on_error);
/*
* Allocates a buffer for the given page in the buffer cache, and locks
* the page. No I/O is initiated. The caller must initialize it and mark
* the buffer dirty before releasing the lock.
*
* This is equivalent to ReadBuffer(RBM_ZERO_AND_LOCK) or
* ReadBuffer(RBM_ZERO_AND_CLEANUP_LOCK).
*/
Buffer
ZeroBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
bool cleanup_lock);
This range-oriented API fits the callers pretty well: the streaming read
API works with block ranges already. There is no need for separate
Prepare and Start steps.
One weakness is that if StartReadBufferRange() finds that the range is
"chopped up", it needs to return and throw away the work it had to do to
look up the next buffer. So in the extreme case that every other block
in the range is in the buffer cache, each call would look up two buffers
in the buffer cache, startBlk and startBlk + 1, but only return one
buffer to the caller.
--
Heikki Linnakangas
Neon (https://neon.tech)
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2024-01-11 14:44:50 | Re: SQL:2011 application time |
Previous Message | vignesh C | 2024-01-11 14:30:27 | Re: Issue in postgres_fdw causing unnecessary wait for cancel request reply |