Re: Streaming I/O, vectored I/O (WIP)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Streaming I/O, vectored I/O (WIP)
Date: 2024-01-11 23:32:21
Message-ID: CA+hUKG+ay_ht+Ge4BUKG_01+SuQF7Z93JMmy1an+0syGM6o5SQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Ok. It feels surprising to have three steps. I understand that you need
> two steps, one to start the I/O and another to wait for them to finish,
> but why do you need separate Prepare and Start steps? What can you do in
> between them? (You explained that. I'm just saying that that's my
> initial reaction when seeing that API. It is surprising.)

Actually I don't think I explained that very well. First, some more
detail about how a two-step version would work:

* we only want to start one I/O in an StartReadBuffers() call, because
otherwise it is hard/impossible for the caller to cap concurrency I/O
* therefore StartReadBuffers() can handle sequences matching /^H*I*H*/
("H" = hit, "I" = miss, I/O) in one call
* in the asynchronous version, "I" in that pattern means we got
BM_IO_IN_PROGRESS
* in the synchronous version, "I" means that it's not valid, not
BM_IN_IN_PROGRESS, but we won't actually try to get BM_IO_IN_PROGRESS
until the later Complete/Wait call (and then the answer might chagne,
but we'll just deal with that by looping in the synchronous version)
* streaming_read.c has to deal with buffering up work that
StartReadBuffers() didn't accept
* that's actually quite easy, you just use the rest to create a new
range in the next slot

Previously I thought the requirement to deal with buffering future
stuff that StartReadBuffers() couldn't accept yet was a pain, and life
became so much simpler once I deleted all that and exposed
PrepareReadBuffer() to the calling code. Perhaps I just hadn't done a
good enough job of that.

The other disadvantage you reminded me of was the duplicate buffer
lookup in certain unlucky patterns, which I had forgotten about in my
previous email. But I guess it's not fatal to the idea and there is a
potential partial mitigation. (See below).

A third thing was the requirement for communication between
StartReadBuffers() and CompleteReadBuffers() which I originally had an
"opaque" object that the caller has to keep around that held private
state. It seemed nice to go back to talking just about buffer
numbers, but that's not really an argument for anything...

OK, I'm going to try the two-step version (again) with interfaces
along the lines you sketched out... more soon.

> I see. When you're about to zero the page, there's not much point in
> splitting the operation into Prepare/Start/Complete stages anyway.
> You're not actually doing any I/O. Perhaps it's best to have a separate
> "Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)"
> function that does the same as
> ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.

That makes sense, but... hmm, sometimes just allocating a page
generates I/O if it has to evict a dirty buffer. Nothing in this code
does anything fancy about that, but imagine some hypothetical future
thing that manages to do that asynchronously -- then we might want to
take advantage of the ability to stream even a zeroed page, ie doing
something ahead of time? Just a thought for another day, and perhaps
that is just an argument for including it in the streaming read API,
but it doesn't mean that the bufmgr.c API can't be as you say.

> One weakness is that if StartReadBufferRange() finds that the range is
> "chopped up", it needs to return and throw away the work it had to do to
> look up the next buffer. So in the extreme case that every other block
> in the range is in the buffer cache, each call would look up two buffers
> in the buffer cache, startBlk and startBlk + 1, but only return one
> buffer to the caller.

Yeah, right. This was one of the observations that influenced my
PrepareReadBuffer() three-step thing that I'd forgotten. To spell
that out with an example, suppose the buffer pool contains every odd
numbered block. Successive StartReadBuffers() calls would process
"HMHm", "MHm", "MHm"... where "m" represents a miss that we can't do
anything with for a block we'll look up in the buffer pool again in
the next call. With the PrepareReadBuffer() design, that miss just
starts a new range and we don't have to look it up again. Hmm, I
suppose that could be mitigated somewhat with ReadRecentBuffer() if we
can find somewhere decent to store it.

BTW it was while thinking about and testing cases like that that I
found Palak Chaturvedi's https://commitfest.postgresql.org/46/4426/
extremely useful. It can kick out every second page or any other
range-chopping scenario you can express in a WHERE clause. I would
quite like to get that tool into the tree...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-01-11 23:35:42 Re: Adding facility for injection points (or probe points?) for more advanced tests
Previous Message Euler Taveira 2024-01-11 23:00:18 Re: speed up a logical replica setup