Quick Links

Re: ReadRecentBuffer() doesn't scale well

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject:	Re: ReadRecentBuffer() doesn't scale well
Date:	2023-06-27 04:09:31
Message-ID:	20230627040931.km2ek55e2ri4a3rv@awork3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2023-06-27 15:33:57 +1200, Thomas Munro wrote:
> On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Unfortunately it scaled way worse at first. This is not an inherent issue, but
> > due to an implementation choice in ReadRecentBuffer(). Whereas the normal
> > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> > LockBufHdr(), checks if the buffer ID is the same and then uses
> > PinBuffer_Locked().
> >
> > The problem with that is that PinBuffer() takes care to not hold the buffer
> > header spinlock, it uses compare_exchange to atomically acquire the pin, while
> > guaranteing nobody holds the lock. When holding the buffer header spinlock,
> > there obviously is the risk of being scheduled out (or even just not have
> > exclusive access to the cacheline).
>
> Yeah. Aside from inherent nastiness of user-space spinlocks

I've been wondering about making our backoff path use futexes, after some
adaptive spinning.

> > The fairly obvious solution to this is to just use PinBuffer() and just unpin
> > the buffer if its identity was changed concurrently. There could be an
> > unlocked pre-check as well. However, there's the following comment in
> > ReadRecentBuffer():
> > * It's now safe to pin the buffer. We can't pin first and ask
> > * questions later, because it might confuse code paths like
> > * InvalidateBuffer() if we pinned a random non-matching buffer.
> > */
> >
> > But I'm not sure I buy that - there's plenty other things that can briefly
> > acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
> > contents, etc).
>
> I may well have been too cautious with that. The worst thing I can
> think of right now is that InvalidateBuffer() would busy loop (as it
> already does in other rare cases) when it sees a pin.

Right. Particularly if we were to add a pre-check for the tag to match, that
should be extremely rare.

> > Another difference between using PinBuffer() and PinBuffer_locked() is that
> > the latter does not adjust a buffer's usagecount.
> >
> > Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
> > codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
> > increasing usagecount anymore?
>
> Yeah, that is not great. The simplification you suggest would fix
> that too, though I guess it would also bump the usage count of buffers
> that don't have the tag we expected; that's obviously rare and erring
> on a better side though.

Yea, I'm not worried about that. If somebody is, we could just add code to
decrement the usagecount again.

> > FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
> > the root page's buffer id in RelationData, seems a noticeable win. About 7% in
> > a concurrent, read-only pgbench that utilizes batches of 10. And it should be
> > easy to get much bigger wins, e.g. with a index nested loop with a relatively
> > small index on the inner side.
>
> Wooo, that's better than I was hoping. Thanks for trying it out! I
> think, for the complexity involved (ie very little)

I don't really have a concrete thought for where to store the id of the recent
buffer. I just added a new field into some padding in RelationData, but we
might go for something fancier.

> smgr_targblock could be another easy-to-cache candidate, ie a place where
> there is a single interesting hot page that we're already keeping track of
> with no requirement for new backend-local mapping machinery.

I wonder if we should simple add a generic field for such a Buffer to
RelationData, that the AM can use as it desires. For btree that would be the
root page, for heap the target block ...

> it's a nice result, and worth considering even though it's also a solid clue
> that we could do much better than this with a (yet to be designed)
> longer-lived pin scheme.

Indeed. PinBuffer() is pretty hot after the change. As is the buffer content
lock.

Particularly for the root page, it'd be really interesting to come up with a
scheme that keeps an offline copy of the root page while also pinning the real
root page. I think we should be able to have a post-check that can figure out
if the copied root page is out of date after searching it, without needing the
content lock.

Greetings,

Andres Freund

In response to

Re: ReadRecentBuffer() doesn't scale well at 2023-06-27 03:33:57 from Thomas Munro

Responses

Re: ReadRecentBuffer() doesn't scale well at 2023-06-27 04:32:22 from Peter Geoghegan
Re: ReadRecentBuffer() doesn't scale well at 2023-06-27 11:49:48 from Ants Aasma

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dilip Kumar	2023-06-27 04:12:28	Re: Improving btree performance through specializing by key shape, take 2
Previous Message	Jaime Casanova	2023-06-27 04:05:43	Assert !bms_overlap(joinrel->relids, required_outer)