From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie> |
Subject: | Re: ReadRecentBuffer() doesn't scale well |
Date: | 2023-06-27 04:09:31 |
Message-ID: | 20230627040931.km2ek55e2ri4a3rv@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2023-06-27 15:33:57 +1200, Thomas Munro wrote:
> On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Unfortunately it scaled way worse at first. This is not an inherent issue, but
> > due to an implementation choice in ReadRecentBuffer(). Whereas the normal
> > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> > LockBufHdr(), checks if the buffer ID is the same and then uses
> > PinBuffer_Locked().
> >
> > The problem with that is that PinBuffer() takes care to not hold the buffer
> > header spinlock, it uses compare_exchange to atomically acquire the pin, while
> > guaranteing nobody holds the lock. When holding the buffer header spinlock,
> > there obviously is the risk of being scheduled out (or even just not have
> > exclusive access to the cacheline).
>
> Yeah. Aside from inherent nastiness of user-space spinlocks
I've been wondering about making our backoff path use futexes, after some
adaptive spinning.
> > The fairly obvious solution to this is to just use PinBuffer() and just unpin
> > the buffer if its identity was changed concurrently. There could be an
> > unlocked pre-check as well. However, there's the following comment in
> > ReadRecentBuffer():
> > * It's now safe to pin the buffer. We can't pin first and ask
> > * questions later, because it might confuse code paths like
> > * InvalidateBuffer() if we pinned a random non-matching buffer.
> > */
> >
> > But I'm not sure I buy that - there's plenty other things that can briefly
> > acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
> > contents, etc).
>
> I may well have been too cautious with that. The worst thing I can
> think of right now is that InvalidateBuffer() would busy loop (as it
> already does in other rare cases) when it sees a pin.
Right. Particularly if we were to add a pre-check for the tag to match, that
should be extremely rare.
> > Another difference between using PinBuffer() and PinBuffer_locked() is that
> > the latter does not adjust a buffer's usagecount.
> >
> > Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
> > codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
> > increasing usagecount anymore?
>
> Yeah, that is not great. The simplification you suggest would fix
> that too, though I guess it would also bump the usage count of buffers
> that don't have the tag we expected; that's obviously rare and erring
> on a better side though.
Yea, I'm not worried about that. If somebody is, we could just add code to
decrement the usagecount again.
> > FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
> > the root page's buffer id in RelationData, seems a noticeable win. About 7% in
> > a concurrent, read-only pgbench that utilizes batches of 10. And it should be
> > easy to get much bigger wins, e.g. with a index nested loop with a relatively
> > small index on the inner side.
>
> Wooo, that's better than I was hoping. Thanks for trying it out! I
> think, for the complexity involved (ie very little)
I don't really have a concrete thought for where to store the id of the recent
buffer. I just added a new field into some padding in RelationData, but we
might go for something fancier.
> smgr_targblock could be another easy-to-cache candidate, ie a place where
> there is a single interesting hot page that we're already keeping track of
> with no requirement for new backend-local mapping machinery.
I wonder if we should simple add a generic field for such a Buffer to
RelationData, that the AM can use as it desires. For btree that would be the
root page, for heap the target block ...
> it's a nice result, and worth considering even though it's also a solid clue
> that we could do much better than this with a (yet to be designed)
> longer-lived pin scheme.
Indeed. PinBuffer() is pretty hot after the change. As is the buffer content
lock.
Particularly for the root page, it'd be really interesting to come up with a
scheme that keeps an offline copy of the root page while also pinning the real
root page. I think we should be able to have a post-check that can figure out
if the copied root page is out of date after searching it, without needing the
content lock.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Dilip Kumar | 2023-06-27 04:12:28 | Re: Improving btree performance through specializing by key shape, take 2 |
Previous Message | Jaime Casanova | 2023-06-27 04:05:43 | Assert !bms_overlap(joinrel->relids, required_outer) |