Quick Links

Re: MAX_BACKENDS size (comment accuracy)

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jacob Brazeal <jacob(dot)brazeal(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: MAX_BACKENDS size (comment accuracy)
Date:	2025-02-12 15:28:42
Message-ID:	q2dxkmnulej2duygjlozhrdazoyq6scwdbdrxubtvtmlfdfvfe@omnwtz37ba7c
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-02-09 12:41:58 -0800, Jacob Brazeal wrote:
> > Halfing the size of LWLock and laying
> > the ground work for making the wait-list lock-free imo would be very well
> > worth the reduction in an unrealistic limit...
>
> BTW, I spent a week or two researching the lock-free queue idea,
> specifically looking at what it might look like to have a Michael-Scott
> type lock-free queue adapted to the waitlist. In the end, it seems like a
> fairly challenging project, and probably beyond my powers, but I'm very
> curious if you had some ideas around it that I'm missing. Here is a summary
> of my digging:

> The proclist has several advantages. We can always store a waitlist entry
> for a given process in the same spot in shared memory, so there's no memory
> allocation or cleanup to worry about. It's a doubly-linked list and hence
> easy to remove items from the middle of the list (given a lock is acquired).

> A Michael-Scott style lock-free queue would require giving up these
> advantages:
> * When you pop a node from a queue, it is repurposed into a temporary
> "dummy" node at the head of the queue. This means that if we want to add a
> process to a new queue, we can't reuse the memory from its previous queue
> node, because it might be in use as a dummy node. So we can't store
> waitlist entries in PGPROCs anymore; we'd probably want some kind of
> freelist. This also means that we need to eventually free nodes that have
> been popped, and the next time they're used, it might be by a different
> process.
> * The queue only supports popping from the head. So we can't directly
> support the current logic, which can skip some nodes in the waitlist to
> wakeup others. Two possibles workarounds I see: 1) split the waitlist into
> 3 different queues, one for each type (shared, exclusive, wait for val) -
> or have a lock-free "waitlist for the waitlist" that accumulates new
> entries and then we take out a lock when we actually want to release nodes.
>
> Finally, we'd need to set up some mechanism for safely freeing queue nodes
> when we're done with them, like hazard pointers or maybe epoch-based
> reclamation, which looks like a slightly better fit to me for our case, but
> both mechanisms are fairly tricky and I think we'd have to seriously
> consider finding some existing implementation that's well-tested and
> compatibly licensed instead of writing our own. (Or, maybe writing and
> testing our own version is the actual heart of this project.)

Yea, I think a general lock free doubly linked list would be hard.

> So, as I say, in the end this feels beyond my powers, but it was very
> interesting to look into and I'm curious if you had some notes lying around
> on it.

As I was suggesting upthread, I was thinking that we would continue to use
something like proclist, except that we'd "inline" the head and tail pointers
into a, now 64bit, state variable.

Generic lock free queues are decidedly nontrivial to write using
compare-exchange etc as a primitive, primarily due to ABA issues. Addressing
that in turn will often require some safe memory reclamation mechanism. But I
think we actually have a considerably simpler case here than a general
lock-free doubly linked list.

E.g.:

1) While waiting for an lwlock, a process will not do anything else (there is
is the corner case of enqueueing while the lock is concurrently released
however)

2) Memory lifetime is not a concern from a need-to-free POV (it could be a
concern from an ABA POV)

3) The set of potential list member is strictly limited and a fairly small
number (2^18)

This means we can represent prev/next in list members as a single 64bit
value that that can be CASed without requiring 128bit CAS (which scales
very badly)

4) The lock list does not have to be completely lock & wait free, it's
sufficient for common cases to be lock free.

We e.g. can do things like have a small ABA avoidance counter and fall back
to something like the existing spinlock whenver it overflows.

Another aspect that could be interesting is that we effectively have two
different wait queues. One for shared lockers and one for exclusive lockers. I
wonder if we could benefit from representing them as two singly-linked list,
instead of one doubly-linked list. There's also LW_WAIT_UNTIL_FREE, but
that's basically an exclusive lock and I think we'll get rid of it before
long.

One annoying thing is that we afaict can't rely on pointers to lwlocks to be
stable across processes, that makes some schemes harder.

I started to write up a scheme using the above restrictions, but it was taking
a bit, and I really gotta focus on stuff that can make it into the next
release...

Greetings,

Andres Freund

In response to

Re: MAX_BACKENDS size (comment accuracy) at 2025-02-09 20:41:58 from Jacob Brazeal

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Burd, Greg	2025-02-12 15:33:52	Re: Expanding HOT updates for expression and partial indexes
Previous Message	Peter Eisentraut	2025-02-12 15:23:36	Re: SQL:2011 application time