From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Antonin Houska <ah(at)cybertec(dot)at> |
Subject: | Re: AIO v2.5 |
Date: | 2025-04-01 16:51:53 |
Message-ID: | g5eisego74jjmdqck2ge4r3bunnjk4m56o7omdec6pnzdp42nf@gcici63b6iyf |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-04-01 09:07:27 -0700, Noah Misch wrote:
> On Tue, Apr 01, 2025 at 11:55:20AM -0400, Andres Freund wrote:
> > WRT the locking issues, I've been wondering whether we could make
> > LWLockWaitForVar() work that purpose, but I doubt it's the right approach.
> > Probably better to get rid of the LWLock*Var functions and go for the approach
> > I had in v1, namely a version of LWLockAcquire() with a callback that gets
> > called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the
> > lock acquisition to abort.
>
> What are the best thing(s) to read to understand the locking issues?
Unfortunately I think it's our discussion from a few days/weeks ago.
The problem basically is that functions like LockBuffer(EXCLUSIVE) need to be able
to non-racily
a) wait for in-fligth IOs
b) acquire the content lock
If you just do it naively like this:
else if (mode == BUFFER_LOCK_EXCLUSIVE)
{
if (pg_atomic_read_u32(&buf->state) &_IO_IN_PROGRESS)
WaitIO(buf);
LWLockAcquire(content_lock, LW_EXCLUSIVE);
}
you obviously could have another backend start new IO between the WaitIO() and
the LWLockAcquire(). If that other backend then doesn't consume the
completion of that IO, the current backend could end up endlessly waiting for
the IO. I don't see a way to avoid with narrow changes just to LockBuffer().
We need some infrastructure that allows to avoid that issue. One approach
could be to integrate more tightly with lwlock.c. If
1) anyone starting IO were to wake up all waiters for the LWLock
2) The waiting side checked that there is no IO in progress *after*
LWLockQueueSelf(), but before PGSemaphoreLock()
The backend doing LockBuffer() would be guaranteed to have the chance to wait
for the IO, rather than the lwlock.
But there might be better approaches.
I'm not really convinced that using generic lwlocks for buffer locking is the
best idea. There's just too many special things about buffers. E.g. we have
rather massive NUMA scalability issues due to the amount of lock traffic from
buffer header and content lock atomic operations, particuly on things like the
uppermost levels of a btree. I've played with ideas like super-pinning and
locking btree root pages, which move all the overhead to the side that wants
to exclusively lock such a page - but that doesn't really make sense for
lwlocks in general.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Ashutosh Bapat | 2025-04-01 16:58:25 | Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning |
Previous Message | Nathan Bossart | 2025-04-01 16:25:23 | Re: Improve CRC32C performance on SSE4.2 |