From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> |
Cc: | James Pang <jamespang886(at)gmail(dot)com>, pgsql-performance(at)lists(dot)postgresql(dot)org |
Subject: | Re: many sessions wait on LWlock WALWrite suddenly |
Date: | 2025-04-15 10:00:48 |
Message-ID: | l23ybh777m5u7goxpzznqtobbgxanm65k7yxqvstwh3lze5mn4@3x5dkdtwzpj6 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Hi,
On 2025-04-15 12:16:40 +0300, Yura Sokolov wrote:
> 11.04.2025 17:36, James Pang пишет:
> > pgv14.8 , during peak time, we suddenly see hundreds of active sessions
> > waiting on LWlock WALWrite at the same time, but we did not find any issue
> > on storage .
> > any suggestions ?
>
> No real suggestions...
>
> There is single WALWrite lock.
That's true - but it's worth specifically calling out that the reason you'd
see a lot of WALWrite lock wait events isn't typically due to real lock
contention. Very often we'll flush WAL for many sessions at once, in those
cases the WALWrite lock wait events just indicate that all those sessions are
actually waiting for the WAL IO to complete.
It'd be good if we could report a different wait event for the case of just
waiting for WAL IO to complete, but right now that's not entirely trivial to
do reliably. But we could perhaps do at least the minimal thing and report a
different wait event if we reach XLogFlush() with an LSN that's already in the
process of being written out?
> In the results, backends waits each other, or, in other words, they waits
> latest of them!!! All backends waits until WAL record written by latest of
> them will be written and fsynced to disk.
They don't necessarily wait for the *latest* write, they just write for the
latest write from the time they started writing.
FWIW, in the v1 AIO prototype I had split up the locking for this so that we'd
not unnnecessarily need to wait previous writes in many cases - unfortunately
for *many* types of storage that turns out to be a significant loss (most
extremely on non-enterprise Samsung SSDs). The "maximal" group commit
behaviour minimizes the number of durable writes that need to be done, and
that is a significant benefit on many forms of storage. On other storage it's
a significant benefit to have multiple concurrent flushes, but it's a hard
hard tuning problem - I spent many months trying to get it right, and I never
fully got there.
> (Andres, iiuc it looks to be main bottleneck on the way of increasing
> NUM_XLOGINSERT_LOCKS. Right?)
I don't think that the "single" WALWriteLock is a blocker to increasing
NUM_XLOGINSERT_LOCKS to a meaningful degree.
However, I think there's somewhat of an *inverse* relationship. To
efficiently flush WAL in smaller increments, we need a cheap way of
identifying the number of backends that need to wait up to a certain LSN. For
that I think we may need a refinement of the WALInsertLock infrastructure.
I think the main blockers for increasing NUM_XLOGINSERT_LOCKS are:
1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and
spinlocks scale really badly under heavy contention
2) There are common codepaths where we need to iterate over all
NUM_XLOGINSERT_LOCKS slots, that turns out to become rather expensive, the
relevant cachelines are very commonly not going to be in the local CPU
cache.
I think we can redesign the mechanism so that there's an LSN ordered
ringbuffer of in-progress insertions, with the reservation being a single
64bit atomic increment, without the need for a low limit like
NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a
disadvantage with using something like MaxConnections * 2).
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Yura Sokolov | 2025-04-15 10:44:09 | Re: many sessions wait on LWlock WALWrite suddenly |
Previous Message | Yura Sokolov | 2025-04-15 09:16:40 | Re: many sessions wait on LWlock WALWrite suddenly |