| From: | Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> | 
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> | 
| Cc: | James Pang <jamespang886(at)gmail(dot)com>, pgsql-performance(at)lists(dot)postgresql(dot)org | 
| Subject: | Re: many sessions wait on LWlock WALWrite suddenly | 
| Date: | 2025-04-15 10:58:55 | 
| Message-ID: | c81611f5-ad93-420b-9bc2-d1576028c337@postgrespro.ru | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-performance | 
15.04.2025 13:53, Andres Freund пишет:
> Hi,
> 
> On 2025-04-15 13:44:09 +0300, Yura Sokolov wrote:
>> 15.04.2025 13:00, Andres Freund пишет:
>>> 1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and
>>>    spinlocks scale really badly under heavy contention
>>>
>>> I think we can redesign the mechanism so that there's an LSN ordered
>>> ringbuffer of in-progress insertions, with the reservation being a single
>>> 64bit atomic increment, without the need for a low limit like
>>> NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a
>>> disadvantage with using something like MaxConnections * 2).
>>
>> There is such attempt at [1]. And Zhiguo tells it really shows promising
>> results.
>>
>> No, I did it not with "ring-buffer", but rather with hash-table. But it is
>> still lock-free.
> 
> I don't find that approach particularly promising - I do think we want this to
> be an ordered datastructure, not something as fundamentally unordered as a
> hashtable.
I've tried to construct such thing. But "Switch WAL" record thing didn't
allow me to finish the design. Because "Switch WAL" have no fixed size, and
it is allowed to not be inserted. It breaks ordering.
Probably, I just didn't think hard enough to work around.
And certainly I though about it only for log reservation, not for waiting
on insertion to complete, nor for waiting writing to complete.
>> And then all stuck in WALWrite lock.
> 
> That will often, but not always, mean that you're just hitting the IO
> throughput of the storage device.  Right now it's too hard to tell the
> difference, hence the suggestion to make the wait events more informative.
> 
> 
>>> However, I think there's somewhat of an *inverse* relationship.  To
>>> efficiently flush WAL in smaller increments, we need a cheap way of
>>> identifying the number of backends that need to wait up to a certain LSN.
>>
>> I believe, LWLockWaitForVar should be redone:
>> - currently it waits for variable to change (ie to be disctinct from
>> provided value).
>> - but I believe, it should wait for variable to be greater than provided value.
> 
> I think we should simply get rid of the mechanism alltogether :)
> 
> 
>> This way:
>> - WALInsertLock waiter will not awake for every change of insertingAt
>> - process, which writes and fsync WAL, will be able to awake waiters on
>> every fsync, instead of end of whole write.
>>
>> It will reduce overhead of waiting WALInsertLock a lot, and will greately
>> reduce time spend on waiting WALWrite lock.
> 
>> Btw, insertingAt have to be filled at the start of copying wal record to
>> wal buffers. Yes, we believe copying of small wal record is fast, but when
>> a lot of wal inserters does their job, we needlessly sleep on their
>> WALInsertLock although they are already in the future.
> 
> Yes, that's a problem - but it also adds some overhead.  I think we'll be
> better off going with the ringbuffer approach where insertions are naturally
> ordered and we can wait for precisely the insertions that we need to.
-- 
regards
Yura Sokolov aka funny-falcon
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Weck, Luis | 2025-04-16 11:16:32 | Constraints elimination during runtime | 
| Previous Message | Andres Freund | 2025-04-15 10:53:53 | Re: many sessions wait on LWlock WALWrite suddenly |