Re: Issue with the PRNG used by Postgres

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Parag Paul <parag(dot)paul(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Issue with the PRNG used by Postgres
Date: 2024-04-10 19:08:21
Message-ID: 20240410190821.yhquanxyhpqtkett@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-04-10 14:02:20 -0400, Tom Lane wrote:
> On third thought ... while I still think this is a misuse of
> perform_spin_delay and we should change it, I'm not sure it'll do
> anything to address Parag's problem, because IIUC he's seeing actual
> "stuck spinlock" reports. That implies that the inner loop of
> LWLockWaitListLock slept NUM_DELAYS times without ever seeing
> LW_FLAG_LOCKED clear. What I'm suggesting would change the triggering
> condition to "NUM_DELAYS sleeps without acquiring the lock", which is
> strictly more likely to happen, so it's not going to help him. It's
> certainly still well out in we-shouldn't-get-there territory, though.

I think it could exascerbate the issue. Parag reported ~7k connections on a
128 core machine. The buffer replacement logic in < 16 tries to lock the old
and new lock partitions at once. That can lead to quite bad "chains" of
dependent lwlocks, occasionally putting all the pressure on a single lwlock.
With 7k waiters on a single spinlock, higher frequency of wakeups will make it
much more likely that the process holding the spinlock will be put to sleep.

This is greatly exacerbated by the issue fixed in a4adc31f690, once the
waitqueue is long, the spinlock will be held for an extended amount of time.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2024-04-10 19:24:32 Re: Table AM Interface Enhancements
Previous Message Pavel Stehule 2024-04-10 18:54:46 Re: broken JIT support on Fedora 40