Re: Issue with the PRNG used by Postgres

From: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Parag Paul <parag(dot)paul(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Issue with the PRNG used by Postgres
Date: 2024-06-11 06:26:38
Message-ID: 239D257E-6740-4644-BBDA-9600A45592AF@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

FWIW, yesterday we had one more reproduction of stuck spinlock panic which does not seem as a stuck spinlock.

I don’t see any valuable diagnostic information. The reproduction happened on hot standby. There’s a message in logs on primary at the same time, but does not seem to be releated:
"process 3918804 acquired ShareLock on transaction 909261926 after 2716.594 ms"
PostgreSQL 14.11
VM with this node does not seem heavily loaded, according to monitoring there were just 2 busy backends before panic shutdown.

> On 16 Apr 2024, at 20:54, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2024-04-15 10:54:16 -0400, Robert Haas wrote:
>> On Fri, Apr 12, 2024 at 3:33 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> Here's a patch implementing this approach. I confirmed that before we trigger
>>> the stuck spinlock logic very quickly and after we don't. However, if most
>>> sleeps are interrupted, it can delay the stuck spinlock detection a good
>>> bit. But that seems much better than triggering it too quickly.
>>
>> +1 for doing something about this. I'm not sure if it goes far enough,
>> but it definitely seems much better than doing nothing.
>
> One thing I started to be worried about is whether a patch ought to prevent
> the timeout used by perform_spin_delay() from increasing when
> interrupted. Otherwise a few signals can trigger quite long waits.
>
> But as a I can't quite see a way to make this accurate in the backbranches, I
> suspect something like what I posted is still a good first version.
>

What kind of inaccuracy do you see?
The code in performa_spin_delay() does not seem to be much different across REL_11_STABLE..REL_12_STABLE.
The only difference I see is how random number is generated.

Thanks!

Best regards, Andrey Borodin.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2024-06-11 06:27:30 Re: Improve the granularity of PQsocketPoll's timeout parameter?
Previous Message Bertrand Drouvot 2024-06-11 06:24:42 Re: Track the amount of time waiting due to cost_delay