Quick Links

Re: the s_lock_stuck on perform_spin_delay

From:	Andy Fan <zhihuifan1213(at)163(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Thomas Munro <tmunro(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: the s_lock_stuck on perform_spin_delay
Date:	2024-01-05 02:20:39
Message-ID:	87wmsoo56u.fsf@163.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Andres Freund <andres(at)anarazel(dot)de> writes:

>
> On 2024-01-04 14:59:06 +0800, Andy Fan wrote:
>> My question is if someone doesn't obey the rule by mistake (everyone
>> can make mistake), shall we PANIC on a production environment? IMO I
>> think it can be a WARNING on a production environment and be a stuck
>> when 'ifdef USE_ASSERT_CHECKING'.
>>
>> [...]
>>
>> I notice this issue actually because of the patch "Cache relation
>> sizes?" from Thomas Munro [1], where the latest patch[2] still have the
>> following code.
>> + sr = smgr_alloc_sr(); <-- HERE a spin lock is hold
>> +
>> + /* Upgrade to exclusive lock so we can create a mapping. */
>> + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
>> operation is needed. it may take a long time.
>>
>> Our internal testing system found more misuses on our own PG version.
>
>> I think a experienced engineer like Thomas can make this mistake and the
>> patch was reviewed by 3 peoples, the bug is still there. It is not easy
>> to say just don't do it.
>
> I don't follow this argument - just ignoring the problem,

I agree with you but I'm feeling you ignored my post at [1], where I
said for the known issue, it should be fixed at the first chance.

> which emitting a
> WARNING basically is, doesn't reduce the impact of the bug, it *increases* the
> impact, because now the system will not recover from the bug without explicit
> operator intervention. During that time the system might be essentially
> unresponsive, because all backends end up contending for some spinlock, which
> makes investigating such issues very hard.

Acutally they are doing pg_usleep at the most time.

Besides what Robert said, one more reason to question PANIC is that: PAINC
can't always make the system recovery faster because: a). In the most
system, PANIC makes a core dump which take times and spaces. b). After
the reboot, all the caches like relcache, plancache, fdcache need to be
rebuit. c). Customer needs to handle failure better or else they will be
hurt *more often*. All of such sense cause slowness as well.

>
> I think we should add infrastructure to detect bugs like this during
> development,

The commit 2 in [1] does something like this. for the details, I missed the
check for memory allocation case as you suggested at [2], but checked
heavyweight lock as well. others should be same IIUC.

> but not PANICing when this happens in production seems completely
> non-viable.
>

Not sure what does *this* exactly means. If it means the bug in Thomas's
patch, I absoluately agree with you(since it is a known bug and it
should be fixed). If it means the general *unknown* case, it's something
we talked above.

I'm also agree that some LLVM static checker should be pretty good
ideally, it just requires more knowledge base and future maintain
effort. I am willing to have a try shortly.

[1] https://www.postgresql.org/message-id/871qaxp3ly.fsf%40163.com
[2]
https://www.postgresql.org/message-id/20240104225403.dgmbbfffmm3srpgq%40awork3.anarazel.de

--
Best Regards
Andy Fan

In response to

Re: the s_lock_stuck on perform_spin_delay at 2024-01-04 23:06:28 from Andres Freund

Responses

Re: the s_lock_stuck on perform_spin_delay at 2024-01-05 19:27:13 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2024-01-05 02:21:07	Re: the s_lock_stuck on perform_spin_delay
Previous Message	John Naylor	2024-01-05 02:03:39	Re: add AVX2 support to simd.h