From: | Andy Fan <zhihuifan1213(at)163(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Cc: | Thomas Munro <tmunro(at)postgresql(dot)org> |
Subject: | the s_lock_stuck on perform_spin_delay |
Date: | 2024-01-04 06:59:06 |
Message-ID: | 87plyhvama.fsf@163.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
from src/backend/storage/lmgr/README:
"""
Spinlocks. These are intended for *very* short-term locks. If a lock
is to be held more than a few dozen instructions, or across any sort of
kernel call (or even a call to a nontrivial subroutine), don't use a
spinlock. Spinlocks are primarily used as infrastructure for lightweight
locks.
"""
I totally agree with this and IIUC spin lock is usually used with the
following functions.
#define init_local_spin_delay(status) ..
void perform_spin_delay(SpinDelayStatus *status);
void finish_spin_delay(SpinDelayStatus *status);
During the perform_spin_delay, we have the following codes:
void
perform_spin_delay(SpinDelayStatus *status)
/* Block the process every spins_per_delay tries */
if (++(status->spins) >= spins_per_delay)
{
if (++(status->delays) > NUM_DELAYS)
s_lock_stuck(status->file, status->line, status->func);
the s_lock_stuck will PAINC the entire system.
My question is if someone doesn't obey the rule by mistake (everyone
can make mistake), shall we PANIC on a production environment? IMO I
think it can be a WARNING on a production environment and be a stuck
when 'ifdef USE_ASSERT_CHECKING'.
People may think spin lock may consume too much CPU, but it is not true
in the discussed scene since perform_spin_delay have pg_usleep in it,
and the MAX_DELAY_USEC is 1 second and MIN_DELAY_USEC is 0.001s.
I notice this issue actually because of the patch "Cache relation
sizes?" from Thomas Munro [1], where the latest patch[2] still have the
following code.
+ sr = smgr_alloc_sr(); <-- HERE a spin lock is hold
+
+ /* Upgrade to exclusive lock so we can create a mapping. */
+ LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
operation is needed. it may take a long time.
Our internal testing system found more misuses on our own PG version.
I think a experienced engineer like Thomas can make this mistake and the
patch was reviewed by 3 peoples, the bug is still there. It is not easy
to say just don't do it.
the attached code show the prototype in my mind. Any feedback is welcome.
[1]
https://www.postgresql.org/message-id/CA%2BhUKGJg%2BgqCs0dgo94L%3D1J9pDp5hKkotji9A05k2nhYQhF4%2Bw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/attachment/123659/v5-0001-WIP-Track-relation-sizes-in-shared-memory.patch
--
Best Regards
Andy Fan
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Don-t-call-s_lock_stuck-in-production-environment.patch | text/x-diff | 1.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Ashutosh Bapat | 2024-01-04 07:00:05 | Re: speed up a logical replica setup |
Previous Message | Peter Smith | 2024-01-04 06:53:44 | Re: GUC names in messages |