From: | Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Sandeep Thakkar <sandeep(dot)thakkar(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, buildfarm-admins(at)lists(dot)postgresql(dot)org, CM Team <cm(at)enterprisedb(dot)com> |
Subject: | Re: "ERROR: latch already owned" on gharial |
Date: | 2024-02-10 01:56:19 |
Message-ID: | CAE-ML+-iXbkxeM6hO9gTU0wfkmqW=mMFKfmu4M9anG7Tz+f-Rg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hey,
Deeply appreciate both your input!
On Thu, Feb 8, 2024 at 4:57 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in
> ProcKill(), before step 3 can happen. Comment in spin.h about
> SpinLockAcquire/Release:
>
> > * Load and store operations in calling code are guaranteed not to be
> > * reordered with respect to these operations, because they include a
> > * compiler barrier. (Before PostgreSQL 9.5, callers needed to use a
> > * volatile qualifier to access data protected by spinlocks.)
>
> That talks about a compiler barrier, though, not a memory barrier. But
> looking at the implementations in s_lock.h, I believe they do act as
> memory barrier, too.
>
> So you might indeed have that problem on 9.4, but AFAICS not on later
> versions.
Yes 9.4 does not have 0709b7ee72e, which I'm assuming you are referring to?
Reading src/backend/storage/lmgr/README.barrier: For x86, to avoid reordering
between a load and a store, we need something that prevents both CPU and
compiler reordering. pg_memory_barrier() fits the bill.
Here we can have pid X's read of latch->owner_pid=Y reordered to precede
pid Y's store of latch->owner_pid = 0. The compiler barrier in S_UNLOCK() will
prevent compiler reordering but not CPU reordering of the above.
#define S_UNLOCK(lock) \
do { __asm__ __volatile__("" : : : "memory"); *(lock) = 0; } while (0)
which is equivalent to a:
#define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory")
But maybe both CPU and memory reordering will be prevented by the tas() in
S_LOCK() which does a lock and xchgb?
Is the above acting as BOTH a compiler and CPU barrier, like the lock; addl
stuff in pg_memory_barrier_impl()?
If yes, then the picture would look like this:
Pid Y in DisownLatch(), Pid X in OwnLatch()
Y: LOAD latch->ownerPid
...
Y: STORE latch->ownerPid = 0
...
// returning PGPROC to freeList
Y:S_LOCK(ProcStructLock) <--- tas() prevents X: LOAD latch->ownerPid
from preceding this
...
... <-------- X: LOAD latch->ownerPid can't get here anyway as spinlock is held
...
Y:S_UNLOCK(ProcStructLock)
...
X: S_LOCK(ProcStructLock) // to retrieve PGPROC from freeList
...
X: S_UNLOCK(ProcStructLock)
...
X: LOAD latch->ownerPid
And this issue is not caused due to 9.4 missing 0709b7ee72e, which
changed S_UNLOCK
exclusively.
If no, then we would need the patch that does an explicit pg_memory_barrier()
at the end of DisownLatch() for PG HEAD.
On Thu, Feb 8, 2024 at 1:41 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Right. I wonder if the issue istead could be something similar to what was
> fixed in 8fb13dd6ab5b and more generally in 97550c0711972a. If two procs go
> through proc_exit() for the same process, you can get all kinds of weird
> mixed up resource ownership. The bug fixed in 8fb13dd6ab5b wouldn't apply,
> but it's pretty easy to introduce similar bugs in other places, so it seems
> quite plausible that greenplum might have done so. We also did have more
> proc_exit()s in signal handlers in older branches, so it might just be an
> issue that also was present before.
Hmm, the pids X and Y in the example provided upthread don't spawn off any
children (like by calling system()) - they are just regular backends. So its
not possible for them to receive TERM and try to proc_exit() w/ the same
PGPROC. So that is not the issue, I guess?
Regards,
Soumyadeep (VMware)
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Lakhin | 2024-02-10 03:00:01 | Re: failure in 019_replslot_limit |
Previous Message | Michael Paquier | 2024-02-10 01:19:15 | Re: Small fix on query_id_enabled |