Re: "ERROR: latch already owned" on gharial

From: Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Sandeep Thakkar <sandeep(dot)thakkar(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, buildfarm-admins(at)lists(dot)postgresql(dot)org, CM Team <cm(at)enterprisedb(dot)com>
Subject: Re: "ERROR: latch already owned" on gharial
Date: 2024-02-10 01:56:19
Message-ID: CAE-ML+-iXbkxeM6hO9gTU0wfkmqW=mMFKfmu4M9anG7Tz+f-Rg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hey,

Deeply appreciate both your input!

On Thu, Feb 8, 2024 at 4:57 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in
> ProcKill(), before step 3 can happen. Comment in spin.h about
> SpinLockAcquire/Release:
>
> > * Load and store operations in calling code are guaranteed not to be
> > * reordered with respect to these operations, because they include a
> > * compiler barrier. (Before PostgreSQL 9.5, callers needed to use a
> > * volatile qualifier to access data protected by spinlocks.)
>
> That talks about a compiler barrier, though, not a memory barrier. But
> looking at the implementations in s_lock.h, I believe they do act as
> memory barrier, too.
>
> So you might indeed have that problem on 9.4, but AFAICS not on later
> versions.

Yes 9.4 does not have 0709b7ee72e, which I'm assuming you are referring to?

Reading src/backend/storage/lmgr/README.barrier: For x86, to avoid reordering
between a load and a store, we need something that prevents both CPU and
compiler reordering. pg_memory_barrier() fits the bill.

Here we can have pid X's read of latch->owner_pid=Y reordered to precede
pid Y's store of latch->owner_pid = 0. The compiler barrier in S_UNLOCK() will
prevent compiler reordering but not CPU reordering of the above.

#define S_UNLOCK(lock) \
do { __asm__ __volatile__("" : : : "memory"); *(lock) = 0; } while (0)
which is equivalent to a:
#define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory")

But maybe both CPU and memory reordering will be prevented by the tas() in
S_LOCK() which does a lock and xchgb?

Is the above acting as BOTH a compiler and CPU barrier, like the lock; addl
stuff in pg_memory_barrier_impl()?

If yes, then the picture would look like this:

Pid Y in DisownLatch(), Pid X in OwnLatch()

Y: LOAD latch->ownerPid
...
Y: STORE latch->ownerPid = 0
...
// returning PGPROC to freeList
Y:S_LOCK(ProcStructLock) <--- tas() prevents X: LOAD latch->ownerPid
from preceding this
...
... <-------- X: LOAD latch->ownerPid can't get here anyway as spinlock is held
...
Y:S_UNLOCK(ProcStructLock)
...
X: S_LOCK(ProcStructLock) // to retrieve PGPROC from freeList
...
X: S_UNLOCK(ProcStructLock)
...
X: LOAD latch->ownerPid

And this issue is not caused due to 9.4 missing 0709b7ee72e, which
changed S_UNLOCK
exclusively.

If no, then we would need the patch that does an explicit pg_memory_barrier()
at the end of DisownLatch() for PG HEAD.

On Thu, Feb 8, 2024 at 1:41 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Right. I wonder if the issue istead could be something similar to what was
> fixed in 8fb13dd6ab5b and more generally in 97550c0711972a. If two procs go
> through proc_exit() for the same process, you can get all kinds of weird
> mixed up resource ownership. The bug fixed in 8fb13dd6ab5b wouldn't apply,
> but it's pretty easy to introduce similar bugs in other places, so it seems
> quite plausible that greenplum might have done so. We also did have more
> proc_exit()s in signal handlers in older branches, so it might just be an
> issue that also was present before.

Hmm, the pids X and Y in the example provided upthread don't spawn off any
children (like by calling system()) - they are just regular backends. So its
not possible for them to receive TERM and try to proc_exit() w/ the same
PGPROC. So that is not the issue, I guess?

Regards,
Soumyadeep (VMware)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2024-02-10 03:00:01 Re: failure in 019_replslot_limit
Previous Message Michael Paquier 2024-02-10 01:19:15 Re: Small fix on query_id_enabled