Re: Inval reliability, especially for inplace updates

From: Noah Misch <noah(at)leadboat(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Nitin Motiani <nitinmotiani(at)google(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inval reliability, especially for inplace updates
Date: 2024-10-31 20:01:39
Message-ID: 20241031200139.b4@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 31, 2024 at 05:00:02PM +0300, Alexander Lakhin wrote:
> I've accidentally discovered an incorrect behaviour caused by commit
> 4eac5a1fa. Running this script:

Thanks. This looks important.

> parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40`

This didn't reproduce it for me, at -j20, -j40, or -j80. I tested at commit
fb7e27a. At what commit(s) does it reproduce for you? At what commits, if
any, did your test not reproduce this?

> All three autovacuum workers (1143263, 1143320, 1143403) are also waiting
> for the same buffer lock:
> #5  0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327
> #6  0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318
> #7  0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182

Can you share the full backtrace for the autovacuum workers?

This looks like four backends all waiting for BUFFER_LOCK_SHARE on the same
pg_class page. One backend is in CREATE TABLE, and three are in autovacuum.
There are no other apparent processes that would hold the
BUFFER_LOCK_EXCLUSIVE blocking these four processes.

> Also as a side note, these processes can't be terminated with SIGTERM, I
> have to kill them.

That suggests they're trying to acquire one LWLock while holding another.
I'll recreate your CREATE TABLE stack trace and study its conditions. It's
not readily clear to me how that would end up holding relevant lwlocks.

Guessing how this happened did lead me to a bad decision in commit a07e03f,
but I expect fixing that bad decision won't fix the hang you captured. That
commit made index_update_stats() needlessly call RelationGetNumberOfBlocks()
and visibilitymap_count() with a pg_class heap buffer lock held. Both do I/O,
and the latter can exclusive-lock a visibility map buffer. The attached patch
corrects that. Since the hang you captured involved a pg_class heap buffer
lock, I don't think this patch will fix that hang. The other inplace updaters
are free from similar badness.

Attachment Content-Type Size
inplace230-index_update_stats-io-before-buflock-v1.patch text/plain 3.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Devulapalli, Raghuveer 2024-10-31 19:58:06 RE: Popcount optimization using AVX512