Re: Inval reliability, especially for inplace updates

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Nitin Motiani <nitinmotiani(at)google(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inval reliability, especially for inplace updates
Date: 2024-10-31 14:00:02
Message-ID: 10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Noah,

31.10.2024 04:39, Noah Misch wrote:
> I had pushed this during the indicated week, before your mail. Reverting it
> is an option. Let's see if more opinions arrive.

I've accidentally discovered an incorrect behaviour caused by commit
4eac5a1fa. Running this script:
for ((j=1; j<=100; j++)); do
echo "iteration $j"

cat << 'EOF' | timeout 60 psql >>psql-$SID.log || { res=1; echo "hanged on iteration $j"; break; }
SELECT format('CREATE TABLE t%s (a int, b text);', g) FROM generate_series(1, 50) g
\gexec

SELECT format('DROP TABLE t%s;', g) FROM generate_series(1, 50) g
\gexec
EOF
done

with
autovacuum = on
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_analyze_threshold = 1

in parallel using separate servers (the full script is attached), like:
parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40`

I can catch the following:
...
3       hanged on iteration 51
...
19      hanged on iteration 64
...
39      hanged on iteration 99

And after the script run, I see the server processes hanging:
law      1081433       1  0 16:22 ?        00:00:00 .../usr/local/pgsql/bin/postgres
law      1081452 1081433  0 16:22 ?        00:00:00 postgres: checkpointer
law      1081453 1081433  0 16:22 ?        00:00:00 postgres: background writer
law      1081460 1081433  0 16:22 ?        00:00:00 postgres: walwriter
law      1081462 1081433  0 16:22 ?        00:00:00 postgres: autovacuum launcher
law      1081464 1081433  0 16:22 ?        00:00:00 postgres: logical replication launcher
law      1143065 1081433  0 16:32 ?        00:00:00 postgres: postgres postgres [local] CREATE TABLE
law      1143263 1081433  0 16:32 ?        00:00:00 postgres: autovacuum worker postgres
law      1143320 1081433  0 16:32 ?        00:00:00 postgres: autovacuum worker postgres
law      1143403 1081433  0 16:32 ?        00:00:00 postgres: autovacuum worker

Attaching to process 1143065
...
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0,
futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0,
futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word(at)entry=0x7fed9a8171b8, expected=expected(at)entry=0,
clockid=clockid(at)entry=0, abstime=abstime(at)entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007feda4674c5f in do_futex_wait (sem=sem(at)entry=0x7fed9a8171b8, abstime=0x0, clockid=0) at
./nptl/sem_waitcommon.c:111
#4  0x00007feda4674cf8 in __new_sem_wait_slow64 (sem=0x7fed9a8171b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a8171b8) at pg_sema.c:327
#6  0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318
#7  0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182
#8  0x0000561dd6d4f506 in heapam_index_fetch_tuple (scan=0x561dd8cb6588, tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28,
slot=0x561dd8cb75a0, call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at heapam_handler.c:146
...
(the full backtrace is attached)

All three autovacuum workers (1143263, 1143320, 1143403) are also waiting
for the same buffer lock:
#5  0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327
#6  0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318
#7  0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182

Probably, this can be reproduced with VACUUM pg_class/pg_type/..., but I
haven't found out the exact combination needed yet.

Also as a side note, these processes can't be terminated with SIGTERM, I
have to kill them.

Initially I saw this on a slowed down VM, but with the attached patch
applied I could reproduce it on my workstation too.

Best regards,
Alexander

Attachment Content-Type Size
reproi.sh application/x-shellscript 1.3 KB
delay-before-lockbuffer.patch text/x-patch 624 bytes
1143065-bt.txt text/plain 6.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2024-10-31 14:00:09 Re: Consider the number of columns in the sort cost model
Previous Message Melih Mutlu 2024-10-31 13:58:19 Re: Separate memory contexts for relcache and catcache