From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | Nitin Motiani <nitinmotiani(at)google(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Inval reliability, especially for inplace updates |
Date: | 2024-10-31 14:00:02 |
Message-ID: | 10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello Noah,
31.10.2024 04:39, Noah Misch wrote:
> I had pushed this during the indicated week, before your mail. Reverting it
> is an option. Let's see if more opinions arrive.
I've accidentally discovered an incorrect behaviour caused by commit
4eac5a1fa. Running this script:
for ((j=1; j<=100; j++)); do
echo "iteration $j"
cat << 'EOF' | timeout 60 psql >>psql-$SID.log || { res=1; echo "hanged on iteration $j"; break; }
SELECT format('CREATE TABLE t%s (a int, b text);', g) FROM generate_series(1, 50) g
\gexec
SELECT format('DROP TABLE t%s;', g) FROM generate_series(1, 50) g
\gexec
EOF
done
with
autovacuum = on
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_analyze_threshold = 1
in parallel using separate servers (the full script is attached), like:
parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40`
I can catch the following:
...
3 hanged on iteration 51
...
19 hanged on iteration 64
...
39 hanged on iteration 99
And after the script run, I see the server processes hanging:
law 1081433 1 0 16:22 ? 00:00:00 .../usr/local/pgsql/bin/postgres
law 1081452 1081433 0 16:22 ? 00:00:00 postgres: checkpointer
law 1081453 1081433 0 16:22 ? 00:00:00 postgres: background writer
law 1081460 1081433 0 16:22 ? 00:00:00 postgres: walwriter
law 1081462 1081433 0 16:22 ? 00:00:00 postgres: autovacuum launcher
law 1081464 1081433 0 16:22 ? 00:00:00 postgres: logical replication launcher
law 1143065 1081433 0 16:32 ? 00:00:00 postgres: postgres postgres [local] CREATE TABLE
law 1143263 1081433 0 16:32 ? 00:00:00 postgres: autovacuum worker postgres
law 1143320 1081433 0 16:32 ? 00:00:00 postgres: autovacuum worker postgres
law 1143403 1081433 0 16:32 ? 00:00:00 postgres: autovacuum worker
Attaching to process 1143065
...
(gdb) bt
#0 __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0,
futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0,
futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word(at)entry=0x7fed9a8171b8, expected=expected(at)entry=0,
clockid=clockid(at)entry=0, abstime=abstime(at)entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3 0x00007feda4674c5f in do_futex_wait (sem=sem(at)entry=0x7fed9a8171b8, abstime=0x0, clockid=0) at
./nptl/sem_waitcommon.c:111
#4 0x00007feda4674cf8 in __new_sem_wait_slow64 (sem=0x7fed9a8171b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5 0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a8171b8) at pg_sema.c:327
#6 0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318
#7 0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182
#8 0x0000561dd6d4f506 in heapam_index_fetch_tuple (scan=0x561dd8cb6588, tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28,
slot=0x561dd8cb75a0, call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at heapam_handler.c:146
...
(the full backtrace is attached)
All three autovacuum workers (1143263, 1143320, 1143403) are also waiting
for the same buffer lock:
#5 0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327
#6 0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318
#7 0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182
Probably, this can be reproduced with VACUUM pg_class/pg_type/..., but I
haven't found out the exact combination needed yet.
Also as a side note, these processes can't be terminated with SIGTERM, I
have to kill them.
Initially I saw this on a slowed down VM, but with the attached patch
applied I could reproduce it on my workstation too.
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
reproi.sh | application/x-shellscript | 1.3 KB |
delay-before-lockbuffer.patch | text/x-patch | 624 bytes |
1143065-bt.txt | text/plain | 6.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alena Rybakina | 2024-10-31 14:00:09 | Re: Consider the number of columns in the sort cost model |
Previous Message | Melih Mutlu | 2024-10-31 13:58:19 | Re: Separate memory contexts for relcache and catcache |