Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date: 2024-04-07 10:00:00
Message-ID: 5cbe0b03-d6f3-501d-3849-534568b0e776@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Robert,

05.04.2024 23:20, Robert Haas wrote:
> On Fri, Oct 29, 2021 at 9:30 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>> I can propose the debugging patch to reproduce the issue that replaces
>> the hang with the assert and modifies a pair of crash-causing test
>> scripts to simplify the reproducing. (Sorry, I have no time now to prune
>> down the scripts further as I have to leave for a week.)
> Just FYI, I tried to reproduce this today on v16, using this formula,
> with some hacking around to try to get it working on my MacBook, and I
> couldn't get it to crash.

I've refreshed the script and simplified it a bit not to use Linux
specifics. This works for me (on REL_14_0, with the patch applied,
CPPFLAGS="-O0" ./configure --enable-debug --enable-cassert ...):
echo "
autovacuum=off
fsync=off
" >> "$PGDATA/postgresql.conf"

pg_ctl -w -l server.log start

export PGDATABASE=regression
createdb regression

echo "
vacuum (verbose, skip_locked, index_cleanup off) pg_catalog.pg_class;
select pg_sleep(random()/50);
" >/tmp/17257/pseudo-autovacuum.sql

export PGDATABASE=regression
createdb regression
pgbench -n -f /tmp/17257/inherit.sql -C -T 1200 >pgbench-1.log 2>&1 &
pgbench -n -f /tmp/17257/vacuum.sql -C -T 1200 >pgbench-2.log 2>&1 &
pgbench -n -f /tmp/17257/pseudo-autovacuum.sql -C -c 10 -T 1200 >pgbench-3.log 2>&1 &
wait
grep -E "(TRAP|terminated)" server.log

(Please use the attached inherit.sql, vacuum.sql (excerpts from
src/test/sql/{inherit,vacuum}.sql).)

With PGDATA placed on tmpfs, this script failed for me after 1m31s,
2m35s, 4m12s:
TRAP: FailedAssertion("numretries < 100", File: "vacuumlazy.c", Line: 1726, PID: 951498)

Another possible outcome:
TRAP: FailedAssertion("relid == targetRelId", File: "relcache.c", Line: 1062, PID: 1257766)

And also:
2024-04-07 05:03:21.656 UTC [2905313] LOG:  server process (PID 2984687) was terminated by signal 6: Aborted
2024-04-07 05:03:21.656 UTC [2905313] DETAIL:  Failed process was running: create table matest0 (id serial primary key,
name text);
With the stack trace:
...
#4  0x00007fc30b4007f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000559f50220719 in index_delete_sort_cmp (deltid1=0x559f523a9f40, deltid2=0x7ffd2f9623f8) at heapam.c:7582
#6  0x0000559f50220847 in index_delete_sort (delstate=0x7ffd2f9636f0) at heapam.c:7623
...
(as in [1])

But on dad1539ae I got no failures for 3 runs (the same is on
REL_16_STABLE with a slightly modified lazy_scan_prune patch).

[1] https://www.postgresql.org/message-id/17255-14c0ac58d0f9b583%40postgresql.org

Best regards,
Alexander

Attachment Content-Type Size
inherit.sql application/sql 17.4 KB
vacuum.sql application/sql 5.5 KB
assert-in-lazy_scan_prune-loop.patch text/x-patch 581 bytes

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tender Wang 2024-04-07 10:10:47 Re: Detach Partition produces a --> SQL-Fehler [XX000]: ERROR: could not find ON INSERT check triggers of foreign key constraint 76908
Previous Message Tender Wang 2024-04-07 09:41:01 Re: BUG #18422: Assert in expandTupleDesc() fails on row mismatch with additional SRF