deadlock-soft isolation test is failing on newest Fedora

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: deadlock-soft isolation test is failing on newest Fedora
Date: 2024-11-17 18:00:01
Message-ID: b25b15cd-a1b5-c0f1-3cc9-514acc3c23a6@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

I'd like to bring your attention to an interesting failure produced by
caiman twice: [1] and [2].

As far as I can see, this animal runs on Fedora and gets OS updates on a
daily basis. At the moment of the first failure it had kernel:
6.12.0-0.rc6.20241108git906bd684e4b1.55.fc42.x86_64, and at the second:
6.12.0-0.rc7.59.fc42.x86_64.

I've created a VirtualBox (7.1.2) VM with Fedora 42 from
Fedora-Server-dvd-x86_64-Rawhide-20241114.n.0.iso and reproduced the issue
locally. On a fully updated OS, with kernel 6.12.0-0.rc7.59.fc42.x86_64
(I also tried 6.11.0-63.fc42.x86_64 and got the same failures), I run
`make check -C src/test/isolation` in a loop with 100 deadlock-soft tests
in isolation_schedule and see, for instance:
ITERATION 81
...
ok 47        - deadlock-soft                             279 ms
not ok 48    - deadlock-soft                          728711 ms
ok 49        - deadlock-soft                            1775 ms

regression.diffs is the same as in the caiman logs.
(Unfortunately, the buildfarm client doesn't enable log_lock_waits (and
log_autovacuum_min_duration) for installcheck stages, unlike pg_regress.c,
so those installcheck logs (inst/logfile) don't contain helpful information
for this case.)
But locally I can see the corresponding lines in postmaster.log:
...
2024-11-17 10:47:47.558 GMT [17448:12] isolation/deadlock-soft/d2 LOG:  process 17448 avoided deadlock for
AccessShareLock on relation 16661 of database 16384 by rearranging queue order after 20.549 ms
2024-11-17 10:47:58.019 GMT [17457:11] isolation/deadlock-soft/e1 LOG:  process 17457 avoided deadlock for
AccessExclusiveLock on relation 16667 of database 16384 by rearranging queue order after 10297.867 ms
2024-11-17 10:59:58.112 GMT [17502:12] isolation/deadlock-soft/d2 LOG:  process 17502 avoided deadlock for
AccessShareLock on relation 16673 of database 16384 by rearranging queue order after 1597.951 ms
..
(That is, the test failed when not d2, but e1 avoided deadlock (I also saw
such failures with d1).)

And another case (produced on Fedora 41, with 6.11.7-300.fc41.x86_64):
2024-11-17 15:05:55.825 GMT [6281:8] isolation/deadlock-soft/e1 LOG:  process 6281 avoided deadlock for
AccessExclusiveLock on relation 16943 of database 16384 by rearranging queue order after 13092.228 ms
2024-11-17 15:05:55.825 GMT [6281:9] isolation/deadlock-soft/e1 DETAIL:  Processes holding the lock: 6279, 6280. Wait
queue: 6281.
2024-11-17 15:05:55.825 GMT [6281:10] isolation/deadlock-soft/e1 STATEMENT:  LOCK TABLE a1 IN ACCESS EXCLUSIVE MODE;
2024-11-17 15:05:55.825 GMT [6281:11] isolation/deadlock-soft/e1 LOG:  process 6281 still waiting for
AccessExclusiveLock on relation 16943 of database 16384 after 13092.228 ms
2024-11-17 15:05:55.825 GMT [6281:12] isolation/deadlock-soft/e1 DETAIL:  Processes holding the lock: 6279, 6280. Wait
queue: 6281.
2024-11-17 15:05:55.825 GMT [6281:13] isolation/deadlock-soft/e1 STATEMENT:  LOCK TABLE a1 IN ACCESS EXCLUSIVE MODE;
2024-11-17 15:06:22.932 GMT [6282:8] isolation/deadlock-soft/e2 LOG:  process 6282 still waiting for AccessExclusiveLock
on relation 16946 of database 16384 after 32320.210 ms
2024-11-17 15:06:22.932 GMT [6282:9] isolation/deadlock-soft/e2 DETAIL:  Process holding the lock: 6280. Wait queue:
6282, 6279.
2024-11-17 15:06:22.932 GMT [6282:10] isolation/deadlock-soft/e2 STATEMENT:  LOCK TABLE a2 IN ACCESS EXCLUSIVE MODE;
2024-11-17 15:07:06.540 GMT [6279:9] isolation/deadlock-soft/d1 LOG:  process 6279 still waiting for AccessShareLock on
relation 16946 of database 16384 after 75890.853 ms
2024-11-17 15:07:06.540 GMT [6279:10] isolation/deadlock-soft/d1 DETAIL:  Process holding the lock: 6280. Wait queue:
6282, 6279.
2024-11-17 15:07:06.540 GMT [6279:11] isolation/deadlock-soft/d1 STATEMENT:  LOCK TABLE a2 IN ACCESS SHARE MODE;
2024-11-17 15:10:01.258 GMT [5524:1] LOG:  checkpoint starting: time
2024-11-17 15:11:50.775 GMT [6295:1] [unknown] LOG:  connection received: host=[local]
2024-11-17 15:11:50.775 GMT [6279:12] isolation/deadlock-soft/d1 ERROR:  canceling statement due to user request
2024-11-17 15:11:50.775 GMT [6279:13] isolation/deadlock-soft/d1 STATEMENT:  LOCK TABLE a2 IN ACCESS SHARE MODE;
...

This failure was accompanied by the following report in the system journal:
Nov 17 15:07:01 localhost.localdomain kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Nov 17 15:07:01 localhost.localdomain kernel: rcu:      0-...!: (2 ticks this GP) idle=5f58/0/0x0 softirq=7659/7660
fqs=1 (false positive?)
Nov 17 15:07:01 localhost.localdomain kernel: rcu:      2-...!: (1 GPs behind) idle=caa0/0/0x0 softirq=6821/6822 fqs=1
(false positive?)
Nov 17 15:07:01 localhost.localdomain kernel: rcu:      (detected by 8, t=60030 jiffies, g=26137, q=80 ncpus=12)
Nov 17 15:07:01 localhost.localdomain kernel: Sending NMI from CPU 8 to CPUs 0:
...

I saw such messages on Fedora 42 too, and it looks like there is no 1:1
correlation between such CPU stall reports and the test failures. Most
likely, a stall reported only when it lasts more than 60 seconds, but the
test can fail even with a shorter stall duration; on the other hand, even
with a long stall the test can pass, depending on which session resolves
a soft deadlock.

So this looks like a kernel/OS issue, but I wonder, are these alternate soft
deadlock resolutions legal in principle? And if so, maybe the test should
expect them...

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2024-11-11%2016%3A02%3A04
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2024-11-16%2004%3A24%3A24

Best regards,
Alexander

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey M. Borodin 2024-11-17 18:39:18 Re: UUID v7
Previous Message Tom Turelinckx 2024-11-17 17:28:13 Re: RISC-V animals sporadically produce weird memory-related failures