From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: deadlock-hard flakiness |
Date: | 2023-02-08 22:11:45 |
Message-ID: | 20230208221145.bwzhancellclrgia@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2023-02-07 17:10:21 -0800, Andres Freund wrote:
> diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out
> --- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out 2023-02-07 05:32:34.536429000 +0000
> +++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out 2023-02-07 05:40:33.833908000 +0000
> @@ -25,10 +25,11 @@
> step s6a7: <... completed>
> step s6c: COMMIT;
> step s5a6: <... completed>
> -step s5c: COMMIT;
> +step s5c: COMMIT; <waiting ...>
> step s4a5: <... completed>
> step s4c: COMMIT;
> step s3a4: <... completed>
> +step s5c: <... completed>
> step s3c: COMMIT;
> step s2a3: <... completed>
> step s2c: COMMIT;
>
>
> Commit 741d7f1047f fixed a similar issue in deadlock-hard. But it looks like
> we need something more. But perhaps this isn't an output ordering issue:
>
> How can we end up with s5c getting reported as waiting? I don't see how s5c
> could end up blocking on anything?
After looking through isolationtester's blocking detection logic I started to
suspect that what we're seeing is not being blocked by a heavyweight lock, but
by a snapshot. So I added logging to
pg_isolation_test_session_is_blocked(). Took a while to reproduce the issue,
but indeed:
https://cirrus-ci.com/task/4901334571286528
https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/isolation-running/isolation/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/runningcheck.log
indicates that we indeed were blocked by a snapshot:
2023-02-08 21:30:12.123 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8971:0] LOG: pid 9280 blocked due to snapshot by pid: 0
...
2023-02-08 21:30:12.155 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8973:0] LOG: pid 9278 blocked due to snapshot by pid: 0
Unclear why we end up without a pid. It looks like 2PC removes the pid from
the field? In the problematic case the prepared_xacts test is indeed
scheduled concurrently:
2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] ERROR: transaction identifier "foo3" is already in use
2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] STATEMENT: PREPARE TRANSACTION 'foo3';
foo3 for example does use SERIALIZABLE.
I don't really understand how GetSafeSnapshotBlockingPids() can end up finding
deadlock-hard's sessions being blocked by a safe snapshot. Afaict nothing uses
serializable in that test. How can SXACT_FLAG_DEFERRABLE_WAITING be set for
the sxact of a backend that never did serializable? Are we possibly forgetting
to clear it or such?
I don't think it should affect the reports here, but I did break something
when removing SHMQueue - GetSafeSnapshotBlockingPids() doesn't check
output_size anymore. Will fix. Thomas, any chance you could do a pass through
96003717645 to see if I screwed up other things? I stared a lot at that
change, but I obviously did miss at least one thing.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2023-02-08 22:15:28 | Re: deadlock-hard flakiness |
Previous Message | Andrew Dunstan | 2023-02-08 22:09:45 | Re: run pgindent on a regular basis / scripted manner |