From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <fujii(at)postgresql(dot)org> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Crash in new pgstats code |
Date: | 2022-04-16 21:36:33 |
Message-ID: | 20220416213633.4gfzputl3wbla55p@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi
On 2022-04-16 12:13:09 -0700, Andres Freund wrote:
> What confuses me so far is what already had generated stats before
> reaching pgstat_reset_after_failure() (so that the bug could even be hit
> in t/025_stuck_on_old_timeline.pl).
I see part of a problem - in archiver stats. Even in 14 (and presumably
before), we do work that can generate archiver stats
(e.g. ReadCheckpointRecord()) before pgstat_reset_all(). It's not the
end of the world, but doesn't seem great.
But since archiver stats are fixed-numbered stats (and thus not in the
hash table), they'd not trigger the backtrace we saw here.
One thing that's interesting is that the failing tests have:
2022-04-15 12:07:48.828 UTC [675922][walreceiver][:0] FATAL: could not link file "pg_wal/xlogtemp.675922" to "pg_wal/00000002.history": File exists
which I haven't seen locally. Looks like we have some race between
startup process and walreceiver? That seems not great. I'm a bit
confused that walreceiver and archiving are both active at the same time
in the first place - that doesn't seem right as things are set up
currently.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2022-04-16 22:07:17 | Re: Crash in new pgstats code |
Previous Message | Thomas Munro | 2022-04-16 20:56:33 | Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman |