Re: prevent immature WAL streaming

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "masao(dot)fujii(at)oss(dot)nttdata(dot)com" <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "mengjuan(dot)cmj(at)alibaba-inc(dot)com" <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, "Jakub(dot)Wartak(at)tomtom(dot)com" <Jakub(dot)Wartak(at)tomtom(dot)com>, Ryo Matsumura <matsumura(dot)ryo(at)fujitsu(dot)com>
Subject: Re: prevent immature WAL streaming
Date: 2021-11-23 19:04:19
Message-ID: 45597.1637694259@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We're *still* not out of the woods with 026_overwrite_contrecord.pl,
as we are continuing to see occasional "mismatching overwritten LSN"
failures, further down in the test where it tries to start up the
standby:

sysname | branch | snapshot | stage | l
------------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------
spurfowl | REL_13_STABLE | 2021-10-18 03:56:26 | recoveryCheck | 2021-10-18 00:08:09.324 EDT [2455:6] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000
sidewinder | HEAD | 2021-10-19 04:32:36 | recoveryCheck | 2021-10-19 06:46:23.168 CEST [26393:6] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000
francolin | REL9_6_STABLE | 2021-10-26 01:41:39 | recoveryCheck | 2021-10-26 01:48:05.646 UTC [3417202][][1/0:0] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000
petalura | HEAD | 2021-11-05 00:20:03 | recoveryCheck | 2021-11-05 02:58:12.146 CET [61848fb3.28d157:6] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000
lapwing | REL_11_STABLE | 2021-11-05 17:24:49 | recoveryCheck | 2021-11-05 17:39:29.741 UTC [9831:6] FATAL: mismatching overwritten LSN 0/1FFE014 -> 0/1FFE000
morepork | HEAD | 2021-11-10 02:51:12 | recoveryCheck | 2021-11-10 04:03:33.576 CET [73561:6] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000
petalura | HEAD | 2021-11-16 15:20:03 | recoveryCheck | 2021-11-16 18:16:47.875 CET [6193e77f.35b87f:6] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000
morepork | HEAD | 2021-11-17 03:45:36 | recoveryCheck | 2021-11-17 04:57:04.359 CET [32089:6] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000
spurfowl | REL_10_STABLE | 2021-11-22 22:21:03 | recoveryCheck | 2021-11-22 17:29:35.520 EST [16011:6] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000
(9 rows)

Looking at adjacent successful runs, it seems that the exact point
where the "missing contrecord" starts varies substantially, even after
our previous fix to disable autovacuum in this test. How could that be?

It's probably for the best though, because I think this is exposing
an actual bug that we would not have seen if the start point were
completely consistent. I have not dug into the code, but it looks to
me like if the "consistent recovery state" is reached exactly at a
page boundary (0/1FFE000 in all these cases), then the standby expects
that to be what the OVERWRITE_CONTRECORD record will point at. But
actually it points to the first WAL record on that page, resulting
in a bogus failure.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-11-23 19:18:30 Re: Post-CVE Wishlist
Previous Message Jacob Champion 2021-11-23 18:54:03 Re: pg_upgrade parallelism