Re: Fix primary crash continually with invalid checkpoint after promote

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: 875941708(at)qq(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, masao(dot)fujii(at)oss(dot)nttdata(dot)com, nathandbossart(at)gmail(dot)com
Subject: Re: Fix primary crash continually with invalid checkpoint after promote
Date: 2022-04-27 02:24:11
Message-ID: 20220427.112411.551209151727752749.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

At Tue, 26 Apr 2022 15:47:13 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> "=?ISO-8859-1?B?WmhhbyBSdWk=?=" <875941708(at)qq(dot)com> writes:
> > Newly promoted primary may leave an invalid checkpoint.
> > In function CreateRestartPoint, control file is updated and old wals are removed. But in some situations, control file is not updated, old wals are still removed. Thus produces an invalid checkpoint with nonexistent wal. Crucial log: "invalid primary checkpoint record", "could not locate a valid checkpoint record".
>
> I believe this is the same issue being discussed here:
>
> https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com
>
> but Horiguchi-san's proposed fix looks quite different from yours.

The root cause is that CreateRestartPoint omits to update last
checkpoint in control file if archiver recovery exits at an
unfortunate timing. So my proposal is going to fix the root cause.

Zhao Rui's proposal is retension of WAL files according to (the wrong
content of) control file.

Aside from the fact that it may let slots be invalidated ealier, It's
not great that an acutally performed restartpoint is forgotten, which
may cause the next crash recovery starts from an already performed
checkpoint.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2022-04-27 03:00:49 Re: Fix primary crash continually with invalid checkpoint after promote
Previous Message David G. Johnston 2022-04-26 23:24:12 Re: lag() default value ignored for some window partition depending on table records count?

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-04-27 03:00:49 Re: Fix primary crash continually with invalid checkpoint after promote
Previous Message Tom Lane 2022-04-27 02:15:16 Re: pgsql: Add contrib/pg_walinspect.