From: | James Coleman <jtc331(at)gmail(dot)com> |
---|---|
To: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: pg_rewind: warn when checkpoint hasn't happened after promotion |
Date: | 2022-06-06 12:10:19 |
Message-ID: | CAAaqYe9JxLvqqF=ZGfnqUsw+KBLUu_Rgf37+OtKdR49mhHLZGw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Jun 4, 2022 at 9:39 AM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> >
> > A few weeks back I sent a bug report [1] directly to the -bugs mailing
> > list, and I haven't seen any activity on it (maybe this is because I
> > emailed directly instead of using the form?), but I got some time to
> > take a look and concluded that a first-level fix is pretty simple.
> >
> > A quick background refresher: after promoting a standby rewinding the
> > former primary requires that a checkpoint have been completed on the
> > new primary after promotion. This is correctly documented. However
> > pg_rewind incorrectly reports to the user that a rewind isn't
> > necessary because the source and target are on the same timeline.
> >
> > Specifically, this happens when the control file on the newly promoted
> > server looks like:
> >
> > Latest checkpoint's TimeLineID: 4
> > Latest checkpoint's PrevTimeLineID: 4
> > ...
> > Min recovery ending loc's timeline: 5
> >
> > Attached is a patch that detects this condition and reports it as an
> > error to the user.
> >
> > In the spirit of the new-ish "ensure shutdown" functionality I could
> > imagine extending this to automatically issue a checkpoint when this
> > situation is detected. I haven't started to code that up, however,
> > wanting to first get buy-in on that.
> >
> > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com
>
> Thanks. I had a quick look over the issue and patch - just a thought -
> can't we let pg_rewind issue a checkpoint on the new primary instead
> of erroring out, maybe optionally? It might sound too much, but helps
> pg_rewind to be self-reliant i.e. avoiding external actor to detect
> the error and issue checkpoint the new primary to be able to
> successfully run pg_rewind on the pld primary and repair it to use it
> as a new standby.
That's what I had suggested as a "further improvement" option in the
last paragraph :)
But I think agreement on this more basic solution would still be good
(even if I add the automatic checkpointing in this thread); given we
currently explicitly mis-inform the user of pg_rewind, I think this is
a bug that should be considered for backpatching, and the simpler
"fail if detected" patch is probably the only thing we could
backpatch.
Thanks for taking a look,
James Coleman
From | Date | Subject | |
---|---|---|---|
Next Message | James Coleman | 2022-06-06 12:32:01 | Re: pg_rewind: warn when checkpoint hasn't happened after promotion |
Previous Message | kuroda.hayato@fujitsu.com | 2022-06-06 10:54:21 | RE: Multi-Master Logical Replication |