From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | hlinnaka(at)iki(dot)fi |
Cc: | pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net, alexk(at)hintbits(dot)com, michael(dot)paquier(at)gmail(dot)com |
Subject: | Re: pg_rewind race condition just after promotion |
Date: | 2020-12-08 04:45:33 |
Message-ID: | 20201208.134533.202575181178574697.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Mon, 7 Dec 2020 20:13:25 +0200, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in
> There's a race condition between the checkpoint at promotion and
> pg_rewind. When a server is promoted, the startup process writes an
> end-of-recovery checkpoint that includes the new TLI, and the server
> is immediate opened for business. The startup process requests the
> checkpointer process to perform a checkpoint, but it can take a few
> seconds or more to complete. If you run pg_rewind, using the just
> promoted server as the source, pg_rewind will think that the server is
> still on the old timeline, because it only looks at TLI in the control
> file's copy of the checkpoint record. That's not updated until the
> checkpoint is finished.
>
> This isn't a new issue. Stephen Frost first reported it back 2015
> [1]. Back then, it was deemed just a small annoyance, and we just
> worked around it in the tests by issuing a checkpoint command after
> promotion, to wait for the checkpoint to finish. I just ran into it
> again today, with the new pg_rewind test, and silenced it in the
> similar way.
I (or we) faced that and avoided that by checking for history file on
the primary.
> I think we should fix this properly. I'm not sure if it can lead to a
> broken cluster, but at least it can cause pg_rewind to fail
> unnecessarily and in a user-unfriendly way. But this is actually
> pretty simple to fix. pg_rewind looks at the control file to find out
> the timeline the server is on. When promotion happens, the startup
> process updates minRecoveryPoint and minRecoveryPointTLI fields in the
> control file. We just need to read it from there. Patch attached.
Looks fine to me. A bit concerned about making sourceHistory
needlessly file-local but on the other hand unifying sourceHistory and
targetHistory looks better.
For the test part, that change doesn't necessariry catch the failure
of the current version, but I *believe* the prevous code is the result
of an actual failure in the past so the test probablistically (or
dependently on platforms?) hits the failure if it happned.
> I think we should also backpatch this. Back in 2015, we decided that
> we can live with this, but it's always been a bit bogus, and seems
> simple enough to fix.
I don't think this changes any successful behavior and it just saves
the failure case so +1 for back-patching.
> Thoughts?
>
> [1]
> https://www.postgresql.org/message-id/20150428180253.GU30322%40tamriel.snowman.net
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Craig Ringer | 2020-12-08 05:01:38 | Re: Blocking I/O, async I/O and io_uring |
Previous Message | Pavel Stehule | 2020-12-08 04:26:08 | Re: On login trigger: take three |