From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net, alexk(at)hintbits(dot)com, michael(dot)paquier(at)gmail(dot)com |
Subject: | Re: pg_rewind race condition just after promotion |
Date: | 2020-12-09 13:35:18 |
Message-ID: | 22b18d17-0556-da06-3e55-b7311ac39d3d@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 08/12/2020 06:45, Kyotaro Horiguchi wrote:
> At Mon, 7 Dec 2020 20:13:25 +0200, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in
>> I think we should fix this properly. I'm not sure if it can lead to a
>> broken cluster, but at least it can cause pg_rewind to fail
>> unnecessarily and in a user-unfriendly way. But this is actually
>> pretty simple to fix. pg_rewind looks at the control file to find out
>> the timeline the server is on. When promotion happens, the startup
>> process updates minRecoveryPoint and minRecoveryPointTLI fields in the
>> control file. We just need to read it from there. Patch attached.
>
> Looks fine to me. A bit concerned about making sourceHistory
> needlessly file-local but on the other hand unifying sourceHistory and
> targetHistory looks better.
Looking closer, findCommonAncestorTimeline() was freeing sourceHistory,
which was pretty horrible when it's a file-local variable. I changed it
so that both the source and target histories are passed to
findCommonAncestorTimeline() as arguments. That seems more clear.
> For the test part, that change doesn't necessariry catch the failure
> of the current version, but I *believe* the prevous code is the result
> of an actual failure in the past so the test probablistically (or
> dependently on platforms?) hits the failure if it happned.
Right. I think the current test coverage is good enough. We've been
bitten by this a few times by now, when we've forgotten to add the
manual checkpoint commands to new tests, and the buildfarm has caught it
pretty quickly.
>> I think we should also backpatch this. Back in 2015, we decided that
>> we can live with this, but it's always been a bit bogus, and seems
>> simple enough to fix.
>
> I don't think this changes any successful behavior and it just saves
> the failure case so +1 for back-patching.
Thanks for the review! New patch version attached.
- Heikki
Attachment | Content-Type | Size |
---|---|---|
v2-0001-pg_rewind-Fix-determining-TLI-when-server-was-jus.patch | text/x-patch | 11.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2020-12-09 13:56:45 | Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently |
Previous Message | Konstantin Knizhnik | 2020-12-09 13:28:12 | Re: On login trigger: take three |