From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | michael(at)paquier(dot)xyz |
Cc: | simseih(at)amazon(dot)com, alvherre(at)alvh(dot)no-ip(dot)org, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [BUG] Panic due to incorrect missingContrecPtr after promotion |
Date: | 2022-06-28 00:46:27 |
Message-ID: | 20220628.094627.1229111489487982500.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Mon, 27 Jun 2022 15:02:11 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> On Fri, Jun 24, 2022 at 04:17:34PM +0000, Imseih (AWS), Sami wrote:
> > It is been difficult to get a generic repro, but the way we reproduce
> > Is through our test suite. To give more details, we are running tests
> > In which we constantly failover and promote standbys. The issue
> > surfaces after we have gone through a few promotions which occur
> > every few hours or so ( not really important but to give context ).
>
> Hmm. Could you describe exactly the failover scenario you are using?
> Is the test using a set of cascading standbys linked to the promoted
> one? Are the standbys recycled from the promoted nodes with pg_rewind
> or created from scratch with a new base backup taken from the
> freshly-promoted primary? I have been looking more at this thread
> through the day but I don't see a remaining issue. It could be
> perfectly possible that we are missing a piece related to the handling
> of those new overwrite contrecords in some cases, like in a rewind.
>
> > I am adding some additional debugging to see if I can draw a better
> > picture of what is happening. Will also give aborted_contrec_reset_3.patch
> > a go, although I suspect it will not handle the specific case we are deaing with.
>
> Yeah, this is not going to change much things if you are still seeing
> an issue. This patch does not change the logic, aka it just
True. That is a siginicant hint on what happened at the time.
- Are there only two hosts in the replication set? I concerned on
whether it is a cascading set or not.
- Exactly what are you performing at every failover? Especially do
the steps contain pg_rewind, and do you copy pg_wal and/or archive
files between the failover hosts?
> simplifies the tracking of the continuation record data, resetting it
> when a complete record has been read. Saying that, getting rid of the
> dependency on StandbyMode because we cannot promote in the middle of a
> record is nice (my memories around that were a bit blurry but even
> recovery_target_lsn would not recover in the middle of an continuation
> record), and this is not bug so there is limited reason to backpatch
> this part of the change.
Agreed. In the first place my "repro" (or the test case) is a bit too
intricated to happen in the real field.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2022-06-28 01:12:48 | Repeatability of installcheck for test_oat_hooks |
Previous Message | Justin Pryzby | 2022-06-28 00:18:07 | Re: Allowing REINDEX to have an optional name |