Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Assertion failure with summarize_wal enabled during pg_createsubscriber
Date: 2024-07-03 17:07:11
Message-ID: CA+TgmobLaJTxCHgdh04rfsUMEhP_ceDbiF0M=gtw5jG4q_zPbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 1, 2024 at 2:08 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> Nope. So, Open Item, here we go.

Some initial investigation:

In this test case, pg_subscriber, during the "starting the subscriber"
section of its work, starts up the database in the "sub" directory as
a standby. It enters standby mode, begins redo, and is then promoted,
selecting timeline 2. The WAL summarizer is supposed to end
summarization at the point at which timeline 1 ended and then resume
summarizing from the beginning of timeline 2. But instead, it fails an
assertion:

Assert(switchpoint >= state->EndRecPtr);

This assertion is trying to verify that, when a new timeline is
spawned, we don't read past the switchpoint on the original timeline.
Here, we have apparently done that. In one test, I got switchpoint =
0/51000510, state->EndRecPtr = 0/51000600. According to pg_waldump, on
timeline 1, we have this record at that LSN:

rmgr: Heap len (rec/tot): 54/ 54, tx: 2313637, lsn:
0/51000510, prev 0/510004D0, desc: DELETE xmax: 2313637, off: 3,
infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/5/6104 blk
0

And on timeline 2, we have this at that LSN:

rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn:
0/51000510, prev 0/510004D0, desc: CHECKPOINT_SHUTDOWN redo
0/51000510; tli 2; prev tli 1; fpw true; xid 0:2313638; oid 24576;
multi 1; offset 0; oldest xid 730 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown

It appears that pg_subscriber creates a recovery.conf that includes:

recovery_target_timeline = 'latest'
recovery_target_inclusive = true
recovery_target_lsn = '%X/%X'

...where %X/%X represents a valid LSN.

I think the problem here is that the WAL summarizer believes that when
a new timeline appears, it should pick up from where the old timeline
ended. And here, that doesn't happen: the new timeline branches off
before the end of the old timeline, because of the recovery target.

I'm not yet sure what should be done about this. The obvious answer is
"remove the assertion," and maybe that is all we need to do. However,
I'm not quite sure what the actual behavior will be if we just do
that, so I think more investigation is needed. I'll keep looking at
this, although given the US holiday I may not have results until next
week.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2024-07-03 17:17:49 Re: cannot abort transaction 2737414167, it was already committed
Previous Message Jacob Champion 2024-07-03 17:02:01 Re: [PoC] Federated Authn/z with OAUTHBEARER