RE: [PoC] pg_upgrade: allow to upgrade publisher node

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node
Date: 2023-09-14 03:10:38
Message-ID: OS0PR01MB571640E1B58741979A5E586594F7A@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wednesday, September 13, 2023 9:52 PM Kuroda, Hayato/黒田 隼人 <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Amit,
>
> Thank you for reviewing! Before making a patch I can reply the important point.
>
> > 1. One thing to note is that if user checks whether the old cluster is
> > upgradable with --check option and then try to upgrade, that will also
> > fail. Because during the --check run there would at least one
> > additional shutdown checkpoint WAL and then in the next run the slots
> > position won't match. Note, I am saying this in context of using
> > --check option with not-running old cluster. Won't that be surprising
> > to users? One possibility is that we document such a behaviour and
> > other is that we go back to WAL reading design where we can ignore
> > known WAL records like shutdown checkpoint, XLOG_RUNNING_XACTS, etc.
>
> Good catch, we have never considered the case that --check is executed for
> stopped cluster. You are right, the old cluster is turned on/off during the
> check and it generates SHUTDOWN_CHECKPOINT. This leads that
> confirmed_flush is
> behind the latest checkpoint lsn.
>
> Here are other approaches we came up with:
>
> 1. adds WARNING message when the --check is executed and slots are
> checked.
> We can say like:
>
> ```
> ...
> Checking for valid logical replication slots
> WARNING: this check generated WALs
> Next pg_uprade would fail.
> Please ensure again that all WALs are replicated.
> ...
> ```
>
>
> 2. adds hint message in the FATAL error when the confirmed_flush is not same
> as
> the latest checkpoint:
>
> ```
> ...
> Checking for valid logical replication slots fatal
>
> Your installation contains invalid logical replication slots.
> These slots can't be copied, so this cluster cannot be upgraded.
> Consider removing such slots or consuming the pending WAL if any,
> and then restart the upgrade.
> If you did pg_upgrade --check before this run, it may be a cause.
> Please start clusters and confirm again that all changes are
> replicated.
> A list of invalid logical replication slots is in the file:
> ```
>
> 3. requests users to do pg_upgrade --check on backup database, if old cluster
> has logical slots. Basically they save a whole of cluster before doing
> pg_uprade,
> so it may be acceptable. This is not a modification of codes.
>

Here are some more ideas about the issue for reference.

1) Extending the controlfile.

We can dd a new field (e.g. non_upgrade_checkPoint) to record the last check point
ptr happened in non-upgrade mode. The new field won't be updated due to
"pg_upgrade --check", so pg_upgrade can use this LSN to compare with the slot's
confirmed_flush_lsn.

Pros: User can smoothly upgrade the cluster even if they run "pg_upgrade
--check" in advance.

Cons: Not sure if this is a enough reason to introduce new field in
controlfile.

-----------

2) Advance the slot's confirmed_flush_lsn in pg_upgrade if the check passes.

Introducing an upgrade support SQL function
(binary_upgrade_advance_logical_slot_lsn()) to set a
flag(catch_confirmed_lsn_up) on server side. On server side, when trying to
flush the slot in shutdown checkpoint(CheckPointReplicationSlots), we update
the slot's confirmed_flush_lsn to the lsn of the current checkpoint if
catch_confirmed_lsn_up is set.

Pros: User can smoothly upgrade the cluster even if they run "pg_upgrade
--check" in advance.

Cons: Although we have some examples for using functions
(binary_upgrade_set_next_pg_enum_oid ...) to set some variables during upgrade
, but not sure if it's a standard behavior to change the slot's lsn during
upgrade.

-----------

3) Introduce a new pg_upgrade option(e.g. skip_slot_check), and suggest if user
already did the upgrade check for stopped server, they can use this option
when trying to upgrade later.

Pros: Can save some efforts for user to advance each slot's lsn.

Cons: I didn't see similar options in pg_upgrade, might need some agreement.

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2023-09-14 03:44:40 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Andres Freund 2023-09-14 02:22:40 Re: BufferUsage counters' values have changed