From: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
---|---|
To: | 'Peter Smith' <smithpb2250(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
Subject: | RE: [PoC] pg_upgrade: allow to upgrade publisher node |
Date: | 2023-08-18 13:51:36 |
Message-ID: | TYAPR01MB5866562EF047F2C9DDD1F9DEF51BA@TYAPR01MB5866.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dear Peter,
PSA new version patch set.
> Here are some review comments for the patch v21-0003
>
> ======
> Commit message
>
> 1.
> pg_upgrade fails if the old node has slots which status is 'lost' or they do not
> consume all WAL records. These are needed for prevent the data loss.
>
> ~
>
> Maybe some minor brush-up like:
>
> SUGGESTION
> In order to prevent data loss, pg_upgrade will fail if the old node
> has slots with the status 'lost', or with unconsumed WAL records.
Improved.
> src/bin/pg_upgrade/check.c
>
> 2. check_for_confirmed_flush_lsn
>
> + /* Check that all logical slots are not in 'lost' state. */
> + res = executeQueryOrDie(conn,
> + "SELECT slot_name FROM pg_catalog.pg_replication_slots "
> + "WHERE temporary = false AND wal_status = 'lost';");
> +
> + ntups = PQntuples(res);
> + i_slotname = PQfnumber(res, "slot_name");
> +
> + for (i = 0; i < ntups; i++)
> + {
> + is_error = true;
> +
> + pg_log(PG_WARNING,
> + "\nWARNING: logical replication slot \"%s\" is obsolete.",
> + PQgetvalue(res, i, i_slotname));
> + }
> +
> + PQclear(res);
> +
> + if (is_error)
> + pg_fatal("logical replication slots not to be in 'lost' state.");
> +
>
> 2a. (GENERAL)
> The above code for checking lost state seems out of place in this
> function which is meant for checking confirmed flush lsn.
>
> Maybe you jammed both kinds of logic into one function to save on the
> extra PGconn or something but IMO two separate functions would be
> better. e.g.
> - check_for_lost_slots
> - check_for_confirmed_flush_lsn
Separated into check_for_lost_slots and check_for_confirmed_flush_lsn.
> 2b.
> + /* Check that all logical slots are not in 'lost' state. */
>
> SUGGESTION
> /* Check there are no logical replication slots with a 'lost' state. */
Changed.
> 2c.
> + res = executeQueryOrDie(conn,
> + "SELECT slot_name FROM pg_catalog.pg_replication_slots "
> + "WHERE temporary = false AND wal_status = 'lost';");
>
> This SQL fragment is very much like others in previous patches. Be
> sure to make all the cases and clauses consistent with all those
> similar SQL fragments.
Unified the order. Note that they could not be the completely the same.
> 2d.
> + is_error = true;
>
> That doesn't need to be in the loop. Better to just say:
> is_error = (ntups > 0);
Removed the variable.
> 2e.
> There is a mix of terms in the WARNING and in the pg_fatal -- e.g.
> "obsolete" versus "lost". Is it OK?
Unified to 'lost'.
> 2f.
> + pg_fatal("logical replication slots not to be in 'lost' state.");
>
> English? And maybe it should be much more verbose...
>
> "Upgrade of this installation is not allowed because one or more
> logical replication slots with a state of 'lost' were detected."
I checked other pg_fatal() and the statement like "Upgrade of this installation is not allowed"
could not be found. So I used later part.
> 3. check_for_confirmed_flush_lsn
>
> + /*
> + * Check that all logical replication slots have reached the latest
> + * checkpoint position (SHUTDOWN_CHECKPOINT record). This checks cannot
> be
> + * done in case of live_check because the server has not been written the
> + * SHUTDOWN_CHECKPOINT record yet.
> + */
> + if (!live_check)
> + {
> + res = executeQueryOrDie(conn,
> + "SELECT slot_name FROM pg_catalog.pg_replication_slots "
> + "WHERE confirmed_flush_lsn != '%X/%X' AND temporary = false;",
> + old_cluster.controldata.chkpnt_latest_upper,
> + old_cluster.controldata.chkpnt_latest_lower);
> +
> + ntups = PQntuples(res);
> + i_slotname = PQfnumber(res, "slot_name");
> +
> + for (i = 0; i < ntups; i++)
> + {
> + is_error = true;
> +
> + pg_log(PG_WARNING,
> + "\nWARNING: logical replication slot \"%s\" has not consumed WALs yet",
> + PQgetvalue(res, i, i_slotname));
> + }
> +
> + PQclear(res);
> + PQfinish(conn);
> +
> + if (is_error)
> + pg_fatal("All logical replication slots consumed all the WALs.");
>
> ~
>
> 3a.
> /This checks/This check/
The comment was no longer needed, because the caller checks live_check variable.
More detail, please see my another post [1].
> 3b.
> I don't think the separation of
> chkpnt_latest_upper/chkpnt_latest_lower is needed like this. AFAIK
> there is an LSN_FORMAT_ARGS(lsn) macro designed for handling exactly
> this kind of parameter substitution.
Fixed to use the macro.
Previously I considered that the header "access/xlogdefs.h" could not be included
from pg_upgrade, and it was the reason why I did not use. But it seemed my
misunderstanding - I could include the file.
> 3c.
> + is_error = true;
>
> That doesn't need to be in the loop. Better to just say:
> is_error = (ntups > 0);
Removed.
> 3d.
> + pg_fatal("All logical replication slots consumed all the WALs.");
>
> The message seems backward. shouldn't it say something like:
> "Upgrade of this installation is not allowed because one or more
> logical replication slots still have unconsumed WAL records."
I used only later part, see above reply.
> src/bin/pg_upgrade/controldata.c
>
> 4. get_control_data
>
> + /*
> + * Upper and lower part of LSN must be read and stored
> + * separately because it is reported as %X/%X format.
> + */
> + cluster->controldata.chkpnt_latest_upper =
> + strtoul(p, &slash, 16);
> + cluster->controldata.chkpnt_latest_lower =
> + strtoul(++slash, NULL, 16);
>
> I felt that this field separation code is maybe not necessary. Please
> refer to other review comments in this post.
Hmm. I thought they must be read separately even if we stored as XLogRecPtr (uint64).
This is because the pg_controldata reports the LSN as %X/%X style. Am I missing something?
```
$ pg_controldata -D data_N1/ | grep "Latest checkpoint location"
Latest checkpoint location: 0/153C8D0
```
> src/bin/pg_upgrade/pg_upgrade.h
>
> 5. ControlData
>
> +
> + uint32 chkpnt_latest_upper;
> + uint32 chkpnt_latest_lower;
> } ControlData;
>
> ~
>
> Actually, I did not recognise the reason why this cannot be stored
> properly as a single XLogRecPtr field. Please see other review
> comments in this post.
Changed to use XLogRecPtr. See above comment.
> .../t/003_logical_replication_slots.pl
>
> 6. GENERAL
>
> Many of the changes to this file are just renaming the
> 'old_node'/'new_node' to 'old_publisher'/'new_publisher'.
>
> This seems a basic change not really associated with this patch 0003.
> To reduce the code churn, this change should be moved into the earlier
> patch where this test file (003_logical_replication_slots.pl) was
> first introduced,
Moved these renaming to 0002.
> 7.
>
> # Cause a failure at the start of pg_upgrade because slot do not finish
> # consuming all the WALs
>
> ~
>
> Can you give a more detailed explanation in the comment of how this
> test case achieves what it says?
Slightly reworded above and this comment. How do you think?
> src/test/regress/sql/misc_functions.sql
>
> 8.
> @@ -236,4 +236,4 @@ SELECT * FROM pg_split_walfile_name('invalid');
> SELECT segment_number > 0 AS ok_segment_number, timeline_id
> FROM pg_split_walfile_name('000000010000000100000000');
> SELECT segment_number > 0 AS ok_segment_number, timeline_id
> - FROM pg_split_walfile_name('ffffffFF00000001000000af');
> + FROM pg_split_walfile_name('ffffffFF00000001000000af');
> \ No newline at end of file
>
> ~
>
> What is this change for? It looks like maybe some accidental
> whitespace change happened.
It was unexpected, removed.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
Attachment | Content-Type | Size |
---|---|---|
v22-0001-Always-persist-to-disk-logical-slots-during-a-sh.patch | application/octet-stream | 4.9 KB |
v22-0002-pg_upgrade-Allow-to-replicate-logical-replicatio.patch | application/octet-stream | 23.1 KB |
v22-0003-pg_upgrade-Add-check-function-for-logical-replic.patch | application/octet-stream | 10.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Rui Zhao | 2023-08-18 14:47:04 | Re: pg_upgrade fails with in-place tablespace |
Previous Message | Hayato Kuroda (Fujitsu) | 2023-08-18 13:43:10 | RE: [PoC] pg_upgrade: allow to upgrade publisher node |