From: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
---|---|
To: | 'Peter Smith' <smithpb2250(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
Subject: | RE: [PoC] pg_upgrade: allow to upgrade publisher node |
Date: | 2023-08-23 02:43:32 |
Message-ID: | TYAPR01MB58668021BB233D129B466122F51CA@TYAPR01MB5866.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dear Peter,
Thanks for giving comments! New version will be available
in the upcoming post.
>
1. check_for_lost_slots
+ /* logical slots can be migrated since PG17. */
+ if (GET_MAJOR_VERSION(cluster->major_version) <= 1600)
+ return;
1a
Maybe the comment should start uppercase for consistency with others.
>
Seems right, but I revisit check_and_dump_old_cluster() again and found that
some version-specific checks are done outside the checking function.
So I started to follow the style and then the part is moved to
check_and_dump_old_cluster(). Also, version checking for new cluster is also
moved to check_new_cluster(). Is it OK for you?
>
1b.
IMO if you check < 1700 instead of <= 1600 it will be a better match with the comment.
>
Per suggestion from Amit, I used < 1700. Some other changes in 0002 were reverted.
>
2. check_for_lost_slots
+ for (i = 0; i < ntups; i++)
+ {
+ pg_log(PG_WARNING,
+ "\nWARNING: logical replication slot \"%s\" is in 'lost' state.",
+ PQgetvalue(res, i, i_slotname));
+ }
+
+
The braces {} are not needed anymore
>
Fixed.
>
3. check_for_confirmed_flush_lsn
+ /* logical slots can be migrated since PG17. */
+ if (GET_MAJOR_VERSION(cluster->major_version) <= 1600)
+ return;
3a.
Maybe the comment should start uppercase for consistency with others.
>
Per reply for comment 1, the part was no longer needed.
>
3b.
IMO if you check < 1700 instead of <= 1600 it will be a better match with the comment.
>
Per suggestion from Amit, I used < 1700.
>
4. check_for_confirmed_flush_lsn
+ for (i = 0; i < ntups; i++)
+ {
+ pg_log(PG_WARNING,
+ "\nWARNING: logical replication slot \"%s\" has not consumed WALs yet",
+ PQgetvalue(res, i, i_slotname));
+ }
+
The braces {} are not needed anymore
>
Fixed.
>
5. get_control_data
+ /*
+ * Gather latest checkpoint location if the cluster is newer or
+ * equal to 17. This is used for upgrading logical replication
+ * slots.
+ */
+ if (GET_MAJOR_VERSION(cluster->major_version) >= 17)
5a.
/newer or equal to 17/PG17 or later/
>
Fixed.
>
5b.
>= 17 should be >= 1700
>
Per suggestion from Amit, I used < 1700.
>
6. get_control_data
+ {
+ char *slash = NULL;
+ uint64 upper_lsn, lower_lsn;
+
+ p = strchr(p, ':');
+
+ if (p == NULL || strlen(p) <= 1)
+ pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+ p++; /* remove ':' char */
+
+ p = strpbrk(p, "01234567890ABCDEF");
+
+ /*
+ * Upper and lower part of LSN must be read separately
+ * because it is reported as %X/%X format.
+ */
+ upper_lsn = strtoul(p, &slash, 16);
+ lower_lsn = strtoul(++slash, NULL, 16);
+
+ /* And combine them */
+ cluster->controldata.chkpnt_latest =
+ (upper_lsn << 32) | lower_lsn;
+ }
Should 'upper_lsn' and 'lower_lsn' be declared as uint32? That seems a better mirror for LSN_FORMAT_ARGS.
>
Changed the definition to uint32, and a cast was added.
>
7. get_logical_slot_infos
+
+ /*
+ * Do additional checks if slots are found on the old node. If something is
+ * found on the new node, a subsequent function
+ * check_new_cluster_is_empty() would report the name of slots and raise a
+ * fatal error.
+ */
+ if (cluster == &old_cluster && slot_count)
+ {
+ check_for_lost_slots(cluster);
+
+ if (!live_check)
+ check_for_confirmed_flush_lsn(cluster);
+ }
It somehow doesn't feel right for these extra checks to be jammed into this function, just because you conveniently have the slot_count available.
On the NEW cluster side, there was extra checking in the check_new_cluster() function.
For consistency, I think this OLD cluster checking should be done in the check_and_dump_old_cluster() function -- see the "Check for various failure cases" comment -- IMO this new fragment belongs there with the other checks.
>
All the checks were moved to check_and_dump_old_cluster(), and adds a check for its major version.
>
8.
bool date_is_int;
bool float8_pass_by_value;
uint32 data_checksum_version;
+
+ XLogRecPtr chkpnt_latest;
} ControlData;
I don't think the new field is particularly different from all the others that it needs a blank line separator.
>
I removed the blank. Actually I wondered where the attribute should be, but kept at last.
>
9.
# Initialize old node
my $old_publisher = PostgreSQL::Test::Cluster->new('old_publisher');
$old_publisher->init(allows_streaming => 'logical');
-$old_publisher->start;
# Initialize new node
my $new_publisher = PostgreSQL::Test::Cluster->new('new_publisher');
$new_publisher->init(allows_streaming => 'replica');
-my $bindir = $new_publisher->config_data('--bindir');
+# Initialize subscriber node
+my $subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$subscriber->init(allows_streaming => 'logical');
-$old_publisher->stop;
+my $bindir = $new_publisher->config_data('--bindir');
~
Are those removal of the old_publisher start/stop changes that actually should be done in the 0002 patch?
>
Yes, It should be removed from 0002.
>
10.
$old_publisher->safe_psql(
'postgres', qq[
SELECT pg_create_logical_replication_slot('test_slot2', 'test_decoding', false, true);
+ SELECT count(*) FROM pg_logical_slot_get_changes('test_slot1', NULL, NULL);
]);
~
What is the purpose of the added SELECT? It doesn't seem covered by the comment.
>
The SELECT statement is needed to trigger the failure caused by the insufficient
max_replication_slots. Checking on new cluster is started after old servers are
verified, so if the step is omitted, another error is reported:
```
Checking confirmed_flush_lsn for logical replication slots
WARNING: logical replication slot "test_slot1" has not consumed WALs yet
One or more logical replication slots still have unconsumed WAL records.
```
I added a comment about it.
>
11.
# Remove an unnecessary slot and generate WALs. These records would not be
# consumed before doing pg_upgrade, so that the upcoming test would fail.
$old_publisher->start;
$old_publisher->safe_psql(
'postgres', qq[
SELECT pg_drop_replication_slot('test_slot2');
CREATE TABLE tbl AS SELECT generate_series(1, 10) AS a;
]);
$old_publisher->stop;
Minor rewording of comment sentence.
SUGGESTION
Because these WAL records do not get consumed it will cause the upcoming pg_upgrade test to fail.
>
Added.
>
12.
# Cause a failure at the start of pg_upgrade because the slot still have
# unconsumed WAL records
~
/still have/still has/
>
Fixed.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
From | Date | Subject | |
---|---|---|---|
Next Message | Hayato Kuroda (Fujitsu) | 2023-08-23 02:44:44 | RE: [PoC] pg_upgrade: allow to upgrade publisher node |
Previous Message | Zhijie Hou (Fujitsu) | 2023-08-23 02:27:24 | RE: subscription/015_stream sometimes breaks |