Re: Synchronizing slots from primary to standby

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-05-23 05:33:32
Message-ID: CAHut+PthA-Pp0h7a2xi0v+QTOROfpsxC8NqWBhgxYy1ZEX=THg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here are some review comments for the docs patch v3-0001.

======
Commit message

1.
This patch adds detailed documentation for the slot sync feature
including examples to guide users on how to verify that all slots have
been successfully synchronized to the standby server and how to
confirm whether the subscription can continue subscribing to
publications on the promoted standby server.

~

This may be easier to read if you put it in bullet form like below:

SUGGESTION

It includes examples to guide the user:

* How to verify that all slots have been successfully synchronized to
the standby server

* How to confirm whether the subscription can continue subscribing to
publications on the promoted standby server

======
doc/src/sgml/high-availability.sgml

2.
+ <para>
+ If you have opted for synchronization of logical slots (see
+ <xref linkend="logicaldecoding-replication-slots-synchronization"/>),
+ then before switching to the standby server, it is recommended to check
+ if the logical slots synchronized on the standby server are ready
+ for failover. This can be done by following the steps described in
+ <xref linkend="logical-replication-failover"/>.
+ </para>
+

Maybe it is better to call this feature "logical replication slot
synchronization" to be more consistent with the title of section
47.2.3.

SUGGESTION
If you have opted for logical replication slot synchronization (see ...

======
doc/src/sgml/logical-replication.sgml

3.
+ <para>
+ When the publisher server is the primary server of a streaming replication,
+ the logical slots on that primary server can be synchronized to the standby
+ server by specifying <literal>failover = true</literal> when creating
+ subscriptions for those publications. Enabling failover ensures a seamless
+ transition of those subscriptions after the standby is promoted. They can
+ continue subscribing to publications now on the new primary server without
+ losing any data that has been flushed to the new primary server.
+ </para>
+

3a.
BEFORE
When the publisher server is the primary server of...

SUGGESTION
When publications are defined on the primary server of...

~

3b.
Enabling failover...

Maybe say "Enabling the failover parameter..." and IMO there should
also be a link to the CREATE SUBSCRIPTION failover parameter so the
user can easily navigate there to read more about it.

~

3c.
BEFORE
They can continue subscribing to publications now on the new primary
server without losing any data that has been flushed to the new
primary server.

SUGGESTION (removes some extra info I did not think was needed)
They can continue subscribing to publications now on the new primary
server without any loss of data.

~~~

4.
+ <para>
+ Because the slot synchronization logic copies asynchronously, it is
+ necessary to confirm that replication slots have been synced to the standby
+ server before the failover happens. Furthermore, to ensure a successful
+ failover, the standby server must not be lagging behind the subscriber. It
+ is highly recommended to use
+ <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link>
+ to prevent the subscriber from consuming changes faster than the
hot standby.
+ To confirm that the standby server is indeed ready for failover, follow
+ these 2 steps:
+ </para>

IMO the last sentence "To confirm..." should be a new paragraph.

~~~

5.
+ <para>
+ Firstly, on the subscriber node, use the following SQL to identify
+ which slots should be synced to the standby that we plan to promote.

AND

+ <para>
+ Next, check that the logical replication slots identified above exist on
+ the standby server and are ready for failover.

~~

5a.
I don't think you need to say "Firstly," and "Next," because the order
to do these steps is already self-evident.

~

5b.
Patch says "on the subscriber node", but isn't that the simplest case?
e.g. maybe there are multiple nodes having subscriptions for these
publications. Maybe the sentence needs to account for case of
subscribers on >1 nodes.

Is there no way to discover this information by querying the publisher?

~~~

6.
+<programlisting>
+test_sub=# SELECT
+ array_agg(slotname) AS slots
+ FROM

...

+<programlisting>
+test_standby=# SELECT slot_name, (synced AND NOT temporary AND NOT
conflicting) AS failover_ready
+ FROM pg_replication_slots
+ WHERE slot_name IN ('sub1','sub2','sub3');

The example SQL for "1a" refers to 'slotname', but the example SQL for
"1b" refers to "slot_name" (i.e. with underscore). It might be better
if those are consistently called 'slot_name'.

~~~

7.
+ <step performance="required">
+ <para>
+ Confirm that the standby server is not lagging behind the subscribers.
+ This step can be skipped if
+ <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link>
+ has been correctly configured. If standby_slot_names is not configured
+ correctly, it is highly recommended to run this step after the primary
+ server is down, otherwise the results of the query may vary at different
+ points of time due to the ongoing replication on the logical subscribers
+ from the primary server.
+ </para>

7a.
I felt that the step should just say "Confirm that the standby server
is not lagging behind the subscribers.". So the text "This step can be
skipped..." should be a separate paragraph.

~

7b.
The 2nd standby_slot_names should use a varname font.

~

7c.
/may vary at different points in time due to/can vary due to/

~~~~

8.
+ <para>
+ Firstly, on the subscriber node check the last replayed WAL.
+ This step needs to be run on the database(s) that includes the failover
+ enabled subscription(s), to find the last replayed WAL on
each database.

8a.
Don't need to say "Firstly,"

~

8b.
The text "This step..." can be simplified as below:

BEFORE
This step needs to be run on the database(s) that includes the
failover enabled subscription(s), to find the last replayed WAL on
each database.

SUGGESTION
This step needs to be run on any database that includes
failover-enabled subscriptions.

~~~

9.
+ <para>
+ Next, on the standby server check that the last-received WAL location
+ is ahead of the replayed WAL location(s) on the subscriber identified
+ above. If the above SQL result was NULL, it means the subscriber has not
+ yet replayed any WAL, so the standby server must be ahead of the
+ subscriber, and this step can be skipped.

Don't need to say "Next,"

~~~

10.
+ <para>
+ If the result (<literal>failover_ready</literal>) of both above steps is
+ true, existing subscriptions will be able to continue without
losing any data
+ that has been flushed to the new primary server.
+ </para>

Let's word this more like the same sentence top of the page. See
review comment #3c

SUGGESTION
If the result (<literal>failover_ready</literal>) of both steps above
is true, then existing subscriptions can continue subscribing to
publications now on the new primary server without any loss of data.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-05-23 05:59:39 Re: AIX support
Previous Message Hayato Kuroda (Fujitsu) 2024-05-23 05:26:38 RE: Pgoutput not capturing the generated columns