Re: speed up a logical replica setup

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>, Hou, Zhijie/侯 志杰 <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: speed up a logical replica setup
Date: 2024-07-11 10:21:16
Message-ID: CAA4eK1KG7+jUdONhsELY2poyBrmesNnkt-m5QSgPhgyyM6cFHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 9, 2024 at 4:30 PM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> Please look at another failure of the test [1]:
> [13:28:05.647](2.460s) not ok 26 - failover slot is synced
> [13:28:05.648](0.001s) # Failed test 'failover slot is synced'
> # at /home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl line 307.
> [13:28:05.648](0.000s) # got: ''
> # expected: 'failover_slot'
>
> with 040_pg_createsubscriber_node_s.log containing:
> 2024-07-08 13:28:05.369 UTC [3985464][client backend][0/2:0] LOG: statement: SELECT pg_sync_replication_slots()
> 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] LOG: could not sync slot "failover_slot" as remote slot
> precedes local slot
> 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] DETAIL: Remote slot has LSN 0/30047B8 and catalog xmin
> 743, but local slot has LSN 0/30047B8 and catalog xmin 744.
>
> I could not reproduce it locally, but I've discovered that that subtest
> somehow depends on pg_createsubscriber executed for the
> 'primary contains unmet conditions on node P' check. For example with this
> test modification:
> @@ -249,7 +249,7 @@ command_fails(
> $node_p->connstr($db1), '--socket-directory',
> $node_s->host, '--subscriber-port',
> $node_s->port, '--database',
> - $db1, '--database',
> + 'XXX', '--database',
> $db2
> ],
> 'primary contains unmet conditions on node P');
>
> I see the same failure:
> 2024-07-09 10:19:43.284 UTC [938890] 040_pg_createsubscriber.pl LOG: statement: SELECT pg_sync_replication_slots()
> 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl LOG: could not sync slot "failover_slot" as remote slot
> precedes local slot
> 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl DETAIL: Remote slot has LSN 0/3004780 and catalog xmin
> 743, but local slot has LSN 0/3004780 and catalog xmin 744.
>
> Thus maybe even a normal pg_createsubscriber run can affect the primary
> server (it's catalog xmin) differently?
>

Yes, pg_createsubscriber can affect the primary server's catalog xmin
because it starts the standby server that can send HSFeedback (See
XLogWalRcvSendHSFeedback()), which can advance the physical slot's
xmin corresponding the following Insert in the test:

# Insert another row on node P and wait node S to catch up
$node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')");
$node_p->wait_for_replay_catchup($node_s);

In the success case, pg_createsubscriber is able to send HSFeedback
and in the failure case, it won't. We can see the following logs in
040_pg_createsubscriber_node_p.log:

2024-07-08 13:28:00.872 UTC [3982331][walsender][:0] FATAL: the
database system is starting up
2024-07-08 13:28:00.875 UTC [3982328][startup][:0] LOG: database
system was shut down at 2024-07-08 13:28:00 UTC
2024-07-08 13:28:01.105 UTC [3981996][postmaster][:0] LOG: database
system is ready to accept connections

This shows that when the test 'primary contains unmet conditions on
node P' starts the standby server the corresponding primary node was
not ready because we just restarted node_p before that test and didn't
ensure that the node_p is up and ready to accept connections before
starting the pg_createsubscriber test.

Even in the successful cases where the standby is able to connect to
primary for test 'primary contains unmet conditions on node P', there
is no guarantee that xmin of the physical slot will be updated at
least, we don't have anything in the test to ensure the same.

Now as before creating logical replication, we didn't ensure that the
physical slot's xmin has been caught up to the latest value, the test
can lead to failure like: "Remote slot has LSN 0/3004780 and catalog
xmin 743, but local slot has LSN 0/3004780 and catalog xmin 744".

The xmin on standby could have been advanced due to the following Insert:
# Insert another row on node P and wait node S to catch up
$node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')");
$node_p->wait_for_replay_catchup($node_s);

We don't wait for the xmin to catch up corresponding to this insert
and I don't know if there is a way to do that. So, we should move this
Insert to after the call to pg_sync_replication_slots(). It won't
impact the general test of pg_createsubscriber.

Thanks to Hou-San for helping me in the analysis of this BF failure.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-07-11 10:26:00 Re: speed up a logical replica setup
Previous Message Aleksander Alekseev 2024-07-11 10:11:05 Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)