RE: speed up a logical replica setup

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: speed up a logical replica setup
Date: 2024-07-11 11:22:03
Message-ID: OS0PR01MB571670F060D9BC88FB2F61B794A52@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, July 11, 2024 6:21 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Jul 9, 2024 at 4:30 PM Alexander Lakhin <exclusion(at)gmail(dot)com>
> wrote:
> >
> > Please look at another failure of the test [1]:
> > [13:28:05.647](2.460s) not ok 26 - failover slot is synced
> > [13:28:05.648](0.001s) # Failed test 'failover slot is synced'
> > # at
> /home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/04
> 0_pg_createsubscriber.pl line 307.
> > [13:28:05.648](0.000s) # got: ''
> > # expected: 'failover_slot'
> >
> > with 040_pg_createsubscriber_node_s.log containing:
> > 2024-07-08 13:28:05.369 UTC [3985464][client backend][0/2:0] LOG:
> > statement: SELECT pg_sync_replication_slots()
> > 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] LOG:
> > could not sync slot "failover_slot" as remote slot precedes local slot
> > 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] DETAIL:
> > Remote slot has LSN 0/30047B8 and catalog xmin 743, but local slot has LSN
> 0/30047B8 and catalog xmin 744.
> >
> > I could not reproduce it locally, but I've discovered that that
> > subtest somehow depends on pg_createsubscriber executed for the
> > 'primary contains unmet conditions on node P' check. For example with
> > this test modification:
> > @@ -249,7 +249,7 @@ command_fails(
> > $node_p->connstr($db1), '--socket-directory',
> > $node_s->host, '--subscriber-port',
> > $node_s->port, '--database',
> > - $db1, '--database',
> > + 'XXX', '--database',
> > $db2
> > ],
> > 'primary contains unmet conditions on node P');
> >
> > I see the same failure:
> > 2024-07-09 10:19:43.284 UTC [938890] 040_pg_createsubscriber.pl LOG:
> > statement: SELECT pg_sync_replication_slots()
> > 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl LOG:
> > could not sync slot "failover_slot" as remote slot precedes local slot
> > 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl
> > DETAIL: Remote slot has LSN 0/3004780 and catalog xmin 743, but local
> slot has LSN 0/3004780 and catalog xmin 744.
> >
> > Thus maybe even a normal pg_createsubscriber run can affect the
> > primary server (it's catalog xmin) differently?
> >
>
> Yes, pg_createsubscriber can affect the primary server's catalog xmin because
> it starts the standby server that can send HSFeedback (See
> XLogWalRcvSendHSFeedback()), which can advance the physical slot's xmin
> corresponding the following Insert in the test:
>
> # Insert another row on node P and wait node S to catch up
> $node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')");
> $node_p->wait_for_replay_catchup($node_s);
>
> In the success case, pg_createsubscriber is able to send HSFeedback and in
> the failure case, it won't. We can see the following logs in
> 040_pg_createsubscriber_node_p.log:
>
> 2024-07-08 13:28:00.872 UTC [3982331][walsender][:0] FATAL: the database
> system is starting up
> 2024-07-08 13:28:00.875 UTC [3982328][startup][:0] LOG: database system
> was shut down at 2024-07-08 13:28:00 UTC
> 2024-07-08 13:28:01.105 UTC [3981996][postmaster][:0] LOG: database
> system is ready to accept connections
>
> This shows that when the test 'primary contains unmet conditions on node P'
> starts the standby server the corresponding primary node was not ready
> because we just restarted node_p before that test and didn't ensure that the
> node_p is up and ready to accept connections before starting the
> pg_createsubscriber test.
>
> Even in the successful cases where the standby is able to connect to primary
> for test 'primary contains unmet conditions on node P', there is no guarantee
> that xmin of the physical slot will be updated at least, we don't have anything in
> the test to ensure the same.
>
> Now as before creating logical replication, we didn't ensure that the physical
> slot's xmin has been caught up to the latest value, the test can lead to failure
> like: "Remote slot has LSN 0/3004780 and catalog xmin 743, but local slot has
> LSN 0/3004780 and catalog xmin 744".
>
> The xmin on standby could have been advanced due to the following Insert:
> # Insert another row on node P and wait node S to catch up
> $node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')");
> $node_p->wait_for_replay_catchup($node_s);
>
> We don't wait for the xmin to catch up corresponding to this insert and I don't
> know if there is a way to do that. So, we should move this Insert to after the call
> to pg_sync_replication_slots(). It won't impact the general test of
> pg_createsubscriber.

The analysis and suggestion look reasonable to me.
Here is a small patch which does the same.

> Thanks to Hou-San for helping me in the analysis of this BF failure.

Best Regards,
Hou zj

Attachment Content-Type Size
0001-fix-unstable-test-in-040_pg_createsubscriber.patch application/octet-stream 1.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2024-07-11 11:29:30 Re: tests fail on windows with default git settings
Previous Message Fujii Masao 2024-07-11 11:14:39 Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?