Re: speed up a logical replica setup

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Hou, Zhijie/侯 志杰 <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: speed up a logical replica setup
Date: 2024-07-11 17:00:00
Message-ID: 68de6498-0449-a113-dd03-e198dded0bac@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Amit and Hou-San,

11.07.2024 13:21, Amit Kapila wrote:
>
> We don't wait for the xmin to catch up corresponding to this insert
> and I don't know if there is a way to do that. So, we should move this
> Insert to after the call to pg_sync_replication_slots(). It won't
> impact the general test of pg_createsubscriber.
>
> Thanks to Hou-San for helping me in the analysis of this BF failure.

Thank you for investigating that issue!

May I ask you to look at another failure of the test occurred today [1]?
The failure log contains:
recovery_target_lsn = '0/30098D0'
pg_createsubscriber: starting the subscriber
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
2024-07-11 07:40:10.837 UTC [2948531][postmaster][:0] LOG:  received fast shutdown request

Though what I'm seeing after a successful run is:
recovery_target_lsn = '0/3009860'
pg_createsubscriber: starting the subscriber
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
2024-07-11 15:19:40.733 UTC [207517] 040_pg_createsubscriber.pl LOG:  statement: SELECT pg_catalog.pg_is_in_recovery()
2024-07-11 15:19:41.635 UTC [207514] LOG:  recovery stopping after WAL location (LSN) "0/3009860"
2024-07-11 15:19:41.635 UTC [207514] LOG:  redo done at 0/3009860 system usage: CPU: user: 0.00 s, system: 0.00 s,
elapsed: 21.00 s

I've managed to reproduce the failure locally. With the bgwriter mod:
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 150

and wal_debug=on, when running 5 test instances with parallel, I get the
failure with the following log:
recovery_target_lsn = '0/3009A20'
pg_createsubscriber: starting the subscriber

2024-07-11 14:40:04.551 UTC [205589:72][startup][33/0:0] LOG:  REDO @ 0/30099E8; LSN 0/3009A20: prev 0/30099B0; xid 0;
len 24 - Standby/RUNNING_XACTS: nextXid 747 latestCompletedXid 746 oldestRunningXid 747
# ^^^ the last REDO record in the log
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
pg_createsubscriber: server was stopped
pg_createsubscriber: error: recovery timed out
...
[14:43:06.011](181.800s) not ok 30 - run pg_createsubscriber on node S
[14:43:06.012](0.000s)
[14:43:06.012](0.000s) #   Failed test 'run pg_createsubscriber on node S'
#   at t/040_pg_createsubscriber.pl line 356.

$ pg_waldump -p src/bin/pg_basebackup_1/tmp_check/t_040_pg_createsubscriber_node_s_data/pgdata/pg_wal/
000000010000000000000003 000000010000000000000003 | tail -2
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/030099B0, prev 0/03009948, desc: RUNNING_XACTS
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/030099E8, prev 0/030099B0, desc: RUNNING_XACTS
nextXid 747 latestCompletedXid 746 oldestRunningXid 747

Whilst
$ pg_waldump -p src/bin/pg_basebackup_1/tmp_check/t_040_pg_createsubscriber_node_p_data/pgdata/pg_wal/
000000010000000000000003 000000010000000000000003 | tail
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/030099B0, prev 0/03009948, desc: RUNNING_XACTS
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/030099E8, prev 0/030099B0, desc: RUNNING_XACTS
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Heap2       len (rec/tot):     60/    60, tx:        747, lsn: 0/03009A20, prev 0/030099E8, desc: NEW_CID rel:
1663/16384/6104, tid: 0/1, cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heap        len (rec/tot):     54/    54, tx:        747, lsn: 0/03009A60, prev 0/03009A20, desc: DELETE xmax:
747, off: 1, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/16384/6104 blk 0
rmgr: Transaction len (rec/tot):     78/    78, tx:        747, lsn: 0/03009A98, prev 0/03009A60, desc: INVALIDATION ;
inval msgs: catcache 49 catcache 46 relcache 0
rmgr: Transaction len (rec/tot):     98/    98, tx:        747, lsn: 0/03009AE8, prev 0/03009A98, desc: COMMIT
2024-07-11 14:43:05.994561 UTC; relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 49 catcache 46
relcache 0
rmgr: Heap2       len (rec/tot):     60/    60, tx:        748, lsn: 0/03009B50, prev 0/03009AE8, desc: NEW_CID rel:
1663/16385/6104, tid: 0/1, cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heap        len (rec/tot):     54/    54, tx:        748, lsn: 0/03009B90, prev 0/03009B50, desc: DELETE xmax:
748, off: 1, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/16385/6104 blk 0
rmgr: Transaction len (rec/tot):     78/    78, tx:        748, lsn: 0/03009BC8, prev 0/03009B90, desc: INVALIDATION ;
inval msgs: catcache 49 catcache 46 relcache 0
rmgr: Transaction len (rec/tot):     98/    98, tx:        748, lsn: 0/03009C18, prev 0/03009BC8, desc: COMMIT
2024-07-11 14:43:06.008619 UTC; relcache init file inval dbid 16385 tsid 1663; inval msgs: catcache 49 catcache 46
relcache 0

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-07-11%2007%3A25%3A12

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-07-11 17:30:23 Re: Improve the connection failure error messages
Previous Message Alena Rybakina 2024-07-11 16:29:02 Re: POC, WIP: OR-clause support for indexes