From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Hou, Zhijie/侯 志杰 <houzj(dot)fnst(at)fujitsu(dot)com> |
Cc: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: speed up a logical replica setup |
Date: | 2024-07-11 17:00:00 |
Message-ID: | 68de6498-0449-a113-dd03-e198dded0bac@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello Amit and Hou-San,
11.07.2024 13:21, Amit Kapila wrote:
>
> We don't wait for the xmin to catch up corresponding to this insert
> and I don't know if there is a way to do that. So, we should move this
> Insert to after the call to pg_sync_replication_slots(). It won't
> impact the general test of pg_createsubscriber.
>
> Thanks to Hou-San for helping me in the analysis of this BF failure.
Thank you for investigating that issue!
May I ask you to look at another failure of the test occurred today [1]?
The failure log contains:
recovery_target_lsn = '0/30098D0'
pg_createsubscriber: starting the subscriber
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
2024-07-11 07:40:10.837 UTC [2948531][postmaster][:0] LOG: received fast shutdown request
Though what I'm seeing after a successful run is:
recovery_target_lsn = '0/3009860'
pg_createsubscriber: starting the subscriber
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
2024-07-11 15:19:40.733 UTC [207517] 040_pg_createsubscriber.pl LOG: statement: SELECT pg_catalog.pg_is_in_recovery()
2024-07-11 15:19:41.635 UTC [207514] LOG: recovery stopping after WAL location (LSN) "0/3009860"
2024-07-11 15:19:41.635 UTC [207514] LOG: redo done at 0/3009860 system usage: CPU: user: 0.00 s, system: 0.00 s,
elapsed: 21.00 s
I've managed to reproduce the failure locally. With the bgwriter mod:
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 150
and wal_debug=on, when running 5 test instances with parallel, I get the
failure with the following log:
recovery_target_lsn = '0/3009A20'
pg_createsubscriber: starting the subscriber
2024-07-11 14:40:04.551 UTC [205589:72][startup][33/0:0] LOG: REDO @ 0/30099E8; LSN 0/3009A20: prev 0/30099B0; xid 0;
len 24 - Standby/RUNNING_XACTS: nextXid 747 latestCompletedXid 746 oldestRunningXid 747
# ^^^ the last REDO record in the log
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
pg_createsubscriber: server was stopped
pg_createsubscriber: error: recovery timed out
...
[14:43:06.011](181.800s) not ok 30 - run pg_createsubscriber on node S
[14:43:06.012](0.000s)
[14:43:06.012](0.000s) # Failed test 'run pg_createsubscriber on node S'
# at t/040_pg_createsubscriber.pl line 356.
$ pg_waldump -p src/bin/pg_basebackup_1/tmp_check/t_040_pg_createsubscriber_node_s_data/pgdata/pg_wal/
000000010000000000000003 000000010000000000000003 | tail -2
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/030099B0, prev 0/03009948, desc: RUNNING_XACTS
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/030099E8, prev 0/030099B0, desc: RUNNING_XACTS
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
Whilst
$ pg_waldump -p src/bin/pg_basebackup_1/tmp_check/t_040_pg_createsubscriber_node_p_data/pgdata/pg_wal/
000000010000000000000003 000000010000000000000003 | tail
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/030099B0, prev 0/03009948, desc: RUNNING_XACTS
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/030099E8, prev 0/030099B0, desc: RUNNING_XACTS
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 747, lsn: 0/03009A20, prev 0/030099E8, desc: NEW_CID rel:
1663/16384/6104, tid: 0/1, cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heap len (rec/tot): 54/ 54, tx: 747, lsn: 0/03009A60, prev 0/03009A20, desc: DELETE xmax:
747, off: 1, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/16384/6104 blk 0
rmgr: Transaction len (rec/tot): 78/ 78, tx: 747, lsn: 0/03009A98, prev 0/03009A60, desc: INVALIDATION ;
inval msgs: catcache 49 catcache 46 relcache 0
rmgr: Transaction len (rec/tot): 98/ 98, tx: 747, lsn: 0/03009AE8, prev 0/03009A98, desc: COMMIT
2024-07-11 14:43:05.994561 UTC; relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 49 catcache 46
relcache 0
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 748, lsn: 0/03009B50, prev 0/03009AE8, desc: NEW_CID rel:
1663/16385/6104, tid: 0/1, cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heap len (rec/tot): 54/ 54, tx: 748, lsn: 0/03009B90, prev 0/03009B50, desc: DELETE xmax:
748, off: 1, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/16385/6104 blk 0
rmgr: Transaction len (rec/tot): 78/ 78, tx: 748, lsn: 0/03009BC8, prev 0/03009B90, desc: INVALIDATION ;
inval msgs: catcache 49 catcache 46 relcache 0
rmgr: Transaction len (rec/tot): 98/ 98, tx: 748, lsn: 0/03009C18, prev 0/03009BC8, desc: COMMIT
2024-07-11 14:43:06.008619 UTC; relcache init file inval dbid 16385 tsid 1663; inval msgs: catcache 49 catcache 46
relcache 0
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-07-11%2007%3A25%3A12
Best regards,
Alexander
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2024-07-11 17:30:23 | Re: Improve the connection failure error messages |
Previous Message | Alena Rybakina | 2024-07-11 16:29:02 | Re: POC, WIP: OR-clause support for indexes |