From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Kevin Hale Boyes <kcboyes(at)gmail(dot)com> |
Cc: | Kartyshov Ivan <i(dot)kartyshov(at)postgrespro(dot)ru>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, alvherre(at)alvh(dot)no-ip(dot)org, pashkin(dot)elfe(at)gmail(dot)com, bharath(dot)rupireddyforpostgres(at)gmail(dot)com, euler(at)eulerto(dot)com, thomas(dot)munro(at)gmail(dot)com, peter(at)eisentraut(dot)org, amit(dot)kapila16(at)gmail(dot)com, dilipbalaut(at)gmail(dot)com, smithpb2250(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: [HACKERS] make async slave to wait for lsn to be replayed |
Date: | 2024-09-13 12:00:00 |
Message-ID: | 1d7b08f2-64a2-77fb-c666-c9a74c68eeda@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Alexander,
10.08.2024 20:18, Alexander Korotkov wrote:
> On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>> On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>> ...
>> Here is a revised version of the patchset. I've fixed some typos,
>> identation, etc. I'm going to push this once it passes cfbot.
> The next revison of the patchset fixes uninitialized variable usage
> spotted by cfbot.
When running check-world on a rather slow armv7 device, I came across the
043_wal_replay_wait.pl test failure:
t/043_wal_replay_wait.pl .............. 7/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 8.
regress_log_043_wal_replay_wait contains:
...
01234[21:58:56.370](1.594s) ok 7 - multiple LSN waiters reported consistent data
### Promoting node "standby"
# Running: pg_ctl -D .../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata -l
.../src/test/recovery/tmp_check/log/043_wal_replay_wait_standby.log promote
waiting for server to promote.... done
server promoted
[21:58:56.637](0.268s) ok 8 - got error after standby promote
error running SQL: 'psql:<stdin>:1: ERROR: recovery is not in progress
HINT: Waiting for LSN can only be executed during recovery.'
while running 'psql -XAtq -d port=10228 host=/tmp/Ftj8qpTQht dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL
pg_wal_replay_wait('0/300D0E8');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2140.
043_wal_replay_wait_standby.log contains:
2024-09-12 21:58:56.518 UTC [15220:1] [unknown] LOG: connection received: host=[local]
2024-09-12 21:58:56.520 UTC [15220:2] [unknown] LOG: connection authenticated: user="android" method=trust
(.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
2024-09-12 21:58:56.520 UTC [15220:3] [unknown] LOG: connection authorized: user=android database=postgres
application_name=043_wal_replay_wait.pl
2024-09-12 21:58:56.527 UTC [15220:4] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('2/570CB4E8');
2024-09-12 21:58:56.535 UTC [15123:7] LOG: received promote request
2024-09-12 21:58:56.535 UTC [15124:2] FATAL: terminating walreceiver process due to administrator command
2024-09-12 21:58:56.537 UTC [15123:8] LOG: invalid record length at 0/300D0B0: expected at least 24, got 0
2024-09-12 21:58:56.537 UTC [15123:9] LOG: redo done at 0/300D088 system usage: CPU: user: 0.01 s, system: 0.00 s,
elapsed: 14.23 s
2024-09-12 21:58:56.537 UTC [15123:10] LOG: last completed transaction was at log time 2024-09-12 21:58:55.322831+00
2024-09-12 21:58:56.540 UTC [15123:11] LOG: selected new timeline ID: 2
2024-09-12 21:58:56.589 UTC [15123:12] LOG: archive recovery complete
2024-09-12 21:58:56.590 UTC [15220:5] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-12 21:58:56.590 UTC [15220:6] 043_wal_replay_wait.pl DETAIL: Recovery ended before replaying target LSN
2/570CB4E8; last replay LSN 0/300D0B0.
2024-09-12 21:58:56.591 UTC [15121:1] LOG: checkpoint starting: force
2024-09-12 21:58:56.592 UTC [15220:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.075 user=android
database=postgres host=[local]
2024-09-12 21:58:56.595 UTC [15120:4] LOG: database system is ready to accept connections
2024-09-12 21:58:56.665 UTC [15227:1] [unknown] LOG: connection received: host=[local]
2024-09-12 21:58:56.668 UTC [15227:2] [unknown] LOG: connection authenticated: user="android" method=trust
(.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
2024-09-12 21:58:56.668 UTC [15227:3] [unknown] LOG: connection authorized: user=android database=postgres
application_name=043_wal_replay_wait.pl
2024-09-12 21:58:56.675 UTC [15227:4] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('0/300D0E8');
2024-09-12 21:58:56.677 UTC [15227:5] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-12 21:58:56.677 UTC [15227:6] 043_wal_replay_wait.pl HINT: Waiting for LSN can only be executed during recovery.
2024-09-12 21:58:56.679 UTC [15227:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.015 user=android
database=postgres host=[local]
Note that last replay LSN is 300D0B0, but the latter pg_wal_replay_wait
call wants to wait for 300D0E8.
pg_waldump -p src/test/recovery/tmp_check/t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 000000010000000000000003
shows:
rmgr: Heap len (rec/tot): 59/ 59, tx: 748, lsn: 0/0300D048, prev 0/0300D020, desc: INSERT off: 35,
flags: 0x00, blkref #0: rel 1663/5/16384 blk 0
rmgr: Transaction len (rec/tot): 34/ 34, tx: 748, lsn: 0/0300D088, prev 0/0300D048, desc: COMMIT
2024-09-12 21:58:55.322831 UTC
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0300D0B0, prev 0/0300D088, desc: RUNNING_XACTS
nextXid 749 latestCompletedXid 748 oldestRunningXid 749
I could reproduce this failure on my workstation with bgworker modified
as below:
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -69 +69 @@ int BgWriterDelay = 200;
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 15
@@ -307 +307 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
- BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+ 1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
When looking at the test, I noticed probably a typo in the test message:
wait for already replayed LSN exists immediately ...
shouldn't it be "exits" there (though maybe the whole phrase could be
improved)?
I also suspect that "achieve" is not suitable word in the context of LSNs
and timeouts. Maybe you would find it appropriate to replace it with
"reach"?
Best regards,
Alexander
From | Date | Subject | |
---|---|---|---|
Next Message | vignesh C | 2024-09-13 12:20:21 | Re: Conflict Detection and Resolution |
Previous Message | vignesh C | 2024-09-13 11:56:52 | Re: Pgoutput not capturing the generated columns |