From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | 001_rep_changes.pl fails due to publisher stuck on shutdown |
Date: | 2024-05-29 11:00:00 |
Message-ID: | f15d665f-4cd1-4894-037c-afdbe369287e@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello hackers,
As a recent buildfarm test failure [1] shows:
[14:33:02.374](0.333s) ok 23 - update works with dropped subscriber column
### Stopping node "publisher" using mode fast
# Running: pg_ctl -D
/home/bf/bf-build/adder/HEAD/pgsql.build/testrun/subscription/001_rep_changes/data/t_001_rep_changes_publisher_data/pgdata
-m fast stop
waiting for server to shut down.. ... ... ... .. failed
pg_ctl: server does not shut down
# pg_ctl stop failed: 256
# Postmaster PID for node "publisher" is 2222549
[14:39:04.375](362.001s) Bail out! pg_ctl stop failed
001_rep_changes_publisher.log
2024-05-16 14:33:02.907 UTC [2238704][client backend][4/22:0] LOG: statement: DELETE FROM tab_rep
2024-05-16 14:33:02.925 UTC [2238704][client backend][:0] LOG: disconnection: session time: 0:00:00.078 user=bf
database=postgres host=[local]
2024-05-16 14:33:02.939 UTC [2222549][postmaster][:0] LOG: received fast shutdown request
2024-05-16 14:33:03.000 UTC [2222549][postmaster][:0] LOG: aborting any active transactions
2024-05-16 14:33:03.049 UTC [2222549][postmaster][:0] LOG: background worker "logical replication launcher" (PID
2223110) exited with exit code 1
2024-05-16 14:33:03.062 UTC [2222901][checkpointer][:0] LOG: shutting down
2024-05-16 14:39:04.377 UTC [2222549][postmaster][:0] LOG: received immediate shutdown request
2024-05-16 14:39:04.382 UTC [2222549][postmaster][:0] LOG: database system is shut down
the publisher node may hang on stopping.
I reproduced the failure (with aggressive autovacuum) locally and
discovered that it happens because:
1) checkpointer calls WalSndInitStopping() (which sends
PROCSIG_WALSND_INIT_STOPPING to walsender), and then spins inside
WalSndWaitStopping() indefinitely, because:
2) walsender receives the signal, sets got_STOPPING = true, but can't exit
WalSndLoop():
3) it never sets got_SIGUSR2 (to get into WalSndDone()) in
XLogSendLogical():
4) it never sets WalSndCaughtUp to true:
5) logical_decoding_ctx->reader->EndRecPtr can't reach flushPtr in
XLogSendLogical():
6) EndRecPtr doesn't advance in XLogNextRecord():
7) XLogDecodeNextRecord() fails do decode a record that crosses a page
boundary:
8) ReadPageInternal() (commented "Wait for the next page to become
available") constantly returns XLREAD_FAIL:
9) state->routine.page_read()/logical_read_xlog_page() constantly returns
-1:
10) flushptr = WalSndWaitForWal() stops advancing, because
got_STOPPING == true (see 2).
That is, walsender doesn't let itself to catch up, if it gets the stop
signal when it's lagging behind and decoding a record requires reading
the next wal page.
Please look at the reproducing test (based on 001_rep_changes.pl) attached.
If fails for me as below:
# 17
Bailout called. Further testing stopped: pg_ctl stop failed
FAILED--Further testing stopped: pg_ctl stop failed
make: *** [Makefile:21: check] Ошибка 255
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-05-16%2014%3A22%3A38
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2024-04-24%2014%3A38%3A35 (apparently the same)
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
099_walsender_stop.pl | application/x-perl | 1.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Maiquel Grassi | 2024-05-29 12:37:21 | RE: Psql meta-command conninfo+ |
Previous Message | ISHAN CHHANGANI . | 2024-05-29 10:45:09 | Timeout gets unset on a syntax error. |