walreceiver termination

From: Justin King <kingpin867(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: walreceiver termination
Date: 2020-04-23 17:07:12
Message-ID: CAE39h21Qbw65hf48BEAMjLriNA4JcrFvKJi0srDRNR6YMZXfXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

We've seen unexpected termination of the WAL receiver process. This
stops streaming replication, but the replica stays available --
restarting the server resumes streaming replication where it left off.
We've seen this across nearly every recent version of PG, (9.4, 9.5,
11.x, 12.x) -- anything omitted is one we haven't used.

I don't have an explanation for the cause, but I was able to set
logging to "debug5" and run an strace of the walrecevier PID when it
eventually happened. It appears as if the SIGTERM is coming from the
"postgres: startup" process. The only notable thing that I've seen
discussion of (in the past) and where it might be relevant here is
that these replicas all reside on a ZFS filesystem (on Linux). In
fact, everywhere we've seen issues, the database sits on a ZFS pool.
We also have another PG cluster running on non-ZFS (ext4) and have
never experienced this problem.

Any suggestions?

Logging snippets are below:

postgresql.log
Apr 23 09:09:13 royce postgres[2229]: [43141904-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG: record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141904-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A6E0 for
Btree/INSERT_LEAF: off 88
Apr 23 09:09:13 royce postgres[2229]: [43141905-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG: record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141905-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A720 for
Heap/HEAP_CONFIRM: off 6
Apr 23 09:09:13 royce postgres[2229]: [43141906-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG: record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141906-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A750 for
Heap/INSERT: off 7 flags 0x0C
Apr 23 09:09:13 royce postgres[2229]: [43141907-1] 2020-04-23
09:09:13.663 GMT [2229] LOG: incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448555-1] 2020-04-23
09:09:13.663 GMT [2299] FATAL: terminating walreceiver process due to
administrator command
Apr 23 09:09:13 royce postgres[2299]: [1448556-1] 2020-04-23
09:09:13.663 GMT [2299] DEBUG: shmem_exit(1): 1 before_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2299]: [1448557-1] 2020-04-23
09:09:13.663 GMT [2299] DEBUG: shmem_exit(1): 6 on_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141908-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG: switched WAL source from stream to
archive after failure
Apr 23 09:09:13 royce postgres[2229]: [43141909-1] 2020-04-23
09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448558-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG: proc_exit(1): 2 callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141910-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG: switched WAL source from archive to
stream after failure
Apr 23 09:09:13 royce postgres[2229]: [43141911-1] 2020-04-23
09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448559-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG: exit(1)
Apr 23 09:09:13 royce postgres[2229]: [43141912-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG: switched WAL source from stream to
archive after failure
Apr 23 09:09:13 royce postgres[2229]: [43141913-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG: incorrect resource manager data
checksum in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2229]: [43141914-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG: switched WAL source from archive to
stream after failure
Apr 23 09:09:13 royce postgres[2299]: [1448560-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG: shmem_exit(-1): 0 before_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141915-1] 2020-04-23
09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum
in record at 2D

process id list:
2224 ? Ss 0:24 /usr/lib/postgresql/12/bin/postgres -D
/var/lib/postgresql/12/main -c
unix_socket_directories=/var/run/postgresql -c
config_file=/etc/postgresql/12/main/postgresql.conf -c
hba_file=/etc/postgresql/12/main/pg_hba.conf -c
ident_file=/etc/postgresql/12/main/pg_ident.conf -c
external_pid_file=/var/run/postgresql/12-main.pid
2229 ? Ss 7:02 postgres: startup
2230 ? Ss 6:09 postgres: checkpointer
2231 ? Ss 0:08 postgres: background writer
2298 ? Ss 1:29 postgres: stats collector
2299 ? Ss 6:11 postgres: walreceiver

strace of walreceiver:
kill(2229, SIGUSR1) = 0
getpid() = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448554-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: sending write
2D6/C25A0000 flush 2D6/C25A0000 apply 2D6/C25915F0", 156,
MSG_NOSIGNAL, NULL, 0) = 156
sendto(4<socket:[208548975]>,
"\x64\x00\x00\x00\x26\x72\x00\x00\x02\xd6\xc2\x5a\x00\x00\x00\x00\x02\xd6\xc2\x5a\x00\x00\x00\x00\x02\xd6\xc2\x59\x15\xf0\x00\x02\x46\xf0\x5f\x04\xbe\x9e\x00",
39, MSG_NOSIGNAL, NULL, 0) = 39
epoll_create1(EPOLL_CLOEXEC) = 6<anon_inode:[eventpoll]>
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
10<pipe:[208548970]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640776,
u64=94863290307528}}) = 0
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
7<pipe:[208548681]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640800,
u64=94863290307552}}) = 0
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
4<socket:[208548975]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640824,
u64=94863290307576}}) = 0
epoll_wait(6<anon_inode:[eventpoll]>, 0x564714b89410, 1, 100) = -1
EINTR (Interrupted system call)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=2229, si_uid=112} ---
write(11<pipe:[208548970]>, "\x00", 1) = 1
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
close(6<anon_inode:[eventpoll]>) = 0
getpid() = 2299
sendto(3<socket:[208548671]>, "<131>Apr 23 09:09:13 postgres[2299]:
[1448555-1] 2020-04-23 09:09:13.663 GMT [2299] FATAL: terminating
walreceiver process due to administrator command", 152, MSG_NOSIGNAL,
NULL, 0) = 152
getpid() = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448556-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: shmem_exit(1):
1 before_shmem_exit callbacks to make", 144, MSG_NOSIGNAL, NULL, 0) =
144
getpid() = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448557-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: shmem_exit(1):
6 on_shmem_exit callbacks to make", 140, MSG_NOSIGNAL, NULL, 0) = 140
sendto(4<socket:[208548975]>, "\x58\x00\x00\x00\x04", 5, MSG_NOSIGNAL,
NULL, 0) = 5

Any suggestions?

Thanks-
Justin

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David G. Johnston 2020-04-23 17:08:00 Re: relationship of backend_start, query_start, state_change
Previous Message David G. Johnston 2020-04-23 17:00:59 Re: relationship of backend_start, query_start, state_change