Quick Links

WAL Receiver Segmentation Fault

From:	Phil Sorber <phil(at)omniti(dot)com>
To:	pgsql-bugs(at)postgresql(dot)org
Subject:	WAL Receiver Segmentation Fault
Date:	2012-12-28 18:55:36
Message-ID:	CADAkt-iewgG3TBbt4oF5eA6+1H7mEc5iSvo9cvB4Aa34fbPNBQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Postgres 9.0.11 running as a hot standby.

The master was restarted and the standby went into a segmentation
fault loop. A hard stop/start fixed it. Here are pertinent logs with
excess and identifying information removed:

2012-12-28 03:39:14 UTC [16850]: [2-1] FATAL: replication terminated
by primary server
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:14 UTC [16801]: [21-1] LOG: record with zero length
at 1A01/D5000078
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:14 UTC [16798]: [2-1] LOG: WAL receiver process
(PID 16671) was terminated by signal 11: Segmentation fault
2012-12-28 03:39:14 UTC [16798]: [3-1] LOG: terminating any other
active server processes
2012-12-28 03:39:15 UTC [16798]: [4-1] LOG: all server processes
terminated; reinitializing
2012-12-28 03:39:15 UTC [16673]: [1-1] LOG: database system was
interrupted while in recovery at log time 2012-12-28 03:35:47 UTC
2012-12-28 03:39:15 UTC [16673]: [2-1] HINT: If this has occurred
more than once some data might be corrupted and you might need to
choose an earlier recovery target.
zcat: /mnt/dbmount/walarchive/00000004.history.gz: No such file or directory
zcat: /mnt/dbmount/walarchive/00000003.history.gz: No such file or directory
2012-12-28 03:39:16 UTC [16673]: [3-1] LOG: entering standby mode
zcat: /mnt/dbmount/walarchive/0000000300001A0100000092.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A010000007D.gz: No such
file or directory
2012-12-28 03:39:16 UTC [16673]: [4-1] LOG: redo starts at 1A01/7D00C500
zcat: /mnt/dbmount/walarchive/0000000300001A010000007E.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A010000007F.gz: No such
file or directory
...
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C0.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C1.gz: No such
file or directory
2012-12-28 03:39:24 UTC [16681]: [1-1] LOG: restartpoint starting: xlog
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C2.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C3.gz: No such
file or directory
...
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D3.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D4.gz: No such
file or directory
2012-12-28 03:39:28 UTC [16673]: [5-1] LOG: consistent recovery
state reached at 1A01/D430F1A0
2012-12-28 03:39:28 UTC [16798]: [5-1] LOG: database system is ready
to accept read only connections
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:28 UTC [16673]: [6-1] LOG: record with zero length
at 1A01/D5000078
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:28 UTC [16798]: [6-1] LOG: WAL receiver process
(PID 16870) was terminated by signal 11: Segmentation fault
2012-12-28 03:39:28 UTC [16798]: [7-1] LOG: terminating any other
active server processes
2012-12-28 03:39:28 UTC [16798]: [8-1] LOG: all server processes
terminated; reinitializing
2012-12-28 03:39:30 UTC [16871]: [1-1] LOG: database system was
interrupted while in recovery at log time 2012-12-28 03:35:47 UTC
2012-12-28 03:39:30 UTC [16871]: [2-1] HINT: If this has occurred
more than once some data might be corrupted and you might need to
choose an earlier recovery target.
zcat: /mnt/dbmount/walarchive/00000004.history.gz: No such file or directory
zcat: /mnt/dbmount/walarchive/00000003.history.gz: No such file or directory
2012-12-28 03:39:30 UTC [16871]: [3-1] LOG: entering standby mode
zcat: /mnt/dbmount/walarchive/0000000300001A0100000092.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A010000007D.gz: No such
file or directory
2012-12-28 03:39:30 UTC [16871]: [4-1] LOG: redo starts at 1A01/7D00C500
zcat: /mnt/dbmount/walarchive/0000000300001A010000007E.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A010000007F.gz: No such
file or directory
...
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C0.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C1.gz: No such
file or directory
2012-12-28 03:39:38 UTC [16883]: [1-1] LOG: restartpoint starting: xlog
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C2.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C3.gz: No such
file or directory
...
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D3.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D4.gz: No such
file or directory
2012-12-28 03:39:41 UTC [16871]: [5-1] LOG: consistent recovery
state reached at 1A01/D430F1A0
2012-12-28 03:39:41 UTC [16798]: [9-1] LOG: database system is ready
to accept read only connections
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:41 UTC [16871]: [6-1] LOG: record with zero length
at 1A01/D5000078
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:41 UTC [16798]: [10-1] LOG: WAL receiver process
(PID 17144) was terminated by signal 11: Segmentation fault
2012-12-28 03:39:41 UTC [16798]: [11-1] LOG: terminating any other
active server processes
2012-12-28 03:39:42 UTC [16798]: [12-1] LOG: all server processes
terminated; reinitializing

Basically kept doing that over and over until I stopped and started it:

2012-12-28 03:58:22 UTC [16798]: [161-1] LOG: received fast shutdown request
2012-12-28 03:58:22 UTC [983]: [1-1] LOG: shutting down
2012-12-28 03:58:22 UTC [983]: [2-1] LOG: database system is shut down
2012-12-28 03:58:48 UTC [1219]: [1-1] LOG: database system was shut
down in recovery at 2012-12-28 03:58:22 UTC
zcat: /mnt/dbmount/walarchive/00000004.history.gz: No such file or directory
zcat: /mnt/dbmount/walarchive/00000003.history.gz: No such file or directory
2012-12-28 03:58:48 UTC [1219]: [2-1] LOG: entering standby mode
2012-12-28 03:58:48 UTC [1219]: [3-1] LOG: restored log file
"0000000300001A01000000C1" from archive
2012-12-28 03:58:48 UTC [1219]: [4-1] LOG: restored log file
"0000000300001A01000000AF" from archive
2012-12-28 03:58:48 UTC [1219]: [5-1] LOG: redo starts at 1A01/AF010A98
2012-12-28 03:58:48 UTC [1219]: [6-1] LOG: restored log file
"0000000300001A01000000B0" from archive
2012-12-28 03:58:48 UTC [1219]: [7-1] LOG: restored log file
"0000000300001A01000000B1" from archive
...
2012-12-28 03:59:10 UTC [1219]: [50-1] LOG: restored log file
"0000000300001A01000000DC" from archive
2012-12-28 03:59:10 UTC [1219]: [51-1] LOG: restored log file
"0000000300001A01000000DD" from archive
2012-12-28 03:59:10 UTC [1219]: [52-1] LOG: consistent recovery
state reached at 1A01/DDED8528
2012-12-28 03:59:10 UTC [1215]: [1-1] LOG: database system is ready
to accept read only connections
2012-12-28 03:59:10 UTC [1219]: [53-1] LOG: restored log file
"0000000300001A01000000DE" from archive
zcat: /mnt/dbmount/walarchive/0000000300001A01000000DF.gz: No such
file or directory
2012-12-28 03:59:10 UTC [1219]: [54-1] LOG: unexpected pageaddr
1A00/F4000000 in log file 6657, segment 223, offset 0
zcat: /mnt/dbmount/walarchive/0000000300001A01000000DF.gz: No such
file or directory
2012-12-28 03:59:10 UTC [1700]: [1-1] LOG: streaming replication
successfully connected to primary

I'll note that /mnt/dbmount is on NFS. That might be related to the
problem, but I did nothing to NFS at any point to fix this. It also
never attempts to connect to primary when it couldn't find the
archive.

If there is any more info I can provide, let me know. This is a
production DB so I won't be able to do any disruptive testing. Based
on what I have seen so far, I think this would be difficult to
replicate anyway.

I did a search and this was the only thing related I could find:

http://archives.postgresql.org/pgsql-bugs/2010-04/msg00080.php

Responses

Re: WAL Receiver Segmentation Fault at 2012-12-28 22:30:41 from Heikki Linnakangas

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2012-12-28 22:30:41	Re: WAL Receiver Segmentation Fault
Previous Message	Andrew Barnham	2012-12-27 00:55:56	Starting PostgreSQL on Windows and recent possible Microsoft Security Manager update