BUG #14189: Corrupted WAL through replication protocol

From: root(at)simply(dot)name
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #14189: Corrupted WAL through replication protocol
Date: 2016-06-14 10:34:15
Message-ID: 20160614103415.5796.6885@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 14189
Logged by: Vladimir Borodin
Email address: root(at)simply(dot)name
PostgreSQL version: 9.5.2
Operating system: RHEL 6
Description:

Hi all.

One of our streaming physical standbys got a bad WAL through replication
protocol (we use replication slots) after network flap between primary and
this standby. Another streaming standby and archive got a valid WAL. Below
are some details.

Log on primary:
[ 2016-06-14 11:04:04.623 MSK
xdb2018f_mail_yandex_net,repl,[unknown],25201,00000 ]:LOG: terminating
walsender process due to replication timeout

Log on standby:
[ 2016-06-14 11:04:04.625 MSK ,,,19103,XX000 ]:FATAL: could not receive
data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

[ 2016-06-14 11:04:04.625 MSK ,,,32631,XX000 ]:FATAL: invalid memory alloc
request size 2684370944
[ 2016-06-14 11:04:04.628 MSK ,,,32624,00000 ]:LOG: startup process (PID
32631) exited with exit code 1
[ 2016-06-14 11:04:04.628 MSK ,,,32624,00000 ]:LOG: terminating any other
active server processes

All other attempts to start/restart fail with the same error message because
of corrupted WAL on standby's filesystem.

[ 2016-06-14 11:04:04.672 MSK [unknown],web,maildb,9491,57P03 ]:FATAL: the
database system is in recovery mode
[ 2016-06-14 11:05:47.046 MSK ,,,12869,00000 ]:LOG: database system was
interrupted while in recovery at log time 2016-06-14 10:31:25 MSK
[ 2016-06-14 11:05:47.046 MSK ,,,12869,00000 ]:HINT: If this has occurred
more than once some data might be corrupted and you might need to choose an
earlier recovery target.
[ 2016-06-14 11:05:50.229 MSK ,,,12869,00000 ]:LOG: entering standby mode
[ 2016-06-14 11:05:50.355 MSK ,,,12869,00000 ]:LOG: redo starts at
24DA/8F0097D0
[ 2016-06-14 11:06:13.375 MSK ,,,12869,00000 ]:LOG: consistent recovery
state reached at 24DB/71EA04C8
[ 2016-06-14 11:06:13.376 MSK ,,,12860,00000 ]:LOG: database system is
ready to accept read only connections
[ 2016-06-14 11:06:13.389 MSK ,,,12869,XX000 ]:FATAL: invalid memory alloc
request size 2684370944
[ 2016-06-14 11:06:13.391 MSK ,,,12860,00000 ]:LOG: startup process (PID
12869) exited with exit code 1
[ 2016-06-14 11:06:13.392 MSK ,,,12860,00000 ]:LOG: terminating any other
active server processes

WAL file 00000001000024DB00000072 on two standbys differs:

root(at)xdb2018f /tmp # md5sum /tmp/{normal,problem}/00000001000024DB00000072
dfd123289ac7a7dbe8c1355654c2cfc7 /tmp/normal/00000001000024DB00000072
b8aae9910f9e833cf21e1d241051a732 /tmp/problem/00000001000024DB00000072
root(at)xdb2018f /tmp #

But not the whole file, only after certain LSN:

root(at)xdb2018f ~ # /usr/pgsql-9.5/bin/pg_xlogdump
/tmp/normal/00000001000024DB00000072 /tmp/normal/00000001000024DB00000072 -e
24DB/72095CC8 | md5sum
24a24db0b0d32ec5d42eac61955c231b -
root(at)xdb2018f ~ # /usr/pgsql-9.5/bin/pg_xlogdump
/tmp/problem/00000001000024DB00000072 /tmp/problem/00000001000024DB00000072
-e 24DB/72095CC8 | md5sum
24a24db0b0d32ec5d42eac61955c231b -
root(at)xdb2018f ~ #

root(at)xdb2018f /tmp # /usr/pgsql-9.5/bin/pg_xlogdump
/tmp/normal/00000001000024DB00000072 /tmp/normal/00000001000024DB00000072 |
fgrep -C1 24DB/72095CC8
rmgr: Gin len (rec/tot): 2/ 3875, tx: 193820712, lsn:
24DB/72094DA0, prev 24DB/72094460, desc: INSERT isdata: F isleaf: T
isdelete: T, blkref #0: rel 1663/16400/18030 blk 368350 FPW
rmgr: Gin len (rec/tot): 2/ 388, tx: 193820712, lsn:
24DB/72095CC8, prev 24DB/72094DA0, desc: INSERT isdata: F isleaf: T
isdelete: T, blkref #0: rel 1663/16400/18030 blk 368350
rmgr: Gin len (rec/tot): 2/ 412, tx: 193820712, lsn:
24DB/72095E50, prev 24DB/72095CC8, desc: INSERT isdata: F isleaf: T
isdelete: T, blkref #0: rel 1663/16400/18030 blk 368350
rmgr: Btree len (rec/tot): 2/ 3323, tx: 193820704, lsn:
24DB/72095FF0, prev 24DB/72095E50, desc: INSERT_LEAF off 94, blkref #0: rel
1663/16400/17565 blk 7488040 FPW
root(at)xdb2018f /tmp #

root(at)xdb2018f /tmp # /usr/pgsql-9.5/bin/pg_xlogdump
/tmp/problem/00000001000024DB00000072 /tmp/problem/00000001000024DB00000072
| fgrep -C1 24DB/72095CC8
pg_xlogdump: FATAL: error in WAL record at 24DB/72095E50: unexpected
pageaddr 24D9/23096000 in log segment 00000000000024DB00000072, offset
614400
rmgr: Gin len (rec/tot): 2/ 3875, tx: 193820712, lsn:
24DB/72094DA0, prev 24DB/72094460, desc: INSERT isdata: F isleaf: T
isdelete: T, blkref #0: rel 1663/16400/18030 blk 368350 FPW

rmgr: Gin len (rec/tot): 2/ 388, tx: 193820712, lsn:
24DB/72095CC8, prev 24DB/72094DA0, desc: INSERT isdata: F isleaf: T
isdelete: T, blkref #0: rel 1663/16400/18030 blk 368350
rmgr: Gin len (rec/tot): 2/ 412, tx: 193820712, lsn:
24DB/72095E50, prev 24DB/72095CC8, desc: INSERT isdata: F isleaf: T
isdelete: T, blkref #0: rel 1663/16400/18030 blk 368350
root(at)xdb2018f /tmp #

Both WALs (normal and problem could be downloaded from
https://yadi.sk/d/LjVH8p4zsUHty) Other WALs after 00000001000024DB00000072
also differ.

Recovery.conf on standby:
root(at)xdb2018f ~ # cat /var/lib/pgsql/9.5/data/recovery.conf
standby_mode = 'on'
recovery_target_timeline = 'latest'
primary_conninfo = 'host=xdb2018g.mail.yandex.net port=5432 user=repl
application_name=xdb2018f_mail_yandex_net'
primary_slot_name = 'xdb2018f_mail_yandex_net'
root(at)xdb2018f ~ #

Replications settings in postgresql.conf (the same on both primary and all
standbys):
xdb2018f/postgres R # SELECT name, setting FROM pg_settings
WHERE category LIKE 'Write-Ahead Log%' OR category LIKE 'Replication%';
name | setting
------------------------------+---------------
archive_command | cp %p wals/%f
archive_mode | on
archive_timeout | 0
checkpoint_completion_target | 0.9
checkpoint_timeout | 3600
checkpoint_warning | 420
commit_delay | 0
commit_siblings | 5
fsync | on
full_page_writes | on
hot_standby | on
hot_standby_feedback | on
max_replication_slots | 10
max_standby_archive_delay | 30000
max_standby_streaming_delay | 30000
max_wal_senders | 10
max_wal_size | 1024
min_wal_size | 64
synchronous_commit | on
synchronous_standby_names |
track_commit_timestamp | off
vacuum_defer_cleanup_age | 0
wal_buffers | 2048
wal_compression | on
wal_keep_segments | 64
wal_level | hot_standby
wal_log_hints | off
wal_receiver_status_interval | 1
wal_receiver_timeout | 60000
wal_retrieve_retry_interval | 5000
wal_sender_timeout | 3000
wal_sync_method | fdatasync
wal_writer_delay | 200
(33 rows)

Time: 1.007 ms
xdb2018f/postgres R #

PGDATA and pg_xlog directory are both mdadm raid10 arrays and mounted with
standard options:
root(at)xdb2018f ~ # mount | fgrep pgsql
/dev/md4 on /var/lib/pgsql/9.5/data type ext4 (rw,noatime,nodiratime)
/dev/md3 on /var/lib/pgsql/9.5/data/pg_xlog type ext4
(rw,noatime,nodiratime)
root(at)xdb2018f ~ #

Should I provide some more diagnostics?

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message sideuxb-ky.consultant 2016-06-14 10:39:13 BUG #14190: 'silent_mode=on', 'pg_ctl restart -w' or 'pg_ctl start -w' reports error
Previous Message gabrimonfa 2016-06-14 09:48:46 BUG #14188: "FOR SELECT DISTINCT, ORDER BY expressions must appear IN SELECT list" error and table alias