Streaming replication with sync slave, but disconnects due to missing WAL segments

From: Mads(dot)Tandrup(at)schneider-electric(dot)com
To: pgsql-general(at)postgresql(dot)org
Subject: Streaming replication with sync slave, but disconnects due to missing WAL segments
Date: 2013-06-04 13:25:47
Message-ID: OF87DBD177.6C324ADB-ONC1257B80.00483254-C1257B80.0049C624@apcc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Hi all

I have a question about sync streaming replication.

I have 2 postgresql 9.1 servers set up with streaming replication. On the
master node the slave is configured as a synchronous standby. I've verified
that pg_stat_replication shows sync_state = sync for the slave node.

It all seems to work fine. But I have noticed that sometimes when I restore
backups created by pg_dump. The slave node will disconnect with the message
in the postgresql log:
2013-06-03 13:13:48 GMT 4271 FATAL: could not receive data from WAL
stream: SSL connection has been closed unexpectedly
2013-06-03 13:13:53 GMT 4270 LOG: invalid magic number 0000 in log file
15, segment 65, offset 11665408
2013-06-03 13:13:54 GMT 36428 LOG: streaming replication successfully
connected to primary
2013-06-03 13:13:54 GMT 36428 FATAL: could not receive data from WAL
stream: FATAL: requested WAL segment 000000010000000F00000041 has already
been removed
2013-06-03 13:13:58 GMT 36458 LOG: streaming replication successfully
connected to primary
2013-06-03 13:13:58 GMT 36458 FATAL: could not receive data from WAL
stream: FATAL: requested WAL segment 000000010000000F00000041 has already
been removed

On the master I get this in the log file in the same timespan:
2013-06-03 13:13:47 GMT 1471 LOG: checkpoints are occurring too
frequently (2 seconds apart)
2013-06-03 13:13:47 GMT 1471 HINT: Consider increasing the configuration
parameter "checkpoint_segments".
2013-06-03 13:13:48 GMT 6189 [unknown] FATAL: requested WAL segment
000000010000000F00000041 has already been removed
2013-06-03 13:13:48 GMT 6189 [unknown] LOG: disconnection: session time:
77:37:37.684 user=root database= host=10.216.80.38 port=56114
2013-06-03 13:13:49 GMT 1471 LOG: checkpoints are occurring too
frequently (2 seconds apart)
2013-06-03 13:13:49 GMT 1471 HINT: Consider increasing the configuration
parameter "checkpoint_segments".
2013-06-03 13:13:51 GMT 1471 LOG: checkpoints are occurring too
frequently (2 seconds apart)
2013-06-03 13:13:51 GMT 1471 HINT: Consider increasing the configuration
parameter "checkpoint_segments".
2013-06-03 13:13:51 GMT 1468 LOG: received SIGHUP, reloading
configuration files
2013-06-03 13:13:51 GMT 1468 LOG: parameter "synchronous_standby_names"
removed from configuration file, reset to default
2013-06-03 13:13:53 GMT 1471 LOG: checkpoints are occurring too
frequently (2 seconds apart)
2013-06-03 13:13:53 GMT 1471 HINT: Consider increasing the configuration
parameter "checkpoint_segments".
2013-06-03 13:13:53 GMT 44063 [unknown] LOG: connection received:
host=10.216.80.38 port=34038
2013-06-03 13:13:54 GMT 44063 [unknown] LOG: replication connection
authorized: user=root
2013-06-03 13:13:54 GMT 44063 [unknown] FATAL: requested WAL segment
000000010000000F00000041 has already been removed
2013-06-03 13:13:54 GMT 44063 [unknown] LOG: disconnection: session time:
0:00:00.090 user=root database= host=10.216.80.38 port=34038

What I don't understand is how the slave node can miss a WAL segment since
it should be sync?

Shouldn't sync prevent the server from continuing if the slave is not able
to get WAL segments fast enough?

I have only noticed it while restoring a database. But the general load on
the DB has not been that high, so I'm not sure if it can occur with other
workloads.

Best regards,
Mads

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Amit Langote 2013-06-04 13:42:59 Re: More buffers used than a relation's relpages
Previous Message Melvin Call 2013-06-04 13:09:23 Re: Passing a WHERE clause by trigger to a function