the database system is shutting down - terminating walsender process due to replication timeout

From: Zarko Aleksic <zarko(dot)aleksic(at)live(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: the database system is shutting down - terminating walsender process due to replication timeout
Date: 2017-10-30 17:35:58
Message-ID: CY4PR04MB1224DF879B60CBD87D7345F29A590@CY4PR04MB1224.namprd04.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Greetings everyone,

I'm looking for a bit of help understanding a particular behavior we are seeing with our PostgreSQL 9.6. After issuing a service shutdown command with "systemctl stop" on RHEL 7 our PostgreSQL instance started behaving weirdly. For the first time it wouldn't shutdown so easily / quickly.

From the logs we could see that standby nodes that were trying to connect were rejected due to database being shutdown. After wal_sender_timeout and wal_receiver_timeout (default 60s) were reached the database finally shut down. It seems that walsender process was preventing the shutdown of the master database - until timeout was reached, a behavior we didn't experience before.

Does anyone know why would this happen?

We have 1 standby node in our primary site (same subnet as master DB), and two standbys in a remote site. WAL archiving is enabled to the remote site with rsync command that worked normally during this time and generally completes within a couple of seconds - definitely less than a minute. So we kind of ruled out WAL archiving.

Would shutting down remote site standbys prevent this kind of delay, they usually have a couple of seconds lag in terms of replication (pg_xlog) location?

Here is the postgres log from master that was being shutdown.

2017-10-26 22:04:01 CDT [1701]: [6-1] user=,db= LOG: received fast shutdown request
2017-10-26 22:04:01 CDT [1701]: [7-1] user=,db= LOG: aborting any active transactions
2017-10-26 22:04:01 CDT [1711]: [2-1] user=,db= LOG: dbms_aq launcher shutting down
2017-10-26 22:04:01 CDT [1708]: [2-1] user=,db= LOG: autovacuum launcher shutting down
2017-10-26 22:04:01 CDT [1705]: [9971-1] user=,db= LOG: shutting down
2017-10-26 22:04:01 CDT [1705]: [9972-1] user=,db= LOG: checkpoint starting: shutdown immediate
2017-10-26 22:04:01 CDT [1705]: [9973-1] user=,db= LOG: checkpoint complete: wrote 54 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.017 s, sync=0.001 s, total=0.026 s; sync files=12, longest=0.000 s, average=0.000 s; distance=1100 kB, estimate=3307 kB
2017-10-26 22:04:01 CDT [1705]: [9974-1] user=,db= LOG: database system is shut down

2017-10-26 22:04:36 CDT [8763]: [1-1] "Local site standby IP"(49606) user=replication_user ,db=[unknown] FATAL: the database system is shutting down

[OMITTED - a bunch of "FATAL: the database system is shutting down" messages]

2017-10-26 22:05:00 CDT [2669]: [1-1] "Disaster recovery site standby IP" (55498) user=replication_user,db=[unknown] LOG: terminating walsender process due to replication timeout

2017-10-26 22:05:00 CDT [1465]: [1-1] "Disaster recovery site standby IP"(36948) user=replication_user,db=[unknown] LOG: terminating walsender process due to replication timeout
2017-10-26 22:05:01 CDT [1701]: [8-1] user=,db= LOG: database system is shut down

Standbys had the same message regarding walreceiver being terminated due to replication timeout.

Any help is greatly appreciated.

Thanks,
Zarko

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Zarko Aleksic 2017-10-30 17:36:59 the database system is shutting down - terminating walsender process due to replication timeout
Previous Message Zarko Aleksic 2017-10-30 17:31:01 the database system is shutting down - terminating walsender process due to replication timeout