From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | wal_sender_timeout should ignore server-side latency |
Date: | 2018-08-26 03:46:00 |
Message-ID: | 20180826034600.GA1105084@rfd.leadboat.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
WalSndLoop() does this, simplifying considerably:
for (;;)
{
/* does: last_reply_timestamp = GetCurrentTimestamp() */
ProcessRepliesIfAny();
send_data(); /* e.g. XLogSendPhysical(), which calls XLogRead() */
WalSndCheckTimeOut(GetCurrentTimestamp());
}
A consequence is that any time spent in the send_data() callback counts
against the timeout. In particular, if a single send_data() takes longer than
wal_sender_timeout, the client is powerless to prevent a timeout. This
disagrees with the wal_sender_timeout documentation ("Terminate replication
connections that are inactive longer than the specified number of
milliseconds. This is useful for the sending server to detect a standby crash
or network outage"). I find it undesirable.
The fix, attached, is to interpret the timeout relative to a timestamp taken
before ProcessRepliesIfAny() polls the socket. If that timestamp is
wal_sender_timeout later than the last reply, we can terminate with
confidence. This adds one gettimeofday() per ProcessRepliesIfAny() finding no
replies, which feels cheap enough.
We've seen a number of wal_sender_timeout buildfarm failures on systems with
I/O performance trouble:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-08-16%2020:55:57
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-06-30%2020:38:10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-04-12%2018:12:36
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2018-01-13%2005:01:17
https://postgr.es/m/flat/20170604211229(dot)GA1528911(at)rfd(dot)leadboat(dot)com
Fixing $SUBJECT won't necessarily cure that, because an I/O stall on the
client side can still cause a failure. We'd need something like threads or
async I/O to avoid that. I mention a less-important corner case in the
WalSndCheckTimeOut() header comment. You can simulate slow XLogSendPhysical()
to explore these problems on any system:
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -65,2 +65,3 @@
#include "libpq/pqformat.h"
+#include "libpq/pqsignal.h"
#include "miscadmin.h"
@@ -2731,2 +2732,5 @@ XLogSendPhysical(void)
enlargeStringInfo(&output_message, nbytes);
+ PG_SETMASK(&BlockSig);
+ pg_usleep(65 * 1000 * 1000);
+ PG_SETMASK(&UnBlockSig);
XLogRead(&output_message.data[output_message.len], startptr, nbytes);
Attachment | Content-Type | Size |
---|---|---|
wal_sender_timeout-server-independent-v1.patch | text/plain | 6.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Fabien COELHO | 2018-08-26 06:16:51 | Re: JIT compiling with LLVM v12 |
Previous Message | Tom Lane | 2018-08-26 03:29:27 | Re: has_table_privilege for a table in unprivileged schema causes an error |