Re: how is the WAL receiver process stopped and restarted when the network connection is broken and then restored?

From: Rui Hai Jiang <ruihaij(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: how is the WAL receiver process stopped and restarted when the network connection is broken and then restored?
Date: 2016-06-23 14:56:19
Message-ID: CAEri+mLJjVD301LvKNmaGnr_VdmsHLBks7vaC2bWvYGLJjjuRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you Craig for your suggestion.

I followed the clue and spent the whole day digging into the code.

Finally I figured out how the WAL receiver exits and restarts.

Question-1. How the WAL receiver process exits
===============================================
When the network connection is broken, WAL receiver couldn't communicate
with the WAL sender. For a long time (timer:wal_receiver_timeout), the WAL
receiver gets nothing from the WAL sender, the WAL receiver process exits
by calling "ereport(ERROR,...)".

Calling ereport(ERROR,...) causes the current process exit, but calling
ereport(LOG,...) doesn't.

WalReceiverMain(void)
{
len = walrcv_receive(NAPTIME_PER_CYCLE, &buf);
if (len != 0)
{
}
else
{
if (wal_receiver_timeout > 0)
{
if (now >= timeout)
ereport(ERROR,
(errmsg("terminating walreceiver
due to timeout")));
}
}
}

Question-2. How WAL receiver process starts again
=====================================================

At the Standby side, the startup process is responsible for recovery
processing. If streaming replication is configured and the startup process
finds that the WAL receiver process is not running, it notify the
Postmaster to start the WAL receiver process.Note: This is also how the WAL
receiver process starts for the first time!

(1) startup process notify Postmaster to start the WAL receiver by sending
a SIGUSR1.

RequestXLogStreaming()
{
if (launch)
SendPostmasterSignal(PMSignalReason
reason=PMSIGNAL_START_WALRECEIVER)
{
kill(PostmasterPid, SIGUSR1);
}
}

(2) Postmaster gets SIGUSR1 and starts the WAL receiver process.

sigusr1_handler(SIGNAL_ARGS)
{
WalReceiverPID = StartWalReceiver();
}

Please let me know if my understanding is incorrect.

thanks,
Rui Hai

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-06-23 14:57:26 Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype
Previous Message Tom Lane 2016-06-23 13:59:34 Re: PQconnectdbParams vs PQconninfoParse