Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date: 2017-04-28 05:10:42
Message-ID: CAEepm=18jxXeMJF8A6GkozC-ssjmJRqSN_86hun68i+N7ga9nA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Mon, Apr 24, 2017 at 2:53 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
>> Every machine sees the LSN moving backwards, but the code path that
>> had the assertion only reached if it decides to interpolate, which is
>> timing dependent: there needs to be a future sample in the lag
>> tracking buffer, which I guess is not the case in those runs.
>
> I'm dissatisfied with this explanation because if it's just timing,
> it doesn't seem very likely that some machines would reproduce the
> failure every single time while others never would. Maybe that can be
> blamed on kernel scheduler vagaries + different numbers of cores, but
> I can't escape the feeling that there's something here we've not
> fully understood.

Ahh. It is timing-dependent, but what I said before wasn't right.
The write LSN always jumps back to the start of the current segment,
but some machines don't report that upstream because of timing
effects.

I found a machine that doesn't reproduce the earlier assertion failure
(or, rather, an elog I put into that code path where the assertion
was). It's a VirtualBox VM running Debian 8 on amd64, and it doesn't
matter whether I tell it to use 1 or 4 CPU cores.

After the timeline change, standby2 restarts WAL streaming from
0/3000000, and standby1 sends it two messages: one 128KB message, and
then another smaller one. On this virtual machine, the inner message
processing loop in walreceiver.c reliably manages to read both
messages from the socket one after the other, taking it all the way to
0/3028470 (end of WAL). Then it calls XLogWalRcvSendReply, so the
position reported to the upstream server never goes backwards.

On my other machines it reliably reads the first message, runs out of
data on the socket so it falls out of the loop, and calls
XLogWalRcvSendReply to report the intermediate location 0/3020000.
Then the socket's soon ready again, we go around the outer loop again
and finally report 0/3028470.

> [...]
>
> So you're right that the standby's reported "write" position can go
> backward, but it seems pretty darn odd that the flush and apply
> positions didn't go backward too. Is there a bug there?

Flush doesn't go backwards because XLogWalRcvFlush does nothing unless
Flush < Write. I am not sure whether it constitutes a bug that we
rewrite the start of the current segment without flushing it.

A timeline change isn't actually necessary for the
rewind-to-start-of-segment behaviour, by the way: we always do that at
the start of streaming, as described in RequestXLogStreaming. But
it's it's only a problem for timeline changes, because in other cases
it's talking to a fresh walsender that doesn't have any state that
would have triggered the (now removed) assertion. I think the change
that was committed in 84638808 was therefore correct.

> I remain of the opinion that if we can't tell from the transmitted
> data whether a timeline switch has caused the position to go backward,
> then that's a protocol shortcoming that ought to be fixed.

Another approach might have been to teach the switch case for
T_StartReplicationCmd in exec_replication_command to clear all
LagTracker state, since that is the point at which the remote
walreceiver has requested a new stream and all subsequent messages
will refer to that stream.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-committers by date

  From Date Subject
Next Message Heikki Linnakangas 2017-04-28 12:24:03 pgsql: Misc SCRAM code cleanups.
Previous Message Stephen Frost 2017-04-28 00:14:58 pgsql: Remove unnecessairly duplicated gram.y productions

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-04-28 05:18:57 Re: Partition-wise join for join between (declaratively) partitioned tables
Previous Message Noah Misch 2017-04-28 05:07:31 Re: Transition tables for triggers on foreign tables and views