Re: Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.
Date: 2012-05-31 19:49:11
Message-ID: CA+TgmoZb1xW3fcG4mAS5ApuM_Ki3ey+Y5C+aSgNKX4r9HMrqdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Wed, May 30, 2012 at 12:17 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> OTOH, I wonder whether we really need to send keepalive messages
> periodically to calculate a network latency. ISTM we don't unless a network
> latency varies from situation to situation so frequently and we'd like to
> monitor that in almost real time.

I didn't look at this patch too carefully when it was committed.
Looking at it more carefully now, it looks to me like this patch does
two different things. One is to add a function called
GetReplicationApplyDelay(), which returns the number of milliseconds
since replay was fully caught up. So if you were last caught up 5
minutes ago and you have replayed 4 minutes and 50 seconds worth of
WAL during that time, this function will return 5 minutes, not 10
seconds. That is not what I would call "apply delay", which I would
define as how far behind you are NOW, not how long it's been since you
weren't behind at all.

The second thing it does is add a function called
GetReplicationTransferLatency(). The return value of this function is
the difference between the slave's clock at the time the last master
keepalive was processed and the master's clock at the time that
keepalive was generated. I think that in practice, unless network
time synchronization is in use, this is mostly going to be computing
the clock skew between the master and the slave. If time
synchronization is in use, then as you say it'll be a very jittery
measure of master-slave network latency, which can be monitored
perfectly well from outside PG.

Now, measuring time skew is potentially a useful thing to do, if we
believe that this will actually give us an accurate measurement of
what the time skew is, because there are a whole series of things that
people want to do which involve subtracting a slave timestamp from a
master timestamp. Tom has persistently rebuffed all such proposals on
the grounds that there might be time skew, so in theory we could make
those things possible by having a way to measure time skew, which this
does. Here's what we do: given a slave timestamp, add the estimated
time skew to find an equivalent master timestamp, and then subtract.
Using a method of this type would allow us to compute a *real* apply
delay. Woohoo! Unfortunately, if time synchronization IS in use,
then the system clocks are probably already synchronized three to six
orders of magnitude more precisely than what this method can measure,
so the effect of using GetReplicationTransferLatency() to adjust slave
timestamps will be to massively reduce the accuracy of such
calculations. However, I've thus far been unable to convince anyone
that this is a bad idea, so maybe this is where we're gonna end up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Peter Eisentraut 2012-05-31 20:42:46 pgsql: Translation updates
Previous Message Tom Lane 2012-05-31 15:20:39 pgsql: Improve comment for GetStableLatestTransactionId().

Browse pgsql-hackers by date

  From Date Subject
Next Message Kohei KaiGai 2012-05-31 19:52:38 Re: [RFC] Interface of Row Level Security
Previous Message Merlin Moncure 2012-05-31 19:25:28 Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile