Re: Dealing with latency to replication slave; what to do?

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Rory Falloon <rfalloon(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Dealing with latency to replication slave; what to do?
Date: 2018-07-25 02:13:31
Message-ID: CAMkU=1xdV6gt-0fgh9_=e=E+pN-xLPd3+t+MaCUNmpt-5cZ-FA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Please don't top-post, it is not the custom on this list.

On Tue, Jul 24, 2018 at 4:08 PM, Rory Falloon <rfalloon(at)gmail(dot)com> wrote:

> On Tue, Jul 24, 2018 at 4:02 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
Hi,
>>
>> On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:
>> > Looking for any tips here on how to best maintain a replication slave
>> which
>> > is operating under some latency between networks - around 230ms. On a
>> good
>> > day/week, replication will keep up for a number of days, but however,
>> when
>> > the link is under higher than average usage, keeping replication active
>> can
>> > last merely minutes before falling behind again.
>> >
>> > 2018-07-24 18:46:14 GMTLOG: database system is ready to accept read
>> only
>> > connections
>> > 2018-07-24 18:46:15 GMTLOG: started streaming WAL from primary at
>> > 2B/93000000 on timeline 1
>> > 2018-07-24 18:59:28 GMTLOG: incomplete startup packet
>> > 2018-07-24 19:15:36 GMTLOG: incomplete startup packet
>> > 2018-07-24 19:15:36 GMTLOG: incomplete startup packet
>> > 2018-07-24 19:15:37 GMTLOG: incomplete startup packet
>> >
>> > As you can see above, it lasted about half an hour before falling out of
>> > sync.
>>
>> How can we see that from the above? The "incomplete startup messages"
>> are independent of streaming rep? I think you need to show us more logs.
>>
>>
>>
> regarding your first reply, I was inferring that from the fact I saw those
> messages at the same time the replication stream fell behind. What other
> logs would be more pertinent to this situation?
>

This is circular. You think it lost sync because you saw some message you
didn't recognize, and then you think the error message was related to it
losing sync because they occured at the same time. What evidence do you
have that it has lost sync at all? From the log file you posted, it seems
the server is running fine and is just getting probed by a port scanner, or
perhaps by a monitoring tool.

If it had lost sync, you would be getting log messages about "requested WAL
segment has already been removed"

Cheers,

Jeff

On Tue, Jul 24, 2018 at 4:08 PM, Rory Falloon <rfalloon(at)gmail(dot)com> wrote:

> Hi Andres,
>
> regarding your first reply, I was inferring that from the fact I saw those
> messages at the same time the replication stream fell behind. What other
> logs would be more pertinent to this situation?
>
>
>
> On Tue, Jul 24, 2018 at 4:02 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
>> Hi,
>>
>> On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:
>> > Looking for any tips here on how to best maintain a replication slave
>> which
>> > is operating under some latency between networks - around 230ms. On a
>> good
>> > day/week, replication will keep up for a number of days, but however,
>> when
>> > the link is under higher than average usage, keeping replication active
>> can
>> > last merely minutes before falling behind again.
>> >
>> > 2018-07-24 18:46:14 GMTLOG: database system is ready to accept read
>> only
>> > connections
>> > 2018-07-24 18:46:15 GMTLOG: started streaming WAL from primary at
>> > 2B/93000000 on timeline 1
>> > 2018-07-24 18:59:28 GMTLOG: incomplete startup packet
>> > 2018-07-24 19:15:36 GMTLOG: incomplete startup packet
>> > 2018-07-24 19:15:36 GMTLOG: incomplete startup packet
>> > 2018-07-24 19:15:37 GMTLOG: incomplete startup packet
>> >
>> > As you can see above, it lasted about half an hour before falling out of
>> > sync.
>>
>> How can we see that from the above? The "incomplete startup messages"
>> are independent of streaming rep? I think you need to show us more logs.
>>
>>
>> > On the master, I have wal_keep_segments=128. What is happening when I
>> see
>> > "incomplete startup packet" - is it simply the slave has fallen behind,
>> > and cannot 'catch up' using the wal segments quick enough? I assume the
>> > slave is using the wal segments to replay changes and assuming there are
>> > enough wal segments to cover the period it cannot stream properly, it
>> will
>> > eventually recover?
>>
>> You might want to look into replication slots to ensure the primary
>> keeps the necessary segments, but not more, around. You might also want
>> to look at wal_compression, to reduce the bandwidth usage.
>>
>> Greetings,
>>
>> Andres Freund
>>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Nicola Contu 2018-07-25 07:43:56 Restore from dumps
Previous Message Christophe Pettus 2018-07-24 20:09:29 Re: width_bucket issue