Quick Links

Re: replication lag despite corrective config

From:	Rene Romero Benavides <rene(dot)romero(dot)b(at)gmail(dot)com>
To:	wyatt(dot)alt(at)gmail(dot)com
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: replication lag despite corrective config
Date:	2018-11-21 19:20:25
Message-ID:	CANaGW09f0z_VN7yHa_Gzye7R2=-QfPeE428Q1HkxPULfFQfpow@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

How big have been the delays after the new settings? I guess significantly
lower than before, right? how much have they decreased?

Am Mi., 21. Nov. 2018 um 13:18 Uhr schrieb Rene Romero Benavides <
rene(dot)romero(dot)b(at)gmail(dot)com>:

> You're welcome.
> Since last Saturday when you addressed the 10 hour delay, with the new
> settings, have you seen more of such delay incidents? what the previous
> settings were?
> Beware that hot_standby_feedback = on and such long queries in the replica
> can increase bloat in the master, are you measuring bloat? if so, do you
> notice a significant increase?
>
> Am Di., 20. Nov. 2018 um 23:08 Uhr schrieb Wyatt Alt <wyatt(dot)alt(at)gmail(dot)com
> >:
>
>> Hi Rene,
>> On 11/19/18 8:46 PM, Rene Romero Benavides wrote:
>>
>> Not sure about the root cause but I can make these observations and raise
>> some questions:
>> 1) 9.6.6 is five bug fix versions behind
>>
>> Valid point to raise.
>>
>> 2) 300GB is so big a table, wouldn't make sense to you to partition it ?
>> 2a) or if it's partitioned, doesn't the time of creation or dropping of
>> new partitions match the time of the conflict?
>>
>> Partitioning is in the works, but none at the moment.
>>
>>
>> 3) can you track long running transactions on the master?
>> 4) what are the isolation levels on master / replica?
>>
>> Transaction times on the master max out around two minutes. On the
>> replica they are much longer -- numerous 1 - 2 hour transactions per day,
>> and occasional ones as long as 10 - 20 hours. Isolation levels are read
>> committed everywhere.
>>
>> 5) check for active locks in the replica, I guess you should see some
>> blocked transactions during big delays, I've seen this in the past when
>> standby_feedback is turned off.
>> 6) any out of the ordinary messages in the replica's logs? any evidence
>> that has been canceling statements?
>>
>> I'll make a note to record the active locks next time. I haven't seen
>> anything unusual in the logs during these incidents, but have observed
>> statements getting canceled at other times, which is why I think the config
>> mostly works.
>>
>> 7) are master and replica exactly the same in terms of resources and main
>> parameters?
>> 8) how is performance in both nodes while the big delay is happening? IO
>> / cpu load / etc.
>>
>> This brings up a good detail I forgot to mention originally. During the
>> last incident, IO utilization on the replica was near 100%, and had been
>> for several hours, which I believe was due to the long queries I canceled.
>> Now that I think about it, I wonder if the lag may have arisen from IO
>> contention between the query and WAL replay, rather than a query conflict.
>>
>>
>> Also, check this out:
>> https://www.alibabacloud.com/forum/read-383
>>
>> Thanks, interesting reading.
>>
>>
>> Am Mo., 19. Nov. 2018 um 21:46 Uhr schrieb Wyatt Alt <wyatt(dot)alt(at)gmail(dot)com
>> >:
>>
>>> Sorry, I see now there was a similar question a few days ago:
>>>
>>> https://www.postgresql.org/message-id/CAJw4d1WtzOdYzd8Nq2=uFK+Z0JY0L_pfg9TvCWPrmt3NCZq9GA@mail.gmail.com
>>>
>>> Two ideas proposed (aside from disconnects):
>>> * Autovacuum is truncating a page on the master and taking an
>>> AccessExclusiveLock on the table in use on the replica
>>> * A "pin conflict", which I'm still unfamiliar with.
>>>
>>> The user's response says they are in the first bucket, but the argument
>>> relies on max_standby_streaming_delay set to -1, while mine is 5 minutes. I
>>> need to understand pin conflicts better, but the likely scenario Andrew
>>> outlined doesn't apply to me. My offending queries were doing bitmap heap
>>> scans on a 300GB table.
>>>
>>> Reading the thread I see Andres ask for the "precise conflict" the user
>>> gets -- is there a way I can get that without a datadir? And to re-frame
>>> the original question, are there causes of replication lag that
>>> max_standby_streaming_delay would not be expected to prevent, that would be
>>> resolved by killing long standby queries? If so, what's the best way to
>>> confirm?
>>>
>>> Wyatt
>>>
>>> On Mon, Nov 19, 2018 at 5:46 PM Wyatt Alt <wyatt(dot)alt(at)gmail(dot)com> wrote:
>>>
>>>> I've been struggling to eliminate replication lag on a Postgres 9.6.6
>>>> instance on Amazon RDS. I believe the lag is caused by early cleanup
>>>> conflicts from vacuums on the master, because I can reliably resolve it by
>>>> killing long-running queries on the standby. I most recently saw ten hours
>>>> of lag on Saturday and addressed it this way.
>>>>
>>>> The standby is running with
>>>> hot_standby_feedback = on
>>>> max_standby_streaming_delay = 5min
>>>> max_standby_archive_delay = 30s
>>>>
>>>> I am not using replication slots on the primary due to reported
>>>> negative interactions with pg_repack on large tables.
>>>>
>>>> My rationale for the first two settings is that hot_standby_feedback
>>>> should address my issues almost all the time, but that
>>>> max_standby_streaming_delay would sometimes be necessary as a fallback, for
>>>> instance in cases of a transient connection loss between the standby and
>>>> primary. I believe these settings are mostly working, because lag is less
>>>> frequent than it was when I configured them.
>>>>
>>>> My questions are,
>>>> * Am I overlooking anything in my configuration?
>>>> * What would explain lag caused by query conflicts given the
>>>> max_standby_streaming_delay setting? Shouldn't those queries be getting
>>>> killed?
>>>> * Is there any particular diagnostic info I should be collecting on the
>>>> next occurrence, to help me figure out the cause? Note that as I'm on RDS,
>>>> I don't have direct access to the datadir -- just psql.
>>>>
>>>> Thanks for any advice!
>>>> Wyatt
>>>>
>>>
>>
>> --
>> El genio es 1% inspiración y 99% transpiración.
>> Thomas Alva Edison
>> http://pglearn.blogspot.mx/
>>
>>
>
> --
> El genio es 1% inspiración y 99% transpiración.
> Thomas Alva Edison
> http://pglearn.blogspot.mx/
>
>

--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/

In response to

Re: replication lag despite corrective config at 2018-11-21 19:18:08 from Rene Romero Benavides

Browse pgsql-general by date

	From	Date	Subject
Next Message	Adrian Klaver	2018-11-21 23:54:32	Re: Parallel query and number of connections.
Previous Message	Rene Romero Benavides	2018-11-21 19:18:08	Re: replication lag despite corrective config