From: | Rene Romero Benavides <rene(dot)romero(dot)b(at)gmail(dot)com> |
---|---|
To: | wyatt(dot)alt(at)gmail(dot)com |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: replication lag despite corrective config |
Date: | 2018-11-21 19:18:08 |
Message-ID: | CANaGW08FD2unmd0RXhvLzBY=kDCp4hkqFX2WzZZUTmv2G-d7qQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
You're welcome.
Since last Saturday when you addressed the 10 hour delay, with the new
settings, have you seen more of such delay incidents? what the previous
settings were?
Beware that hot_standby_feedback = on and such long queries in the replica
can increase bloat in the master, are you measuring bloat? if so, do you
notice a significant increase?
Am Di., 20. Nov. 2018 um 23:08 Uhr schrieb Wyatt Alt <wyatt(dot)alt(at)gmail(dot)com>:
> Hi Rene,
> On 11/19/18 8:46 PM, Rene Romero Benavides wrote:
>
> Not sure about the root cause but I can make these observations and raise
> some questions:
> 1) 9.6.6 is five bug fix versions behind
>
> Valid point to raise.
>
> 2) 300GB is so big a table, wouldn't make sense to you to partition it ?
> 2a) or if it's partitioned, doesn't the time of creation or dropping of
> new partitions match the time of the conflict?
>
> Partitioning is in the works, but none at the moment.
>
>
> 3) can you track long running transactions on the master?
> 4) what are the isolation levels on master / replica?
>
> Transaction times on the master max out around two minutes. On the replica
> they are much longer -- numerous 1 - 2 hour transactions per day, and
> occasional ones as long as 10 - 20 hours. Isolation levels are read
> committed everywhere.
>
> 5) check for active locks in the replica, I guess you should see some
> blocked transactions during big delays, I've seen this in the past when
> standby_feedback is turned off.
> 6) any out of the ordinary messages in the replica's logs? any evidence
> that has been canceling statements?
>
> I'll make a note to record the active locks next time. I haven't seen
> anything unusual in the logs during these incidents, but have observed
> statements getting canceled at other times, which is why I think the config
> mostly works.
>
> 7) are master and replica exactly the same in terms of resources and main
> parameters?
> 8) how is performance in both nodes while the big delay is happening? IO /
> cpu load / etc.
>
> This brings up a good detail I forgot to mention originally. During the
> last incident, IO utilization on the replica was near 100%, and had been
> for several hours, which I believe was due to the long queries I canceled.
> Now that I think about it, I wonder if the lag may have arisen from IO
> contention between the query and WAL replay, rather than a query conflict.
>
>
> Also, check this out:
> https://www.alibabacloud.com/forum/read-383
>
> Thanks, interesting reading.
>
>
> Am Mo., 19. Nov. 2018 um 21:46 Uhr schrieb Wyatt Alt <wyatt(dot)alt(at)gmail(dot)com
> >:
>
>> Sorry, I see now there was a similar question a few days ago:
>>
>> https://www.postgresql.org/message-id/CAJw4d1WtzOdYzd8Nq2=uFK+Z0JY0L_pfg9TvCWPrmt3NCZq9GA@mail.gmail.com
>>
>> Two ideas proposed (aside from disconnects):
>> * Autovacuum is truncating a page on the master and taking an
>> AccessExclusiveLock on the table in use on the replica
>> * A "pin conflict", which I'm still unfamiliar with.
>>
>> The user's response says they are in the first bucket, but the argument
>> relies on max_standby_streaming_delay set to -1, while mine is 5 minutes. I
>> need to understand pin conflicts better, but the likely scenario Andrew
>> outlined doesn't apply to me. My offending queries were doing bitmap heap
>> scans on a 300GB table.
>>
>> Reading the thread I see Andres ask for the "precise conflict" the user
>> gets -- is there a way I can get that without a datadir? And to re-frame
>> the original question, are there causes of replication lag that
>> max_standby_streaming_delay would not be expected to prevent, that would be
>> resolved by killing long standby queries? If so, what's the best way to
>> confirm?
>>
>> Wyatt
>>
>> On Mon, Nov 19, 2018 at 5:46 PM Wyatt Alt <wyatt(dot)alt(at)gmail(dot)com> wrote:
>>
>>> I've been struggling to eliminate replication lag on a Postgres 9.6.6
>>> instance on Amazon RDS. I believe the lag is caused by early cleanup
>>> conflicts from vacuums on the master, because I can reliably resolve it by
>>> killing long-running queries on the standby. I most recently saw ten hours
>>> of lag on Saturday and addressed it this way.
>>>
>>> The standby is running with
>>> hot_standby_feedback = on
>>> max_standby_streaming_delay = 5min
>>> max_standby_archive_delay = 30s
>>>
>>> I am not using replication slots on the primary due to reported negative
>>> interactions with pg_repack on large tables.
>>>
>>> My rationale for the first two settings is that hot_standby_feedback
>>> should address my issues almost all the time, but that
>>> max_standby_streaming_delay would sometimes be necessary as a fallback, for
>>> instance in cases of a transient connection loss between the standby and
>>> primary. I believe these settings are mostly working, because lag is less
>>> frequent than it was when I configured them.
>>>
>>> My questions are,
>>> * Am I overlooking anything in my configuration?
>>> * What would explain lag caused by query conflicts given the
>>> max_standby_streaming_delay setting? Shouldn't those queries be getting
>>> killed?
>>> * Is there any particular diagnostic info I should be collecting on the
>>> next occurrence, to help me figure out the cause? Note that as I'm on RDS,
>>> I don't have direct access to the datadir -- just psql.
>>>
>>> Thanks for any advice!
>>> Wyatt
>>>
>>
>
> --
> El genio es 1% inspiración y 99% transpiración.
> Thomas Alva Edison
> http://pglearn.blogspot.mx/
>
>
--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/
From | Date | Subject | |
---|---|---|---|
Next Message | Rene Romero Benavides | 2018-11-21 19:20:25 | Re: replication lag despite corrective config |
Previous Message | Jean-Marc Lessard | 2018-11-21 18:45:28 | RE: pg_dump out of memory for large table with LOB |