Re: conflict with recovery when delay is gone

From: Radoslav Nedyalkov <rnedyalkov(at)gmail(dot)com>
To: Mohamed Wael Khobalatte <mkhobalatte(at)grubhub(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: conflict with recovery when delay is gone
Date: 2020-11-15 11:49:33
Message-ID: CANhtRia0Gu+qVVHoUWtj59pDN8yqowSC4qjmqCrLMgMPR-=pHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, Nov 15, 2020 at 12:48 AM Mohamed Wael Khobalatte <
mkhobalatte(at)grubhub(dot)com> wrote:

>
>
> On Sat, Nov 14, 2020 at 2:46 PM Radoslav Nedyalkov <rnedyalkov(at)gmail(dot)com>
> wrote:
>
>>
>>
>> On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov(at)gmail(dot)com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
>>> wrote:
>>>
>>>> On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
>>>> > On a very busy master-standby setup which runs typical olap
>>>> processing -
>>>> > long living , massive writes statements, we're getting on the
>>>> standby:
>>>> >
>>>> > ERROR: canceling statement due to conflict with recovery
>>>> > FATAL: terminating connection due to conflict with recovery
>>>> >
>>>> > The weird thing is that cancellations happen usually after standby
>>>> has experienced
>>>> > some huge delay(2h), still not at the allowed maximum(3h). Even
>>>> recently run statements
>>>> > got cancelled when the delay is already at zero.
>>>> >
>>>> > Sometimes the situation got relaxed after an hour or so.
>>>> > Restarting the server instantly helps.
>>>> >
>>>> > It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
>>>> >
>>>> > What phenomenon could we be facing?
>>>>
>>>> Hard to say. Perhaps an unusual kind of replication conflict?
>>>>
>>>> What is in "pg_stat_database_conflicts" on the standby server?
>>>>
>>>
>>> db01=# select * from pg_stat_database_conflicts;
>>> datid | datname | confl_tablespace | confl_lock | confl_snapshot |
>>> confl_bufferpin | confl_deadlock
>>>
>>> -------+-----------+------------------+------------+----------------+-----------------+----------------
>>> 13877 | template0 | 0 | 0 | 0 |
>>> 0 | 0
>>> 16400 | template1 | 0 | 0 | 0 |
>>> 0 | 0
>>> 16402 | postgres | 0 | 0 | 0 |
>>> 0 | 0
>>> 16401 | db01 | 0 | 0 | 51 |
>>> 0 | 0
>>> (4 rows)
>>>
>>> On a freshly restarted standby we've just got similar behaviour after a
>>> 2 hours delay and a slow catch-up.
>>> confl_snapshots is 51 and we have exactly the same number cancelled
>>> statements.
>>>
>>>
>> No luck so far. Searching for the explanation i found we fail into the
>> unexplained case when
>> snapshot conflicts happen even hot_standby_feedback is on.
>>
>> Thanks,
>> Rado
>>
>>
>
> Perhaps you have a value set for old_snapshot_threshold? If not, do the
> walreceiver connections drop out?
>

old_snapshot_threshold is -1 on both master and replica.
walreceiver does not drop.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Josef Šimánek 2020-11-15 12:11:24 Re: Bi-directional Replica updates
Previous Message Dilip Kumar 2020-11-15 09:47:12 Re: Race condition with restore_command on streaming replica