Re: Taking into account syncrep position in flush_lsn reported by apply worker

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Arseny Sher <ars(at)neon(dot)tech>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Taking into account syncrep position in flush_lsn reported by apply worker
Date: 2024-08-21 07:10:50
Message-ID: f592bea4-b07d-462c-a915-bb23485d6826@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21/08/2024 09:25, Amit Kapila wrote:
> On Wed, Aug 21, 2024 at 2:25 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>
>> On 14/08/2024 16:54, Arseny Sher wrote:
>>> On 8/13/24 06:35, Amit Kapila wrote:
>>>> On Mon, Aug 12, 2024 at 3:43 PM Arseny Sher <ars(at)neon(dot)tech> wrote:
>>>>>
>>>>> Sorry for the poor formatting of the message above, this should be
>>>>> better:
>>>>>
>>>>> Hey. Currently synchronous_commit is disabled for logical apply worker
>>>>> on the ground that reported flush_lsn includes only locally flushed data
>>>>> so slot (publisher) preserves everything higher than this, and so in
>>>>> case of subscriber restart no data is lost. However, imagine that
>>>>> subscriber is made highly available by standby to which synchronous
>>>>> replication is enabled. Then reported flush_lsn is ignorant of this
>>>>> synchronous replication progress, and in case of failover data loss may
>>>>> occur if subscriber managed to ack flush_lsn ahead of syncrep.
>>>>
>>>> Won't the same can be achieved by enabling the synchronous_commit
>>>> parameter for a subscription?
>>>
>>> Nope, because it would force WAL flush and wait for replication to the
>>> standby in the apply worker, slowing down it. The logic missing
>>> currently is not to wait for the synchronous commit, but still mind its
>>> progress in the flush_lsn reporting.
>>
>> I think this patch makes sense. I'm not sure we've actually made any
>> promises on it, but it feels wrong that the slot's LSN might be advanced
>> past the LSN that's been has been acknowledged by the replica, if
>> synchronous replication is configured. I see little downside in making
>> that promise.
>
> One possible downside of such a promise could be that the publisher
> may slow down for sync replication because it has to wait for all the
> configured sync_standbys of subscribers to acknowledge the LSN. I
> don't know how many applications can be impacted due to this if we do
> it by default but if we feel there won't be any such cases or they
> will be in the minority then it is okay to proceed with this.

It only slows down updating the flush LSN on the publisher, which is
updated quite lazily anyway.

A more serious scenario is if the sync replica crashes or is not
responding at all. In that case, the flush LSN on the publisher cannot
advance, and WAL starts to accumulate. However, if a sync replica is not
responding, that's very painful for the (subscribing) server anyway: all
commits will hang waiting for the replica. Holding back the flush LSN on
the publisher seems like a minor problem compared to that.

It would be good to have some kind of an escape hatch though. If you get
into that situation, is there a way to advance the publisher's flush LSN
even though the synchronous replica has crashed? You can temporarily
turn off synchronous replication on the subscriber. That will release
any COMMITs on the server too. In theory you might not want that, but in
practice stuck COMMITs are so painful that if you are taking manual
action, you probably do want to release them as well.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2024-08-21 07:10:51 Re: Eager aggregation, take 3
Previous Message Peter Eisentraut 2024-08-21 07:00:44 Re: Virtual generated columns