Re: Fix slot synchronization with two_phase decoding enabled

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Fix slot synchronization with two_phase decoding enabled
Date: 2025-04-25 16:26:40
Message-ID: CAD21AoDtqcEHEneo5tbdG1NAG3hqwoxF4RZROPKB3W3PHgh9dg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 25, 2025 at 3:43 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Apr 25, 2025 at 6:02 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > I realized that users who create a logical slot using
> > pg_create_logical_replication_slot() would not be able to enable both
> > options at slot creation, and there is no easy way to enable the
> > failover after two_phase-enabled-slot creation. Users would need to
> > use ALTER_REPLICATION_SLOT replication command, which seems
> > unrealistics for users to use. On the other hand, if we allow creating
> > a logical slot with enabling failover and two_phase using SQL API,
> > there is still a chance for this bug to occur. Would it be worth
> > considering that if a logical slot is created with enabling failover
> > and two_phase using SQL API, we create the slot with only
> > two_phase=true, then advance the slot until the slot satisfies
> > restart_lsn >= two_phase_at, and then enable the failover?
> >
>
> This means we either need to maintain somewhere that user has provided
> failover flag till restart_lsn >= two_phase_at or and then set
> failover flag in the slot

I was thinking of this idea.

> or initially mark it but enable the
> functionality of failover when we reach the condition restart_lsn >=
> two_phase_at.

IIUC the slot could be synchronized to the standby as soon as we
complete DecodingContextFindStartpoint() for a failover-enabled slot.
So we would need some mechanisms to make sure that the slot is not
synchronized while we're waiting to reach the condition restart_lsn >=
two_phase_at even if the failover is enabled.

> Both seem to have different kinds of problems. The first
> idea seems to have an issue with persistence, which means we can lose
> track of the flag after the restart.

I think we can do this series of operations while the slot is not
persistent, that is the slot is still RS_EPHEMERAL.

> The second can mislead the user
> for a long period in cases where prepare and commit have a large time
> gap. I feel this will introduce complexity either in the form of code
> or in giving the information to the user.

Agreed. Both ways introduce complexity so we need to consider the
user-unfriendliness (by not having a proper way to enable failover for
the two_phase-enabled-slot using SQL API) vs. risk (of introducing
complexity).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-04-25 16:44:14 Re: AIO v2.5
Previous Message Xuneng Zhou 2025-04-25 16:17:31 Re: Add pg_buffercache_mark_dirty[_all] functions to the pg_buffercache