Quick Links

Re: Add an option to skip loading missing publication to avoid logical replication failure

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Add an option to skip loading missing publication to avoid logical replication failure
Date:	2025-03-09 03:29:49
Message-ID:	CAD21AoAXs93QdXyg6T5V7L0n+TG5-Aa3JN0cj1E0+oX5Swtv=g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Mar 4, 2025 at 9:04 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Mar 4, 2025 at 12:23 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > There is almost negligible dip with the above suggested way, the test
> > results for the same is given below(execution time is in milli
> > seconds):
> > Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
> > Head | 10.25 | 15.85 | 65.53 | 569.15 | 9194.19
> > Patch | 10.25 | 15.84 | 65.91 | 571.75 | 9208.66
> > % diff | 0.00 | 0.06 | -0.58 | -0.46 | -0.16
> >
> > There is a performance dip in the range of 0 to 0.58 percent.
> > The attached patch has the changes for the same. The test script used
> > is also attached.
> >
>
> The patch still needs more review but the change has negligible
> performance impact. The next step is to get more opinions on whether
> we should add a new subscription option (say
> skip_not_existant_publication) for this work. See patch v1-0002-* in
> email [1]. The problem summary is explained in email [2] and in the
> commit message of the 0001 patch in this thread. But still, let me
> write briefly for the ease of others.
>
> The problem is that ALTER SUBSCRIPTION ... SET PUBLICATION ... will
> lead to restarting of apply worker, and after the restart, the apply
> worker will use the existing slot and replication origin corresponding
> to the subscription. Now, it is possible that before the restart, the
> origin has not been updated, and the WAL start location points to a
> location before where PUBLICATION pointed to by SET PUBLICATION
> exists. This leads to an error: "ERROR: publication "pub1" does not
> exist". Once this error occurs, apply worker will never be able to
> proceed and will always return the same error. For users, this is a
> problem because they would have created a publication before executing
> ALTER SUBSCRIPTION ... SET PUBLICATION .. and now they have no way to
> proceed.
>
> The solution we came up with is to skip loading the publication if the
> publication does not exist. We load the publication later and update
> the relation entry when the publication gets created.
>
> The two main concerns with this idea, as shared in email [3], are
> performance implications of this change and the possibility of current
> behaviour expectations from the users.
>
> We came up with a solution where the performance impact is negligible,
> as shown in the tests [4]. For that, we won't try to reload the
> skipped/missing publication for each change but will attempt it only
> when any new publication is created/dropped for a valid relation entry
> in RelationSyncCache (maintained by pgoutput).

Thank you for summarizing the issue. That helps catch up a lot.

>
> The new option skip_not_existant_publication is to address the second
> concern "Imagine you have a subscriber using two publications p1 and
> p2, and someone comes around and drops p1 by mistake. With the
> proposed patch, the subscription will notice this, but it'll continue
> sending data ignoring the missing publication. Yes, it will continue
> working, but it's quite possible this breaks the subscriber and it's
> be better to fail and stop replicating.".

I think that in this particular situation the current behavior would
be likely to miss more changes than the patch'ed behavior case.

After the logical replication stops, the user would have to alter the
subscription to subscribe to only p1 by executing 'ALTER SUBSCRIPTION
... SET PUBLICATION p1' in order to resume the logical replication. In
any case, the publisher might be receiving further changes but the
subscriber would end up missing changes for tables associated with p2
generated while p2 doesn't exist. Even if the user re-creates the
publication p2 after that, it would be hard for users to re-alter the
subscription to get changes for tables associated with p1 and p2 from
the exact point of p2 being created. Therefore, the subscriber could
end up missing some changes that happened between 'CREATE PUBLICATION
p2' and 'ALTER SUBSCRIPTION ... SET PUBLICATION p1, p2'.

On the other hand, with the patch, the publication can send the
changes to tables associated with p1 and p2 as soon as it decodes the
WAL record of re-CREATE PUBLICATION p2.

>
> I see the point of adding such an option to avoid breaking the current
> applications (if there are any) that are relying on current behaviour.
> But OTOH, I am not sure if users expect us to fail explicitly in such
> scenarios.

On the side note, with the patch since we ignore missing publications
we will be able to create a subscription with whatever publications
regardless their existence like:

CREATE SUBSCRIPTION ... PUBLICATION pub1, pub2, pub3, pub4, pub5, ..., pub1000;

The walsender corresponding to such subscriber can stream changes as
soon as the listed publications are created on the publisher and
REFRESH PUBLICATION is executed.

>
> This is a long-standing behaviour for which we get reports from time
> to time, and once analyzing a failure, Tom also looked at it and
> agreed that we don't have much choice to avoid skipping non-existent
> publications [5]. But we never concluded as to whether skipping should
> be a default behavior or an optional one. So, we need more opinions on
> it.

I'm leaning toward making the skipping behavior a default as I could
not find a good benefit for the current behavior (i.e., stopping
logical replication due to missing publications).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Re: Add an option to skip loading missing publication to avoid logical replication failure at 2025-03-05 05:04:08 from Amit Kapila

Responses

Re: Add an option to skip loading missing publication to avoid logical replication failure at 2025-03-10 03:33:54 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tatsuo Ishii	2025-03-09 06:39:57	Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options
Previous Message	Alexander Korotkov	2025-03-09 02:53:29	Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs