Re: Add an option to skip loading missing publication to avoid logical replication failure

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add an option to skip loading missing publication to avoid logical replication failure
Date: 2025-03-05 05:04:08
Message-ID: CAA4eK1L02YLsBZjdEi0cxnkdj=vvPUerRZQ8Nx6TT=y24=9+Ng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 4, 2025 at 12:23 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> There is almost negligible dip with the above suggested way, the test
> results for the same is given below(execution time is in milli
> seconds):
> Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
> Head | 10.25 | 15.85 | 65.53 | 569.15 | 9194.19
> Patch | 10.25 | 15.84 | 65.91 | 571.75 | 9208.66
> % diff | 0.00 | 0.06 | -0.58 | -0.46 | -0.16
>
> There is a performance dip in the range of 0 to 0.58 percent.
> The attached patch has the changes for the same. The test script used
> is also attached.
>

The patch still needs more review but the change has negligible
performance impact. The next step is to get more opinions on whether
we should add a new subscription option (say
skip_not_existant_publication) for this work. See patch v1-0002-* in
email [1]. The problem summary is explained in email [2] and in the
commit message of the 0001 patch in this thread. But still, let me
write briefly for the ease of others.

The problem is that ALTER SUBSCRIPTION ... SET PUBLICATION ... will
lead to restarting of apply worker, and after the restart, the apply
worker will use the existing slot and replication origin corresponding
to the subscription. Now, it is possible that before the restart, the
origin has not been updated, and the WAL start location points to a
location before where PUBLICATION pointed to by SET PUBLICATION
exists. This leads to an error: "ERROR: publication "pub1" does not
exist". Once this error occurs, apply worker will never be able to
proceed and will always return the same error. For users, this is a
problem because they would have created a publication before executing
ALTER SUBSCRIPTION ... SET PUBLICATION .. and now they have no way to
proceed.

The solution we came up with is to skip loading the publication if the
publication does not exist. We load the publication later and update
the relation entry when the publication gets created.

The two main concerns with this idea, as shared in email [3], are
performance implications of this change and the possibility of current
behaviour expectations from the users.

We came up with a solution where the performance impact is negligible,
as shown in the tests [4]. For that, we won't try to reload the
skipped/missing publication for each change but will attempt it only
when any new publication is created/dropped for a valid relation entry
in RelationSyncCache (maintained by pgoutput).

The new option skip_not_existant_publication is to address the second
concern "Imagine you have a subscriber using two publications p1 and
p2, and someone comes around and drops p1 by mistake. With the
proposed patch, the subscription will notice this, but it'll continue
sending data ignoring the missing publication. Yes, it will continue
working, but it's quite possible this breaks the subscriber and it's
be better to fail and stop replicating.".

I see the point of adding such an option to avoid breaking the current
applications (if there are any) that are relying on current behaviour.
But OTOH, I am not sure if users expect us to fail explicitly in such
scenarios.

This is a long-standing behaviour for which we get reports from time
to time, and once analyzing a failure, Tom also looked at it and
agreed that we don't have much choice to avoid skipping non-existent
publications [5]. But we never concluded as to whether skipping should
be a default behavior or an optional one. So, we need more opinions on
it.

Thoughts?

[1] - https://www.postgresql.org/message-id/CALDaNm0-n8FGAorM%2BbTxkzn%2BAOUyx5%3DL_XmnvOP6T24%2B-NcBKg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1Lc%3DNDV1HrY2gNasFK90MtysnA575a%2Brd0p%2BPOjXN%2BSpw%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com
[4] - https://www.postgresql.org/message-id/CALDaNm2Xkm1M-ik2RLJZ9rMhW2zW2GRLL6ePyZJbXcAjOVwzXg%40mail.gmail.com
[5] - https://www.postgresql.org/message-id/631312.1707251789%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alex Friedman 2025-03-05 05:06:10 Re: Doc fix of aggressive vacuum threshold for multixact members storage
Previous Message Michael Paquier 2025-03-05 04:45:12 Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible