Re: Add an option to skip loading missing publication to avoid logical replication failure

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add an option to skip loading missing publication to avoid logical replication failure
Date: 2025-03-10 05:24:06
Message-ID: CAA4eK1KxSbo0XG4xXRm_6VX6Q5U1w1w8s3qbX8dKCKaSvw2dfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 10, 2025 at 10:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>> >
>> > On further thinking, I felt the use of publications_updated variable
>> > is not required we can use publications_valid itself which will be set
>> > if the publication system table is invalidated. Here is a patch for
>> > the same.
>> >
>>
>> The patch relies on the fact that whenever a publication's data is
>> invalidated, it will also invalidate all the RelSyncEntires as per
>> publication_invalidation_cb. But note that we are discussing removing
>> that inefficiency in the thread [1]. So, we should try to rebuild the
>> entry when we have skipped the required publication previously.
>>
>> Apart from this, please consider updating the docs, as mentioned in my
>> response to Sawada-San's email.
>
>
> I'm not sure I fully understand it, but based on your previous email and the initial email from Vignesh, if IIUC, the issue occurs when a publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET PUBLICATION is executed, the subscriber workers restart and request the changes based on restart_lsn, which is at an earlier LSN in the WAL than the LSN at which the publication was created. This leads to an error, and we are addressing this behavior as part of the fix by skipping the changes which are between the restart_lsn of subscriber and the lsn at which publication is created and this behavior looks fine.
>

Yes, your understanding is correct, but note that as such, the patch
is simply skipping the missing publication. The skipped changes are
because those were on the table that is not part of any publication
w.r.t historic snapshot we have at the point of time.

>> BTW, I am planning to commit this only on HEAD as this is a behavior
>> change. Please let me know if you guys think otherwise.
>
>
> Somehow this looks like a bug fix which should be backported no? Am I missing something?
>

We can consider this a bug-fix and backpatch it, but I am afraid that
because this is a behavior change, some users may not like it. Also, I
don't remember seeing public reports for this behavior; that is
probably because it is hard to hit. FYI, we found this via BF
failures. So, I thought it would be better to get this field tested
via HEAD only at this point in time.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-03-10 05:25:28 Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
Previous Message torikoshia 2025-03-10 05:10:27 Re: RFC: Logging plan of the running query