Re: Subscription sometimes loses txns after initial table sync

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Pritam Baral <pritam(at)pritambaral(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Subscription sometimes loses txns after initial table sync
Date: 2024-12-11 04:50:49
Message-ID: CANhcyEUcY8YxMC0zBS3WQWxUkcTJQ_80rzV8Eu1y2e-sFVxLrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 10 Dec 2024 at 07:24, Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Monday, December 9, 2024 9:21 PM Pritam Baral <pritam(at)pritambaral(dot)com> wrote:
> > To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
> > Subject: Subscription sometimes loses txns after initial table sync
> >
> > This was discovered when testing the plan for a major version upgrade via
> > logical replication. Said plan requires that some tables be synced before
> > others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ...
> > followed
> > by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness
> > revealed
> > that sometimes, for some tables added this way, txns after the initial data copy
> > are lost by the subscription.
> >
> > A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
> > even 12.22 (on either side of the replication setup). The script runs at a
> > default scale of 100 tables with 10k inserts each. This scale is enough to
> > demonstrate a failure rate of 1% to 9% of tables on my modest laptop.
> >
> > In attempts to analyse why this happens, it has been observed that the sender
> > sometimes does not pick up a published table, even when the receiver that
> > started the sender process has seen the table as available (as returned by
> > pg_get_publication_tables()) and has thus begun COPYing its data. When the
> > COPY
> > finishes (and the tablesync worker is finished), the apply loop on the receiver
> > expects to receive (and apply) subsequent changes for such tables, but simply
> > isn't sent any. This was observed by dumping every CopyData message sent
> > over
> > the wire.
> >
> > The attached script (like the original migration plan) uses a single publication
> > and adds tables to it successively. Curiously, when the script was changed to
> > use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ...
> > ADD
> > PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION),
> > the no. of
> > tables with data loss jumped to 100%.
>
> Thanks for reporting the issue.
>
> The described behavior looks similar to another bug discussed in [1]. If
> possible, could you please check if the latest patch in that thread can fix the
> bug you reported ?
>
> If it does, it would be helpful to share the feedback in that thread.
>
> [1] https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com
>

Hi,

I tried to reproduce the issue on HEAD and REL_17_STABLE branches. I
found that the issue is intermittent for me. I ran the script,
provided in [1], 50 times on both branches and I was able to reproduce
the issue 4 times and 5 times respectively.
Then I tested both the branches after applying patches in [2] and ran
the script 50 times. I was not able to reproduce the issue with patch.

I think as Hou-san suggested, the patches in [2] can fix this issue.

[1]: https://www.postgresql.org/message-id/8b595156-d8b6-4b53-a788-7d945726cd2f%40pritambaral.com
[2]: https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com

Thanks and Regards,
Shlok Kyal

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-12-11 05:06:05 Re: Conflict detection for update_deleted in logical replication
Previous Message Michael Paquier 2024-12-11 04:47:08 Re: Changing the state of data checksums in a running cluster