RE: Subscription sometimes loses txns after initial table sync

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Pritam Baral <pritam(at)pritambaral(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Subscription sometimes loses txns after initial table sync
Date: 2024-12-10 01:53:51
Message-ID: OS0PR01MB5716CC07BD28832B02D86F52943D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Monday, December 9, 2024 9:21 PM Pritam Baral <pritam(at)pritambaral(dot)com> wrote:
> To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
> Subject: Subscription sometimes loses txns after initial table sync
>
> This was discovered when testing the plan for a major version upgrade via
> logical replication. Said plan requires that some tables be synced before
> others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ...
> followed
> by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness
> revealed
> that sometimes, for some tables added this way, txns after the initial data copy
> are lost by the subscription.
>
> A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
> even 12.22 (on either side of the replication setup). The script runs at a
> default scale of 100 tables with 10k inserts each. This scale is enough to
> demonstrate a failure rate of 1% to 9% of tables on my modest laptop.
>
> In attempts to analyse why this happens, it has been observed that the sender
> sometimes does not pick up a published table, even when the receiver that
> started the sender process has seen the table as available (as returned by
> pg_get_publication_tables()) and has thus begun COPYing its data. When the
> COPY
> finishes (and the tablesync worker is finished), the apply loop on the receiver
> expects to receive (and apply) subsequent changes for such tables, but simply
> isn't sent any. This was observed by dumping every CopyData message sent
> over
> the wire.
>
> The attached script (like the original migration plan) uses a single publication
> and adds tables to it successively. Curiously, when the script was changed to
> use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ...
> ADD
> PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION),
> the no. of
> tables with data loss jumped to 100%.

Thanks for reporting the issue.

The described behavior looks similar to another bug discussed in [1]. If
possible, could you please check if the latest patch in that thread can fix the
bug you reported ?

If it does, it would be helpful to share the feedback in that thread.

[1] https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-12-10 01:56:40 Re: Incorrect EXPLAIN ANALYZE output in bloom index docs
Previous Message Yan Chengpeng 2024-12-10 01:33:44 Re: Incorrect EXPLAIN ANALYZE output in bloom index docs