Quick Links

RE: Subscription sometimes loses txns after initial table sync

From:	"Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To:	Pritam Baral <pritam(at)pritambaral(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Subscription sometimes loses txns after initial table sync
Date:	2024-12-10 01:53:51
Message-ID:	OS0PR01MB5716CC07BD28832B02D86F52943D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Monday, December 9, 2024 9:21 PM Pritam Baral <pritam(at)pritambaral(dot)com> wrote:
> To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
> Subject: Subscription sometimes loses txns after initial table sync
>
> This was discovered when testing the plan for a major version upgrade via
> logical replication. Said plan requires that some tables be synced before
> others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ...
> followed
> by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness
> revealed
> that sometimes, for some tables added this way, txns after the initial data copy
> are lost by the subscription.
>
> A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
> even 12.22 (on either side of the replication setup). The script runs at a
> default scale of 100 tables with 10k inserts each. This scale is enough to
> demonstrate a failure rate of 1% to 9% of tables on my modest laptop.
>
> In attempts to analyse why this happens, it has been observed that the sender
> sometimes does not pick up a published table, even when the receiver that
> started the sender process has seen the table as available (as returned by
> pg_get_publication_tables()) and has thus begun COPYing its data. When the
> COPY
> finishes (and the tablesync worker is finished), the apply loop on the receiver
> expects to receive (and apply) subsequent changes for such tables, but simply
> isn't sent any. This was observed by dumping every CopyData message sent
> over
> the wire.
>
> The attached script (like the original migration plan) uses a single publication
> and adds tables to it successively. Curiously, when the script was changed to
> use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ...
> ADD
> PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION),
> the no. of
> tables with data loss jumped to 100%.

Thanks for reporting the issue.

The described behavior looks similar to another bug discussed in [1]. If
possible, could you please check if the latest patch in that thread can fix the
bug you reported ?

If it does, it would be helpful to share the feedback in that thread.

[1] https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com

Best Regards,
Hou zj

In response to

Subscription sometimes loses txns after initial table sync at 2024-12-09 13:20:41 from Pritam Baral

Responses

Re: Subscription sometimes loses txns after initial table sync at 2024-12-11 04:50:49 from Shlok Kyal

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Rowley	2024-12-10 01:56:40	Re: Incorrect EXPLAIN ANALYZE output in bloom index docs
Previous Message	Yan Chengpeng	2024-12-10 01:33:44	Re: Incorrect EXPLAIN ANALYZE output in bloom index docs