From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Pritam Baral <pritam(at)pritambaral(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | RE: Subscription sometimes loses txns after initial table sync |
Date: | 2024-12-10 01:53:51 |
Message-ID: | OS0PR01MB5716CC07BD28832B02D86F52943D2@OS0PR01MB5716.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Monday, December 9, 2024 9:21 PM Pritam Baral <pritam(at)pritambaral(dot)com> wrote:
> To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
> Subject: Subscription sometimes loses txns after initial table sync
>
> This was discovered when testing the plan for a major version upgrade via
> logical replication. Said plan requires that some tables be synced before
> others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ...
> followed
> by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness
> revealed
> that sometimes, for some tables added this way, txns after the initial data copy
> are lost by the subscription.
>
> A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
> even 12.22 (on either side of the replication setup). The script runs at a
> default scale of 100 tables with 10k inserts each. This scale is enough to
> demonstrate a failure rate of 1% to 9% of tables on my modest laptop.
>
> In attempts to analyse why this happens, it has been observed that the sender
> sometimes does not pick up a published table, even when the receiver that
> started the sender process has seen the table as available (as returned by
> pg_get_publication_tables()) and has thus begun COPYing its data. When the
> COPY
> finishes (and the tablesync worker is finished), the apply loop on the receiver
> expects to receive (and apply) subsequent changes for such tables, but simply
> isn't sent any. This was observed by dumping every CopyData message sent
> over
> the wire.
>
> The attached script (like the original migration plan) uses a single publication
> and adds tables to it successively. Curiously, when the script was changed to
> use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ...
> ADD
> PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION),
> the no. of
> tables with data loss jumped to 100%.
Thanks for reporting the issue.
The described behavior looks similar to another bug discussed in [1]. If
possible, could you please check if the latest patch in that thread can fix the
bug you reported ?
If it does, it would be helpful to share the feedback in that thread.
[1] https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com
Best Regards,
Hou zj
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2024-12-10 01:56:40 | Re: Incorrect EXPLAIN ANALYZE output in bloom index docs |
Previous Message | Yan Chengpeng | 2024-12-10 01:33:44 | Re: Incorrect EXPLAIN ANALYZE output in bloom index docs |