Quick Links

Re: long-standing data loss bug in initial sync of logical replication

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Nitin Motiani <nitinmotiani(at)google(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: long-standing data loss bug in initial sync of logical replication
Date:	2024-12-11 07:06:26
Message-ID:	CAD21AoDoWc8MWTyKtmNF_606bcW6J0gV==r=VmPXKUN-e3o9ew@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Oct 8, 2024 at 2:51 AM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
>
> On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > >
> > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > > >
> > > > > BTW, I noticed that we don't take any table-level locks for Create
> > > > > Publication .. For ALL TABLES (and Drop Publication). Can that create
> > > > > a similar problem? I haven't tested so not sure but even if there is a
> > > > > problem for the Create case, it should lead to some ERROR like missing
> > > > > publication.
> > > >
> > > > I tested these scenarios, and as you expected, it throws an error for
> > > > the create publication case:
> > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
> > > > data from WAL stream: ERROR: publication "pub1" does not exist
> > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change
> > > > callback, associated LSN 0/1510CD8
> > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
> > > > "logical replication apply worker" (PID 481526) exited with exit code
> > > > 1
> > > >
> > > > The steps for this process are as follows:
> > > > 1) Create tables in both the publisher and subscriber.
> > > > 2) On the publisher: Create a replication slot.
> > > > 3) On the subscriber: Create a subscription using the slot created by
> > > > the publisher.
> > > > 4) On the publisher:
> > > > 4.a) Session 1: BEGIN; INSERT INTO T1;
> > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
> > > > 4.c) Session 1: COMMIT;
> > > >
> > > > Since we are throwing out a "publication does not exist" error, there
> > > > is no inconsistency issue here.
> > > >
> > > > However, an issue persists with DROP ALL TABLES publication, where
> > > > data continues to replicate even after the publication is dropped.
> > > > This happens because the open transaction consumes the invalidation,
> > > > causing the publications to be revalidated using old snapshot. As a
> > > > result, both the open transactions and the subsequent transactions are
> > > > getting replicated.
> > > >
> > > > We can reproduce this issue by following these steps in a logical
> > > > replication setup with an "ALL TABLES" publication:
> > > > On the publisher:
> > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
> > > > In another session on the publisher:
> > > > Session 2: DROP PUBLICATION
> > > > Back in Session 1 on the publisher:
> > > > COMMIT;
> > > > Finally, in Session 1 on the publisher:
> > > > INSERT INTO T1 VALUES (val2);
> > > >
> > > > Even after dropping the publication, both val1 and val2 are still
> > > > being replicated to the subscriber. This means that both the
> > > > in-progress concurrent transaction and the subsequent transactions are
> > > > being replicated.
> > > >
> > > > I don't think locking all tables is a viable solution in this case, as
> > > > it would require asking the user to refrain from performing any
> > > > operations on any of the tables in the database while creating a
> > > > publication.
> > > >
> > >
> > > Indeed, locking all tables in the database to prevent concurrent DMLs
> > > for this scenario also looks odd to me. The other alternative
> > > previously suggested by Andres is to distribute catalog modifying
> > > transactions to all concurrent in-progress transactions [1] but as
> > > mentioned this could add an overhead. One possibility to reduce
> > > overhead is that we selectively distribute invalidations for
> > > catalogs-related publications but I haven't analyzed the feasibility.
> > >
> > > We need more opinions to decide here, so let me summarize the problem
> > > and solutions discussed. As explained with an example in an email [1],
> > > the problem related to logical decoding is that it doesn't process
> > > invalidations corresponding to DDLs for the already in-progress
> > > transactions. We discussed preventing DMLs in the first place when
> > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
> > > progress. The solution discussed was to acquire
> > > ShareUpdateExclusiveLock for all the tables being added via such
> > > commands. Further analysis revealed that the same handling is required
> > > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
> > > the tables in the specified schemas. Then DROP PUBLICATION also seems
> > > to have similar symptoms which means in the worst case (where
> > > publication is for ALL TABLES) we have to lock all the tables in the
> > > database. We are not sure if that is good so the other alternative we
> > > can pursue is to distribute invalidations in logical decoding
> > > infrastructure [1] which has its downsides.
> > >
> > > Thoughts?
> >
> > Thank you for summarizing the problem and solutions!
> >
> > I think it's worth trying the idea of distributing invalidation
> > messages, and we will see if there could be overheads or any further
> > obstacles. IIUC this approach would resolve another issue we discussed
> > before too[1].
> >
> > Regards,
> >
> > [1] https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com
> >
>
> Hi Sawada-san,
>
> I have tested the scenario shared by you on the thread [1]. And I
> confirm that the latest patch [2] fixes this issue.
>

I confirmed that the proposed patch fixes these issues. I have one
question about the patch:

In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the
following code:

/*
* If we don't have a base snapshot yet, there are no changes in this
* transaction which in turn implies we don't yet need a snapshot at
* all. We'll add a snapshot when the first change gets queued.
*
* NB: This works correctly even for subtransactions because
* ReorderBufferAssignChild() takes care to transfer the base snapshot
* to the top-level transaction, and while iterating the changequeue
* we'll get the change from the subtxn.
*/
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
continue;

Is there any case where we need to distribute inval messages to
transactions that don't have the base snapshot yet but eventually need
the inval messages?

Overall, with this idea, we distribute invalidation messages to all
concurrent decoded transactions. It could introduce performance
regressions by several causes. For example, we could end up
invalidating RelationSyncCache entries in more cases. While this is
addressed by your selectively cache invalidation patch, there is still
5% regression. We might need to accept a certain amount of regressions
for making it correct but it would be better to figure out where these
regressions come from. Other than that, I think the performance
regression could happen due to the costs of distributing invalidation
messages. You've already observed there is 1~3% performance regression
in cases where we distribute a large amount of invalidation messages
to one concurrently decoded transaction[1]. I guess that the
selectively cache invalidation idea would not help this case. Also, I
think we might want to test other cases like where we distribute a
small amount of invalidation messages to many concurrently decoded
transactions.

Regards,

[1] https://www.postgresql.org/message-id/CANhcyEX%2BC3G68W51myHWfbpAdmSXDwHdMsWUa%2BzHBF_QKKvZMw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Re: long-standing data loss bug in initial sync of logical replication at 2024-10-08 09:51:38 from Shlok Kyal

Responses

Re: long-standing data loss bug in initial sync of logical replication at 2025-02-24 09:49:49 from Amit Kapila
Re: long-standing data loss bug in initial sync of logical replication at 2025-02-25 10:26:29 from Benoit Lobréau
Re: long-standing data loss bug in initial sync of logical replication at 2025-02-26 03:51:29 from Shlok Kyal

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	John Naylor	2024-12-11 07:08:58	Re: CRC32C Parallel Computation Optimization on ARM
Previous Message	Peter Eisentraut	2024-12-11 07:05:05	advanced patch feedback session at FOSDEM, responses needed