From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Nitin Motiani <nitinmotiani(at)google(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: long-standing data loss bug in initial sync of logical replication |
Date: | 2025-04-23 16:58:20 |
Message-ID: | CAD21AoAhU3kp8shYqP=ExiFDZ9sZxpFb5WzLa0p+vEe5j+7CWQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Apr 22, 2025 at 11:31 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Apr 22, 2025 at 10:57 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Mar 18, 2025 at 2:55 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu)
> > > <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> > > >
> > > > > Regarding the PG13, it cannot be
> > > > > applied
> > > > > as-is thus some adjustments are needed. I will share it in upcoming posts.
> > > >
> > > > Here is a patch set for PG13. Apart from PG14-17, the patch could be created as-is,
> > > > because...
> > > >
> > > > 1. WAL record for invalidation messages (XLOG_XACT_INVALIDATIONS) does not exist.
> > > > 2. Thus the ReorderBufferChange for the invalidation does not exist.
> > > > Our patch tries to distribute it but cannot be done as-is.
> > > > 3. Codes assumed that invalidation messages can be added only once.
> > > > 4. The timing when invalidation messages are consumed is limited:
> > > > a. COMMAND_ID change is poped,
> > > > b. start of decoding a transaction, or
> > > > c. end of decoding a transaction.
> > > >
> > > > Above means that invalidations cannot be executed while being decoded.
> > > > I created two patch sets to resolve the data loss issue. 0001 has less code
> > > > changes but could resolve a part of issue, 0002 has huge changes but provides a
> > > > complete solution.
> > > >
> > > > 0001 - mostly same as patches for other versions. ReorderBufferAddInvalidations()
> > > > was adjusted to allow being called several times. As I said above,
> > > > 0001 cannot execute inval messages while decoding the transacitons.
> > > > 0002 - introduces new ReorderBufferChange type to indicate inval messages.
> > > > It would be handled like PG14+.
> > > >
> > > > Here is an example. Assuming that the table foo exists on both nodes, a
> > > > publication "pub" which publishes all tables, and a subscription "sub" which
> > > > subscribes "pub". What if the workload is executed?
> > > >
> > > > ```
> > > > S1 S2
> > > > BEGIN;
> > > > INSERT INTO foo VALUES (1)
> > > > ALTER PUBLICATION pub RENAME TO pub_renamed;
> > > > INSERT INTO foo VALUES (2)
> > > > COMMIT;
> > > > LR -> ?
> > > > ```
> > > >
> > > > With 0001, tuples (1) and (2) would be replicated to the subscriber.
> > > > An error "publication "pub" does not exist" would raise when new changes are done
> > > > later.
> > > >
> > > > 0001+0002 works more aggressively; the error would raise when S1 transaction is decoded.
> > > > The behavior is same as for patched PG14-PG17.
> > > >
> > >
> > > I understand that with 0001 the fix is partial in the sense that
> > > because invalidations are processed at the transaction end, the
> > > changes of concurrent DDL will only be reflected for the next
> > > transaction. Now, on one hand, it is prudent to not add a new type of
> > > ReorderBufferChange in the backbranch (v13) but the change is not that
> > > invasive, so we can go with it as well. My preference would be to go
> > > with just 0001 for v13 to minimize the risk of adding new bugs or
> > > breaking something unintentionally.
> > >
> > > Sawada-San, and others involved here, do you have any suggestions on
> > > this matter?
> >
> > Sorry for the late response.
> >
> > I agree with just 0001 for v13 as 0002 seems invasive. Given that v13
> > would have only a few releases until EOL and 0001 can deal with some
> > cases in question, I'd like to avoid such invasive changes in v13.
> >
>
> Fair enough. OTOH, we can leave the 13 branch considering following:
> (a) it is near EOL, (b) bug happens in rare cases (when the DDLs like
> ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take
> a strong lock on table happens concurrently to DMLs on the tables
> involved in the DDL.), and (c) the complete fix is invasive, even
> partial fix is not simple. I have a slight fear that if we make any
> mistake in fixing it partially (of course, we can't see any today), we
> may not even get a chance to fix it.
>
> Now, if the above convinces you or someone else not to push the
> partial fix in 13, then fine; otherwise, I'll push the 0001 to 13 day
> after tomorrow.
I've considered the above points. I guess (b), particularly executing
ALTER PUBLICATION .. ADD TABLE while the target table is being
updated, might not be rare depending on systems. Given that this bug
causes a silent data-loss on the subscriber that is hard for users to
realize, it could ultimately depend on to what extent we can mitigate
the problem with only 0001 and there is a workaround when the problem
happens.
Kuroda-san already shared[1] the analysis of what happens with and
without 0002 patch, but let me try with the example close to the
original data-loss problem[2]:
Consider the following scenario:
S1: CREATE TABLE d(data text not null);
S1: INSERT INTO d VALUES('d1');
S2: BEGIN;
S2: INSERT INTO d VALUES('d2');
S1: ALTER PUBLICATION pb ADD TABLE d;
S2: INSERT INTO d VALUES('d3');
S2: COMMIT
S2: INSERT INTO d VALUES('d4');
S1: INSERT INTO d VALUES('d5');
Without 0001 and 0002 (i.e. as of today), the walsender fails to send
all changes to table 'd' until it invalidates its caches for some
reasons.
With only 0001, the walsender sends 'd4' insertion or later.
WIth both 0001 and 0002, the wansender sends 'd3' insertion or later.
ISTM the difference between without both 0001 and 0002 and with 0001
is significant. So I think it's worth applying 0001 for v13.
Regards,
[1] https://www.postgresql.org/message-id/OSCPR01MB149664A485A89B0AC6FB7BA71F5DF2%40OSCPR01MB14966.jpnprd01.prod.outlook.com
[2] https://www.postgresql.org/message-id/20231118025445.crhaeeuvoe2g5dv6%40awork3.anarazel.de
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Devrim Gündüz | 2025-04-23 17:01:05 | Re: Built-in Raft replication |
Previous Message | Nikita Malakhov | 2025-04-23 16:54:36 | Re: SQL:2023 JSON simplified accessor support |