From: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Nitin Motiani <nitinmotiani(at)google(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: long-standing data loss bug in initial sync of logical replication |
Date: | 2025-04-25 05:15:39 |
Message-ID: | CANhcyEXsObdjkjxEnq10aJumDpa5J6aiPzgTh_w4KCWRYHLw6Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 24 Apr 2025 at 14:39, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Apr 23, 2025 at 10:28 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > >
> > > Fair enough. OTOH, we can leave the 13 branch considering following:
> > > (a) it is near EOL, (b) bug happens in rare cases (when the DDLs like
> > > ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take
> > > a strong lock on table happens concurrently to DMLs on the tables
> > > involved in the DDL.), and (c) the complete fix is invasive, even
> > > partial fix is not simple. I have a slight fear that if we make any
> > > mistake in fixing it partially (of course, we can't see any today), we
> > > may not even get a chance to fix it.
> > >
> > > Now, if the above convinces you or someone else not to push the
> > > partial fix in 13, then fine; otherwise, I'll push the 0001 to 13 day
> > > after tomorrow.
> >
> > I've considered the above points. I guess (b), particularly executing
> > ALTER PUBLICATION .. ADD TABLE while the target table is being
> > updated, might not be rare depending on systems. Given that this bug
> > causes a silent data-loss on the subscriber that is hard for users to
> > realize, it could ultimately depend on to what extent we can mitigate
> > the problem with only 0001 and there is a workaround when the problem
> > happens.
> >
> > Kuroda-san already shared[1] the analysis of what happens with and
> > without 0002 patch, but let me try with the example close to the
> > original data-loss problem[2]:
> >
> > Consider the following scenario:
> >
> > S1: CREATE TABLE d(data text not null);
> > S1: INSERT INTO d VALUES('d1');
> > S2: BEGIN;
> > S2: INSERT INTO d VALUES('d2');
> > S1: ALTER PUBLICATION pb ADD TABLE d;
> > S2: INSERT INTO d VALUES('d3');
> > S2: COMMIT
> > S2: INSERT INTO d VALUES('d4');
> > S1: INSERT INTO d VALUES('d5');
> >
> > Without 0001 and 0002 (i.e. as of today), the walsender fails to send
> > all changes to table 'd' until it invalidates its caches for some
> > reasons.
> >
> > With only 0001, the walsender sends 'd4' insertion or later.
> >
> > WIth both 0001 and 0002, the wansender sends 'd3' insertion or later.
> >
> > ISTM the difference between without both 0001 and 0002 and with 0001
> > is significant. So I think it's worth applying 0001 for v13.
> >
>
> Pushed to v13 as well, thanks for sharing the feedback.
>
In the commits, I saw that the filenames are misspelled for files
invalidation_distrubution.out and invalidation_distrubution.spec.
This is present in branches from REL_13 to HEAD. I have attached
patches to fix the same.
Thanks and Regards,
Shlok Kyal
Attachment | Content-Type | Size |
---|---|---|
v1_HEAD-0001-Fix-spelling-for-file-names.patch | application/octet-stream | 2.7 KB |
v1_REL_15-0001-Fix-spelling-for-file-names.patch | application/octet-stream | 2.1 KB |
v1_REL_13-0001-Fix-spelling-for-file-names.patch | application/octet-stream | 2.1 KB |
v1_REL_16-0001-Fix-spelling-for-file-names.patch | application/octet-stream | 2.7 KB |
v1_REL_14-0001-Fix-spelling-for-file-names.patch | application/octet-stream | 2.1 KB |
v1_REL_17-0001-Fix-spelling-for-file-names.patch | application/octet-stream | 2.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2025-04-25 09:02:07 | Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints |
Previous Message | Masahiko Sawada | 2025-04-25 05:15:34 | Re: Fix premature xmin advancement during fast forward decoding |