Re: long-standing data loss bug in initial sync of logical replication

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: long-standing data loss bug in initial sync of logical replication
Date: 2024-07-09 14:43:42
Message-ID: CALDaNm14MFRQ9vkTXY1QTJdWHmHhXZPFj3Zv3PcqF9CdJukbUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 9 Jul 2024 at 17:05, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> >
> > This issue is present in all supported versions. I was able to
> > reproduce it using the steps recommended by Andres and Tomas's
> > scripts. I also conducted a small test through TAP tests to verify the
> > problem. Attached is the alternate_lock_HEAD.patch, which includes the
> > lock modification(Tomas's change) and the TAP test.
> >
>
> @@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
> /* Allow query cancel in case this takes a long time */
> CHECK_FOR_INTERRUPTS();
>
> - rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
> + rel = table_openrv(t->relation, ShareRowExclusiveLock);
>
> The comment just above this code ("Open, share-lock, and check all the
> explicitly-specified relations") needs modification. It would be
> better to explain the reason of why we would need SRE lock here.

Updated comments for the same.

> > To reproduce the issue in the HEAD version, we cannot use the same
> > test as in the alternate_lock_HEAD patch because the behavior changes
> > slightly after the fix to wait for the lock until the open transaction
> > completes.
> >
>
> But won't the test that reproduces the problem in HEAD be successful
> after the code change? If so, can't we use the same test instead of
> slight modification to verify the lock mode?

Before the patch fix, the ALTER PUBLICATION command would succeed
immediately. Now, the ALTER PUBLICATION command waits until it
acquires the ShareRowExclusiveLock. This change means that in test
cases, previously we waited until the table was added to the
publication, whereas now, after applying the patch, we wait until the
ALTER PUBLICATION command is actively waiting for the
ShareRowExclusiveLock. This waiting step ensures consistent execution
and sequencing of tests each time.

> > The attached issue_reproduce_testcase_head.patch can be
> > used to reproduce the issue through TAP test in HEAD.
> > The changes made in the HEAD version do not directly apply to older
> > branches. For PG14, PG13, and PG12 branches, you can use the
> > alternate_lock_PG14.patch.
> >
>
> Why didn't you include the test in the back branches? If it is due to
> background psql stuff, then won't commit
> (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=187b8991f70fc3d2a13dc709edd408a8df0be055)
> can address it?

Indeed, I initially believed it wasn't available. Currently, I haven't
incorporated the back branch patch, but I plan to include it in a
subsequent version once there are no review comments on the HEAD
patch.

The updated v2 version patch has the fix for the comments.

Regards,
Vignesh

Attachment Content-Type Size
v2-0001-Fix-random-data-loss-during-logical-replication.patch text/x-patch 5.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2024-07-09 14:45:37 Re: Document DateStyle effect on jsonpath string()
Previous Message Ashutosh Bapat 2024-07-09 14:41:41 Re: Optimize WindowAgg's use of tuplestores