Re: long-standing data loss bug in initial sync of logical replication

From: Nitin Motiani <nitinmotiani(at)google(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: long-standing data loss bug in initial sync of logical replication
Date: 2024-07-10 17:52:36
Message-ID: CAH5HC95UsGrD1ez0vfNPCLP6vE3ptmaOkUECB5JsqLOCYjRW3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 10, 2024 at 10:39 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > The patch missed to use the ShareRowExclusiveLock for partitions, see
> > attached. I haven't tested it but they should also face the same
> > problem. Apart from that, I have changed the comments in a few places
> > in the patch.
>
> I could not hit the updated ShareRowExclusiveLock changes through the
> partition table, instead I could verify it using the inheritance
> table. Added a test for the same and also attaching the backbranch
> patch.
>

Hi,

I tested alternative-experimental-fix-lock.patch provided by Tomas
(replaces SUE with SRE in OpenTableList). I believe there are a couple
of scenarios the patch does not cover.

1. It doesn't handle the case of "ALTER PUBLICATION <pub> ADD TABLES
IN SCHEMA <schema>".

I took crash-test.sh provided by Tomas and modified it to add all
tables in the schema to publication using the following command :

ALTER PUBLICATION p ADD TABLES IN SCHEMA public

The modified script is attached (crash-test-with-schema.sh). With this
script, I can reproduce the issue even with the patch applied. This is
because the code path to add a schema to the publication doesn't go
through OpenTableList.

I have also attached a script run-test-with-schema.sh to run
crash-test-with-schema.sh in a loop with randomly generated parameters
(modified from run.sh provided by Tomas).

2. The second issue is a deadlock which happens when the alter
publication command is run for a comma separated list of tables.

I created another script create-test-tables-order-reverse.sh. This
script runs a command like the following :

ALTER PUBLICATION p ADD TABLE test_2,test_1

Running the above script, I was able to get a deadlock error (the
output is attached in deadlock.txt). In the alter publication command,
I added the tables in the reverse order to increase the probability of
the deadlock. But it should happen with any order of tables.

I am not sure if the deadlock is a major issue because detecting the
deadlock is better than data loss. The schema issue is probably more
important. I didn't test it out with the latest patches sent by
Vignesh but since the code changes in that patch are also in
OpenTableList, I think the schema scenario won't be covered by those.

Thanks & Regards,
Nitin Motiani
Google

Attachment Content-Type Size
deadlock.txt text/plain 6.8 KB
crash-test-with-schema.sh application/x-sh 9.1 KB
run-test-with-schema.sh application/x-sh 1.2 KB
crash-test-tables-order-reverse.sh application/x-sh 9.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kirill Reshke 2024-07-10 17:57:45 Re: Allow non-superuser to cancel superuser tasks.
Previous Message Imran Zaheer 2024-07-10 17:06:40 Re: Windows: openssl & gssapi dislike each other