Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
Date: 2023-01-03 12:00:09
Message-ID: CAA4eK1Lr5bYT=JPKzsfxM0O0VdkO4cr-4jjY1SNZEuYMDZozcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 3, 2023 at 2:14 PM Michail Nikolaev
<michail(dot)nikolaev(at)gmail(dot)com> wrote:
>
> > The point which is not completely clear from your description is the
> > timing of missing records. In one of your previous emails, you seem to
> > have indicated that the data missed from Table B is from the time when
> > the initial sync for Table B was in-progress, right? Also, from your
> > description, it seems there is no error or restart that happened
> > during the time of initial sync for Table B. Is that understanding
> > correct?
>
> Yes and yes.
> * B sync started - 08:08:34
> * lost records are created - 09:49:xx
> * B initial sync finished - 10:19:08
> * I/O error with WAL - 10:19:22
> * SIGTERM - 10:35:20
>
> "Finished" here is `logical replication table synchronization worker
> for subscription "cloud_production_main_sub_v4", table "B" has
> finished`.
> As far as I know, it is about COPY command.
>
> > I am not able to see how these steps can lead to the problem.
>
> One idea I have here - it is something related to the patch about
> forbidding of canceling queries while waiting for synchronous
> replication acknowledgement [1].
> It is applied to Postgres in the cloud we were using [2]. We started
> to see such errors in 10:24:18:
>
> `The COMMIT record has already flushed to WAL locally and might
> not have been replicated to the standby. We must wait here.`
>

Does that by any chance mean you are using a non-community version of
Postgres which has some other changes?

> I wonder could it be some tricky race because of downtime of
> synchronous replica and queries stuck waiting for ACK forever?
>

It is possible but ideally, in that case, the client should request
such a transaction again.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2023-01-03 12:10:39 Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions
Previous Message shveta malik 2023-01-03 11:22:08 Re: Time delayed LR (WAS Re: logical replication restrictions)