From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION |
Date: | 2023-01-03 12:00:09 |
Message-ID: | CAA4eK1Lr5bYT=JPKzsfxM0O0VdkO4cr-4jjY1SNZEuYMDZozcw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jan 3, 2023 at 2:14 PM Michail Nikolaev
<michail(dot)nikolaev(at)gmail(dot)com> wrote:
>
> > The point which is not completely clear from your description is the
> > timing of missing records. In one of your previous emails, you seem to
> > have indicated that the data missed from Table B is from the time when
> > the initial sync for Table B was in-progress, right? Also, from your
> > description, it seems there is no error or restart that happened
> > during the time of initial sync for Table B. Is that understanding
> > correct?
>
> Yes and yes.
> * B sync started - 08:08:34
> * lost records are created - 09:49:xx
> * B initial sync finished - 10:19:08
> * I/O error with WAL - 10:19:22
> * SIGTERM - 10:35:20
>
> "Finished" here is `logical replication table synchronization worker
> for subscription "cloud_production_main_sub_v4", table "B" has
> finished`.
> As far as I know, it is about COPY command.
>
> > I am not able to see how these steps can lead to the problem.
>
> One idea I have here - it is something related to the patch about
> forbidding of canceling queries while waiting for synchronous
> replication acknowledgement [1].
> It is applied to Postgres in the cloud we were using [2]. We started
> to see such errors in 10:24:18:
>
> `The COMMIT record has already flushed to WAL locally and might
> not have been replicated to the standby. We must wait here.`
>
Does that by any chance mean you are using a non-community version of
Postgres which has some other changes?
> I wonder could it be some tricky race because of downtime of
> synchronous replica and queries stuck waiting for ACK forever?
>
It is possible but ideally, in that case, the client should request
such a transaction again.
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | vignesh C | 2023-01-03 12:10:39 | Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions |
Previous Message | shveta malik | 2023-01-03 11:22:08 | Re: Time delayed LR (WAS Re: logical replication restrictions) |