Re: Excessive number of replication slots for 12->14 logical replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Hubert Lubaczewski <depesz(at)depesz(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Excessive number of replication slots for 12->14 logical replication
Date: 2022-09-10 10:22:57
Message-ID: CAA4eK1KpXQRLswLkqLiWx61DBbL4x1NBSRxpLYmSJzr3gRYc7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Sep 10, 2022 at 11:45 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sat, Sep 10, 2022 at 11:06 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Sat, Sep 10, 2022 at 3:19 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > One thing is not clear to me how the first time error: "could not find
> > record while sending logically-decoded data ..." can happen due to
> > this commit? Also, based on the origin even if the client sends a
> > prior location (0/0 in this case) but the server will still start from
> > the location where the client has confirmed the commit (aka
> > confirmed_flush location).
> >
>
> I missed the point that if the 'origin_lsn' is ahead of the
> 'confirmed_flush' location then it will start from some prior location
> which I think will be problematic.
>

I am able to reproduce the behavior as seen in BF failure with the
help of a debugger by introducing an artificial error in
libpqrcv_endstreaming and by ensuring that apply worker skips the
transaction that performs an operation on a table for which the sync
worker is copying the table. I have to also suppress keep_alive
messages from the publisher, otherwise, they move the confirm_flush
location ahead of origin_lsn. So, it is clear that this commit has
caused the BF failure even though the first error seen: "ERROR: could
not find record while sending logically-decoded data: missing
contrecord at 0/1CCF9F0" was not due to this commit.

I don't have any better ideas to solve this at this stage than what
Hou-San has mentioned in his email [1]. What do you think?

[1] - https://www.postgresql.org/message-id/OS0PR01MB5716E128E78C6CECD15C718394429%40OS0PR01MB5716.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2022-09-10 10:34:23 RE: Excessive number of replication slots for 12->14 logical replication
Previous Message Moisés Limón 2022-09-10 06:44:40 Bug in UPDATE statement