Re: Excessive number of replication slots for 12->14 logical replication

From: Bowen Shi <zxwsbg12138(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Hubert Lubaczewski <depesz(at)depesz(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Excessive number of replication slots for 12->14 logical replication
Date: 2024-01-22 02:26:19
Message-ID: CAM_vCueG60nGH4ScKA5SeN9RbcV3UDhqM1TJw8qvEp_xDHwdhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> I think the reason for these origin/slots ERRORs could be that the table
sync worker
> don't drop the origin and slot on ERROR (The table sync worker only drop
these
> after finishing the sync in process_syncing_tables_for_sync).

> So, if one table sync worker exited due to ERROR, and the apply worker
may be trying
> to start more workers but the origin number of previous errored table sync
> worker has not been dropped, causing a bunch of origin/slots ERRORs.

Right, that's the problem.

> BTW, for the first root ERROR("COPY: ERROR: could not find record while
> sending logically-decoded data: missing contrecord at xxxx/xxxxxxxxx")
which
> causes the following slot/origin, I am not sure what would cause this.

This is not important, it is just a statement of the problem that occurred
at that time. Other errors could also lead to the same results mentioned
above.

Bowen Shi

Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com> 于2024年1月21日周日 21:32写道:

> On Saturday, January 20, 2024 12:40 AM vignesh C <vignesh21(at)gmail(dot)com>
> wrote:
>
> Hi,
>
> >
> > On Thu, 18 Jan 2024 at 13:00, Bowen Shi <zxwsbg12138(at)gmail(dot)com> wrote:
> > >
> > > Dears,
> > >
> > > I encountered a similar problem when I used logical replication to
> replicate
> > databases from pg 16 to pg 16.
> > >
> > > I started 3 subscription in parallel, and subscriber's
> postgresql.conf is
> > following:
> > > max_replication_slots = 10
> > > max_sync_workers_per_subscription = 2
> > >
> > > However, after 3 minutes, I found three COPY errors in subscriber:
> > > "error while shutting down streaming COPY: ERROR: could not find
> record
> > while sending logically-decoded data: missing contrecord at
> xxxx/xxxxxxxxx""
> > > Then, the subscriber began to print a large number of errors: "could
> not find
> > free replication state slot for replication origin with ID 11, Increase
> > max_replication_slots and try again."
> > >
> > > And the publisher was full of pg_xxx_sync_xxxxxxx slots, printing lots
> of "all
> > replication slots are in use, Free one or increase
> max_replication_slots."
> > >
> > > This question is very similar to
> > https://www.postgresql.org/message-id/flat/20220714115155.GA5439%40depe
> > sz.com . When the table sync worker encounters an error and exits while
> copying
> > a table, the replication origin will not be deleted. And new table sync
> workers
> > would create sync slot in the publisher and then exit without dropping
> them.
> >
> > I had tried various tests with the suggested configuration, but I did
> not hit this
> > scenario. I was able to simulate this problem with a lesser number of
> > max_replication_slots, but the behavior is as expected in this case.
> > If you have a test case or logs for this, can you share it please. It
> will be easier to
> > generate the sequence of things that is happening and to project a clear
> picture
> > of what is happening.
>
> I think the reason for these origin/slots ERRORs could be that the table
> sync worker
> don't drop the origin and slot on ERROR (The table sync worker only drop
> these
> after finishing the sync in process_syncing_tables_for_sync).
>
> So, if one table sync worker exited due to ERROR, and the apply worker may
> be trying
> to start more workers but the origin number of previous errored table sync
> worker has not been dropped, causing a bunch of origin/slots ERRORs.
>
> If the above reason is correct, maybe we could somehow drop the origin and
> slots on ERROR exit as well, although it needs some analysis.
>
> BTW, for the first root ERROR("COPY: ERROR: could not find record while
> sending logically-decoded data: missing contrecord at xxxx/xxxxxxxxx")
> which
> causes the following slot/origin, I am not sure what would cause this.
>
> As Vignesh mentioned, it would be better to provide log file in both
> publisher and
> subscriber to do further analysis.
>
> Best Regards,
> Hou zj
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Smith 2024-01-22 04:31:46 Re: Removing const-false IS NULL quals and redundant IS NOT NULL quals
Previous Message Peter Smith 2024-01-21 23:38:08 Re: pg_rewind WAL segments deletion pitfall