| From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> | 
|---|---|
| To: | vignesh C <vignesh21(at)gmail(dot)com>, Bowen Shi <zxwsbg12138(at)gmail(dot)com> | 
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Hubert Lubaczewski <depesz(at)depesz(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> | 
| Subject: | RE: Excessive number of replication slots for 12->14 logical replication | 
| Date: | 2024-01-21 13:32:02 | 
| Message-ID: | OS0PR01MB57165FF10C478BFBB837696A94762@OS0PR01MB5716.jpnprd01.prod.outlook.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-bugs | 
On Saturday, January 20, 2024 12:40 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
Hi,
> 
> On Thu, 18 Jan 2024 at 13:00, Bowen Shi <zxwsbg12138(at)gmail(dot)com> wrote:
> >
> > Dears,
> >
> > I encountered a similar problem when I used logical replication to replicate
> databases from pg 16 to pg 16.
> >
> > I started 3 subscription in parallel, and  subscriber's postgresql.conf is
> following:
> > max_replication_slots = 10
> > max_sync_workers_per_subscription = 2
> >
> > However, after 3 minutes, I found three COPY errors in subscriber:
> > "error while shutting down streaming COPY: ERROR:  could not find record
> while sending logically-decoded data: missing contrecord at xxxx/xxxxxxxxx""
> > Then,  the subscriber began to print a large number of errors: "could not find
> free replication state slot for replication origin with ID 11, Increase
> max_replication_slots and try again."
> >
> > And the publisher was full of pg_xxx_sync_xxxxxxx slots, printing lots of "all
> replication slots are in use, Free one or increase max_replication_slots."
> >
> > This question is very similar to
> https://www.postgresql.org/message-id/flat/20220714115155.GA5439%40depe
> sz.com . When the table sync worker encounters an error and exits while copying
> a table, the replication origin will not be deleted. And new table sync workers
> would create sync slot in the publisher and then exit without dropping them.
> 
> I had tried various tests with the suggested configuration, but I did not hit this
> scenario. I was able to simulate this problem with a lesser number of
> max_replication_slots, but the behavior is as expected in this case.
> If you have a test case or logs for this, can you share it please. It will be easier to
> generate the sequence of things that is happening and to project a clear picture
> of what is happening.
I think the reason for these origin/slots ERRORs could be that the table sync worker
don't drop the origin and slot on ERROR (The table sync worker only drop these
after finishing the sync in process_syncing_tables_for_sync).
So, if one table sync worker exited due to ERROR, and the apply worker may be trying
to start more workers but the origin number of previous errored table sync
worker has not been dropped, causing a bunch of origin/slots ERRORs.
If the above reason is correct, maybe we could somehow drop the origin and
slots on ERROR exit as well, although it needs some analysis.
BTW, for the first root ERROR("COPY: ERROR:  could not find record while
sending logically-decoded data: missing contrecord at xxxx/xxxxxxxxx") which
causes the following slot/origin, I am not sure what would cause this.
As Vignesh mentioned, it would be better to provide log file in both publisher and
subscriber to do further analysis.
Best Regards,
Hou zj
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Smith | 2024-01-21 23:38:08 | Re: pg_rewind WAL segments deletion pitfall | 
| Previous Message | Devrim Gündüz | 2024-01-21 00:14:32 | Re: BUG #18304: Faulty proj93 RPM package in EL9 repo |