RE: Excessive number of replication slots for 12->14 logical replication

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Hubert Lubaczewski <depesz(at)depesz(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: Excessive number of replication slots for 12->14 logical replication
Date: 2022-09-10 10:34:23
Message-ID: OS0PR01MB5716E1D75CFDB08E939E521794429@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Saturday, September 10, 2022 11:41 AM houzj(dot)fnst(at)fujitsu(dot)com wrote:
>
> On Saturday, September 10, 2022 5:49 AM Masahiko Sawada
> <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Aug 30, 2022 at 3:44 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > >
> > > On Fri, Aug 26, 2022 at 7:04 AM Amit Kapila
> > > <amit(dot)kapila16(at)gmail(dot)com>
> > wrote:
> > > >
> > > > Thanks for the testing. I'll push this sometime early next week
> > > > (by
> > > > Tuesday) unless Sawada-San or someone else has any comments on it.
> > > >
> > >
> > > Pushed.
> >
> > Tom reported buildfarm failures[1] and I've investigated the cause and
> > concluded this commit is relevant.
> >
> > In process_syncing_tables_for_sync(), we have the following code:
> >
> > UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
> > MyLogicalRepWorker->relid,
> > MyLogicalRepWorker->relstate,
> > MyLogicalRepWorker->relstate_lsn);
> >
> > ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
> > MyLogicalRepWorker->relid,
> > originname,
> > sizeof(originname));
> > replorigin_session_reset();
> > replorigin_session_origin = InvalidRepOriginId;
> > replorigin_session_origin_lsn = InvalidXLogRecPtr;
> > replorigin_session_origin_timestamp = 0;
> >
> > /*
> > * We expect that origin must be present. The concurrent operations
> > * that remove origin like a refresh for the subscription take an
> > * access exclusive lock on pg_subscription which prevent the previou
> > * operation to update the rel state to SUBREL_STATE_SYNCDONE to
> > * succeed.
> > */
> > replorigin_drop_by_name(originname, false, false);
> >
> > /*
> > * End streaming so that LogRepWorkerWalRcvConn can be used to
> > drop
> > * the slot.
> > */
> > walrcv_endstreaming(LogRepWorkerWalRcvConn, &tli);
> >
> > /*
> > * Cleanup the tablesync slot.
> > *
> > * This has to be done after the data changes because otherwise if
> > * there is an error while doing the database operations we won't be
> > * able to rollback dropped slot.
> > */
> > ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
> > MyLogicalRepWorker->relid,
> > syncslotname,
> > sizeof(syncslotname));
> >
> > If the table sync worker errored at walrcv_endstreaming(), we assumed
> > that both dropping the replication origin and updating relstate are
> > rolled back, which however was wrong. Indeed, the replication origin
> > is not dropped but the in-memory state is reset. Therefore, after the
> > tablesync worker restarts, it starts logical replication with starting
> > point 0/0. Consequently, it ends up applying the transaction that has already
> been applied.
>
> Thanks for the analysis !
>
> I think you are right. The replorigin_drop_by_name() will clear the
> remote_lsn/local_lsn in shared memory which won't be rollback if we fail to drop
> the origin.
>
> Per off-list discussion with Amit. To fix this problem, I think we need to drop the
> origin in table sync worker after committing the transaction which set the
> relstate to SYNCDONE. Because it can make sure that the worker won’t be
> restarted even if we fail to drop the origin. Besides, we need to add the origin
> drop code back in apply worker in case the table sync worker failed to drop the
> origin before it exits(which seems a rare case). I will share the patch if we agree
> with the fix.

Here is the draft patch. Share it here so that others can take a look at the basic logic.
I will keep testing and improving it.

Best regards,
Hou zj

Attachment Content-Type Size
0001-Fix-unexpected-clean-of-in-memory-remote_lsn-due-to-.patch application/octet-stream 7.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message James Pang (chaolpan) 2022-09-10 10:50:13 RE: huge memory of Postgresql backend process
Previous Message Amit Kapila 2022-09-10 10:22:57 Re: Excessive number of replication slots for 12->14 logical replication