Re: Single transaction in the tablesync worker?

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject: Re: Single transaction in the tablesync worker?
Date: 2020-12-22 11:28:21
Message-ID: CAHut+PuT=vp9uDTnpRNLYt+LhrB5XFcKroVSgfcBfW7hj_e3_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> >
> > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > > Few other comments:
> > > ==================
> >
> > Thanks for your feedback.
> >
> > > 1.
> > > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > > because I don't think
> > > + * such workers are even iterated by this loop, and nobody else is
> > > removing them.
> > > + */
> > > + if (slotname)
> > > + {
> > >
> > > The above FIXME is not clear to me. Actually, the crashed workers
> > > should restart, finish their work, and drop the slots. So not sure
> > > what exactly this FIXME refers to?
> >
> > Yes, normally if the tablesync can complete it should behave like that.
> > But I think there are other scenarios where it may be unable to
> > clean-up after itself. For example:
> >
> > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> > handled by tablesync can give a PK violation which also will crash
> > again and again for each re-launched/replacement tablesync worker.
> > This can be reproduced in the debugger. If the DropSubscription
> > doesn't clean-up the tablesync's slot then nobody will.
> >
> > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> > ensure that the launcher doesn't restart new worker during dropping
> > the subscription".
> >
>
> Yeah, I have also read that comment but do you know how it is
> preventing relaunch? How does the subscription lock help?

Hmmm. I did see there is a matching lock in get_subscription_list of
launcher.c, which may be what that code comment was referring to. But
I also am currently unsure how this lock prevents anybody (e.g.
process_syncing_tables_for_apply) from executing another
logicalrep_worker_launch.

>
> > So executing DROP SUBSCRIPTION will prevent a newly
> > crashed tablesync from re-launching, so it won’t be able to take care
> > of its own slot. If the DropSubscription doesn't clean-up that
> > tablesync's slot then nobody will.
> >
>
>
> > >
> > > 2.
> > > DropSubscription()
> > > {
> > > ..
> > > ReplicationSlotDropAtPubNode(
> > > + NULL,
> > > + conninfo, /* use conninfo to make a new connection. */
> > > + subname,
> > > + syncslotname);
> > > ..
> > > }
> > >
> > > With the above call, it will form a connection with the publisher and
> > > drop the required slots. I think we need to save the connection info
> > > so that we don't need to connect/disconnect for each slot to be
> > > dropped. Later in this function, we again connect and drop the apply
> > > worker slot. I think we should connect just once drop the apply and
> > > table sync slots if any.
> >
> > OK. IIUC this is a suggestion for more efficient connection usage,
> > rather than actual bug right?
> >
>
> Yes, it is for effective connection usage.
>

I have addressed this in the latest patch [v6]

> >
> > >
> > > 3.
> > > ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
> > > *conninfo, char *subname, char *slotname)
> > > {
> > > ..
> > > + PG_TRY();
> > > ..
> > > + PG_CATCH();
> > > + {
> > > + /* NOP. Just gobble any ERROR. */
> > > + }
> > > + PG_END_TRY();
> > >
> > > Why are we suppressing the error instead of handling it the error in
> > > the same way as we do while dropping the apply worker slot in
> > > DropSubscription?
> >
> > This function is common - it is also called from the tablesync
> > finish_sync_worker. But in the finish_sync_worker case I wanted to
> > avoid throwing an ERROR which would cause the tablesync to crash and
> > relaunch (and crash/relaunch/repeat...) when all it was trying to do
> > in the first place was just cleanup and exit the process. Perhaps the
> > error suppression should be conditional depending where this function
> > is called from?
> >
>
> Yeah, that could be one way and if you follow my previous suggestion
> this function might change a bit more.

I have addressed this in the latest patch [v6]

---
[v6] https://www.postgresql.org/message-id/CAHut%2BPuCLty2HGNT6neyOcUmBNxOLo%3DybQ2Yv-nTR4kFY-8QLw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-12-22 11:41:02 Re: Deadlock between backend and recovery may not be detected
Previous Message Daniil Zakhlystov 2020-12-22 11:24:05 Re: libpq compression