Re: Single transaction in the tablesync worker?

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Single transaction in the tablesync worker?
Date: 2021-02-01 05:53:33
Message-ID: CAHut+PtkZbYbNCnVLawWyh_k4C39ddDMdhmtTf6OhDc+MuaV4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 1, 2021 at 3:44 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Feb 1, 2021 at 9:38 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> >
> > On Mon, Feb 1, 2021 at 1:54 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Feb 1, 2021 at 6:48 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> > > >
> > > > On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > I have made the below changes in the patch. Let me know what you think
> > > > > about these?
> > > > > 1. It was a bit difficult to understand the code in DropSubscription
> > > > > so I have rearranged the code to match the way we are doing in HEAD
> > > > > where we drop the slots at the end after finishing all the other
> > > > > cleanup.
> > > >
> > > > There was a reason why the v22 logic was different from HEAD.
> > > >
> > > > The broken connection leaves dangling slots which is unavoidable.
> > > >
> > >
> > > I think this is true only when the user specifically requested it by
> > > the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> > > Otherwise, we give an error on a broken connection. Also, if that is
> > > true then is there a reason to pass missing_ok as true while dropping
> > > tablesync slots?
> > >
> >
> > AFAIK there is always a potential race with DropSubscription dropping
> > slots. The DropSubscription might be running at exactly the same time
> > the apply worker has just dropped the very same tablesync slot.
> >
>
> We stopped the workers before getting a list of NotReady relations and
> then we try to drop the corresponding slots. So, how such a race
> condition can happen? Note, because we have a lock on pg_subscrition,
> there is no chance that the workers can restart till the transaction
> end.

OK. I think I was forgetting the logicalrep_worker_stop would also go
into a loop waiting for the worker process to die. So even if the
tablesync worker does simultaneously drop it's own slot, I think it
will certainly at least be in SYNCDONE state before DropSubscription
does anything else with that worker.

>
> > By
> > saying missing_ok = true it means DropSubscription would not give
> > ERROR in such a case, so at least the DROP SUBSCRIPTION would not fail
> > with an unexpected error.
> >
> > >
> > > > But,
> > > > whereas the user knows the name of the Subscription slot (they named
> > > > it), there is no easy way for them to know the names of the remaining
> > > > tablesync slots unless we log them.
> > > >
> > > > That is why the v22 code was written to process the tablesync slots
> > > > even for wrconn == NULL so their name could be logged:
> > > > elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
> > > > syncslotname);
> > > >
> > > > The v23 patch removed this dangling slot name information, so it makes
> > > > it difficult for the user to know what tablesync slots to cleanup.
> > > >
> > >
> > > Okay, then can we think of combining with the existing error of the
> > > replication slot? I think that might produce a very long message, so
> > > another idea could be to LOG a separate WARNING for each such slot
> > > just before giving the error.
> >
> > There may be many subscribed tables so I agree combining to one
> > message might be too long. Yes, we can add another loop to output the
> > necessary information. But, isn’t logging each tablesync slot WARNING
> > before the subscription slot ERROR exactly the behaviour which already
> > existed in v22. IIUC the DropSubscription refactoring in V23 was not
> > done for any functional change, but was intended only to make the code
> > simpler, but how is that goal achieved if v23 ends up needing 3 loops
> > where v22 only needed 1 loop to do the same thing?
> >
>
> No, there is a functionality change as well. The way we have code in
> v22 can easily lead to a problem when we have dropped the slots but
> get an error while removing origins or an entry from subscription rel.
> In such cases, we won't be able to rollback the drop of slots but the
> other database operations will be rolled back. This is the reason we
> have to drop the slots at the end. We need to ensure the same thing
> for AlterSubscription_refresh. Does this make sense now?
>

OK.

----
Kind Regards,
Peter Smith.
Fujitsu Australia.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-02-01 05:57:53 Re: [sqlsmith] Failed assertion during partition pruning
Previous Message Tom Lane 2021-02-01 05:48:57 Re: [sqlsmith] Failed assertion during partition pruning