From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Deadlock between logrep apply worker and tablesync worker |
Date: | 2023-01-30 10:32:52 |
Message-ID: | CALDaNm2_xXBSjccM_Wc57CdJ6XT7Zzdq32Qxt8B+4fQHK4Hyww@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 30 Jan 2023 at 13:00, houzj(dot)fnst(at)fujitsu(dot)com
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Monday, January 30, 2023 2:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > >
> > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > One thing that looks a bit odd is that we will anyway have a similar
> > > > check in replorigin_drop_guts() which is a static function and
> > > > called from only one place, so, will it be required to check at both places?
> > >
> > > There is a possibility that the initial check to verify if replication
> > > origin exists in replorigin_drop_by_name was successful but later one
> > > of either table sync worker or apply worker process might have dropped
> > > the replication origin,
> > >
> >
> > Won't locking on the particular origin prevent concurrent drops? IIUC, the
> > drop happens after the patch acquires the lock on the origin.
>
> Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
> already lock the origin before that. I think the check in replorigin_drop_guts
> is a custom check after calling SearchSysCache1 to get the tuple, but the error
> should not happen as no concurrent drop can be performed.
>
> To make it simpler, one idea is to move the code that getting the tuple from
> system cache to the replorigin_drop_by_name(). After locking the origin, we
> can try to get the tuple and do the existence check, and we can reuse
> this tuple to perform origin delete. In this approach we only need to check
> origin existence once after locking. BTW, if we do this, then we'd better rename the
> replorigin_drop_guts() to something like replorigin_state_clear() as the function
> only clear the in-memory information after that.
>
> The code could be like:
>
> -------
> replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> ...
> /*
> * Lock the origin to prevent concurrent drops. We keep the lock until the
> * end of transaction.
> */
> LockSharedObject(ReplicationOriginRelationId, roident, 0,
> AccessExclusiveLock);
>
> tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
> if (!HeapTupleIsValid(tuple))
> {
> if (!missing_ok)
> elog(ERROR, "cache lookup failed for replication origin with ID %d",
> roident);
>
> return;
> }
>
> replorigin_state_clear(rel, roident, nowait);
>
> /*
> * Now, we can delete the catalog entry.
> */
> CatalogTupleDelete(rel, &tuple->t_self);
> ReleaseSysCache(tuple);
>
> CommandCounterIncrement();
> ...
+1 for this change as it removes the redundant check which is not
required. I will post an updated version for this.
Regards,
Vignesh
From | Date | Subject | |
---|---|---|---|
Next Message | Takamichi Osumi (Fujitsu) | 2023-01-30 10:33:43 | RE: Time delayed LR (WAS Re: logical replication restrictions) |
Previous Message | John Naylor | 2023-01-30 10:32:04 | Re: Considering additional sort specialisation functions for PG16 |