Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber
Date: 2024-08-09 04:04:54
Message-ID: CAJpy0uDLUAsz3iaAUAcxFztLuLwJ91r+pxNMU5Q2bXKijMAwDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 8, 2024 at 6:08 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Aug 8, 2024 at 3:41 PM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Thursday, August 8, 2024 6:01 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Aug 8, 2024 at 12:03 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > > wrote:
> > > >
> > > > On Thu, Aug 8, 2024 at 10:37 AM Hayato Kuroda (Fujitsu)
> > > > <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> > > > >
> > > > ...
> > > > >
> > > > > An easiest fix is to reset session replication origin before calling
> > > > > the RecordTransactionAbort(). I think this can happen when 1)
> > > > > LogicalRepApplyLoop() raises an ERROR or 2) apply worker exits.
> > > Attached patch can fix the issue on HEAD.
> > > > >
> > > >
> > > > Few comments:
> > > > =============
> > > > *
> > > > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos)
> > > > }
> > > > PG_CATCH();
> > > > {
> > > > + /*
> > > > + * Reset the origin data to prevent the advancement of origin
> > > > + progress
> > > > + * if the transaction failed to apply.
> > > > + */
> > > > + replorigin_session_origin = InvalidRepOriginId;
> > > > + replorigin_session_origin_lsn = InvalidXLogRecPtr;
> > > > + replorigin_session_origin_timestamp = 0;
> > > >
> > > > Can't we call replorigin_reset() instead here?
> > > >
> > > > *
> > > > + /*
> > > > + * Register a callback to reset the origin state before aborting the
> > > > + * transaction in ShutdownPostgres(). This is to prevent the
> > > > + advancement
> > > > + * of origin progress if the transaction failed to apply.
> > > > + */
> > > > + before_shmem_exit(replorigin_reset, (Datum) 0);
> > > >
> > > > I think we need this despite resetting the origin-related variables in
> > > > PG_CATCH block to handle FATAL error cases, right? If so, can we use
> > > > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()?
> > >
> > > +1
> > >
> > > Basic tests work fine on this patch. Just thinking out loud,
> > > SetupApplyOrSyncWorker() is called for table-sync worker as well and IIUC
> > > tablesync worker does not deal with 2PC txns. So do we even need to register
> > > replorigin_reset() for tablesync worker as well? If we may hit such an issue in
> > > general, then perhaps we need it in table-sync worker otherwise not. It
> > > needs some investigation. Thoughts?
> >
> > I think this is a general issue that can occur not only due to 2PC. IIUC, this
> > problem should arise if any ERROR arises after setting the
> > replorigin_session_origin_lsn but before the CommitTransactionCommand is
> > completed. If so, I think we should register it for tablesync worker as well.
> >
>
> As pointed out earlier, won't using PG_ENSURE_ERROR_CLEANUP() instead
> of PG_CATCH() be enough?

Yes, I think it should suffice. IIUC, we are going to change
'replorigin_session_origin_lsn' only in start_apply() and not before
that, and thus ensuring its reset during any ERROR or FATAL in
start_apply() is good enough. And I guess we don't want this
origin-reset to be called during regular shutdown, isn't it? But
registering it through before_shmem_exit() will make the
reset-function to be called during normal shutdown as well.
And to answer my previous question (as Hou-San also pointed out), we
do need it in table-sync worker as well. So a change in start_apply
will make sure the fix is valid for both apply and tablesync worker.

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2024-08-09 04:05:46 Re: Don't overwrite scan key in systable_beginscan()
Previous Message shveta malik 2024-08-09 03:33:10 Re: Found issues related with logical replication and 2PC