Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber
Date: 2024-08-12 10:06:52
Message-ID: CAJpy0uA_BMM_Rb-1a7SNubDJ0rhTya5fACAHa2SmzgdnWRf63Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 9, 2024 at 2:39 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Amit, Shveta, Hou,
>
> Thanks for giving many comments! I've updated the patch.
>
> > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos)
> > }
> > PG_CATCH();
> > {
> > + /*
> > + * Reset the origin data to prevent the advancement of origin progress
> > + * if the transaction failed to apply.
> > + */
> > + replorigin_session_origin = InvalidRepOriginId;
> > + replorigin_session_origin_lsn = InvalidXLogRecPtr;
> > + replorigin_session_origin_timestamp = 0;
> >
> > Can't we call replorigin_reset() instead here?
>
> I didn't use the function because arguments of calling function looked strange,
> but ideally I can. Fixed.
>
> > + /*
> > + * Register a callback to reset the origin state before aborting the
> > + * transaction in ShutdownPostgres(). This is to prevent the advancement
> > + * of origin progress if the transaction failed to apply.
> > + */
> > + before_shmem_exit(replorigin_reset, (Datum) 0);
> >
> > I think we need this despite resetting the origin-related variables in
> > PG_CATCH block to handle FATAL error cases, right? If so, can we use
> > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()?
>
> There are two reasons to add a shmem-exit callback. One is to support a FATAL,
> another one is to support the case that user does the shutdown request while
> applying changes. In this case, I think ShutdownPostgres() can be called so that
> the session origin may advance.

Agree that we need the 'reset' during shutdown flow as well. Details at [1]

> However, I think we cannot use PG_ENSURE_ERROR_CLEANUP()/PG_END_ENSURE_ERROR_CLEANUP
> macros here. According to codes, it assumes that any before-shmem callbacks are
> not registered within the block because the cleanup function is registered and canceled
> within the macro. LogicalRepApplyLoop() can register the function when
> it handles COMMIT PREPARED message so it breaks the rule.

Yes, on reanalyzing, we can not use PG_ENSURE_ERROR_CLEANUP in this
flow due to the limitation of cancel_before_shmem_exit() that it can
cancel only the last registered callback, while in our flow we have
other callbacks also registered after we register our reset one.

[1]
Shutdown analysis:

I did a test where we make apply worker wait for say 0sec right after
it updates 'replorigin_session_origin_lsn' in
apply_handle_prepare_internal() (say it at code-point1). During this
wait, we triggered a subscriber shutdown.Under normal circumstances,
everything works fine: after the wait, the apply worker processes the
SIGTERM (via LogicalRepApplyLoop-->ProcessInterrupts()) only after the
prepare phase is complete, meaning the PREPARE LSN is flushed, and the
origin LSN is correctly advanced in EndPrepare() before the worker
shuts down. But, if we insert a LOG statement between code-point1 and
EndPrepare(), the apply worker processes the SIGTERM during the LOG
operation, as errfinish() triggers CHECK_FOR_INTERRUPTS at the end,
which causes the origin LSN to be incorrectly advanced during
shutdown. And thus the subsequent COMMIT PREPARED on the publisher
results in ERROR on subscriber; as the 'PREPARE' is lost on the
subscriber and is not resent by the publisher. ERROR: prepared
transaction with identifier "pg_gid_16403_757" does not exist

A similar problem can also occur without introducing any additional
LOG statements, but by simply setting log_min_messages=debug5. This
causes the apply worker to output a few DEBUG messages upon receiving
a shutdown signal (after code-point1) before it reaches EndPrepare().
As a result, it ends up processing the SIGTERM (during logging)and
invoking AbortOutOfAnyTransaction(), which incorrectly advances the
origin LSN.

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2024-08-12 10:10:03 Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber
Previous Message jian he 2024-08-12 10:03:22 minor comments issue in ResultRelInfo src/include/nodes/execnodes.h