Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: "ocean_li_996(at)163(dot)com" <ocean_li_996(at)163(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
Date: 2023-08-17 10:03:20
Message-ID: CAA4eK1LXD6iv2oEL8a-qYx6Y+MPv7eBim-dF-NCsJ-QL8z3j1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply(at)postgresql(dot)org> wrote:
> >
> > The following bug has been logged on the website:
> >
> > Bug reference: 18055
> > Logged by: ocean li
> > Email address: ocean_li_996(at)163(dot)com
> > PostgreSQL version: 11.9
> > Operating system: centos7 5.10.84 x86_64
> > Description:
> >
> > For testing logical decoding module, *pg_logical_slot_get_changes* function
> > is used. Sometimes i got an core whose stack was like that:
> > ...
> > And in level #3 of stack above, NInitialRunningXacts is 2 and
> > InitialRunningXacts is not NULL observed in one of cores.
> >
> > Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the
> > core, as far as i know, maybe caused by this way: an ERROR raised when calling
> > *pg_logical_slot_get_changes_guts* function. The code part of
> > PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts.
> > Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
> > Unfortunately, I couldn't find a minimal reproduction case. However, I
> > observed an *ERROR: canceling statement due to statement timeout* logged
> > before each core occurred. (For some reason, I can't provide the information of
> > log)
>
> Thanks for the report and the fix!
>
> I can also reproduce by the steps[1] in PG15~11(Note that we need to change
> LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal
> records).
>
> About the patch:
>
> ---
> /* free context, call shutdown callback */
> FreeDecodingContext(ctx);
>
> ReplicationSlotRelease();
> InvalidateSystemCaches();
> }
> PG_CATCH();
> {
> +
> + /* free context, call shutdown callback */
> + FreeDecodingContext(ctx);
> +
> + ReplicationSlotRelease();
>
> I think we could not directly call the cleanup functions here. There could be
> two risks:
> 1) it's possible either 'ctx' hasn't been initialized before the error happen,
> we don't need to clean it up in this case
> 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally
> before entering the catch() block which means we will double cleanup(free) it.
>
> To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions
> in it and do the null check for 'ctx' before cleaning up.
>
> Just share one small patch based on your fix for reference, feel free to update it.
>

- LogicalDecodingContext *ctx;
+ LogicalDecodingContext *ctx = NULL;

Don't we need to use volatile for ctx? See the following comment in
elog.h: "Note: if a local variable of the function containing PG_TRY
is modified in the PG_TRY section and used in the PG_CATCH section,
that variable must be declared "volatile" for POSIX compliance."?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2023-08-17 10:07:10 Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
Previous Message Zhijie Hou (Fujitsu) 2023-08-17 06:51:02 RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()