RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "ocean_li_996(at)163(dot)com" <ocean_li_996(at)163(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
Date: 2023-08-18 04:21:53
Message-ID: OS0PR01MB57167C62D7DA4A8EBBC92B0A941BA@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thursday, August 17, 2023 10:04 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Thu, Aug 17, 2023 at 9:10 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> wrote:
> >
> > On Thursday, August 17, 2023 6:07 PM Amit Kapila
> <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila
> > > <amit(dot)kapila16(at)gmail(dot)com>
> > > wrote:
> > > >
> > > > On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
> > > > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > > >
> > > > > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form
> > > <noreply(at)postgresql(dot)org> wrote:
> > > > > >
> > > > > > The following bug has been logged on the website:
> > > > > >
> > > > > > Bug reference: 18055
> > > > > > Logged by: ocean li
> > > > > > Email address: ocean_li_996(at)163(dot)com
> > > > > > PostgreSQL version: 11.9
> > > > > > Operating system: centos7 5.10.84 x86_64
> > > > > > Description:
> > > > > >
> > > > > > For testing logical decoding module,
> > > > > > *pg_logical_slot_get_changes* function is used. Sometimes i got an
> core whose stack was like that:
> > > > > > ...
> > > > > > And in level #3 of stack above, NInitialRunningXacts is 2 and
> > > > > > InitialRunningXacts is not NULL observed in one of cores.
> > > > > >
> > > > > > Using of NInitialRunningXacts and InitialRunningXacts are clear.
> > > > > > Currently, the core, as far as i know, maybe caused by this way:
> > > > > > an ERROR raised when calling
> > > > > > *pg_logical_slot_get_changes_guts* function. The code part of
> > > > > > PG_CATCH() doses not reset NInitialRunningXacts and
> > > InitialRunningXacts.
> > > > > > Then, calling pg_logical_slot_get_changes_guts again, the core
> > > > > > may
> > > occur.
> > > > > > Unfortunately, I couldn't find a minimal reproduction case.
> > > > > > However, I observed an *ERROR: canceling statement due to
> > > > > > statement timeout* logged before each core occurred. (For
> > > > > > some reason, I can't provide the information of
> > > > > > log)
> > > > >
> > > > > Thanks for the report and the fix!
> > > > >
> > > > > I can also reproduce by the steps[1] in PG15~11(Note that we
> > > > > need to change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to
> > > > > avoid extra running xacts wal records).
> > > > >
> > > > > About the patch:
> > > > >
> > > > > ---
> > > > > /* free context, call shutdown callback */
> > > > > FreeDecodingContext(ctx);
> > > > >
> > > > > ReplicationSlotRelease();
> > > > > InvalidateSystemCaches();
> > > > > }
> > > > > PG_CATCH();
> > > > > {
> > > > > +
> > > > > + /* free context, call shutdown callback */
> > > > > + FreeDecodingContext(ctx);
> > > > > +
> > > > > + ReplicationSlotRelease();
> > > > >
> > > > > I think we could not directly call the cleanup functions here.
> > > > > There could be two risks:
> > > > > 1) it's possible either 'ctx' hasn't been initialized before the
> > > > > error happen, we don't need to clean it up in this case
> > > > > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up
> > > > > normally before entering the catch() block which means we will
> > > > > double
> > > cleanup(free) it.
> > > > >
> > > > > To improve this, I think we can use PG_FINALLY() here and move
> > > > > all these cleanup functions in it and do the null check for
> > > > > 'ctx' before cleaning
> > > up.
> > > > >
> > > > > Just share one small patch based on your fix for reference, feel
> > > > > free to
> > > update it.
> > > > >
> > > >
> > > > - LogicalDecodingContext *ctx;
> > > > + LogicalDecodingContext *ctx = NULL;
> > > >
> > > > Don't we need to use volatile for ctx? See the following comment
> > > > in
> > > > elog.h: "Note: if a local variable of the function containing
> > > > PG_TRY is modified in the PG_TRY section and used in the PG_CATCH
> > > > section, that variable must be declared "volatile" for POSIX compliance."?
> > > >
> >
> > Thanks for pointing it out, I think we need to use volatile for it.
> >
> > >
> > > BTW, don't we need a similar change in
> pg_logical_replication_slot_advance()?
> >
> > You are right, I can reproduce the same following the steps in [1] and
> > replace the last two pg_logical_slot_get_changes function calls with
> > two "SELECT pg_replication_slot_advance('isolation_slot', (select
> > pg_current_wal_lsn()));"
>
> I've not tested but the same is true in a case where
> pg_create_logical_replication_slot() raises an error (e.g., cancelled while
> DecodingContextFindStartpoint()) after creating the decoding context?

Indeed, I can reproduce this case[1] as well, thanks!

Here is a new version patch which addressed this based on PG15.

Apart from this, I am thinking we’d better also apply the fix on HEAD even if
it doesn't have such flag and Assert. Because in error case(HEAD), we will skip
calling shutdown callbacks for output plugins, so it could cause other
problems. For example, in pgoutput we free and reset the hash table in shutdown
callbacks, if we skip this, and next time we consume changes from another slot,
we would use an expired hash table which can cause problems. I can provide
patches for all branches if the fix is OK.

[1]
~~ reproduction steps (Need to change LOG_SNAPSHOT_INTERVAL_MS to a bigger number)

session 1:
-----
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1', 'test_decoding');
-----

session 2:
-----
- Start a transaction
BEGIN;
CREATE TABLE test(a int);
INSERT INTO test VALUES(1);
-----

session 3:
-----
- Note, to let the current slot restore another snapshot, we need to use gdb to block the current backend at
- SnapBuildFindSnapshot(), the backend should have logged the running_xacts WAL
- before reaching SnapBuildFindSnapshot.

SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2', 'test_decoding');
-----

session 1:
-----
- The current backend will serialize the snapshot when decoding the running_xacts WAL, and the snapshot
- can be used by other slots (e.g. isolation_slot_2)

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
-----

session 3:
-----
- Release the process and we can see if restore the snaphost and set NInitialRunningXacts to 1.
- Then let it report an ERROR and try to create the slot again after ERROR.
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2', 'test_decoding');
-----

Best Regards,
Hou zj

Attachment Content-Type Size
v4-0001-cleanup-decoding-context-in-error-cases.patch application/octet-stream 3.6 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2023-08-18 04:55:32 Re: BUG #17928: Standby fails to decode WAL on termination of primary
Previous Message Thomas Munro 2023-08-18 02:30:31 Re: BUG #17928: Standby fails to decode WAL on termination of primary