RE: Slow catchup of 2PC (twophase) transactions on replica in LR

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Peter Smith' <smithpb2250(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Давыдов Виталий <v(dot)davydov(at)postgrespro(dot)ru>
Subject: RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Date: 2024-05-09 08:54:36
Message-ID: OSBPR01MB2552D3618A0BAB99ECDDD589F5E62@OSBPR01MB2552.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Peter,

> ======
> Commit message
>
> 1.
> IIUC there is quite a lot of subtlety and details about why the slot
> option needs to be changed only when altering "true" to "false", but
> not when altering "false" to "true".
>
> It also should explain why PreventInTransactionBlock is only needed
> when altering two_phase "true" to "false".
>
> Please include a commit message to describe all those tricky details.

Added.

> ======
> src/backend/commands/subscriptioncmds.c
>
> 2. AlterSubscription
>
> - PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... SET
> (two_phase)");
> + if (!opts.twophase)
> + PreventInTransactionBlock(isTopLevel,
> + "ALTER SUBSCRIPTION ... SET (two_phase = off)");
>
> IMO this needs a comment to explain why PreventInTransactionBlock is
> only needed when changing the 'two_phase' option from on to off.

Added. Thoutht?

> 3. AlterSubscription
>
> /*
> * Try to acquire the connection necessary for altering slot.
> *
> * This has to be at the end because otherwise if there is an error while
> * doing the database operations we won't be able to rollback altered
> * slot.
> */
> if (replaces[Anum_pg_subscription_subfailover - 1] ||
> replaces[Anum_pg_subscription_subtwophasestate - 1])
> {
> bool must_use_password;
> char *err;
> WalReceiverConn *wrconn;
> bool failover_needs_to_be_updated;
> bool two_phase_needs_to_be_updated;
>
> /* Load the library providing us libpq calls. */
> load_file("libpqwalreceiver", false);
>
> /* Try to connect to the publisher. */
> must_use_password = sub->passwordrequired && !sub->ownersuperuser;
> wrconn = walrcv_connect(sub->conninfo, true, true, must_use_password,
> sub->name, &err);
> if (!wrconn)
> ereport(ERROR,
> (errcode(ERRCODE_CONNECTION_FAILURE),
> errmsg("could not connect to the publisher: %s", err)));
>
> /*
> * Consider which slot option must be altered.
> *
> * We must alter the failover option whenever subfailover is updated.
> * Two_phase, however, is altered only when changing true to false.
> */
> failover_needs_to_be_updated =
> replaces[Anum_pg_subscription_subfailover - 1];
> two_phase_needs_to_be_updated =
> (replaces[Anum_pg_subscription_subtwophasestate - 1] &&
> !opts.twophase);
>
> PG_TRY();
> {
> if (two_phase_needs_to_be_updated || failover_needs_to_be_updated)
> walrcv_alter_slot(wrconn, sub->slotname,
> failover_needs_to_be_updated ? &opts.failover : NULL,
> two_phase_needs_to_be_updated ? &opts.twophase : NULL);
> }
> PG_FINALLY();
> {
> walrcv_disconnect(wrconn);
> }
> PG_END_TRY();
> }
>
> 3a.
> The block comment "Consider which slot option must be altered..." says
> WHEN those options need to be updated, but it doesn't say WHY. e.g.
> why only update the 'two_phase' when it is being disabled but not when
> it is being enabled? In other words, I think there needs to be more
> background/reason details given in this comment.
>
> ~~~
>
> 3b.
> Can't those 2 new variable assignments be done up-front and guard this
> entire "if-block" instead of the current replaces[] guarding it? Then
> the code is somewhat simplified.
>
> SUGGESTION:
> /*
> * <improved comment here to explain these variables>
> */
> update_failover = replaces[Anum_pg_subscription_subfailover - 1];
> update_two_phase = (replaces[Anum_pg_subscription_subtwophasestate -
> 1] && !opts.twophase);
>
> /*
> * Try to acquire the connection necessary for altering slot.
> *
> * This has to be at the end because otherwise if there is an error while
> * doing the database operations we won't be able to rollback altered
> * slot.
> */
> if (update_failover || update_two_phase)
> {
> ...
>
> /* Load the library providing us libpq calls. */
> load_file("libpqwalreceiver", false);
>
> /* Try to connect to the publisher. */
> must_use_password = sub->passwordrequired && !sub->ownersuperuser;
> wrconn = walrcv_connect(sub->conninfo, true, true,
> must_use_password, sub->name, &err);
> if (!wrconn)
> ereport(ERROR, ...);
>
> PG_TRY();
> {
> walrcv_alter_slot(wrconn, sub->slotname,
> update_failover ? &opts.failover : NULL,
> update_two_phase ? &opts.twophase : NULL);
> }
> PG_FINALLY();
> {
> walrcv_disconnect(wrconn);
> }
> PG_END_TRY();
> }

Two variables were added and comments were updated.

> .../libpqwalreceiver/libpqwalreceiver.c
>
> 4.
> + appendStringInfo(&cmd, "ALTER_REPLICATION_SLOT %s ( ",
> + quote_identifier(slotname));
> +
> + if (failover)
> + appendStringInfo(&cmd, "FAILOVER %s ",
> + (*failover) ? "true" : "false");
> +
> + if (two_phase)
> + appendStringInfo(&cmd, "TWO_PHASE %s%s ",
> + (*two_phase) ? "true" : "false",
> + failover ? ", " : "");
> +
> + appendStringInfoString(&cmd, ");");
>
> 4a.
> IIUC the comma logic here was broken in v7 when you swapped the order.
> Anyway, IMO it will be better NOT to try combining that comma logic
> with the existing appendStringInfo. Doing it separately is both easier
> and less error-prone.
>
> Furthermore, the parentheses like "(*two_phase)" instead of just
> "*two_phase" seemed a bit overkill.
>
> SUGGESTION:
> + if (failover)
> + appendStringInfo(&cmd, "FAILOVER %s",
> + *failover ? "true" : "false");
> +
> + if (failover && two_phase)
> + appendStringInfo(&cmd, ", ");
> +
> + if (two_phase)
> + appendStringInfo(&cmd, "TWO_PHASE %s",
> + *two_phase ? "true" : "false");
> +
> + appendStringInfoString(&cmd, " );");

Fixed.

> 4b.
> Like I said above, IMO the current separator logic in v7 is broken. So
> it is a bit concerning the tests all passed anyway. How did that
> happen? I think this indicates that there needs to be an additional
> test scenario where both 'failover' and 'two_phase' get altered at the
> same time so this code gets exercised properly.

Right, it was added.

> ======
> src/test/subscription/t/099_twophase_added.pl
>
> 5.
> +# Define pre-existing tables on both nodes
>
> Why say they are "pre-existing"? They are not pre-existing because you
> are creating them right here!

Removed the word.

> 6.
> +######
> +# Check the case that prepared transactions exist on publisher node
> +######
>
> I think this needs a slightly more detailed comment.
>
> SUGGESTION (this is just an example, but you can surely improve it)
>
> # Check the case that prepared transactions exist on the publisher node.
> #
> # Since two_phase is "off", then normally this PREPARE will do nothing until
> # the COMMIT PREPARED, but in this test, we toggle the two_phase to "on" again
> # before the COMMIT PREPARED happens.

Changed with adjustments.

> 7.
> Maybe this test case needs a few more one-line comments for each of
> the sub-steps. e.g.:
>
> # prepare a transaction to insert some rows to the table
>
> # verify the prepared tx is not yet replicated to the subscriber
> (because 'two_phase = off')
>
> # toggle the two_phase to 'on' *before* the COMMIT PREPARED
>
> # verify the inserted rows got replicated ok

Modified like yours, but changed based on the suggestion by Grammarly.

> 8.
> IIUC this test will behave the same even if you DON'T do the toggle
> 'two_phase = on'. So I wonder is there something more you can do to
> test this scenario more convincingly?

I found an indicator. When the apply starts, it outputs the current status of
two_phase option. I added wait_for_log() to ensure below appeared. Thought?

```
ereport(DEBUG1,
(errmsg_internal("logical replication apply worker for subscription \"%s\" two_phase is %s",
MySubscription->name,
MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
"?")));
```
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-05-09 09:10:53 RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Previous Message Hayato Kuroda (Fujitsu) 2024-05-09 08:26:19 RE: Slow catchup of 2PC (twophase) transactions on replica in LR