RE: Slow catchup of 2PC (twophase) transactions on replica in LR

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>
Cc: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, Ajin Cherian <itsajin(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Date: 2024-07-09 11:49:38
Message-ID: OSBPR01MB2552961D068087129E97048CF5DB2@OSBPR01MB2552.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Amit,

> > I see that in 0003/0004, the patch first aborts pending prepared
> > transactions, update's catalog, and then change slot's property via
> > walrcv_alter_slot. What if there is any ERROR (say the remote node is
> > not reachable or there is an error while updating the catalog) after
> > we abort the pending prepared transaction? Won't we end up with lost
> > prepared transactions in such a case?

Yes, v16 could happen the case, becasue FinishPreparedTransaction() itself is not
the transactional operation. In below example, the subscription was altered after
stopping the publisher. You could see that prepared transaction were rollbacked.

```
subscriber=# SELECT gid FROM pg_prepared_xacts ;
gid
------------------
pg_gid_16390_741
pg_gid_16390_742
(2 rows)
subscriber=# ALTER SUBSCRIPTION sub SET (TWO_PHASE = off, FORCE_ALTER = on);
NOTICE: requested altering to two_phase = false but there are prepared transactions done by the subscription
DETAIL: Such transactions are being rollbacked.
ERROR: could not connect to the publisher: connection to server on socket "/tmp/.s.PGSQL.5431" failed: No such file or directory
Is the server running locally and accepting connections on that socket?
subscriber=# SELECT gid FROM pg_prepared_xacts ;
gid
-----
(0 rows)
```

> Considering the above is a problem the other possibility I thought of
> is to change the order like abort prepared xacts after slot update.
> That is also dangerous because any failure while aborting could make a
> slot change permanent whereas the subscription option will still be
> old value. Now, because the slot's two_phase property is off, at
> commit, it can resend the entire transaction which can create a
> problem because the corresponding prepared transaction will already be
> present.

I feel it is rare case but still possible. E.g., race condition by TwoPhaseStateLock
locking, oom, disk failures and so on.
And since prepared transactions hold locks, duplicated arrival of transactions
may cause table-lock failures.

> One more thing to think about in this regard is what if we fail after
> aborting a few prepared transactions and not all?

It's bit hard to emulate, but I imagine part of prepared transactions remains.

> At this stage, I am not able to think of a good solution for these
> problems. So, if we don't get a solution for these, we can document
> that users can first manually abort prepared transactions and then
> switch off the two_phase option using Alter Subscription command.

I'm also not sure what should we do. Ideally, it may be happy to make
FinishPreparedTransaction() transactional, but not sure it is realistic. So
changes for aborting prepared txns are removed, documentation patch was added
instead.

Here is a summary of updates for patches. Dropping-prepared-transaction patch
was removed for now.

0001 - Codes for SUBOPT_TWOPHASE_COMMIT are moved per requirement [1].
Also, checks for failover and two_phase are unified into one function.
0002 - updated accordingly. An argument for the check function is added.
0003 - this contains documentation changes required in [2].

[1]: https://www.postgresql.org/message-id/CAA4eK1%2BFRrL_fLWLsWQGHZRESg39ixzDX_S9hU8D7aFtU%2Ba8uQ%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/CAA4eK1Khy_YWFoQ1HOF_tGtiixD8YoTg86coX1-ckxt8vK3U%3DQ%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachment Content-Type Size
v17-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch application/octet-stream 30.2 KB
v17-0002-Alter-slot-option-two_phase-only-when-altering-t.patch application/octet-stream 13.7 KB
v17-0003-Notify-users-to-roll-back-prepared-transactions.patch application/octet-stream 2.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2024-07-09 12:01:35 Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
Previous Message Hayato Kuroda (Fujitsu) 2024-07-09 11:42:01 RE: Slow catchup of 2PC (twophase) transactions on replica in LR