Re: Transactions involving multiple postgres foreign servers, take 2

From: Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Muhammad Usama <m(dot)usama(at)gmail(dot)com>, amul sul <sulamul(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Ildar Musin <ildar(at)adjust(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Chris Travers <chris(dot)travers(at)adjust(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
Subject: Re: Transactions involving multiple postgres foreign servers, take 2
Date: 2020-07-18 00:44:05
Message-ID: 8fddc794e9bf8f3153627e622c23e992@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020-07-17 15:55, Masahiko Sawada wrote:
> On Fri, 17 Jul 2020 at 11:06, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
> wrote:
>>
>> On 2020-07-16 13:16, Masahiko Sawada wrote:
>> > On Tue, 14 Jul 2020 at 17:24, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
>> > wrote:
>> >>
>> >> > I've attached the latest version patches. I've incorporated the review
>> >> > comments I got so far and improved locking strategy.
>> >>
>> >> I want to ask a question about streaming replication with 2PC.
>> >> Are you going to support 2PC with streaming replication?
>> >>
>> >> I tried streaming replication using v23 patches.
>> >> I confirm that 2PC works with streaming replication,
>> >> which there are primary/standby coordinator.
>> >>
>> >> But, in my understanding, the WAL of "PREPARE" and
>> >> "COMMIT/ABORT PREPARED" can't be replicated to the standby server in
>> >> sync.
>> >>
>> >> If this is right, the unresolved transaction can be occurred.
>> >>
>> >> For example,
>> >>
>> >> 1. PREPARE is done
>> >> 2. crash primary before the WAL related to PREPARE is
>> >> replicated to the standby server
>> >> 3. promote standby server // but can't execute "ABORT PREPARED"
>> >>
>> >> In above case, the remote server has the unresolved transaction.
>> >> Can we solve this problem to support in-sync replication?
>> >>
>> >> But, I think some users use async replication for performance.
>> >> Do we need to document the limitation or make another solution?
>> >>
>> >
>> > IIUC with synchronous replication, we can guarantee that WAL records
>> > are written on both primary and replicas when the client got an
>> > acknowledgment of commit. We don't replicate each WAL records
>> > generated during transaction one by one in sync. In the case you
>> > described, the client will get an error due to the server crash.
>> > Therefore I think the user cannot expect WAL records generated so far
>> > has been replicated. The same issue could happen also when the user
>> > executes PREPARE TRANSACTION and the server crashes.
>>
>> Thanks! I didn't noticed the behavior when a user executes PREPARE
>> TRANSACTION is same.
>>
>> IIUC with 2PC, there is a different point between (1)PREPARE
>> TRANSACTION
>> and (2)2PC.
>> The point is that whether the client can know when the server crashed
>> and it's global tx id.
>>
>> If (1)PREPARE TRANSACTION is failed, it's ok the client execute same
>> command
>> because if the remote server is already prepared the command will be
>> ignored.
>>
>> But, if (2)2PC is failed with coordinator crash, the client can't know
>> what operations should be done.
>>
>> If the old coordinator already executed PREPARED, there are some
>> transaction which should be ABORT PREPARED.
>> But if the PREPARED WAL is not sent to the standby, the new
>> coordinator
>> can't execute ABORT PREPARED.
>> And the client can't know which remote servers have PREPARED
>> transactions which should be ABORTED either.
>>
>> Even if the client can know that, only the old coordinator knows its
>> global transaction id.
>> Only the database administrator can analyze the old coordinator's log
>> and then execute the appropriate commands manually, right?
>
> I think that's right. In the case of the coordinator crash, the user
> can look orphaned foreign prepared transactions by checking the
> 'identifier' column of pg_foreign_xacts on the new standby server and
> the prepared transactions on the remote servers.

I think there is a case we can't check orphaned foreign
prepared transaction in pg_foreign_xacts view on the new standby server.
It confuses users and database administrators.

If the primary coordinator crashes after preparing foreign transaction,
but before sending XLOG_FDWXACT_INSERT records to the standby server,
the standby server can't restore their transaction status and
pg_foreign_xacts view doesn't show the prepared foreign transactions.

To send XLOG_FDWXACT_INSERT records asynchronously leads this problem.

>> > To prevent this
>> > issue, I think we would need to send each WAL records in sync but I'm
>> > not sure it's reasonable behavior, and as long as we write WAL in the
>> > local and then send it to replicas we would need a smart mechanism to
>> > prevent this situation.
>>
>> I agree. To send each 2PC WAL records in sync must be with a large
>> performance impact.
>> At least, we need to document the limitation and how to handle this
>> situation.
>
> Ok. I'll add it.

Thanks a lot.

>> > Related to the pointing out by Ikeda-san, I realized that with the
>> > current patch the backend waits for synchronous replication and then
>> > waits for foreign transaction resolution. But it should be reversed.
>> > Otherwise, it could lead to data loss even when the client got an
>> > acknowledgment of commit. Also, when the user is using both atomic
>> > commit and synchronous replication and wants to cancel waiting, he/she
>> > will need to press ctl-c twice with the current patch, which also
>> > should be fixed.
>>
>> I'm sorry that I can't understood.
>>
>> In my understanding, if COMMIT WAL is replicated to the standby in
>> sync,
>> the standby server can resolve the transaction after crash recovery in
>> promoted phase.
>>
>> If reversed, there are some situation which can't guarantee atomic
>> commit.
>> In case that some foreign transaction resolutions are succeed but
>> others
>> are failed(and COMMIT WAL is not replicated),
>> the standby must ABORT PREPARED because the COMMIT WAL is not
>> replicated.
>> This means that some foreign transactions are COMMITE PREPARED
>> executed
>> by primary coordinator,
>> other foreign transactions can be ABORT PREPARED executed by secondary
>> coordinator.
>
> You're right. Thank you for pointing out!
>
> If the coordinator crashes after the client gets acknowledgment of the
> successful commit of the transaction but before sending
> XLOG_FDWXACT_REMOVE record to the replicas, the FdwXact entries are
> left on the replicas even after failover. But since we require FDW to
> tolerate the error of undefined prepared transactions in
> COMMIT/ROLLBACK PREPARED it won’t be a critical problem.

I agree. It's ok that the primary coordinator sends
XLOG_FDWXACT_REMOVE records asynchronously.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-07-18 00:53:00 Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
Previous Message Jeff Davis 2020-07-18 00:13:15 Re: Default setting for enable_hashagg_disk