From: | "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com> |
---|---|
To: | Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> |
Cc: | 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Muhammad Usama <m(dot)usama(at)gmail(dot)com>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, amul sul <sulamul(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Ildar Musin <ildar(at)adjust(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Chris Travers <chris(dot)travers(at)adjust(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> |
Subject: | RE: Transactions involving multiple postgres foreign servers, take 2 |
Date: | 2020-09-08 04:00:45 |
Message-ID: | TYAPR01MB29906E7F30E4C6FE6843998EFE290@TYAPR01MB2990.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> I intend to say that the global-visibility work can impact this in a
> major way and we have analyzed that to some extent during a discussion
> on the other thread. So, I think without having a complete
> design/solution that addresses both the 2PC and global-visibility, it
> is not apparent what is the right way to proceed. It seems to me that
> rather than working on individual (or smaller) parts one needs to come
> up with a bigger picture (or overall design) and then once we have
> figured that out correctly, it would be easier to decide which parts
> can go first.
I'm really sorry I've been getting late and late and latex10 to publish the revised scale-out design wiki to discuss the big picture! I don't know why I'm taking this long time; I feel I were captive in a time prison (yes, nobody is holding me captive; I'm just late.) Please wait a few days.
But to proceed with the development, let me comment on the atomic commit and global visibility.
* We have to hear from Andrey about their check on the possibility that Clock-SI could be Microsoft's patent and if we can avoid it.
* I have a feeling that we can adopt the algorithm used by Spanner, CockroachDB, and YugabyteDB. That is, 2PC for multi-node atomic commit, Paxos or Raft for replica synchronization (in the process of commit) to make 2PC more highly available, and the timestamp-based global visibility. However, the timestamp-based approach makes the database instance shut down when the node's clock is distant from the other nodes.
* Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other information to be transferred among the cluster nodes. However, this seems to have to track the order of read and write operations among concurrent transactions to ensure the correct commit order, so I'm not sure about the performance. The MVCO paper seems to present the information we need, but I haven't understood it well yet (it's difficult.) Could you anybody kindly interpret this?
Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co
As for the Sawada-san's 2PC patch, which I find interesting purely as FDW enhancement, I raised the following issues to be addressed:
1. Make FDW API implementable by other FDWs than postgres_fdw (this is what Amit-san kindly pointed out.) I think oracle_fdw and jdbc_fdw would be good examples to consider, while MySQL may not be good because it exposes the XA feature as SQL statements, not C functions as defined in the XA specification.
2. 2PC processing is queued and serialized in one background worker. That severely subdues transaction throughput. Each backend should perform 2PC.
3. postgres_fdw cannot detect remote updates when the UDF executed on a remote node updates data.
Regards
Takayuki Tsunakawa
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2020-09-08 04:02:25 | Re: [Patch] Optimize dropping of relation buffers using dlist |
Previous Message | Bharath Rupireddy | 2020-09-08 03:46:31 | Re: Logical Replication - detail message with names of missing columns |