Re: eXtensible Transaction Manager API

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: eXtensible Transaction Manager API
Date: 2015-11-13 13:35:18
Message-ID: CAB7nPqQJLP=GdTZhAhk0ca3CQ3uExGOUQJcC4Fx_wKJqHG3UhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Nov 8, 2015 at 6:35 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> Sure. Now imagine that the pg_twophase entry is corrupted for this
>> transaction on one node. This would trigger a PANIC on it, and
>> transaction would not be committed everywhere.
>
> If the database is corrupted, there's no way to guarantee that
> anything works as planned. This is like saying that criticizing
> somebody's disaster recovery plan on the basis that it will be
> inadequate if the entire planet earth is destroyed.

As well as there could be FS, OS, network problems... To come back to
the point, my point is simply that I found surprising the sentence of
Konstantin upthread saying that if commit fails on some of the nodes
we should rollback the prepared transaction on all nodes. In the
example given, in the phase after calling dtm_end_prepare, say we
perform COMMIT PREPARED correctly on node 1, but then failed it on
node 2 because a meteor has hit a server, it seems that we cannot
rollback, instead we had better rolling in a backup and be sure that
the transaction gets committed. How would you rollback the transaction
already committed on node 1? But perhaps I missed something...

>> I am aware of the fact
>> that by definition PREPARE TRANSACTION ensures that a transaction will
>> be committed with COMMIT PREPARED, just trying to see any corner cases
>> with the approach proposed. The DTM approach is actually rather close
>> to what a GTM in Postgres-XC does :)
>
> Yes. I think that we should try to learn as much as possible from the
> XC experience, but that doesn't mean we should incorporate XC's fuzzy
> thinking about 2PC into PG. We should not.

Yep.

> One point I'd like to mention is that it's absolutely critical to
> design this in a way that minimizes network roundtrips without
> compromising correctness. XC's GTM proxy suggests that they failed to
> do that. I think we really need to look at what's going to be on the
> other sides of the proposed APIs and think about whether it's going to
> be possible to have a strong local caching layer that keeps network
> roundtrips to a minimum. We should consider whether the need for such
> a caching layer has any impact on what the APIs should look like.

At this time, the number of round trips needed particularly for READ
COMMITTED transactions that need a new snapshot for each query was
really a performance killer. We used DBT-1 (TPC-W) which is less
OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit
was quickly reached with 10-20 nodes..

> For example, consider a 10-node cluster where each node has 32 cores
> and 32 clients, and each client is running lots of short-running SQL
> statements. The demand for snapshots will be intense. If every
> backend separately requests a snapshot for every SQL statement from
> the coordinator, that's probably going to be terrible. We can make it
> the problem of the stuff behind the DTM API to figure out a way to
> avoid that, but maybe that's going to result in every DTM needing to
> solve the same problems.

This recalls a couple of things, though in 2009 I was not playing with
servers of this scale.

> Another thing that I think we need to consider is fault-tolerance.
> For example, suppose that transaction commit and snapshot coordination
> services are being provided by a central server which keeps track of
> the global commit ordering. When that server gets hit by a freak bold
> of lightning and melted into a heap of slag, somebody else needs to
> take over. Again, this would live below the API proposed here, but I
> think it really deserves some thought before we get too far down the
> path. XC didn't start thinking about how to add fault-tolerance until
> quite late in the project, I think, and the same could be said of
> PostgreSQL itself: some newer systems have easier-to-use fault
> tolerance mechanisms because it was designed in from the beginning.
> Distributed systems by nature need high availability to a far greater
> degree than single systems, because when there are more nodes, node
> failures are more frequent.

Your memories on the matter are right. In the case of XC, the SPOF
that is GTM has been somewhat made more stable with the creation of a
GTM standby, though it did not solve the scalability limit of having a
centralized snapshot facility. It actually increased the load on the
whole system because for short transactions. Alea jacta est.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2015-11-13 13:38:53 Re: Parallel Seq Scan
Previous Message Amit Kapila 2015-11-13 13:34:01 Re: Parallel Seq Scan