Re: Question concerning XTM (eXtensible Transaction Manager API)

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Question concerning XTM (eXtensible Transaction Manager API)
Date: 2015-11-17 15:27:34
Message-ID: 1945168568.5065550.1447774054158.JavaMail.yahoo@mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, November 17, 2015 12:43 AM, konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
> On Nov 16, 2015, at 11:21 PM, Kevin Grittner wrote:

>> If you are saying that DTM tries to roll back a transaction after
>> any participating server has entered the RecordTransactionCommit()
>> critical section, then IMO it is broken.  Full stop.  That can't
>> work with any reasonable semantics as far as I can see.
>
> DTM is not trying to rollback committed transaction.
> What he tries to do is to hide this commit.
> As I already wrote, the idea was to implement "lightweight" 2PC
> because prepared transactions mechanism in PostgreSQL adds too much
> overhead and cause soe problems with recovery.

The point remains that there must be *some* "point of no return"
beyond which rollback (or "hiding" is not possible).  Until this
point, all heavyweight locks held by the transaction must be
maintained without interruption, data modification of the
transaction must not be visible, and any attempt to update or
delete data updated or deleted by the transaction must block or
throw an error.  It sounds like you are attempting to move the
point at which this "point of no return" is, but it isn't as clear
as I would like.  It seems like all participating nodes are
responsible for notifying the arbiter that they have completed, and
until then the arbiter gets involved in every visibility check,
overriding the "normal" value?

> The transaction is normally committed in xlog, so that it can
> always be recovered in case of node fault.
> But before setting correspondent bit(s) in CLOG and releasing
> locks we first contact arbiter to get global status of transaction.
> If it is successfully locally committed by all nodes, then
> arbiter approves commit and commit of transaction normally
> completed.
> Otherwise arbiter rejects commit. In this case DTM marks
> transaction as aborted in CLOG and returns error to the client.
> XLOG is not changed and in case of failure PostgreSQL will try to
> replay this transaction.
> But during recovery it also tries to restore transaction status
> in CLOG.
> And at this placeDTM contacts arbiter to know status of
> transaction.
> If it is marked as aborted in arbiter's CLOG, then it wiull be
> also marked as aborted in local CLOG.
> And according to PostgreSQL visibility rules no other transaction
> will see changes made by this transaction.

If a node goes through crash and recovery after it has written its
commit information to xlog, how are its heavyweight locks, etc.,
maintained throughout?  For example, does each arbiter node have
the complete set of heavyweight locks?  (Basically, all the
information which can be written to files in pg_twophase must be
held somewhere by all arbiter nodes, and used where appropriate.)

If a participating node is lost after some other nodes have told
the arbiter that they have committed, and the lost node will never
be able to indicate that it is committed or rolled back, what is
the mechanism for resolving that?

>>> We can not just call elog(ERROR,...) in SetTransactionStatus
>>> implementation because inside critical section it cause Postgres
>>> crash with panic message. So we have to remember that transaction is
>>> rejected and report error later after exit from critical section:
>>
>> I don't believe that is a good plan.  You should not enter the
>> critical section for recording that a commit is complete until all
>> the work for the commit is done except for telling the all the
>> servers that all servers are ready.
>
> It is good point.
> May be it is the reason of performance scalability problems we
> have noticed with DTM.

Well, certainly the first phase of two-phase commit can take place
in parallel, and once that is complete then the second phase
(commit or rollback of all the participating prepared transactions)
can take place in parallel.  There is no need to serialize that.

> Sorry, some clarification.
> We get 10x slowdown of performance caused by 2pc on very heavy
> load on the IBM system with 256 cores.
> At "normal" servers slowdown of 2pc is smaller - about 2x.

That suggests some contention point, probably on spinlocks.  Were
you able to identify the particular hot spot(s)?

On Tuesday, November 17, 2015 3:09 AM, konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
> On Nov 17, 2015, at 10:44 AM, Amit Kapila wrote:

>> I think the general idea is that if Commit is WAL logged, then the
>> operation is considered to committed on local node and commit should
>> happen on any node, only once prepare from all nodes is successful.
>> And after that transaction is not supposed to abort.  But I think you are
>> trying to optimize the DTM in some way to not follow that kind of protocol.
>
> DTM is still following 2PC protocol:
> First transaction is saved in WAL at all nodes and only after it
> commit is completed at all nodes.

So, essentially you are treating the traditional commit point as
phase 1 in a new approach to two-phase commit, and adding another
layer to override normal visibility checking and record locks
(etc.) past that point?

> We try to avoid maintaining of separate log files for 2PC (as now
> for prepared transactions) and do not want to change logic of
> work with WAL.
>
> DTM approach is based on the assumption that PostgreSQL CLOG and
> visibility rules allows to "hide" transaction even if it is
> committed in WAL.

I see where you could get a performance benefit from not recording
(and cleaning up) persistent state for a transaction in the
pg_twophase directory between the time the transaction is prepared
and when it is committed (which should normally be a very short
period of time, but must survive crashes, and communication
failures).  Essentially you are trying to keep that in RAM instead,
and counting on multiple processes at different locations
redundantly (and synchronously) storing this data to ensure
persistence, rather than writing the data to disk files when are
deleted as soon as the prepared transaction is committed or rolled
back.

I wonder whether it might not be safer to just do that -- rather
than trying to develop a whole new way of implementing two-phase
commit, just come up with a new way to persist the information
which must survive between the prepared and the later commit or
rollback of the prepared transaction.  Essentially, provide hooks
for persisting the data when preparing a transaction, and the
arbiter would set the hooks to a function to send the data there.
Likewise with the release of the information (normally a very small
fraction of a second later).  The rest of the arbiter code becomes
a distributed transaction manager.  It's not a trivial job to get
that right, but at least it is a very well-understood problem, and
is not likely to take as long to develop and shake out tricky
data-eating bugs.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2015-11-17 15:43:08 Re: Freeze avoidance of very large table.
Previous Message Jim Nasby 2015-11-17 15:24:51 Re: Extracting fields from 'infinity'::TIMESTAMP[TZ]