Re: eXtensible Transaction Manager API

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: eXtensible Transaction Manager API
Date: 2015-12-03 22:41:49
Message-ID: CA+TgmoZXypBBMUAQ9Mj6Y9mcfkvZhVbLWmW17kHhFZBEvar=nA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 1, 2015 at 9:19 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Tue, Nov 17, 2015 at 12:48:38PM -0500, Robert Haas wrote:
>> > At this time, the number of round trips needed particularly for READ
>> > COMMITTED transactions that need a new snapshot for each query was
>> > really a performance killer. We used DBT-1 (TPC-W) which is less
>> > OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit
>> > was quickly reached with 10-20 nodes..
>>
>> Yeah. I think this merits a good bit of thought. Superficially, at
>> least, it seems that every time you need a snapshot - which in the
>> case of READ COMMITTED is for every SQL statement - you need a network
>> roundtrip to the snapshot server. If multiple backends request a
>> snapshot in very quick succession, you might be able to do a sort of
>> "group commit" thing where you send a single request to the server and
>> they all use the resulting snapshot, but it seems hard to get very far
>> with such optimization. For example, if backend 1 sends a snapshot
>> request and backend 2 then realizes that it also needs a snapshot, it
>> can't just wait for the reply from backend 1 and use that one. The
>> user might have committed a transaction someplace else and then kicked
>> off a transaction on backend 2 afterward, expecting it to see the work
>> committed earlier. But the snapshot returned to backend 1 might have
>> been taken before that. So, all in all, this seems rather crippling.
>>
>> Things are better if the system has a single coordinator node that is
>> also the arbiter of commits and snapshots. Then, it can always take a
>> snapshot locally with no network roundtrip, and when it reaches out to
>> a shard, it can pass along the snapshot information with the SQL query
>> (or query plan) it has to send anyway. But then the single
>> coordinator seems like it becomes a bottleneck. As soon as you have
>> multiple coordinators, one of them has got to be the arbiter of global
>> ordering, and now all of the other coordinators have to talk to it
>> constantly.
>
> I think the performance benefits of having a single coordinator are
> going to require us to implement different snapshot/transaction code
> paths for single coordinator and multi-coordinator cases. :-( That is,
> people who can get by with only a single coordinator are going to want
> to do that to avoid the overhead of multiple coordinators, while those
> using multiple coordinators are going to have to live with that
> overhead.
>
> I think an open question is what workloads can use a single coordinator?
> Read-only? Long queries? Are those cases also ones where the
> snapshot/transaction overhead is negligible, meaning we don't need the
> single coordinator optimizations?

Well, the cost of the coordinator is basically a per-snapshot cost,
which means in effect a per-transaction cost. So if your typical
query runtimes are measured in seconds or minutes, it's probably going
to be really hard to swamp the coordinator. If they're measured
milliseconds or microseconds, it's likely to be a huge problem. I
think this is why a lot of NoSQL systems have abandoned transactions
as a way of doing business. The overhead of isolating transactions is
significant even on a single-server system when there are many CPUs
and lots of short-running queries (cf.
0e141c0fbb211bdd23783afa731e3eef95c9ad7a and previous efforts in the
same area) but in a multi-server environment it just gets crushingly
expensive.

I think there are ways to reduce the cost of this. Some distributed
systems have solved it by retreating from snapshot isolation and going
back to using locks. This can improve scalability if you've got lots
of transactions but they're very short and rarely touch the same data.
Locking individual data elements (or partitions) doesn't require a
central system that knows about all commits - each individual node can
just release locks for each transaction as it completes. More
generally, if you could avoid having to make a decision about whether
transaction A precedes or follows transaction B unless transaction A
and B actually touch the same data - an idea we already use for SSI -
then you could get much better scalability. But I doubt that can be
made to work without a deeper rethink of our transaction system.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2015-12-04 00:49:01 Re: Passing initially_valid values instead of !skip_validation to StoreRelCheck() in AddRelationNewConstraints()
Previous Message Alvaro Herrera 2015-12-03 22:30:14 Re: [DOCS] max_worker_processes on the standby