Re: The plan for FDW-based sharding

From: Kevin Grittner <kgrittn(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: The plan for FDW-based sharding
Date: 2016-03-07 15:18:32
Message-ID: CACjxUsO=Chvy7GVetaPRhh7PVnN1H+OxOEHgMqizxhmAoVgCog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 7, 2016 at 6:13 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 5 March 2016 at 23:41, Kevin Grittner <kgrittn(at)gmail(dot)com> wrote:

>> The only place you *need* to vary from commit order for correctness
>> is when there are overlapping SERIALIZABLE transactions, one
>> modifies data and commits, and another reads the old version of the
>> data but commits later.
>
> Ah, right. So here, even though X1 commits before X2 running concurrently
> under SSI, the logical order in which the xacts could've occurred serially
> is that where xact 2 runs and commits before X1, since xact 2 doesn't depend
> on xact 1. X2 read the old row version before xact 1 modified it, and
> logically occurs before xact1 in the serial rearrangement.

Right, because X2 is *seeing* data in a state that existed before X1 ran.

> I don't fully grasp how that can lead to a situation where xacts can commit
> in an order that's valid upstream but not valid as a downstream apply order.

With SSI, it can matter whether an intermediate state is *read*.

> I presume we're looking at read-only logical replicas here (rather than
> multimaster),

I have not worked out how this works with MMR. I'm not sure that
there is one clear answer to that.

> and it's only a concern for SERIALIZABLE xacts since a READ
> COMMITTED xact on the master and replica would both be able to see the state
> where X1 is commited but X2 isn't yet.

REPEATABLE READ would allow the anomaly to be seen, too, if a
transaction acquired its snapshot between the two commits.

> But I don't see how a read-only xact
> in SERIALIZABLE on the replica can get different results to what it'd get
> with SSI on the master. It's entirely possible for a read xact on the master
> to get a snapshot after X1 commits and after X2 commits, same as READ
> COMMITTED. SSI shouldn't AFAIK come into play with no writes to create a
> pivot. Is that wrong?

As mentioned earlier in this thread, look at the examples in this
section of the Wiki page, and imagine that the READ ONLY
transaction involved did *not* run on the primary, but *did* run on
the replica:

https://wiki.postgresql.org/wiki/SSI#Read_Only_Transactions

> If we applied this sequence to the downstream in commit order we'd still get
> correct results on the heap after applying both.

... eventually.

> We'd have an intermediate
> state where X1 is commited but X2 isn't, but we can have the same on the
> master. SSI doesn't AFAIK mask X1 from becoming visible in a snapshot until
> X2 commits or anything, right?

If that intermediate state is *seen* on the master, a transaction
is rolled back.

>> The key is that
>> there is a read-write dependency (a/k/a rw-conflict) between the
>> two transactions which tells you that the second to commit has to
>> come before the first in any graph of apparent order of execution.
>
> Yeah, I get that part. How does that stop a 3rd SERIALIZABLE xact from
> getting a snapshot between the two commits and reading from there?

Serializable Snapshot Isolation doesn't generally block anything
that REPEATABLE READ (which is straight Snapshot Isolation) doesn't
block -- unless you explicitly request READ ONLY DEFERRABLE. What
is does is monitor for situations that can present anomalies and
rolls back transactions as necessary to prevent anomalies in
successfully committed transactions. We tried very hard to avoid
rolling back a transaction that could fail a second time on
conflict the same set of transactions, although there were some
corner cases where it could not be avoided when a transaction was
PREPARED and not yet committed. Another possibly useful fact is
that we were able to guarantee that whenever there was a rollback,
some SERIALIZABLE transaction which overlaps the one being rolled
back has modified data and successfully committed -- ensuring that
there is some forward progress even in worst case situations.

>> The tricky part is that when there are two overlapping SERIALIZABLE
>> transactions and one of them has modified data and committed, and
>> there is an overlapping SERIALIZABLE transaction which is not READ
>> ONLY which has not yet reached completion (COMMIT or ROLLBACK) the
>> correct ordering remains in doubt -- there is no way to know which
>> might need to commit first, or whether it even matters. I am
>> skeptical about whether in logical replication (including MMR), it
>> is going to be possible to manage this by finding "safe snapshots".
>> The only alternative I can see, though, is to suspend replication
>> while correct transaction ordering remains in doubt. A big READ
>> ONLY transaction would not cause a replication stall, but a big
>> READ WRITE transaction could cause an indefinite stall. Simon
>> seemed to be saying that this is unacceptable, but I tend to think
>> it is a viable approach for some workloads, especially if the READ
>> ONLY transaction property is used when possible.
>
> We already have huge replication stalls when big write xacts occur. We don't
> start sending any data for the xact to a peer until it commits, and once we
> start we don't send any other xact data until that xact is received (and
> probably applied) by the peer.
>
> I'd like to address that by introducing xact streaming / interleaved xacts,
> where we stream big xacts on the wire as they occur and buffer them on the
> peer, possibly speculatively applying them too. This requires that
> individual row changes be tagged with subxact IDs and that
> subxact-to-top-level-xact mapping info be sent, so the peer can accumulate
> the right xacts into the right buffers. Basically offloading reorder
> buffering to the peer.
>
> That same mechanism would let replication continue while logical
> serializable commit-order is in-doubt, blocking only the actual commit from
> proceeding, and only on those xacts. I think.

That makes sense to me.

> That said I'm still clearly more fuzzy about the details of what SSI does,
> what it guarantees and how it works than I thought I was, so I may just be
> handwaving pointlessly at this point. I'd better read some code...

You might want to also review the paper presented at the VLDB
conference:

http://vldb.org/pvldb/vol5/p1850_danrkports_vldb2012.pdf

Really I think the key is to consider it a monitor on top of the
Snapshot Isolation of REPEATABLE READ, which looks for patterns in
read-write dependencies and transaction boundaries (as the points
where snapshots are acquired and commits successfully complete)
that will cancel transactions as necessary to prevent anomalies.
The patterns that are used were recognized over the course of many
years of research into the topic by groups at MIT, Sidney, and
others. Dan and I managed to extend the theory with respect to
READ ONLY transactions in a way that was reviewed by some of the
prior researchers and stood up to peer review at the VLDB
conference. Getting your head around all the conditions involved
in making an anomaly possible is a bit of work, but it is all well
grounded in both theory and practical research.

I will admit that getting your head around the internal workings of
SSI is one or two orders of magnitude more work than getting your
head around S2PL. The bright side is that end users don't need to
do that to be able to *use* it effectively.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-03-07 15:19:35 Re: Move PinBuffer and UnpinBuffer to atomics
Previous Message Tom Lane 2016-03-07 15:09:57 Re: Splitting lengthy sgml files