Jan Wieck wrote:
> WARNING: This is getting long ...
>
> Postgres-R is a very interesting and inspiring idea. And I've been
> kicking that concept around for a while now. What I don't like about
> it is that it requires fundamental changes in the lock mechanism and
> that it is based on the assumption of very low lock conflict.
>
> <explain-PG-R>
> In Postgres-R a committing transaction sends it's workset (WS - a list
> of all updates done in this transaction) to the group communication
> system (GC). The GC guarantees total order, meaning that all nodes
> will receive all WSs in the same order, no matter how they have been
> sent.
>
> If a node receives back it's own WS before any error occured, it goes
> ahead and finalizes the commit. If it receives a foreign WS, it has to
> apply the whole WS and commit it before it can process anything else.
> If now a local transaction, in progress or while waiting for it's WS
> to come back, holds a lock that is required to process such remote WS,
> the local transaction needs to be aborted to unlock it's resources ...
> it lost the total order race.
> </explain-PG-R>
>
> Postgres-R requires that all remote WSs are applied and committed
> before a local transaction can commit. Otherwise it couldn't correctly
> detect a lock conflict. So there will not be any read ahead. And since
> the total order really counts here, it cannot apply any two remote WSs
> in parallel, a race condition could possibly exist and a later WS in
> the total order runs faster and locks up a previous one, so we have to
> squeeze all remote WSs through one single replication work process.
> And all the locally parallel executed transactions that wait for their
> WSs to come back have to wait until that poor little worker is done
> with the whole pile. Bye bye concurrency. And I don't know how the GC
> will deal with the backlog either. Could well choke on it.
>
> I do not see how this will scale well in a multi-SMP-system cluster.
> At least the serialization of WSs will become a horror if there is
> significant lock contention like in a standard TPC-C on the district
> row containing the order number counter. I don't know for sure, but I
> suspect that with this kind of bottleneck, Postgres-R will have to
> rollback more than 50% of it's transactions when there are more than 4
> nodes under heavy load (like in a benchmark run). That will suck ...
>
>
> But ... initially I said that it is an inspiring concept ... soooo ...
>
> I am currently hacking around with some C+PL/TclU+Spread constructs
> that might form a rude kind of prototype creature.
>
> My changes to the Postgres-R concept are that there will be as many
> replicating slave processes as there are in summary masters out in the
> cluster ... yes, it will try to utilize all the CPU's in the cluster!
> For failover reliability, A committing transaction will hold before
> finalizing the commit and send it's "I'm ready" to the GC. Every
> replicator that reaches the same state send's "I'm ready" too. Spread
> guarantees in SAFE_MESS mode that messages are delivered to all nodes
> in a group or that at least LEAVE/DISCONNECT messages are deliverd
> before. So if a node receives more than 50% of "I'm ready", there
> would be a very small gap where multiple nodes have to fail in the
> same split second so that the majority of nodes does NOT commit. A
> node that reported "I'm ready" but lost more than 50% of the cluster
> before committing has to rollback and rejoin or wait for operator
> intervention.
>
> Now the idea is to split up the communication into GC distribution
> groups per transaction. So working master backends and associated
> replication backends will join/leave a unique group for every
> transaction in the cluster. This way, the per process communication is
> reduced to the required minimum.
>
>
> As said, I am hacking on some code ...
>
>
> Jan
>
> Chris Travers wrote:
>
>> Tom Lane wrote:
>>
>>> Chris Travers <chris(at)travelamericas(dot)com> writes:
>>>
>>>
>>>> Yes I have. Postgres-r is not a high-availability solution which is
>>>> capable of transparent failover,
>>>>
>>>
>>>
>>> What makes you say that? My understanding is it's supposed to survive
>>> loss of individual servers.
>>>
>>> regards, tom lane
>>>
>>>
>>>
>>>
>> My mistake. I must have gotten them confused with another
>> (asynchronous) replication project.
>>
>> Best Wishes,
>> Chris Travers
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 9: the planner will ignore your desire to choose an index scan if
>> your
>> joining column's datatypes do not match
>
>
>
As my british friends would say, "Bully for you",and I applaud you
playing, struggling, learning from this for our sakes. Jeez, all I think
about is me,huh?