From: | Jan Wieck <JanWieck(at)Yahoo(dot)com> |
---|---|
To: | Chris Travers <chris(at)travelamericas(dot)com> |
Cc: | Richard Huxton <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: Feature Request for 7.5 |
Date: | 2003-12-03 15:20:03 |
Message-ID: | 3FCDFF23.7060404@Yahoo.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
The following is more or less a brain-dump ... not finally thought out
and not supposed to be considered a proposal at this time.
The synchronous multi-master solution I have in mind needs a few
currently non existent support features in the backend. One is
non-blocking locks and another one is a callback mechanism just before
marking a transaction in clog as committed.
It will use reliable group communication (GC) that can guarantee total
order. There is an AFTER trigger on all replicated tables. A daemon
started for every database will create a number of threads/subprocesses.
Each of these workers has his separate DB connection and is a member of
a different group in the GC. The number of these groups determines the
maximum number of concurrent UPDATE-transactions, the cluster can handle.
At the first call of the trigger inside of a transaction (this is the
first modifying statement), the trigger allocates one of the replication
groups (possibly waiting for one to become free). It now communicates
with one daemon thread on every database in the cluster. The triggers
now send the replication data into this group. It is not necessary to
wait for the other cluster members as long as the GC guarantees FIFO by
sender.
At the time the transaction commits, it sends a commit message into the
group. This message has another service type level which is total order.
It will wait now for all members in the replication group to reply with
the same. When every member in the group replied, all agreed to commit
and are just before stamping clog.
Since the service type is total order, the GC guarantees that either all
members get the messages in the same order, or if one cannot get a
message a corresponding LEAVE message will be generated. Also, all the
replication threads will use non-blocking locking. If any of them ever
finds a locked row, it will send an ABORT message into the group,
causing the whole group to roll back.
This way, either all members of the group reach the "just before
stamping clog" state together and know that everyone got there, or they
will get an abort or leave message from any of their co-workers and roll
back.
There is a gap between reporting "ready" and really stamping clog in
which a database might crash. This will cause all other cluster members
to go ahead and commit while the crashed DB does not commit. But this is
limited to crashes only and a restarting database must rejoin/resync
with the cluster anyway and doubt its own data. So this is not really a
problem.
With this synchronous model, read only transactions can be handled on
every node independently of replication at all - this is the scaling
part. The total amount of UPDATE transactions is limited by the slowest
cluster member and does not scale, but that is true for all synchronous
solutions.
Jan
Chris Travers wrote:
> Interesting feedback.
>
> It strikes me that, for many sorts of databases, multimaster synchronous
> replication is not the best solution for the reasons that Scott, Jan, et.
> al. have raised. I am wondering how commercial RDBMS's get arround this
> problem? There are several possibilities that I can think of-- have a write
> master, and many read-only slaves (available at the moment, iirc).
> Replication could then occur at the tuple level using linked databases,
> triggers, etc. Rewrite rules could then allow one to use the slaves to
> "funnel" the queries back up to the master. It seems to me that latency
> would be a killer on this sort of solution, though everything would
> effectively occur on all databases in the same order, but recovering from a
> crash of the master could be complicated and result in additional
> downtime...
>
> The other solution (still not "guaranteed" to work in all cases) is that
> every proxy could be hardwired to attempt to contact databases in a set
> order. This would also avoid deadlocks. Note that if sufficient business
> logic is built into the database, one would be guaranteed that a single
> "consistent" view would be maintained at any given time (conflicts would
> result in the minority of up to 50 percent of the servers needing to go
> through the recovery process-- not killing uptime, but certainly killing
> performance).
>
> However, it seems to me that the only solution for many of these databases
> is to have a "cluster in a box" solution where you have a system comprised
> entirely of redundent, hot-swapable hardware so that nearly anything can be
> swapped out if it breaks. In this case, we should be able to just run
> PostgreSQL as is....
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2003-12-03 15:20:50 | Re: Error in select |
Previous Message | Chris Travers | 2003-12-03 15:19:02 | Re: PostgreSQL Advocacy, Thoughts and Comments |