Re: Multimaster

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Multimaster
Date: 2016-04-18 08:28:02
Message-ID: 57149A92.7020605@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,
Thank you for your response.

On 17.04.2016 15:30, Craig Ringer wrote:
> I intend to make the same split in pglogical its self - a receiver and
> apply worker split. Though my intent is to have them communicate via a
> shared memory segment until/unless the apply worker gets too far
> behind and spills to disk.
>

In case of multimaster "too far behind" scenario can never happen. So
here is yet another difference in asynchronous and synchronous
replication approaches. For asynchronous replication situation when
replica is far behind master is quite normal and has to be addressed
without blocking master. For synchronous replication it is not possible
all this "spill to disk" adds just extra overhead.

It seems to me that pglogical plugin is now becoming too universal,
trying to address a lot of different issues and play different roles.
Here are some use cases for logical replication which I see (I am quite
sure that you know more):
1. Asynchronous replication (including georeplication) - this is
actually BDR.
2. Logical backup: transfer data to different database (including new
version of Postgres)
3. Change notification: there are many different subscribers which can
be interested in receiving notifications about database changes.
As far as I know new JDBC driver is going to use logical replication to
receive update streams. It can be also used for update/invalidation of
caches in ORMs.
4. Synchronous replication: multimaster

> Any vacant worker form this pool can dequeue this work and proceed it.
>
>
> How do you handle correctness of ordering though? A naïve approach
> will suffer from a variety of anomalies when subject to
> insert/delete/insert write patterns, among other things. You can also
> get lost updates, rows deleted upstream that don't get deleted
> downstream and various other exciting ordering issues.
>
> At absolute minimum you'd have to commit on the downstream in the same
> commit order as the upstream.. This can deadlock. So when you get a
> deadlock you'd abort the xacts of the deadlocked worker and all xacts
> with later commit timestamps, then retry the lot.

We are not enforcing order of commits as Galera does. Consistency is
enforces by DTM, which enforce that transactions at all nodes are given
consistent snapshots and assigned same CSNs. We have also global
deadlock detection algorithm which build global lock graph (but still
false positives are possible because this graphs is build incrementally
and so it doesn't correspond to some global snapshot).

>
> BDR has enough trouble with this when applying transactions from
> multiple peer nodes. To a degree it just throws its hands up and gives
> up - in particular, it can't tell the difference between an
> insert/update conflict and an update/delete conflict. But that's
> between loosely coupled nodes where we explicitly document that some
> kinds of anomalies are permitted. I can't imagine it being OK to have
> an even more complex set of possible anomalies occur when simply
> replaying transactions from a single peer...

We should definitely perform more testing here, but right now we do not
have any tests causing some synchronization anomalies.

>
> It is certainly possible with this approach that order of applying
> transactions can be not the same at different nodes.
>
>
> Well, it can produce downright wrong results, and the results even in
> a single-master case will be all over the place.
>
> But it is not a problem if we have DTM.
>
>
> How does that follow?

Multimaster is just particular (and simplest) case of distributed
transactions. Specific of multimaster is that the same transaction has
to be applied at all nodes and that selects can be executed at any node.
The goal of DTM is to provide consistent execution of distributed
transactions. If it is able to do for arbitrary transactions then,
definitely, it can do it for multimaster.
I can not give you here formal prove that our DTM is able to solve all
this problems. Certainly there are may be bugs in implementation
and this is why we need to perform more testing. But actually we are
not "reinventing the wheel", our DTM is based on the existed approaches.

> The only exception is recovery of multimaster node. In this case
> we have to apply transaction exactly in the same order as them
> were applied at the original node performing recovery. It is done
> by applying changes in recovery mode by pglogical_receiver itself.
>
>
> I'm not sure I understand what you area saying here.

Sorry for unclearness.
I just said that normally transactions are applied concurrently by
multiple workers and DTM is used to enforce consistency.
But in case of recovery (when some node is crashed and then reconnect to
the cluster), we perform recovery of this node sequentially, by single
worker. In this case DTM is not used (because other nodes are far ahead)
and to restore the same state of node we need to apply changes exactly
in the same order and at the source node. In this case case content of
target (recovered) node should be the same as of source node.

>> We also need 2PC support but this code was sent to you by
>> Stas, so I hope that sometime it will be included in
>> PostgreSQL core and pglogical plugin.
>>
>>
>> I never got a response to my suggestion that testing of upstream
>> DDL is needed for that. I want to see more on how you plan to
>> handle DDL on the upstream side that changes the table structure
>> and acquires strong locks. Especially when it's combined with row
>> changes in the same prepared xacts.
>
> We are now replicating DDL in the way similar with one used in
> BDR: DDL statements are inserted in special table and are replayed
> at destination node as part of transaction.
>
> We have also alternative implementation done by Artur Zakirov
> <a(dot)zakirov(at)postgrespro(dot)ru <mailto:a(dot)zakirov(at)postgrespro(dot)ru>>
> which is using custom WAL records:
> https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
> Patch for custom WAL records was committed in 9.6, so we are going
> to switch to this approach.
>
>
> How does that really improve anything over using a table?

It is more straightforward approach, isn't it? You can either try to
restore DDL from low level sequence of updates of system catalogue.
But it is difficult and not always possible. Or need to add to somehow
add original DDL statements to the log.
It can be done using some special table or store this information
directly in the log (if custom WAL records are supported).
Certainly in the last case logical protocol should be extended to
support playback of user-defined WAl records.
But it seems to be universal mechanism which can be used not only for DDL.

I agree, that custom WAL adds no performance or functionality advantages
over using a table.
This is why we still didn't switch to it. But IMHO approach with
inserting DDL (or any other user-defined information) in special table
looks like hack.

>
> This doesn't address what I asked above though, which is whether you
> have tried doing ALTER TABLE in a 2PC xact with your 2PC replication
> patch, especially one that also makes row changes.
>
> Well, recently I have made attempt to merge our code with the
> latest version of pglogical plugin (because our original
> implementation of multimaster was based on the code partly taken
> fro BDR) but finally have to postpone most of changes. My primary
> intention was to support metadata caching. But presence of
> multiple apply workers make it not possible to implement it in the
> same way as it is done node in pglogical plugin.
>
>
> Not with a simplistic implementation of multiple workers that just
> round-robin process transactions, no. Your receiver will have to be
> smart enough to read the protocol stream and write the metadata
> changes to a separate stream all the workers read. Which is awkward.
>
> I think you'll probably need your receiver to act as a metadata broker
> for the apply workers in the end.
>
> Also now pglogical plugin contains a lot of code which performs
> mapping between source and target database schemas. So it it is
> assumed that them may be different.
> But it is not true in case of multimaster and I do not want to pay
> extra cost for the functionality we do not need.
>
>
> All it's really doing is mapping upstream to downstream tables by
> name, since the oids will be different.

Really?
Why then you send all table metadata (information about attributes) and
handle invalidation messages?
What is the purpose of "mapping to local relation, filled as needed"
fields in PGLogicalRelation if are are not going to perform such mapping?

Multimater really needs to map local or remote OIDs. We do not need to
provide any attribute mapping and handle catalog invalidations.

--
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David Bennett 2016-04-18 10:46:15 Re: How to detoast a column of type BYTEAOID
Previous Message Albe Laurenz 2016-04-18 08:10:52 Re: what's the exact command definition in read committed isolation level?