| From: | Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> | 
|---|---|
| To: | Craig Ringer <craig(at)2ndquadrant(dot)com> | 
| Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org> | 
| Subject: | Re: Multimaster | 
| Date: | 2016-04-18 08:28:02 | 
| Message-ID: | 57149A92.7020605@postgrespro.ru | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
Hi,
Thank you for your response.
On 17.04.2016 15:30, Craig Ringer wrote:
> I intend to make the same split in pglogical its self - a receiver and 
> apply worker split. Though my intent is to have them communicate via a 
> shared memory segment until/unless the apply worker gets too far 
> behind and spills to disk.
>
In case of multimaster  "too far behind" scenario can never happen. So 
here is yet another difference in asynchronous and synchronous 
replication approaches. For asynchronous replication situation when 
replica is far behind master is quite normal and has to be addressed 
without blocking master. For synchronous replication it is not possible 
all this "spill to disk" adds just extra overhead.
It seems to me that pglogical plugin is now becoming too universal, 
trying to address a lot of different issues and play different roles.
Here are some use cases for logical replication which I see (I am quite 
sure that you know more):
1. Asynchronous replication (including georeplication) - this is 
actually BDR.
2. Logical backup: transfer data to different database (including new 
version of Postgres)
3. Change notification: there are many different subscribers which can 
be interested in receiving notifications about database changes.
As far as I know new JDBC driver is going to use logical replication to 
receive update streams. It can be also used for update/invalidation of 
caches in ORMs.
4. Synchronous replication: multimaster
>     Any vacant worker form this pool can dequeue this work and proceed it.
>
>
> How do you handle correctness of ordering though? A naïve approach 
> will suffer from a variety of anomalies when subject to 
> insert/delete/insert write patterns, among other things. You can also 
> get lost updates, rows deleted upstream that don't get deleted 
> downstream and various other exciting ordering issues.
>
> At absolute minimum you'd have to commit on the downstream in the same 
> commit order as the upstream.. This can deadlock. So when you get a 
> deadlock you'd abort the xacts of the deadlocked worker and all xacts 
> with later commit timestamps, then retry the lot.
We are not enforcing order of commits as Galera does. Consistency is 
enforces by DTM, which enforce that transactions at all nodes are given 
consistent snapshots and assigned same CSNs. We have also global 
deadlock detection algorithm which build global lock graph (but still 
false positives are possible because  this graphs is build incrementally 
and so it doesn't correspond to some global snapshot).
>
> BDR has enough trouble with this when applying transactions from 
> multiple peer nodes. To a degree it just throws its hands up and gives 
> up - in particular, it can't tell the difference between an 
> insert/update conflict and an update/delete conflict. But that's 
> between loosely coupled nodes where we explicitly document that some 
> kinds of anomalies are permitted. I can't imagine it being OK to have 
> an even more complex set of possible anomalies occur when simply 
> replaying transactions from a single peer...
We should definitely perform more testing here, but right now we do not 
have any tests causing some synchronization anomalies.
>
>     It is certainly possible with this approach that order of applying
>     transactions can be not the same at different nodes.
>
>
> Well, it can produce downright wrong results, and the results even in 
> a single-master case will be all over the place.
>
>     But it is not a problem if we have DTM.
>
>
> How does that follow?
Multimaster is just particular (and simplest) case of distributed 
transactions. Specific of multimaster is that the same transaction has 
to be applied at all nodes and that selects can be executed at any node. 
The goal of DTM is to provide consistent execution of distributed 
transactions. If it is able to do for arbitrary transactions then, 
definitely,  it can do it for multimaster.
I can not give you here formal prove that our DTM is able to solve all 
this problems. Certainly there are may be bugs in implementation
and this is why we need to perform more testing.  But actually we are 
not "reinventing the wheel", our DTM is based on the existed approaches.
>     The only exception is recovery of multimaster node. In this case
>     we have to apply transaction exactly in the same order as them
>     were applied at the original node performing recovery. It is done
>     by applying changes in recovery mode by pglogical_receiver itself.
>
>
> I'm not sure I understand what you area saying here.
Sorry for unclearness.
I just said that normally transactions are applied concurrently by 
multiple workers and DTM is used to enforce consistency.
But in case of recovery (when some node is crashed and then reconnect to 
the cluster), we perform recovery of this node sequentially, by single 
worker. In this case DTM is not used (because other nodes are far ahead) 
and to restore the same state of node we need to apply changes exactly 
in the same order and at the source node. In this case case content of 
target (recovered) node should be the same as of source node.
>>         We also need 2PC support but this code was sent to you by
>>         Stas, so I hope that sometime it will be included in
>>         PostgreSQL core and pglogical plugin.
>>
>>
>>     I never got a response to my suggestion that testing of upstream
>>     DDL is needed for that. I want to see more on how you plan to
>>     handle DDL on the upstream side that changes the table structure
>>     and acquires strong locks. Especially when it's combined with row
>>     changes in the same prepared xacts.
>
>     We are now replicating DDL in the way similar with one used in
>     BDR: DDL statements are inserted in special table and are replayed
>     at destination node as part of transaction.
>
>     We have also alternative implementation done by Artur Zakirov
>     <a(dot)zakirov(at)postgrespro(dot)ru <mailto:a(dot)zakirov(at)postgrespro(dot)ru>>
>     which is using custom WAL records:
>     https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
>     Patch for custom WAL records was committed in 9.6, so we are going
>     to switch to this approach.
>
>
> How does that really improve anything over using a table?
It is more straightforward approach, isn't it? You can either try to 
restore DDL from low level sequence of updates of system catalogue.
But it is difficult and not always possible. Or need to add to somehow 
add original DDL statements to the log.
It can be done using some special table or store this information 
directly in the log (if custom WAL records are supported).
Certainly in the last case logical protocol should be extended to 
support playback of user-defined WAl records.
But it seems to be universal mechanism which can be used not only for DDL.
I agree, that custom WAL adds no performance or functionality advantages 
over using a table.
This is why we still didn't switch to it. But IMHO approach with 
inserting DDL (or any other user-defined information) in special table 
looks like hack.
>
> This doesn't address what I asked above though, which is whether you 
> have tried doing ALTER TABLE in a 2PC xact with your 2PC replication 
> patch, especially one that also makes row changes.
>
>     Well, recently I have made attempt to merge our code with the
>     latest version of pglogical plugin (because our original
>     implementation of multimaster was based on the code partly taken
>     fro BDR) but finally have to postpone most of changes. My primary
>     intention was to support metadata caching. But presence of
>     multiple apply workers make it not possible to implement it in the
>     same way as it is done node in pglogical plugin.
>
>
> Not with a simplistic implementation of multiple workers that just 
> round-robin process transactions, no. Your receiver will have to be 
> smart enough to read the protocol stream and write the metadata 
> changes to a separate stream all the workers read. Which is awkward.
>
> I think you'll probably need your receiver to act as a metadata broker 
> for the apply workers in the end.
>
>     Also now pglogical plugin contains a lot of code which performs
>     mapping between source and target database schemas. So it it is
>     assumed that them may be different.
>     But it is not true in case of multimaster and I do not want to pay
>     extra cost for the functionality we do not need.
>
>
> All it's really doing is mapping upstream to downstream tables by 
> name, since the oids will be different.
Really?
Why then you send all table metadata (information about attributes) and 
handle invalidation messages?
What is the purpose of "mapping to local relation, filled as needed" 
fields in PGLogicalRelation if are are not going to perform such mapping?
Multimater really  needs to map local or remote OIDs.  We do not need to 
provide any attribute mapping and handle catalog invalidations.
-- 
>  Craig Ringer http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David Bennett | 2016-04-18 10:46:15 | Re: How to detoast a column of type BYTEAOID | 
| Previous Message | Albe Laurenz | 2016-04-18 08:10:52 | Re: what's the exact command definition in read committed isolation level? |