Re: Multimaster

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Multimaster
Date: 2016-04-17 12:30:42
Message-ID: CAMsr+YFH6e540wniVOutbEdxewkt8AswwAyWmk=kQ27iMRwJyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 14 April 2016 at 17:14, konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
wrote:

>
> On Apr 14, 2016, at 8:41 AM, Craig Ringer wrote:
>
> On 1 April 2016 at 19:50, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
> wrote:
>
> Right now the main problem is parallel apply: we need to apply changes
>> concurrently to avoid unintended dependencies causing deadlocks and provide
>> reasonable performance.
>>
>
> How do you intend to approach that?
>
>
> Actually we already have working implementation of multimaster...
> There is a pool of pglogical executors. pglogical_receiver just reads
> transaction body from connection and append it to ready-for-execution queue.
>

I intend to make the same split in pglogical its self - a receiver and
apply worker split. Though my intent is to have them communicate via a
shared memory segment until/unless the apply worker gets too far behind and
spills to disk.

Any vacant worker form this pool can dequeue this work and proceed it.
>

How do you handle correctness of ordering though? A naïve approach will
suffer from a variety of anomalies when subject to insert/delete/insert
write patterns, among other things. You can also get lost updates, rows
deleted upstream that don't get deleted downstream and various other
exciting ordering issues.

At absolute minimum you'd have to commit on the downstream in the same
commit order as the upstream.. This can deadlock. So when you get a
deadlock you'd abort the xacts of the deadlocked worker and all xacts with
later commit timestamps, then retry the lot.

BDR has enough trouble with this when applying transactions from multiple
peer nodes. To a degree it just throws its hands up and gives up - in
particular, it can't tell the difference between an insert/update conflict
and an update/delete conflict. But that's between loosely coupled nodes
where we explicitly document that some kinds of anomalies are permitted. I
can't imagine it being OK to have an even more complex set of possible
anomalies occur when simply replaying transactions from a single peer...

It is certainly possible with this approach that order of applying
> transactions can be not the same at different nodes.
>

Well, it can produce downright wrong results, and the results even in a
single-master case will be all over the place.

But it is not a problem if we have DTM.
>

How does that follow?

> The only exception is recovery of multimaster node. In this case we have
> to apply transaction exactly in the same order as them were applied at the
> original node performing recovery. It is done by applying changes in
> recovery mode by pglogical_receiver itself.
>

I'm not sure I understand what you area saying here.

> We also need 2PC support but this code was sent to you by Stas, so I hope
>> that sometime it will be included in PostgreSQL core and pglogical plugin.
>>
>
> I never got a response to my suggestion that testing of upstream DDL is
> needed for that. I want to see more on how you plan to handle DDL on the
> upstream side that changes the table structure and acquires strong locks.
> Especially when it's combined with row changes in the same prepared xacts.
>
>
> We are now replicating DDL in the way similar with one used in BDR: DDL
> statements are inserted in special table and are replayed at destination
> node as part of transaction.
>
We have also alternative implementation done by Artur Zakirov <
> a(dot)zakirov(at)postgrespro(dot)ru>
> which is using custom WAL records:
> https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
> Patch for custom WAL records was committed in 9.6, so we are going to
> switch to this approach.
>

How does that really improve anything over using a table?

This doesn't address what I asked above though, which is whether you have
tried doing ALTER TABLE in a 2PC xact with your 2PC replication patch,
especially one that also makes row changes.

> Well, recently I have made attempt to merge our code with the latest
> version of pglogical plugin (because our original implementation of
> multimaster was based on the code partly taken fro BDR) but finally have to
> postpone most of changes. My primary intention was to support metadata
> caching. But presence of multiple apply workers make it not possible to
> implement it in the same way as it is done node in pglogical plugin.
>

Not with a simplistic implementation of multiple workers that just
round-robin process transactions, no. Your receiver will have to be smart
enough to read the protocol stream and write the metadata changes to a
separate stream all the workers read. Which is awkward.

I think you'll probably need your receiver to act as a metadata broker for
the apply workers in the end.

Also now pglogical plugin contains a lot of code which performs mapping
> between source and target database schemas. So it it is assumed that them
> may be different.
> But it is not true in case of multimaster and I do not want to pay extra
> cost for the functionality we do not need.
>

All it's really doing is mapping upstream to downstream tables by name,
since the oids will be different.

Are you attempting to force table oids to be the same on all nodes, so you
can rely on direct 1:1 table oid mappings? 'cos that seems fragile...

> We can try to prepare our "wish list" for pglogical plugin.
>

That would be useful.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2016-04-17 12:32:49 Re: Multimaster
Previous Message Karsten Hilbert 2016-04-17 10:13:53 Re: Enhancement request for pg_dump