Re: Multimaster

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Multimaster
Date: 2016-04-19 12:56:28
Message-ID: CAMsr+YEuW7HbCwBQzoQJuPaMh8i0O7VKvcNuoy-_tgTw_OJDiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 18 April 2016 at 16:28, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
wrote:

> I intend to make the same split in pglogical its self - a receiver and
> apply worker split. Though my intent is to have them communicate via a
> shared memory segment until/unless the apply worker gets too far behind and
> spills to disk.
>
>
> In case of multimaster "too far behind" scenario can never happen.
>

I disagree. In the case of tightly coupled synchronous multi-master it
can't happen, sure. But that's hardly the only case of multi-master out
there.

I expect you'll want the ability to weaken synchronous guarantees for some
commits anyway, like we have with physical replication's synchronous_commit
= remote_write, synchronous_commit = local, etc. In that case lag becomes
relevant again.

You might also want to be able to spool a big tx to temporary storage even
as you apply it, if you're running over a WAN or something. That way if you
crash during apply you don't have to transfer the data over the WAN again.
Like we do with physical replication, where we write the WAL to disk then
replay from disk.

I agree that spilling to disk isn't needed for the simplest cases of
synchronous logical MM. But it's far from useless.

> It seems to me that pglogical plugin is now becoming too universal, trying
> to address a lot of different issues and play different roles.
>

I'm not convinced. They're all closely related, overlapping, and require
much of the same functionality. While some use cases don't need certain
pieces of functionality, they can still be _useful_. Asynchronous MM
replication doesn't need table mapping and transforms, for example ...
except that in reality lots of the flexibility offered by replication sets,
table mapping, etc is actually really handy in MM too.

We may well want to move much of that into core and have much thinner
plugins, but the direction Andres, Robert etc are talking about seems to be
more along the lines of a fully in-core logical replication subsystem.
It'll need to (eventually) meet all theses sorts of needs.

Before you start cutting or assuming you need something very separate I
suggest taking a closer look at why each piece is there, whether there's
truly any significant performance impact, and whether it can be avoided
without just cutting out the functionality entirely.

1. Asynchronous replication (including georeplication) - this is actually
> BDR.
>

Well, BDR is asynchronous MM. There's also the single-master case and
related ones for non-overlapping multimaster where any given set of tables
are only written on one node.

> 2. Logical backup: transfer data to different database (including new
> version of Postgres)
>

I think that's more HA than logical backup. Needs to be able to be
synchronous or asynchronous, much like our current phys.rep.

Closely related but not quite the same is logical read replicas/standbys.

> 3. Change notification: there are many different subscribers which can be
> interested in receiving notifications about database changes.
>

Yep. I suspect we'll want a json output plugin for this, separate to
pglogical etc, but we'll need to move a bunch of functionality from
pglogical into core so it can be shared rather than duplicated.

> 4. Synchronous replication: multimaster
>

"Synchronous multimaster". Not all multimastrer is synchronous, not all
synchronous replication is multimaster.

We are not enforcing order of commits as Galera does. Consistency is
> enforces by DTM, which enforce that transactions at all nodes are given
> consistent snapshots and assigned same CSNs. We have also global deadlock
> detection algorithm which build global lock graph (but still false
> positives are possible because this graphs is build incrementally and so
> it doesn't correspond to some global snapshot).
>

OK, so you're relying on a GTM to determine safe, conflict-free apply
orderings.

I'm ... curious ... about how you do that. Do you have a global lock
manager too? How do you determine ordering for things that in a
single-master case are addressed via unique b-tree indexes, not (just)
heavyweight locking?

>
> Multimaster is just particular (and simplest) case of distributed
> transactions. Specific of multimaster is that the same transaction has to
> be applied at all nodes and that selects can be executed at any node.
>

The specification of your symmetric, synchronous tightly-coupled
multimaster design, yes. Which sounds like it's intended to be transparent
or near-transparent multi-master clustering.

>
> The only exception is recovery of multimaster node. In this case we have
>> to apply transaction exactly in the same order as them were applied at the
>> original node performing recovery. It is done by applying changes in
>> recovery mode by pglogical_receiver itself.
>>
>
> I'm not sure I understand what you area saying here.
>
>
> Sorry for unclearness.
> I just said that normally transactions are applied concurrently by
> multiple workers and DTM is used to enforce consistency.
> But in case of recovery (when some node is crashed and then reconnect to
> the cluster), we perform recovery of this node sequentially, by single
> worker. In this case DTM is not used (because other nodes are far ahead)
> and to restore the same state of node we need to apply changes exactly in
> the same order and at the source node. In this case case content of target
> (recovered) node should be the same as of source node.
>

OK, that makes perfect sense.

Presumably in this case you could save a local snapshot of the DTM's
knowledge of the correct apply ordering of those tx's as you apply, so when
you crash you can consult that saved ordering information to still
parallelize apply. Later.

>
>
>>
>>

> We are now replicating DDL in the way similar with one used in BDR: DDL
>> statements are inserted in special table and are replayed at destination
>> node as part of transaction.
>>
> We have also alternative implementation done by Artur Zakirov <
>> <a(dot)zakirov(at)postgrespro(dot)ru>a(dot)zakirov(at)postgrespro(dot)ru>
>> which is using custom WAL records:
>> https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
>> Patch for custom WAL records was committed in 9.6, so we are going to
>> switch to this approach.
>>
>
> How does that really improve anything over using a table?
>
>
> It is more straightforward approach, isn't it? You can either try to
> restore DDL from low level sequence of updates of system catalogue.
> But it is difficult and not always possible.
>

Understatement of the century ;)

> Or need to add to somehow add original DDL statements to the log.
>

Actually you need to be able to add normalized statements to the xlog. The
original DDL text isn't quite good enough due to issues with search_path
among other things. Hence DDL deparse.

> I agree, that custom WAL adds no performance or functionality advantages
> over using a table.
> This is why we still didn't switch to it. But IMHO approach with inserting
> DDL (or any other user-defined information) in special table looks like
> hack.
>

Yeah, it is a hack. Logical WAL messages do provide a cleaner way to do it,
though with the minor downside that they're opaque to the user, who can't
see what DDL is being done / due to be done anymore. I'd rather do it with
generic logical WAL messages in future, now that they're in core.

> Also now pglogical plugin contains a lot of code which performs mapping
>> between source and target database schemas. So it it is assumed that them
>> may be different.
>> But it is not true in case of multimaster and I do not want to pay extra
>> cost for the functionality we do not need.
>>
>
> All it's really doing is mapping upstream to downstream tables by name,
> since the oids will be different.
>
>
> Really?
> Why then you send all table metadata (information about attributes) and
> handle invalidation messages?
>

Right, you meant columns, not tables.

See DESIGN.md.

We can't just use attno since column drops on one node will cause attno to
differ even if the user-visible table schema is the same.

BDR solves this (now) by either initalizing nodes from a physical
pg_basebackup of another node, including dropped cols etc, or using
pg_dump's binary upgrade mode to preserve dropped columns when bringing a
node up from a logical copy.

That's not viable for general purpose logical replication like pglogical,
so we send a table attribute mapping.

I agree that this can be avoided if the system can guarantee that the
upstream and downstream tables have exactly the same structure including
dropped columns. Which it can only guarantee when it has DDL replication
and all DDL is either replicated or blocked from being run. That's the
approach BDR tries to take, and it works - with problems. One of the
problems you won't have because it's caused by the need to sync up the
otherwise asynchronous cluster so there are no outstanding
committed-but-not-replayed changes for the old table structure on any node
before we change the structure on all nodes. But others, with coverage of
DDL replication, problems with full table rewrites, etc, you will have.

I think it would be reasonable for pglogical to offer the option of sending
a minimal table metadata message that simply says that it expects the
downstream to deal with the upstream attnos exactly as-is, either by having
them exactly the same or managing its own translations. In this case column
mapping etc can be omitted. Feel free to send a patch.

> Multimater really needs to map local or remote OIDs. We do not need to
> provide any attribute mapping and handle catalog invalidations.
>

For synchronous tightly-coupled multi-master with a GTM and GLM that
doesn't allow non-replicated DDL, yes, I agree.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2016-04-19 13:32:14 Re: error while installing auto_explain contrib module
Previous Message Иван Фролков 2016-04-19 12:37:43 How are files of tables/indexes/etc deleting?