Quick Links

Re: Replication identifiers, take 3

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Replication identifiers, take 3
Date:	2014-10-02 09:30:06
Message-ID:	20141002093006.GG7158@awork2.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2014-10-02 11:49:31 +0300, Heikki Linnakangas wrote:
> On 09/23/2014 09:24 PM, Andres Freund wrote:
> >I've previously started two threads about replication identifiers. Check
> >http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
> >and
> >http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
> >.
> >
> >The've also been discussed in the course of another thread:
> >http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de
>
> And even earlier here:
> http://www.postgresql.org/message-id/flat/1339586927-13156-10-git-send-email-andres(at)2ndquadrant(dot)com#1339586927-13156-10-git-send-email-andres@2ndquadrant.com
> The thread branched a lot, the relevant branch is the one with subject
> "[PATCH 10/16] Introduce the concept that wal has a 'origin' node"

Right. Long time ago already ;)

> >== Identify the origin of changes ==
> >
> >Say you're building a replication solution that allows two nodes to
> >insert into the same table on two nodes. Ignoring conflict resolution
> >and similar fun, one needs to prevent the same change being replayed
> >over and over. In logical replication the changes to the heap have to
> >be WAL logged, and thus the *replay* of changes from a remote node
> >produce WAL which then will be decoded again.
> >
> >To avoid that it's very useful to tag individual changes/transactions
> >with their 'origin'. I.e. mark changes that have been directly
> >triggered by the user sending SQL as originating 'locally' and changes
> >originating from replaying another node's changes as originating
> >somewhere else.
> >
> >If that origin is exposed to logical decoding output plugins they can
> >easily check whether to stream out the changes/transactions or not.
> >
> >
> >It is possible to do this by adding extra columns to every table and
> >store the origin of a row in there, but that a) permanently needs
> >storage b) makes things much more invasive.
>
> An origin column in the table itself helps tremendously to debug issues with
> the replication system. In many if not most scenarios, I think you'd want to
> have that extra column, even if it's not strictly required.

I don't think you'll have much success convincing actual customers of
that. It's one thing to increase the size of the WAL stream a bit, it's
entirely different to persistently increase the table size of all their
tables.

> >What I've previously suggested (and which works well in BDR) is to add
> >the internal id to the XLogRecord struct. There's 2 free bytes of
> >padding that can be used for that purpose.
>
> Adding a field to XLogRecord for this feels wrong. This is for *logical*
> replication - why do you need to mess with something as physical as the WAL
> record format?

XLogRecord isn't all that "physical". It doesn't encode anything in that
regard but the fact that there's backup blocks in the record. It's
essentially just an implementation detail of logging. Whether that's
physical or logical doesn't really matter much.

There's basically two primary reasons I think it's a good idea to add it
there:

a) There's many different type of records where it's useful to add the
origin. Adding the information to all these will make things more
complicated, using more space, and be more fragile. And I'm pretty
sure that the number of things people will want to expose over
logical replication will increase.

I know of at least two things that have at least some working code:
Exposing 2PC to logical decoding to allow optionally synchronous
replication, and allowing to send transactional/nontransactional
'messages' via the WAL without writing to a table.

Now, we could add a framework to attach general information to every
record - but I have a very hard time seing how this will be of
comparable complexity *and* efficiency.

b) It's dead simple with a pretty darn low cost. Both from a runtime as
well as a maintenance perspective.

c) There needs to be crash recovery interation anyway to compute the
state of how far replication succeeded before crashing. So it's not
like we could make this completely extensible without core code
knowing.

> And who's to say that a node ID is the most useful piece of information for
> a replication system to add to the WAL header. I can easily imagine that
> you'd want to put a changeset ID or something else in there, instead. (I
> mentioned another example of this in
> http://www.postgresql.org/message-id/4FE17043.60403@enterprisedb.com)

I'm onboard with adding a extensible facility to attach data to
successful transactions. There've been at least two people asking me
directly about how to e.g. attach user information to transactions.

I don't think that's equivalent with what I'm talking about here
though. One important thing about this proposal is that it allows to
completely skip (nearly, except cache inval) all records with a
uninteresting origin id *before* decoding them. Without having to keep
any per transaction state about 'uninteresting' transactions.

> If we need additional information added to WAL records, for extensions, then
> that should be made in an extensible fashion

I can see how we'd do that for individual records (e.g. the various
commit records, after combining them), but i have a hard time seing the
cost of doing that for all records worth it. Especially as it seems
likely to require significant increases in wal volume?

> IIRC (I couldn't find a link
> right now), when we discussed the changes to heap_insert et al for
> wal_level=logical, I already argued back then that we should make it
> possible for extensions to annotate WAL records, with things like "this is
> the primary key", or whatever information is needed for conflict resolution,
> or handling loops. I don't like it that we're adding little pieces of
> information to the WAL format, bit by bit.

I don't think this is "adding little pieces of information to the WAL
format, bit by bit.". It's a relatively central piece for allowing
efficient and maintainable logical replication.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: Replication identifiers, take 3 at 2014-10-02 08:49:31 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2014-10-02 09:50:14	Re: Dynamic LWLock tracing via pg_stat_lwlock (proof of concept)
Previous Message	Peter Geoghegan	2014-10-02 09:30:02	Re: Yet another abort-early plan disaster on 9.3