Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture
Date: 2012-06-20 10:15:55
Message-ID: 201206201215.55772.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Robert, Hi All!

On Wednesday, June 20, 2012 03:08:48 AM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> >> Well, the words are fuzzy, but I would define logical replication to
> >> be something which is independent of the binary format in which stuff
> >> gets stored on disk. If it's not independent of the disk format, then
> >> you can't do heterogenous replication (between versions, or between
> >> products). That precise limitation is the main thing that drives
> >> people to use anything other than SR in the first place, IME.
> >
> > Not in mine. The main limitation I see is that you cannot write anything
> > on the standby. Which sucks majorly for many things. Its pretty much
> > impossible to "fix" that for SR outside of very limited cases.
> > While many scenarios don't need multimaster *many* need to write outside
> > of the standby's replication set.
> Well, that's certainly a common problem, even if it's not IME the most
> common, but I don't think we need to argue about which one is more
> common, because I'm not arguing against it. The point, though, is
> that if the logical format is independent of the on-disk format, the
> things we can do are a strict superset of the things we can do if it
> isn't. I don't want to insist that catalogs be the same (or else you
> get garbage when you decode tuples). I want to tolerate the fact that
> they may very well be different. That will in no way preclude writing
> outside the standby's replication set, nor will it prevent
> multi-master replication. It will, however, enable heterogenous
> replication, which is a very important use case. It will also mean
> that innocent mistakes (like somehow ending up with a column that is
> text on one server and numeric on another server) produce
> comprehensible error messages, rather than garbage.
I agree with most of that. I think that some parts of the above need to be
optional because you do loose too much for other scenarious.
I *definitely* want to build the *infrastructure* which make it easy to
implement all of the above but I find it a bit much to require that from the
get-go. Its important that everything is reusable for that, yes. Does a
patchset that wants to implement tightly coupled multimaster need to implement
everything for that? No.
If we raise the barrier for anything around this topic so high we will *NEVER*
get anywhere. Its a huge topic with loads of people wanting loads of different
things. And that will hurt people wanting some feature which matches 90% of
the proposed goals *far* more.

> > Its not only the logging side which is a limitation in todays replication
> > scenarios. The apply side scales even worse because its *very* hard to
> > distribute it between multiple backends.

> I don't think that making LCR format = on-disk format is going to
> solve that problem. To solve that problem, we need to track
> dependencies between transactions, so that if tuple A is modified by
> T1 and T2, in that order, we apply T1 before T2. But if T3 - which
> committed after both T1 and T2 - touches none of the same data as T1
> or T2 - then we can apply it in parallel, so long as we don't commit
> until T1 and T2 have committed (because allowing T3 to commit early
> would produce a serialization anomaly from the point of view of a
> concurrent reader).
Well, doing apply on such low level, without reencoding the data increased
throughput nearly threefold even for trivial types. So it pushes of the point
where we need to do the above quite a bit.

> >> Because the routines that decode tuples don't include enough sanity
> >> checks to prevent running off the end of the block, or even the end of
> >> memory completely. Consider a corrupt TOAST pointer that indicates
> >> that there is a gigabyte of data stored in an 8kB block. One of the
> >> common symptoms of corruption IME is TOAST requests for -3 bytes of
> >> memory.
> > Yes, but we need to put safeguards against that sort of thing anyway. So
> > sure, we can have bugs but this is not a fundamental limitation.
> There's a reason we haven't done that already, though: it's probably
> going to stink for performance. If it turns out that it doesn't stink
> for performance, great. But if it causes a 5% slowdown on common use
> cases, I suspect we're not gonna do it, and I bet I can construct a
> case where it's worse than that (think: 400 column table with lots of
> varlenas, sorting by column 400 to return column 399). I think it's
> treading on dangerous ground to assume we're going to be able to "just
> go fix" this.
I am talking about ensuring that the catalog is the same on the decoding site
not about making all decoding totally safe in the face of corrupted
information.

> > Postgis uses one information table in a few more complex functions but
> > not in anything low-level. Evidenced by the fact that it was totally
> > normal for that to go out of sync before < 2.0.
> >
> > But even if such a thing would be needed, it wouldn't be problematic to
> > make extension configuration tables be replicated as well.
> Ugh. That's a hack on top of a hack. Now it all works great if type
> X is installed as an extension but if it isn't installed as an
> extension then the world blows up.
Then introduce a storage attribute (or something similar) which does say the
same.

> > I have played with several ideas:
> >
> > 1.)
> > keep the decoding catalog in sync with command/event triggers, correctly
> > replicating oids. If those log into some internal event table its easy to
> > keep the catalog in a correct transactional state because the events
> > from that table get decoded in the transaction and replayed at exactly
> > the right spot in there *after* it has been reassembled. The locking on
> > the generating side takes care of the concurrency aspects.
> I am not following this one completely.
If (and yes, thats a somewhat big if) we had event triggers which can
reconstruct equivalent DDL statements with some additions to preserve oids you
can keep a 2nd catalog in sync. That catalog can be part of a full database or
just a decoding instance.
If those event triggers log into some system table the wal entries of those
INSERTS will be at the exactly right point in the wal stream *and* at the
right point in the transaction to apply that change exactly when you decode
(or apply) the wal contents after reassembling transactions.
That makes most (all?) of the syscache/snapshot catalog consistency problems
go away.

> > 2.)
> > Keep the decoding site up2date by replicating the catalog via normal
> > recovery mechanisms
> This surely seems better than #1, since it won't do amazingly weird
> things if the user bypasses the event triggers.
It has the disadvantage that it cannot be used to keep tightly coupled
instances in sync without a proxy instance inbetween.

Why should the user be able to bypass event triggers? If we design event
triggers to be bypassable by anything but eplicit actions from a superuser we
have made grave design errors.
Sure, a superuser can screw that, but then, he already has *loads* of things
he can do to corrupt an instance.

> > 3.)
> > Fully versioned catalog
> One possible way of doing this would be to have the LCR generator run
> on the primary, but hold back RecentGlobalXmin until it's captured the
> information that it needs. It seems like as long as tuples can't get
> pruned, the information you need must still be there, as long as you
> can figure out which snapshot you need to read it under. But since
> you know the commit ordering, it seems like you ought to be able to
> figure out what SnapshotNow would have looked like at any given point
> in the WAL stream. So you could, at that point in the WAL stream,
> read the master's catalogs under what we might call SnapshotThen.
Yes, I considered it before but never got comfortable enough with the idea.
Sounds like it would involve some trickery but it might be possible. I am
happy to go that route if people tentatively agree that the resulting
ugliness/intricate code is acceptable. Sounds like fun.

> > 4.)
> > Log enough information in the walstream to make decoding possible using
> > only the walstream.
> >
> > Advantages:
> > * Decoding can optionally be done on the master
> > * No catalog syncing/access required
> > * its possible to make this architecture independent
> >
> > Disadvantage:
> > * high to very high implementation overhead depending on efficiency aims
> > * high space overhead in the wal because at least all the catalog
> > information needs to be logged in a transactional manner repeatedly
> > * misuses wal far more than other methods
> > * significant new complexity in somewhat cricital code paths (heapam.c)
> > * insanely high space overhead if the decoding should be possible
> > architecture independent
>
> I'm not really convinced that the WAL overhead has to be that much
> with this method. Most of the information you need about the catalogs
> only needs to be logged when it changes, or once per checkpoint cycle,
> or once per transaction, or once per transaction per checkpoint cycle.
> I will concede that it looks somewhat complex, but I am not convinced
> that it's undoable.
I am not saying its impossible to achieve only moderate space overhead , but I
have a hard time believing its possible to do this in a manner thats
realistically implementable *and* acceptable for Tom.
I think I am more worried about the complexities introduced than the space
overhead...

> > 5.)
> > The actually good idea. Yours?
> Hey, look, an elephant!
One can dream...

Andres
--
Andres Freund http://www.2ndQuadrant.com/ PostgreSQL
Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2012-06-20 10:20:46 Re: performance regression in 9.2 when loading lots of small tables
Previous Message Peter Eisentraut 2012-06-20 10:00:13 Re: sortsupport for text