Quick Links

Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture
Date:	2012-06-19 18:23:20
Message-ID:	201206192023.20589.andres@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

The most important part, even for people not following my discussion with
Robert is at the bottom where the possible wal decoding strategies are laid
out.

On Tuesday, June 19, 2012 03:20:58 AM Robert Haas wrote:
> On Sat, Jun 16, 2012 at 7:43 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> >> > Hm. Yes, you could do that. But I have to say I don't really see a
> >> > point. Maybe the fact that I do envision multimaster systems at some
> >> > point is clouding my judgement though as its far less easy in that
> >> > case.
> >>
> >> Why? I don't think that particularly changes anything.
> >
> > Because it makes conflict detection very hard. I also don't think its a
> > feature worth supporting. Whats the use-case of updating records you
> > cannot properly identify?
> Don't ask me; I just work here. I think it's something that some
> people want, though. I mean, if you don't support replicating a table
> without a primary key, then you can't even run pgbench in a
> replication environment.
Well, I have no problem with INSERT only tables not having a PK. And
pgbench_history is the only pgbench table that doesn't have a pkey? And thats
only truncated...

> Is that an important workload? Well,
> objectively, no. But I guarantee you that other people with more
> realistic workloads than that will complain if we don't have it.
> Absolutely required on day one? Probably not. Completely useless
> appendage that no one wants? Not that, either.
Maybe. And I don't really care, so if others see that as important I am happy
to appease them ;).

> >> In my view, a logical replication solution is precisely one in which
> >> the catalogs don't need to be in sync. If the catalogs have to be in
> >> sync, it's not logical replication. ISTM that what you're talking
> >> about is sort of a hybrid between physical replication (pages) and
> >> logical replication (tuples) - you want to ship around raw binary
> >> tuple data, but not entire pages.
> > Ok, thats a valid point. Simon argued at the cluster summit that
> > everything thats not physical is logical. Which has some appeal because
> > it seems hard to agree what exactly logical rep is. So definition by
> > exclusion makes kind of sense ;)
> Well, the words are fuzzy, but I would define logical replication to
> be something which is independent of the binary format in which stuff
> gets stored on disk. If it's not independent of the disk format, then
> you can't do heterogenous replication (between versions, or between
> products). That precise limitation is the main thing that drives
> people to use anything other than SR in the first place, IME.
Not in mine. The main limitation I see is that you cannot write anything on
the standby. Which sucks majorly for many things. Its pretty much impossible
to "fix" that for SR outside of very limited cases.
While many scenarios don't need multimaster *many* need to write outside of
the standby's replication set.

> > I think what you categorized as "hybrid logical/physical" rep solves an
> > important use-case thats very hard to solve at the moment. Before my
> > 2ndquadrant days I had several client which had huge problemsing the
> > trigger based solutions because their overhead simply was to big a
> > burden on the master. They couldn't use SR either because every
> > consuming database kept loads of local data.
> > I think such scenarios are getting more and more common.

> I think this is to some extent true, but I also think you're
> conflating two different things. Change extraction via triggers
> introduces overhead that can be eliminated by reconstructing tuples
> from WAL in the background rather than forcing them to be inserted
> into a shadow table (and re-WAL-logged!) in the foreground. I will
> grant that shipping the tuple as a binary blob rather than as text
> eliminates additional overehead on both ends, but it also closes off a
> lot of important use cases. As I noted in my previous email, I think
> that ought to be a performance optimization that we do, if at all,
> when it's been proven safe, not a baked-in part of the design. Even a
> solution that decodes WAL to text tuples and ships those around and
> reinserts the via pure SQL should be significantly faster than the
> replication solutions we have today; if it isn't, something's wrong.
Its not only the logging side which is a limitation in todays replication
scenarios. The apply side scales even worse because its *very* hard to
distribute it between multiple backends.

> >> The problem with that is it's going to be tough to make robust. Users
> >> could easily end up with answers that are total nonsense, or probably
> >> even crash the server.
> > Why?
> Because the routines that decode tuples don't include enough sanity
> checks to prevent running off the end of the block, or even the end of
> memory completely. Consider a corrupt TOAST pointer that indicates
> that there is a gigabyte of data stored in an 8kB block. One of the
> common symptoms of corruption IME is TOAST requests for -3 bytes of
> memory.
Yes, but we need to put safeguards against that sort of thing anyway. So sure,
we can have bugs but this is not a fundamental limitation.

> And, of course, even if you could avoid crashing, interpreting what
> was originally intended as a series of int4s as a varlena isn't likely
> to produce anything terribly meaningful. Tuple data isn't
> self-identifying; that's why this is such a hard problem.
Yes, sure.

> >> To step back and talk about DDL more generally, you've mentioned a few
> >> times the idea of using an SR instance that has been filtered down to
> >> just the system catalogs as a means of generating logical change
> >> records. However, as things stand today, there's no reason to suppose
> >> that replicating anything less than the entire cluster is sufficient.
> >> For example, you can't translate enum labels to strings without access
> >> to the pg_enum catalog, which would be there, because enums are
> >> built-in types. But someone could supply a similar user-defined type
> >> that uses a user-defined table to do those lookups, and now you've got
> >> a problem. I think this is a contractual problem, not a technical
> >> one. From the point of view of logical replication, it would be nice
> >> if type output functions were basically guaranteed to look at nothing
> >> but the datum they get passed as an argument, or at the very least
> >> nothing other than the system catalogs, but there is no such
> >> guarantee. And, without such a guarantee, I don't believe that we can
> >> create a high-performance, robust, in-core replication solution.
> >
> > I don't think thats a valid argument. Any such solution existing today
> > fails to work properly with dump/restore and such because it implies
> > dependencies that they do not know about. The "internal" tables will
> > possibly be restored later than the tables using the tables and such. So
> > your data format *has* to deal with loading/outputting data without such
> > anyway.
> Do you know for certain that PostGIS doesn't do anything of this type?
> Or what about something like an SE-Linux label cache, where we might
> arrange to create labels as they are used and associate them with
> integer tags?
Postgis uses one information table in a few more complex functions but not in
anything low-level. Evidenced by the fact that it was totally normal for that
to go out of sync before < 2.0.

But even if such a thing would be needed, it wouldn't be problematic to make
extension configuration tables be replicated as well.

> > You could reduce the space overhead by only adding that information only
> > the first time after a table has changed (and then regularly after a
> > checkpoint or so) but doing so seems to be introducing too much
> > complexity.
> Well, I dunno: it is complicated, but I'm worried that the design
> you've got is awfully complicated, too. Requiring an extra PG
> instance with a very specific configuration that furthermore uses an
> untested WAL-filtering methodology that excludes everything but the
> system catalogs seems like an administrative nightmare, and I remain
> unconvinced that it is safe.
Well, I assumed we would have a utility to create such an instance via the
replication protocol which does allt he ncessary thing.

I personally don't want to do this in the very first version of an applied
patch. I think solving this on a binary level, without decoding is the
implementation wise smallest, already usable set of features. Yes, it doesn't
provide many things people want (cross arch, replication into text, ...) but
it provides most of the infrastructure for that. If we try to do everything at
once we will *never* get anywhere.

The wal filtering machinery is a very nice building block to improve the SR
experience as well btw, by removing unneeded full page writes, filtering on
one database and such.

> In fact, I have a strong feeling that it isn't safe, but if you're not
> convinced by the argument already laid out then I'm not sure I can convince
> you of it right this minute.

> What happens if you have a crash on the WAL generation machine?
> You'll have to rewind to the most recent restartpoint, and you can't
> use the catalogs until you've reached the minimum recovery point. Is
> that going to mess you up?
With "Wal generation machine" you mean the catalog-only instance?

If yes: Very good point.

The only real problem I know with that are checkpoints with overflowed subxid
snapshots. Otherwise you always can restart with the last checkpoint and
exactly replay to the former location. The minimum recovery point cannot be
ahead the point you just replayed. Everytime the recovery point is updated you
need to serialize the applycache to disk but thats already required to be
possible because otherwise you would need unbounded amounts of memory.
For the overflowed snapshot problem I had toyed with the idea of simply
serializing the state of the KnownAssignedXids machinery when replaying
checkpoints and at RecoveryPoints to solve that issue after the initial clone
for HS in general.

> >> And then maybe we handle poorly-behaved types by pushing some of the
> >> work into the foreground task that's generating the WAL: in the worst
> >> case, the process logs a record before each insert/update/delete
> >> containing the text representation of any values that are going to be
> >> hard to decode. In some cases (e.g. records all of whose constituent
> >> fields are well-behaved types) we could instead log enough additional
> >> information about the type to permit blind decoding.
> >
> > I think this is prohibitively expensive from a development, runtime,
> > space and maintenance standpoint.
> > For databases using thing were decoding is rather expensive (e.g.
> > postgis) you wouldn't really improve much above the old trigger based
> > solutions. Its a return to "log everything twice".
>
> Well, if the PostGIS types are poorly behaved under the definition I
> proposed, that implies they won't work at all under your scheme. I
> think putting a replication solution into core that won't support
> PostGIS is dead on arrival. If they're well-behaved, then there's no
> double-logging.
I am pretty sure its not bad-behaved. But how should the code know that? You
want each type to explictly say that its unsafe if it is?
In that case you just as well can mark all such tables which would also make
reliable pg_dump/restore possible. Assuming you buy the duplicated catalog
idea. Which you do not ;)

> > Sorry if I seem pigheaded here, but I fail to see why all that would buy
> > us anything but loads of complexity while loosing many potential
> > advantages.
> The current system that you are proposing is very complex and has a
> number of holes at present, some of which you've already mentioned in
> previous emails. There's a lot of advantage in picking a design that
> allows you to put together a working prototype relatively quickly, but
> I have a sinking feeling that your chosen design is going to be very
> hard to bullet-proof and not very user-friendly. If we could find a
> way to run it all inside a single server I think we would be way ahead
> on both fronts.
Don't get me wrong: If I believed another approach would be realistic I would
definitely prefer not going for that. I do find the requirement of keeping
multiple copies of the catalog arround pretty damn ugly.

The problem is just that to support basically arbitrary decoding requirements
you need to provide at least those pieces of information in a transactionally
consistent manner:
* the data
* table names
* column names
* type information
* replication configuration

I have yet to see anybody point out a way to provide all that information
without huge problems with the complexity/space-usage/performance triangle.

I have played with several ideas:

1.)
keep the decoding catalog in sync with command/event triggers, correctly
replicating oids. If those log into some internal event table its easy to keep
the catalog in a correct transactional state because the events from that
table get decoded in the transaction and replayed at exactly the right spot in
there *after* it has been reassembled. The locking on the generating side
takes care of the concurrency aspects.

Advantages:
* minimal overhead (space, performance)
* allows additional tables/indexes/triggers if you take care with oid
allocation
* easy transactionally correct catalog behaviour behaviour with catalogs
* the decoding instance can be used to store all data in a highly efficient
manner (no decoding, no full detoasting, ...)
* the decoding instance is fully writable without problems if you don't
generate conflicts (separate tables, non-overlapping writes, whatever)
* implementable in a pretty unintrusive way

Disadvantes:
* the table structure of replicated tables needs to be *exactly* the same
* the type definition + support procs needs to be similar enough to read the
data
* error checking of the above isn't easy
* full version/architecture compatibility required
* a second instance required even if you want to replicate into some other
system/architecture/verison

2.)
Keep the decoding site up2date by replicating the catalog via normal recovery
mechanisms

Advantages:
* most of the technology is already there
* minimal overhead (space, performance)
* no danger of out of sync catalogs
* no support for command triggers required that can keep a catalog in sync,
including oids

Disadvantages:
* driving the catalog recovery that way requires some somewhat intricate code
as it needs to be done in lockstep with decoding the wal-stream
* requires an additional feature to guarantee HS always has enough information
to be queryable
* some complex logic/low-level fudging required to keep the transactional
behaviour sensible when querying the catalog
* fully version/architecture compatibility required
* the decoding site will always ever be only readable

3.)
Fully versioned catalog

Advantages:
* Decoding is done on the master in an asynchronous fashion
* low overhead during normal DML execution, not much additional code in that
path
* can be very efficient if architecture/version are the same
* version/architecture compatibility can be done transparently by falling back
to textual versions on mismatch

Disadvantages:
* catalog versioning is complex to implement
* space overhead for all users, even without using logical replication
* I can't see -hackers signing off
* decoding has to happen on the master which might not be what people want
performancewise

4.)
Log enough information in the walstream to make decoding possible using only
the walstream.

Advantages:
* Decoding can optionally be done on the master
* No catalog syncing/access required
* its possible to make this architecture independent

Disadvantage:
* high to very high implementation overhead depending on efficiency aims
* high space overhead in the wal because at least all the catalog information
needs to be logged in a transactional manner repeatedly
* misuses wal far more than other methods
* significant new complexity in somewhat cricital code paths (heapam.c)
* insanely high space overhead if the decoding should be possible architecture
independent

5.)
The actually good idea. Yours?

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture at 2012-06-19 01:20:58 from Robert Haas

Responses

Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture at 2012-06-19 21:36:47 from Kevin Grittner
Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture at 2012-06-20 01:08:48 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2012-06-19 18:44:35	Re: sortsupport for text
Previous Message	Kevin Grittner	2012-06-19 17:57:34	Re: sortsupport for text