Quick Links

Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture
Date:	2012-06-19 01:20:58
Message-ID:	CA+TgmoZ3DhYrO5-OT3OTv8n47Uy6pwJfacbygNWf2V=_ZJZehg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Jun 16, 2012 at 7:43 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > Hm. Yes, you could do that. But I have to say I don't really see a point.
>> > Maybe the fact that I do envision multimaster systems at some point is
>> > clouding my judgement though as its far less easy in that case.
>> Why? I don't think that particularly changes anything.
> Because it makes conflict detection very hard. I also don't think its a
> feature worth supporting. Whats the use-case of updating records you cannot
> properly identify?

Don't ask me; I just work here. I think it's something that some
people want, though. I mean, if you don't support replicating a table
without a primary key, then you can't even run pgbench in a
replication environment. Is that an important workload? Well,
objectively, no. But I guarantee you that other people with more
realistic workloads than that will complain if we don't have it.
Absolutely required on day one? Probably not. Completely useless
appendage that no one wants? Not that, either.

>> In my view, a logical replication solution is precisely one in which
>> the catalogs don't need to be in sync. If the catalogs have to be in
>> sync, it's not logical replication. ISTM that what you're talking
>> about is sort of a hybrid between physical replication (pages) and
>> logical replication (tuples) - you want to ship around raw binary
>> tuple data, but not entire pages.
> Ok, thats a valid point. Simon argued at the cluster summit that everything
> thats not physical is logical. Which has some appeal because it seems hard to
> agree what exactly logical rep is. So definition by exclusion makes kind of
> sense ;)

Well, the words are fuzzy, but I would define logical replication to
be something which is independent of the binary format in which stuff
gets stored on disk. If it's not independent of the disk format, then
you can't do heterogenous replication (between versions, or between
products). That precise limitation is the main thing that drives
people to use anything other than SR in the first place, IME.

> I think what you categorized as "hybrid logical/physical" rep solves an
> important use-case thats very hard to solve at the moment. Before my
> 2ndquadrant days I had several client which had huge problemsing the trigger
> based solutions because their overhead simply was to big a burden on the
> master. They couldn't use SR either because every consuming database kept
> loads of local data.
> I think such scenarios are getting more and more common.

I think this is to some extent true, but I also think you're
conflating two different things. Change extraction via triggers
introduces overhead that can be eliminated by reconstructing tuples
from WAL in the background rather than forcing them to be inserted
into a shadow table (and re-WAL-logged!) in the foreground. I will
grant that shipping the tuple as a binary blob rather than as text
eliminates additional overehead on both ends, but it also closes off a
lot of important use cases. As I noted in my previous email, I think
that ought to be a performance optimization that we do, if at all,
when it's been proven safe, not a baked-in part of the design. Even a
solution that decodes WAL to text tuples and ships those around and
reinserts the via pure SQL should be significantly faster than the
replication solutions we have today; if it isn't, something's wrong.

>> The problem with that is it's going to be tough to make robust. Users could
>> easily end up with answers that are total nonsense, or probably even crash
>> the server.
> Why?

Because the routines that decode tuples don't include enough sanity
checks to prevent running off the end of the block, or even the end of
memory completely. Consider a corrupt TOAST pointer that indicates
that there is a gigabyte of data stored in an 8kB block. One of the
common symptoms of corruption IME is TOAST requests for -3 bytes of
memory.

And, of course, even if you could avoid crashing, interpreting what
was originally intended as a series of int4s as a varlena isn't likely
to produce anything terribly meaningful. Tuple data isn't
self-identifying; that's why this is such a hard problem.

>> To step back and talk about DDL more generally, you've mentioned a few
>> times the idea of using an SR instance that has been filtered down to
>> just the system catalogs as a means of generating logical change
>> records. However, as things stand today, there's no reason to suppose
>> that replicating anything less than the entire cluster is sufficient.
>> For example, you can't translate enum labels to strings without access
>> to the pg_enum catalog, which would be there, because enums are
>> built-in types. But someone could supply a similar user-defined type
>> that uses a user-defined table to do those lookups, and now you've got
>> a problem. I think this is a contractual problem, not a technical
>> one. From the point of view of logical replication, it would be nice
>> if type output functions were basically guaranteed to look at nothing
>> but the datum they get passed as an argument, or at the very least
>> nothing other than the system catalogs, but there is no such
>> guarantee. And, without such a guarantee, I don't believe that we can
>> create a high-performance, robust, in-core replication solution.
> I don't think thats a valid argument. Any such solution existing today fails
> to work properly with dump/restore and such because it implies dependencies
> that they do not know about. The "internal" tables will possibly be restored
> later than the tables using the tables and such. So your data format *has* to
> deal with loading/outputting data without such anyway.

Do you know for certain that PostGIS doesn't do anything of this type?
Or what about something like an SE-Linux label cache, where we might
arrange to create labels as they are used and associate them with
integer tags?

> And yes, you obviously can implement it without needing side-table for output.
> Just as a string which is checked during input.

That misses the point - if people wanted labels represented by a
string rather than an integer, they would have just used a string and
stuffed a check or foreign key constraint in there.

> You could reduce the space overhead by only adding that information only the
> first time after a table has changed (and then regularly after a checkpoint or
> so) but doing so seems to be introducing too much complexity.

Well, I dunno: it is complicated, but I'm worried that the design
you've got is awfully complicated, too. Requiring an extra PG
instance with a very specific configuration that furthermore uses an
untested WAL-filtering methodology that excludes everything but the
system catalogs seems like an administrative nightmare, and I remain
unconvinced that it is safe. In fact, I have a strong feeling that it
isn't safe, but if you're not convinced by the argument already laid
out then I'm not sure I can convince you of it right this minute.
What happens if you have a crash on the WAL generation machine?
You'll have to rewind to the most recent restartpoint, and you can't
use the catalogs until you've reached the minimum recovery point. Is
that going to mess you up?

>> And then maybe we handle poorly-behaved types by pushing some of the
>> work into the foreground task that's generating the WAL: in the worst
>> case, the process logs a record before each insert/update/delete
>> containing the text representation of any values that are going to be
>> hard to decode. In some cases (e.g. records all of whose constituent
>> fields are well-behaved types) we could instead log enough additional
>> information about the type to permit blind decoding.
> I think this is prohibitively expensive from a development, runtime, space and
> maintenance standpoint.
> For databases using thing were decoding is rather expensive (e.g. postgis) you
> wouldn't really improve much above the old trigger based solutions. Its a
> return to "log everything twice".

Well, if the PostGIS types are poorly behaved under the definition I
proposed, that implies they won't work at all under your scheme. I
think putting a replication solution into core that won't support
PostGIS is dead on arrival. If they're well-behaved, then there's no
double-logging.

> Sorry if I seem pigheaded here, but I fail to see why all that would buy us
> anything but loads of complexity while loosing many potential advantages.

The current system that you are proposing is very complex and has a
number of holes at present, some of which you've already mentioned in
previous emails. There's a lot of advantage in picking a design that
allows you to put together a working prototype relatively quickly, but
I have a sinking feeling that your chosen design is going to be very
hard to bullet-proof and not very user-friendly. If we could find a
way to run it all inside a single server I think we would be way ahead
on both fronts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture at 2012-06-16 11:43:35 from Andres Freund

Responses

Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture at 2012-06-19 18:23:20 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2012-06-19 01:30:14	Re: Pg default's verbosity?
Previous Message	Robert Haas	2012-06-19 00:52:05	Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture