Quick Links

Re: logical changeset generation v6

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: logical changeset generation v6
Date:	2013-09-19 14:43:02
Message-ID:	20130919144302.GB15812@awork2.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Robert,

On 2013-09-19 10:02:31 -0400, Robert Haas wrote:
> On Tue, Sep 17, 2013 at 10:31 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > Rebased patches attached.
>
> I spent a bit of time looking at these patches yesterday and today.
> It seems to me that there's a fair amount of stylistic cleanup that is
> still needed, and some pretty bad naming choices, and some FIXMEs that
> should probably be fixed, but for an initial review email it seems
> more helpful to focus on high-level design, so here goes.

Thanks for looking at it.

Yes, I think the highlevel stuff is the important bit.

As you note, the documentation needs to be written and that's certainly
not a small task. But doing so before the highlevel design is agreed
upon makes it too likely that it will need to be entirely scrapped.

> - Looking specifically at the 0004 patch, I think that the
> RecentGlobalDataXmin changes are logically separate from the rest of
> the patch, and that if we're going to commit them at all, it should be
> separate from the rest of this. I think this is basically a
> performance optimization. AFAICS, there's no correctness problem with
> the idea of continuing to maintain a single RecentGlobalXmin; it's
> just that you'll potentially end up with quite a lot of bloat. But
> that's not an argument that those two things HAVE to be committed
> together; either could be done first, and independently of the other.
> Also, these changes are quite complex and it's different to form a
> judgement as to whether this idea is safe when they are intermingled
> with the rest of the logical replication stuff.

Up until v3 the RecentGlobalDataXmin stuff wasn't included and reviewers
(primarily Peter G. on -hackers and Greg Stark at pgconf.eu) remarked on
that and considered it critical. I argued for a considerable amount of
time that it shouldn't be done in an initial patch and then gave in.

They have a point though, if you e.g. replicate a pgbench -j16 workload
the addition of RecentGlobalDataXmin reduces the performance impact of
replication from about 60% to less than 5% in my measurements. Turns out
heap pruning is damn critical for that kind of workload.

> More generally, the thing that bugs me about this approach is that
> logical replication is not really special, but what you've done here
> MAKES it special. There are plenty of other situations where we are
> too aggressive about holding back xmin. A single-statement
> transaction that selects from a single table holds back xmin for the
> entire cluster, and that is Very Bad. It would probably be unfair to
> say, well, you have to solve that problem first. But on the other
> hand, how do we know that the approach you've adopted here isn't going
> to make the more general problem harder to solve? It seems invasive
> at a pretty low level.

The reason why I think it's actually different is that the user actually
has control over how long transactions are running on the primary. They
don't really control how fast a replication consumer consumes and how
often it sends feedback messages.

> I think we should at least spend some time
> thinking about what *general* solutions to this problem would like
> like and then decide whether this is approach is likely to be
> forward-compatible with those solutions.

I thought about the general case for a good bit and decided that all
solutions that work in a more general scenario are complex enough that I
don't want to implement them. And I don't really see any backward
compatibility concerns here - removing the logic of using a separate
horizon for user tables in contrast to system tables is pretty trivial
and shouldn't have any external effect. Except pegging the horizon more,
but that's what the new approach would fix, right?

> - Given that we don't reassemble transactions until commit time, why
> do we need to to ensure that XIDs are logged before their sub-XIDs
> appear in WAL?

Currently it's important to know where the oldest transaction that is
alive started at to determine from where we need to restart
decoding. That's done by keeping a lsn-ordered list of in progress
toplevel transactions. The above invariant makes it cheap to maintain
that list.

> As I've said previously, I actually think that
> on-the-fly reassembly is probably going to turn out to be very
> important. But if we're not getting that, do we really need this?

It's also preparatory for supporting that.

I agree that it's pretty important, but after actually having
implemented a replication solution using this, I still think that most
usecase won't using it when available. I plan to work on implementing
that.

> Also, while I'm trying to keep this email focused on high-level
> concerns, I have to say that guaranteedlyLogged has got to be one of
> the worst variable names I've ever seen, starting (but not ending)
> with the fact that guaranteedly is not a word. I'm also tempted to
> say that all of the wal_level=logical changes should be split out as
> their own patch, separate from the decoding stuff. Personally, I
> would find that much easier to review, although I admit it's less
> critical here than for the RecentGlobalDataXmin stuff.

I can do that again and it actually was that way in the past. But
there's no user for it before the later patch and it's hard to
understand the reasoning for the changed wal logging separately, that's
why I merged it at some point.

> - If XLOG_HEAP_INPLACE is not decoded, doesn't that mean that this
> facility can't be used to replicate a system catalog table? Is that a
> restriction we should enforce/document somehow?

Currently catalog tables aren't replicated, yes. They simply are skipped
during decoding. XLOG_HEAP_INPLACE isn't the primary reason for that
though.

Do you see a usecase for it?.

> - The new code is rather light on comments. decode.c is extremely
> light.

Will improve. I think most of the other code is better commented, but it
still could use quite a bit of improvement nonethless.

> - It still bothers me that we're going to have mandatory slots for
> logical replication and no slots for physical replication. Why are
> slots mandatory in one case and not even allowable in the other? Is
> it sensible to think about slotless logical replication - or maybe I
> should say, is it any LESS sensible than slotless physical
> replication?

Well, as you know, I do want to have slots for physical replication as
well. But there actually is a fundamental difference why we need it for
logical rep and not for physical: In physical replication, if the xmin
progresses too far, client queries will be cancelled. Annoying but not
fatal. In logical replication we will not be able to continue
replicating since we cannot decode the WAL stream without a valid
catalog snapshot. If xmin already has progressed too far the tuples
won't be there anymore.

If people think this needs to be a general facility from the start, I
can be convinced that way, but I think there's so much to discuss around
the semantics and different usecases that I'd much prefer to discuss
that later.

> - What is the purpose of (Un)SuspendDecodingSnapshots? It seems that
> should be explained somewhere. I have my doubts about how safe that
> is.

I'll document the details if they aren't right now. Consider what
happens if somebody does something like: "VACUUM FULL pg_am;". If we
were to build the relation descriptor of pg_am in an "historical
snapshot", as you coin it, we'd have the wrong filenode in there. And
consequently any future lookups in pg_am will open a file that doesn't
exist.
That problem only exist for non-nailed relations that are accessed
during decoding.

> And I definitely think that SetupDecodingSnapshots() is not OK.
> Overwriting the satisfies functions in static pointers may be a great
> way to make sure you've covered all bases during development, but I
> can't see us wanting that ugliness in the official sources.

Yes, I don't like it either. I am not sure what to replace it with
though. It's easy enough to fit something in GetCatalogSnapshot() and I
don't have a problem with that, but I am less happy with adding code
like that to GetSnapshotData() for callers that use explicit snapshots.

> - I don't really like "time travel" as a name for reconstructing a
> previous snapshot of a catalog. Maybe it's as good as anything, but
> it also doesn't help that "during decoding" is used in some places to
> refer to the same concept.

Heh, I think that's me trying to avoid repeating the same term over and
over subconsciously.

> I wonder if we should call these "logical replication snapshots" or
> "historical MVCC snapshots" or somesuch and then try to make the
> terminology consistent throughout. ReorderBufferTXN->does_timetravel
> really means "time travel will be needed to decode what this
> transaction did", which is not really the same thing.

Hm. ->does_timetravel really is badly named. Yuck. Should be
'->modifies_catalog' or similar.

I'll think whether I can agree with either of the suggested terms or can
think of a better one. Till then I'll try to make the comments more
consistent.

Thanks!

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: logical changeset generation v6 at 2013-09-19 14:02:31 from Robert Haas

Responses

Re: logical changeset generation v6 at 2013-09-19 16:05:35 from Robert Haas
Re: logical changeset generation v6 at 2013-09-20 21:15:23 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Merlin Moncure	2013-09-19 15:30:41	Re: Some interesting news about Linux 3.12 OOM
Previous Message	Robert Haas	2013-09-19 14:12:24	Re: Some interesting news about Linux 3.12 OOM