From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [HACKERS] logical decoding of two-phase transactions |
Date: | 2018-07-18 14:08:37 |
Message-ID: | c9005196-e858-5dfd-80f3-faa076f7102f@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 07/17/2018 08:10 PM, Robert Haas wrote:
> On Mon, Jul 16, 2018 at 3:25 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Oh, right, I forgot the patch also adds the leader into the group, for
>> some reason (I agree it's unclear why that would be necessary, as you
>> pointed out later).
>>
>> But all this is happening while holding the partition lock (in exclusive
>> mode). And the decoding backends do synchronize on the lock correctly
>> (although, man, the rechecks are confusing ...).
>>
>> But now I see ProcKill accesses decodeGroupLeader in multiple places,
>> and only the first one is protected by the lock, for some reason
>> (interestingly enough the one in lockGroupLeader block). Is that what
>> you mean?
>
> I haven't traced out the control flow completely, but it sure looks to
> me like there are places where decodeGroupLeader is checked without
> holding any LWLock at all. Also, it looks to me like some places
> (like where we're trying to find a PGPROC by XID) we use ProcArrayLock
> and in others -- I guess where we're checking the decodeGroupBlah
> stuff -- we are using the lock manager locks. I don't know how safe
> that is, and there are not a lot of comments justifying it. I also
> wonder why we're using the lock manager locks to protect the
> decodeGroup stuff rather than backendLock.
>
>> FWIW I suspect the ProcKill part is borked due to incorrectly resolved
>> merge conflict or something, per my initial response from today.
>
> Yeah I wasn't seeing the code the way I thought you were describing it
> in that response, but I'm dumb this week so maybe I just
> misunderstood.
>
>>> I think that's probably not going to work out, but of course it's up
>>> to you how you want to spend your time!
>>
>> Well, yeah. I'm sure I could think of more fun things to do, but OTOH I
>> also have patches that require the capability to decode in-progress
>> transactions.
>
> It's not a matter of fun; it's a matter of whether it can be made to
> work. Don't get me wrong -- I want the ability to decode in-progress
> transactions. I complained about that aspect of the design to Andres
> when I was reviewing and committing logical slots & logical decoding,
> and I complained about it probably more than I complained about any
> other aspect of it, largely because it instantaneously generates a
> large lag when a bulk load commits. But not liking something about
> the way things are is not the same as knowing how to make them better.
> I believe there is a way to make it work because I believe there's a
> way to make anything work. But I suspect that it's at least one order
> of magnitude more complex than this patch currently is, and likely an
> altogether different design.
>
Sure, it may turn out not to work - but how do you know until you try?
We have a well known theater play here, where of the actors is blowing
tobacco smoke into the sink, to try if gold can be created that way.
Which is foolish, but his reasoning is "Someone had to try, to be sure!"
So we're in the phase of blowing tobacco smoke, kinda ;-)
Also, you often discover solutions while investigating approaches that
seem to be unworkable initially. Or entirely new approaches. It sure
happened to me, many times.
There's a great book/movie "Touching the Void" [1] about a climber
falling into a deep crevasse. Unable to climb up, he decides to crawl
down - which is obviously foolish, but he happens to find a way out.
I suppose we're kinda doing the same thing here - crawling down a
crevasse (while still smoking and blowing the tobacco smoke into a sink,
which we happened to find in the crevasse or something).
Anyway, I have no clear idea what changes would be necessary to the
original design of logical decoding to make implementing this easier
now. The decoding in general is quite constrained by how our transam and
WAL stuff works. I suppose Andres thought about this aspect, and I guess
he concluded that (a) it's not needed for v1, and (b) adding it later
will require about the same effort. So in the "better" case we'd end up
waiting for logical decoding much longer, in the worse case we would not
have it at all.
>> But the way I understand it, it pretty much *is* a list of waiters,
>> along with a couple of flags to allow the processes to notify the other
>> side about lock/unlock/abort. It does resemble the lock groups, but I
>> don't think it has the same goals.
>
> So the parts that aren't relevant shouldn't be copied over.
>
I'm not sure which parts aren't relevant, but in general I agree that
stuff that is not necessary should not be copied over.
>>> That having been said, I still don't see how that's really going to
>>> work. Just to take one example, suppose that the leader is trying to
>>> ERROR out, and the decoding workers are blocked waiting for a lock
>>> held by the leader. The system has no way of detecting this deadlock
>>> and resolving it automatically, which certainly seems unacceptable.
>>> The only way that's going to work is if the leader waits for the
>>> worker by trying to acquire a lock held by the worker. Then the
>>> deadlock detector would know to abort some transaction. But that
>>> doesn't really work either - the deadlock was created by the
>>> foreground process trying to abort, and if the deadlock detector
>>> chooses that process as its victim, what then? We're already trying
>>> to abort, and the abort code isn't supposed to throw further errors,
>>> or fail in any way, lest we break all kinds of other things. Not to
>>> mention the fact that running the deadlock detector in the abort path
>>> isn't really safe to begin with, again because we can't throw errors
>>> when we're already in an abort path.
>>
>> Fair point, not sure. I'll leave this up to Nikhil.
>
> That's fine, but please understand that I think there's a basic design
> flaw here that just can't be overcome with any amount of hacking on
> the details here. I think we need a much higher-level consideration
> of the problem here and probably a lot of new infrastructure to
> support it. One idea might be to initially support decoding of
> in-progress transactions only if they don't modify any catalog state.
The problem is you don't know if a transaction does DDL sometime later,
in the part that you might not have decoded yet (or perhaps concurrently
with the decoding). So I don't see how you could easily exclude such
transactions from the decoding ...
> That would leave out a bunch of cases we'd probably like to support,
> such as CREATE TABLE + COPY in the same transaction, but it would
> likely dodge a lot of really hard problems, too, and we could improve
> things later. One approach to the problem of catalog changes would be
> to prevent catalog tuples from being removed even after transaction
> abort until such time as there's no decoding in progress that might
> care about them. That is not by itself sufficient because a
> transaction can abort after inserting a heap tuple but before
> inserting an index tuple and we can't look at the catalog when it's an
> inconsistent state like that and expect reasonable results. But it
> helps: for example, if you are decoding a transaction that has
> inserted a WAL record with a cmin or cmax value of 4, and you know
> that none of the catalog records created by that transaction can have
> been pruned, then it should be safe to use a snapshot with CID 3 or
> smaller to decode the catalogs. So consider a case like:
>
> BEGIN;
> CREATE TABLE blah ... -- command ID 0
> COPY blah FROM '/tmp/blah' ... -- command ID 1
>
> Once we see the COPY show up in the WAL, it should be safe to decode
> the CREATE TABLE command and figure out what a snapshot with command
> ID 0 can see (again, assuming we've suppressed pruning in the catalogs
> in a sufficiently-well-considered way). Then, as long as the COPY
> command doesn't do any DML via a trigger or a datatype input function
> (!) or whatever, we should be able to use that snapshot to decode the
> data inserted by COPY.
One obvious issue with this is that it actually does not help with
reducing the replication lag, which is about the main goal of this whole
effort. Because if the COPY is a big data load, waiting until after the
COPY completes gives us pretty much nothing.
> I'm not quite sure what happens if the COPY
> does do some DML or something like that -- we might have to stop
> decoding until the following command begins in the live transaction,
> or something like that. Or maybe we don't have to do that. I'm not
> totally sure how the command counter is managed for catalog snapshots.
> However it works in detail, we will get into trouble if we ever use a
> catalog snapshot that can see a change that the live transaction is
> still in the midst of making. Even with pruning prevented, we can
> only count on the catalogs to be in a consistent state once the live
> transaction has finished the command -- otherwise, for example, it
> might have increased pg_class.relnatts but not yet added the
> pg_attribute entry at the time it aborts, or something like that. I'm
> blathering a little bit but hopefully you get the point: I think the
> way forward is for somebody to think carefully through how and under
> what circumstances using a catalog snapshot can be made safe even if
> an abort has occurred afterwards -- not trying to postpone the abort,
> which I think is never going to be right.
>
But isn't this (delaying the catalog cleanup etc.) pretty much the
original approach, implemented by the original patch? Which you also
claimed to be unworkable, IIRC? Or how is this addressing the problems
with broken HOT chains, for example? Those issues were pretty much the
reason why we started looking at alternative approaches, like delaying
the abort ...
I wonder if disabling HOT on catalogs with wal_level=logical would be an
option here. I'm not sure how important HOT on catalogs is, in practice
(it surely does not help with the typical catalog bloat issue, which is
temporary tables, because that's mostly insert+delete). I suppose we
could disable it only when there's a replication slot indicating support
for decoding of in-progress transactions, so that you still get HOT with
plain logical decoding.
I'm sure there will be other obstacles, not just the HOT chain stuff,
but it would mean one step closer to a solution.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2018-07-18 14:33:01 | Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)" |
Previous Message | Ashutosh Bapat | 2018-07-18 13:58:58 | Re: [HACKERS] PoC: full merge join on comparison clause |