Quick Links

Re: abort-time portal cleanup

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: abort-time portal cleanup
Date:	2019-09-27 14:35:47
Message-ID:	CA+TgmoZt5+Rk3_61U1qW-Goz_QKVWO-3j_ohJfXjo1gdg6fPhw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Sep 24, 2019 at 4:45 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hm. Doing that cleanup requires digging through all the portals etc. I'd
> rather rely on less state being correct than more during FATAL
> processing.

I agree with the principle, but the advantage of removing the hash
table entries is that it protects against any other code that might
try to access the portal machinery later; I thought that was
worthwhile enough to justify doing this here. I don't feel
super-strongly about it, but I do still like that approach.

> > The second wrinkle is that there might be an active portal. Apart
> > from the FATAL case already mentioned, I think the only way this can
> > happen is some C code that calls purposefully calls AbortTransaction()
> > in the middle of executing a command. It can't be an ERROR, because
> > then the portal would be marked as failed before we get here, and it
> > can't be an explicit ROLLBACK, because as noted above, that case was
> > changed 15 years ago. It's got to be some other case where C code
> > calls AbortTransaction() voluntarily in the middle of a statement. For
> > over a decade, there were no cases at all of this type, but the code
> > in this function catered to hypothetical cases by marking the portal
> > failed. By 2016, Noah had figured out that this was bogus, and that
> > any future cases would likely require different handling, but Tom and
> > I shouted him down:
>
> Hm. But wouldn't doing so run into a ton of problems anyway? I mean we
> need to do a ton of checks and special case hangups to make
> CommitTransactionCommand();StartTransactionCommand(); work for VACUUM,
> CIC, ...

Right... those are the kinds of cases I'm talking about here, just for
abort rather than commit.

> The cases where one can use AbortTransaction() (via
> AbortCurrentTransaction() presumably) are ones where either there's no
> surrounding code relying on the transaction (e.g. autovacuum,
> postgres.c), or where special care has been taken with portals
> (e.g. _SPI_rollback()). We didn't have the pin mechanism back then, so
> I think even if we accept your/Tom's reasoning from back then (I don't
> really), it's outdated now that the pin mechanism exists.

It isn't, actually. To respond to this and also your question below
about why I'm looking at active portal rather than pinned portals, try
adding this debugging code to AtAbort_Portals:

+ if (portal->status == PORTAL_ACTIVE)
+ elog(NOTICE, "this portal is ACTIVE and %spinned",
+ portal->portalPinned ? "" : "NOT ");

Then run 'make -C src/pl/plpgsql check' and check
src/pl/plpgsql/src/regression.diffs and you'll see a whole lot of
this:

+NOTICE: this portal is ACTIVE and NOT pinned

The PLs pin the portals they generate internally, but they don't force
the surrounding portal in which the toplevel query is executing to be
pinned. AFAICT, pinning is mostly guarding against explicit
user-initiated drops of portals that were automatically generated by a
PL, whereas the portal's state is about tracking what the system is
doing with the portal.

(I think this could be a lot better documented than it is, but looking
at the commit history, I'm fairly sure that's what is happening here.)

> I'd be happy if we added some defenses against such bogus cases being
> introduced (i.e. erroring out if we encounter an active portal during
> abort processing).

Erroring out during error handling is probably a bad idea, but also, see above.

> > Stepping back a bit, stored procedures are a good example of a command
> > that uses multiple transactions internally. We have a few others, such
> > as VACUUM, but at present, that case only commits transactions
> > internally; it does not roll them back internally. If it did, it
> > would want the same thing that procedures want, namely, to leave the
> > active portal alone. It doesn't quite work to leave the active portal
> > completely alone, because the portal has got a pointer to a
> > ResourceOwner which is about to be freed by end-of-transaction
> > cleanup;
>
> Well, that's why _SPI_commit()/_SPI_rollback() do a HoldPinnedPortals(),
> which, via HoldPortal(), sets portal->resowner to = NULL.

Right, but it's still necessary for AtAbort_Portals() to do the same
thing, again because the top-level portal isn't pinned. If you apply
my patch, comment out the line that does portal->resowner = NULL; for
an active portal, and run make -C src/pl/plpgsql check, it will seg
fault inside exec_simple_query -> PortalDrop -> ResourceOwnerRelease
-> etc.

> > The current code also calls PortalReleaseCachedPlan in this case; I'm
> > not 100% certain whether that's appropriate or not.
>
> I think it's appropriate, because we cannot guarantee that the plan is
> still usable. Besides normal plan invalidation issues, the catalog
> contents the plan might rely on might have only existed in the aborted
> transaction - which seems like a fatal problem. That's why holding
> portals persists the portalstore. Which then also obsoletes the plan.

In a case like this, it can't really be an actual planable statement.
If the executor were involved, aborting the running transaction and
starting a new one would certainly result in a crash, because the
memory context in which we had saved all of our working state would be
vanish out from under us. It's got to be a procedure call or maybe
some kind of DDL command that has been specially-crafted to survive a
mid-command abort. It's not clear to me that the issues for plan
invalidation and catalog contents are the same in those kinds of cases
as they are for planable statements, but perhaps they are.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: abort-time portal cleanup at 2019-09-24 20:45:10 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2019-09-27 14:36:49	Re: Standby accepts recovery_target_timeline setting?
Previous Message	Alvaro Herrera	2019-09-27 14:28:18	Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)