Quick Links

Re: abort-time portal cleanup

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: abort-time portal cleanup
Date:	2019-09-24 09:30:48
Message-ID:	CAA4eK1J8aurBB+bqV-Y9S8MgDDnthzXfXvZYbudPZaTjD31HwQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few
> days and have come to the conclusion that the code is not entirely up
> to our usual standards. I believe that a good deal of the reason for
> this is attributable to the poor quality of the code comments in this
> area, although there are perhaps some other contributing factors as
> well, such as bullheadedness on my part and perhaps others.
>
> The trouble starts with the header comment for AtAbort_Portals, which
> states that "At this point we run the cleanup hook if present, but we
> can't release the portal's memory until the cleanup call." At the time
> this logic was introduced in commit
> de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02),
> AtAbort_Portals() affected all non-held portals without caring whether
> they were active or not, and, UserAbortTransactionBlock() called
> AbortTransaction() directly, so typing "ROLLBACK;" would cause
> AtAbort_Portals() to be reached from within PortalRun(). Even if
> PortalRun() managed to return without crashing, the caller would next
> try to call PortalDrop() on what was now an invalid pointer. However,
> commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed
> things so that UserAbortEndTransaction() just sets things up so that
> the subsequent call to CommitTransactionCommand() would call
> AbortTransaction() instead of trying to do it right away, and that
> doesn't happen until after we're done with the portal. As far as I
> can see, that change made this comment mostly false, but the comment
> has nevertheless managed to survive for another ~15 years. I think we
> can, and in fact should, just drop the portal right here.
>
> As far as actually making that work, there are a few wrinkles. The
> first is that we might be in the middle of FATAL. In that case, unlike
> the ROLLBACK case, a call to PortalRun() is still on the stack, but
> we'll exit the process rather than returning, so the fact that we've
> created a dangling pointer for the caller won't matter. However, as
> shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01)
> and the report that led up to it at
> https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de
> it's not a good idea to try to clean up the portal in that case,
> because we might've already started shutting down critical systems.
> It seems not only risky but also unnecessary: our process-local state
> is about to go away, and the executor shouldn't need to clean up any
> shared memory state that won't also get cleaned up by some other
> mechanism. So, it seems to me that if we reach AtAbort_Portals()
> during FATAL processing, we should either (1) do nothing at all and
> just return or (2) forget about all the existing portals without
> cleaning them up and then return. The second option seems a little
> safer to me, because it guarantees that if we somehow reach code that
> might try to look up a portal later, it won't find anything. But I
> think it's arguable.
>

I agree with your position on this.

>
> Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely
> in favor of dropping portals on the spot in At(Sub)Abort_Portals(),
> (2) overhauls the comments in this area, and (3) makes
> AtSubAbort_Portals() consistent with AtAbort_Portals().

The overall idea seems good to me, but I have a few comments on the changes.

1.
@@ -2756,7 +2756,6 @@ CleanupTransaction(void)
/*
* do abort cleanup processing
*/
- AtCleanup_Portals(); /* now safe to release portal
memory */
AtEOXact_Snapshot(false, true); /* and release the transaction's
snapshots */

CurrentResourceOwner = NULL; /* and resource owner */
@@ -5032,8 +5031,6 @@ CleanupSubTransaction(void)
elog(WARNING, "CleanupSubTransaction while in %s state",
TransStateAsString(s->state));

- AtSubCleanup_Portals(s->subTransactionId);
-

After this cleanup, I think we don't need At(Sub)Abort_Portals in
AbortOutOfAnyTransaction() for the states TBLOCK_(SUB)ABORT and
friends. This is because AbortTransaction itself would have zapped the
portal.

2. You seem to forgot removing AtCleanup_Portals() from portal.h

3.
/*
- * If it was created in the current transaction, we
can't do normal
- * shutdown on a READY portal either; it might refer to
objects
- * created in the failed transaction. See comments in
- * AtSubAbort_Portals.
- */
- if (portal->status == PORTAL_READY)
- MarkPortalFailed(portal);
-

Why it is safe to remove this check? It has been explained in commit
7981c342 why we need that check. I don't see any explanation in email
or patch which justifies this code removal. Is it because you removed
PortalCleanup? If so, that is still called from PortalDrop?

4.
-AtCleanup_Portals(void)
-{
- HASH_SEQ_STATUS status;
- PortalHashEnt *hentry;
-
- hash_seq_init(&status, PortalHashTable);
-
- while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) !=
NULL)
- {
- Portal portal = hentry->portal;
-
- /*
- * Do not touch active portals --- this can only happen
in the case of
- * a multi-transaction command.
+ * If the status is PORTAL_ACTIVE, then we must be
executing a command
+ * that uses multiple transactions internally. In that
case, the
+ * command in question must be one that does not
internally rely on
+ * any transaction-lifetime resources, because they
would disappear
+ * in the upcoming transaction-wide cleanup.
*/
if (portal->status == PORTAL_ACTIVE)

I am not able to understand how we can reach with the portal state as
'active' for a multi-transaction command. It seems wherever we mark
portal as active, we don't relinquish the control unless its state is
changed. Can you share some example where this can happen?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

abort-time portal cleanup at 2019-09-12 20:42:46 from Robert Haas

Responses

Re: abort-time portal cleanup at 2019-10-07 16:14:52 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fujii Masao	2019-09-24 09:41:02	log message in proto.c
Previous Message	Kyotaro Horiguchi	2019-09-24 08:41:46	Re: Index Skip Scan