Re: Bug in wait time when waiting on nested subtransaction

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in wait time when waiting on nested subtransaction
Date: 2022-11-28 17:38:16
Message-ID: CA+TgmobzZWa5LGaC1vyQy0okEu8L0SbM_dXTSogrwVco_qEq2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 28, 2022 at 10:28 AM Simon Riggs
<simon(dot)riggs(at)enterprisedb(dot)com> wrote:
> So we have these options
>
> 1. Removing the XactLockTableDelete() call in CommitSubTransaction().
> That releases lock waiters earlier than expected, which requires
> pushups in XactLockTableWait() to cope with that (which are currently
> broken). Why do we do it? To save shmem space in the lock table should
> anyone want to run a transaction that contains thousands of
> subtransactions, or more. So option (1) alone would eventually cause
> us to run out of space in the lock table and a transaction would
> receive ERRORs rather than be allowed to run for a long time.

This seems unprincipled to me. The point of having a lock on the
subtransaction in the lock table is so that people can wait for the
subtransaction to end. If we don't remove the lock table entry at
subtransaction end, then that lock table entry doesn't serve that
purpose any more.

> 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so
> we go up the levels one by one as we did before. However, (2) causes
> huge subtrans contention and if we implemented that and backpatched it
> the performance issues could be significant. So my feeling is that if
> we do (2) then we should not backpatch it.

What I find suspicious about the coding of this function is that it
doesn't distinguish between commits and aborts at all. Like, if a
subtransaction commits, the changes don't become globally visible
until the parent transaction also commits. If a subtransaction aborts,
though, what happens to the top-level XID doesn't seem to matter at
all. The comment seems to agree:

* Note that this does the right thing for subtransactions: if we wait on a
* subtransaction, we will exit as soon as it aborts or its top parent commits.

I feel like what I'd expect to see given this comment is code which
(1) waits until the supplied XID is no longer running, (2) checks
whether the XID aborted, and if so return at once, and (3) otherwise
recurse to the parent XID. But the code doesn't do that. Perhaps
that's not actually the right thing to do, since it seems like a big
behavior change, but then I don't understand the comment.

Incidentally, one possible optimization here to try to release locking
traffic would be to try waiting for the top-parent first using a
conditional lock acquisition. If that works, cool. If not, go back
around and work up the tree level by level. Since that path would only
be taken in the unhappy case where we're probably going to have to
wait anyway, the cost probably wouldn't be too bad.

> My preferred solution would be a mix of the above, call it option (3)
>
> 3.
> a) Include the lock table entry for the first 64 subtransactions only,
> so that we limit shmem. For those first 64 entries, have the subtrans
> point direct to top, since this makes a subtrans lookup into an O(1)
> operation, which is important for performance of later actions.
>
> b) For any subtransactions after first 64, delete the subxid lock on
> subcommit, to save shmem, but make subtrans point to the immediate
> parent (only), not the topxid. That fixes the bug, but causes
> performance problems in XidInMVCCSnapshot() and others, so we also do
> c) and d)
>
> c) At top level commit, go back and alter subtrans again for subxids
> so now it points to the topxid, so that we avoid O(N) behavior in
> XidInMVCCSnapshot() and other callers. Additional cost for longer
> transactions, but it saves considerable cost in later callers that
> need to call GetTopmostTransaction.
>
> d) Optimize SubTransGetTopmostTransaction() so it retrieves entries
> page-at-a-time. This will reduce the contention of repeatedly
> re-visiting the same page(s) and ensure that a page is less often
> paged out when we are still using it.

I'm honestly not very sanguine about changing pg_subtrans to point
straight to the topmost XID. It's only OK to do that if there's
absolutely nothing that needs to know the full tree structure, and the
present case looks like an instance where we would like to have the
full tree structure. I would not be surprised if there are others.
That said, it seems a lot less likely that this would be an issue once
the top-level transaction is no longer running. At that point, all the
subtransactions are no longer running either: they either committed or
they rolled back, and I can't see a reason why any code should care
about anything other than which of those two things happened. So I
think your idea in (c) might have some possibilities.

You could also flip that idea around and have readers replace
immediate parent pointers with top-parent pointers opportunistically,
but I'm not sure that's actually any better. As you present it in (c)
above, there's a risk of going back and updating CLOG state that no
one will ever look at. But if you flipped it around and did it on the
read side, then you'd have the risk of a bunch of backends trying to
do it at the same time. I'm not sure whether either of those things is
a big problem in practice, or whether both are, or neither.

I agree that it looks possible to optimize
SubTransGetTopmostTransaction better for the case where we want to
traverse multiple or all levels, so that fewer pages are read. I don't
know to what degree that would affect user-observible performance, but
I can believe that it could be a win.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-11-28 17:58:48 Re: Introduce a new view for checkpointer related stats
Previous Message Andrew Dunstan 2022-11-28 17:13:13 Re: predefined role(s) for VACUUM and ANALYZE