TruncateMultiXact() bugs

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: TruncateMultiXact() bugs
Date: 2024-06-14 11:37:35
Message-ID: ccc66933-31c1-4f6a-bf4b-45fef0d4f22e@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I was performing tests around multixid wraparound, when I ran into this
assertion:

> TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "../src/backend/utils/mmgr/mcxt.c", Line: 1353, PID: 920981
> postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e]
> postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d]
> postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e]
> postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb]
> postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a]
> postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1]
> postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b]
> postgres: autovacuum worker template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3]
> postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66]
> postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d]
> postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead]
> postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e]
> postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb]
> postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45]
> postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31]
> 2024-06-14 13:11:02.025 EEST [920971] LOG: server process (PID 920981) was terminated by signal 6: Aborted
> 2024-06-14 13:11:02.025 EEST [920971] DETAIL: Failed process was running: autovacuum: VACUUM pg_toast.pg_toast_13407 (to prevent wraparound)

The attached python script reproduces this pretty reliably. It's a
reduced version of a larger test script I was working on, it probably
could be simplified further for this particular issue.

Looking at the code, it's pretty clear how it happens:

1. TruncateMultiXact does START_CRIT_SECTION();

2. In the critical section, it calls PerformMembersTruncation() ->
SlruDeleteSegment() -> SlruInternalDeleteSegment() ->
RegisterSyncRequest() -> ForwardSyncRequest()

3. If the fsync request queue is full, it calls
CompactCheckpointerRequestQueue(), which calls palloc0. Pallocs are not
allowed in a critical section.

A straightforward fix is to add a check to
CompactCheckpointerRequestQueue() to bail out without compacting, if
it's called in a critical section. That would cover any other cases like
this, where RegisterSyncRequest() is called in a critical section. I
haven't tried searching if any more cases like this exist.

But wait there is more!

After applying that fix in CompactCheckpointerRequestQueue(), the test
script often gets stuck. There's a deadlock between the checkpointer,
and the autovacuum backend trimming the SLRUs:

1. TruncateMultiXact does this:

MyProc->delayChkptFlags |= DELAY_CHKPT_START;

2. It then makes that call to PerformMembersTruncation() and
RegisterSyncRequest(). If it cannot queue the request, it sleeps a
little and retries. But the checkpointer is stuck waiting for the
autovacuum backend, because of delayChkptFlags, and will never clear the
queue.

To fix, I propose to add AbsorbSyncRequests() calls to the wait-loops in
CreateCheckPoint().

Attached patch fixes both of those issues.

I can't help thinking that TruncateMultiXact() should perhaps not have
such a long critical section. TruncateCLOG() doesn't do that. But it was
added for good reasons in commit 4f627f897367, and this fix seems
appropriate for the stable branches anyway, even if we come up with
something better for master.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
repro-multixid-trim-assertion.py text/x-python 6.0 KB
0001-Fix-bugs-in-MultiXact-truncation.patch text/x-patch 7.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2024-06-14 11:47:38 RE: Conflict Detection and Resolution
Previous Message Amit Kapila 2024-06-14 11:29:28 Re: Conflict Detection and Resolution