vac_truncate_clog()'s bogus check leads to bogusness

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Noah Misch <noah(at)leadboat(dot)com>
Subject: vac_truncate_clog()'s bogus check leads to bogusness
Date: 2023-06-21 22:12:08
Message-ID: 20230621221208.vhsqgduwfpzwxnpg@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

When vac_truncate_clog() returns early, due to one of these paths:

/*
* Do not truncate CLOG if we seem to have suffered wraparound already;
* the computed minimum XID might be bogus. This case should now be
* impossible due to the defenses in GetNewTransactionId, but we keep the
* test anyway.
*/
if (frozenAlreadyWrapped)
{
ereport(WARNING,
(errmsg("some databases have not been vacuumed in over 2 billion transactions"),
errdetail("You might have already suffered transaction-wraparound data loss.")));
return;
}

/* chicken out if data is bogus in any other way */
if (bogus)
return;

we haven't released the lwlock that we acquired earlier:

/* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
LWLockAcquire(WrapLimitsVacuumLock, LW_EXCLUSIVE);

as this isn't a path raising an error, the lock isn't released during abort.
Until there's some cause for the session to call LWLockReleaseAll(), the lock
is held. Until then neither the process holding the lock, nor any other
process, can finish vacuuming. We don't even have an assert against a
self-deadlock with an already held lock, oddly enough.

This is somewhat nasty - there's no real way to get out of this without an
immediate restart, and it's hard to pinpoint the problem as well :(.

Ok, the subject line is not the most precise, but it was just too good an
opportunity.

To reproduce (only on a throwaway system please!):

CREATE DATABASE invalid;
UPDATE pg_database SET datfrozenxid = '10002' WHERE datname = 'invalid';
DROP TABLE IF EXISTS foo_tbl; CREATE TABLE foo_tbl(); DROP TABLE foo_tbl; VACUUM FREEZE;
DROP TABLE IF EXISTS foo_tbl; CREATE TABLE foo_tbl(); DROP TABLE foo_tbl; VACUUM FREEZE;
<hang>

Found this while writing a test for the fix for partial dropping of
databases [1].

Separately, I think it's quite bad that we *silently* return from
vac_truncate_clog() when finding a bogus xid. That's a quite severe condition,
we should at least tell the user about it.

Greetings,

Andres Freund

[1] https://postgr.es/m/20230621190204.nsaelabojxppiuix%40awork3.anarazel.de

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bəxtiyar Neyman 2023-06-21 22:58:23 Re: Can JoinFilter condition be pushed down into IndexScan?
Previous Message Nathan Bossart 2023-06-21 21:57:45 Re: Preventing non-superusers from altering session authorization